Troubleshooting Buffer Configuration

Configure destination buffering in Edge Delta to prevent data loss during destination outages, slowdowns, or network issues.

13 minute read

Configuration Wizard

Use the interactive wizard to determine the right buffer settings for your destination:

1 Goal

2 Mode

3 Size

4 Drain

5 Order

6 Config

What's your primary concern for this destination?

This determines how aggressively the agent buffers data to disk.

Your answers stay in your browser

Overview

When a destination becomes unavailable or slow, the Edge Delta agent can buffer data to disk and automatically retry delivery when the destination recovers. This prevents data loss during outages without requiring manual intervention.

The persistent_queue configuration is available on destination nodes and provides disk-based buffering with configurable behavior for different failure scenarios. It is a retry mechanism for failing delivery requests — it does not store or organize output files (such as GCS objects or S3 files). The queue controls which requests are retried, regardless of the data structure of the request.

When enabled, the agent writes data to disk during destination issues and drains the buffer when normal operation resumes.

Why sender-side buffering

Most observability destinations follow a stateless receiver pattern: they accept data via HTTP or similar protocols but have no knowledge of sender state or identity. Destinations cannot request retries, signal senders to slow down, or track what data is missing from a specific source. Because the destination is unaware of sender state, the sender must own durability. The Edge Delta agent is the only component that knows what data hasn’t been acknowledged, what needs to be retried, and how long to buffer before dropping. This is why persistent_queue is configured on the agent rather than relying on destination-side buffering.

This architecture prioritizes loose coupling between components. The agent operates autonomously while destinations remain simple HTTP endpoints. You can swap destinations (Splunk, Datadog, S3) without changing the delivery mechanism because they all follow the same stateless contract: accept data, return success or failure. Agent failures don’t corrupt destination state, and destination failures don’t corrupt agent state.

The trade-off is delivery semantics. True exactly-once delivery would require tight coupling: both sender and receiver participating in distributed transactions, sharing state, and coordinating acknowledgments. Instead, Edge Delta provides at-least-once delivery where the agent retries until the destination acknowledges success. The benefit is a simpler, more resilient architecture where each component owns its domain completely: the agent owns buffering, retry logic, and rate limiting, while the destination owns ingestion.

How buffering works

The buffering process operates through four stages:

Normal operation - Data flows directly to the destination (for error and backpressure modes) or through the disk buffer (for always mode).
Issue detected - Based on the configured mode, the agent detects destination failures or slowdowns and begins writing data to disk.
Recovery - When the destination becomes healthy, buffered data drains at the configured rate while new data continues flowing.
Completion - The buffer clears and normal operation resumes.

Configuration parameters

Configure the persistent_queue block within a destination node:

nodes:
  - name: my_http_destination
    type: http_output
    endpoint: https://my-destination.example.com/ingest
    persistent_queue:
      path: /var/lib/edgedelta/outputbuffer
      mode: error
      max_byte_size: 1GB
      drain_rate_limit: 1000
      strict_ordering: false

path

The path parameter specifies the directory where buffered data is stored on disk. This parameter is required when configuring a persistent queue.

Requirements:

The directory must have sufficient disk space for the configured max_byte_size
The agent process must have read/write permissions to this location
The path should be on a persistent volume (not tmpfs or memory-backed filesystem)
For Kubernetes deployments, the Helm chart pre-configures /var/lib/edgedelta as a hostPath volume

Best practices:

Use dedicated storage for buffer data separate from logs
Monitor disk usage to prevent the buffer from filling available space
Ensure the path persists across agent restarts to maintain buffered data

Kubernetes storage considerations

The storage type you use for the persistent queue path depends on your deployment model:

DaemonSet deployments (agents): hostPath works because DaemonSet pods are pinned to nodes. If a pod restarts, the replacement pod starts on the same node and picks up orphaned queue files from the host filesystem.
Gateway Deployments: Use a PersistentVolumeClaim (PVC) instead of hostPath. Deployment pods can be scheduled on any node, so a hostPath volume is inaccessible if the replacement pod lands on a different node. A PVC ensures the new pod can mount the same volume regardless of node placement.

Note: According to engineering, the persistent queue path is not uniquely identified per agent. Files are written directly to the given path. If multiple agent instances share the same path on the same volume, they may conflict. Ensure each agent instance has its own path or its own volume.

mode

The mode parameter determines when data is buffered to disk. Three modes are available:

Mode	Behavior	Trade-off	Recommended for
`error` (default)	Buffers only when the destination returns errors	No protection during slow responses	Reliable destinations with consistent response times
`backpressure`	Buffers when in-memory queue reaches 80% capacity or on errors	Slightly more disk writes during slowdowns	Most production deployments
`always`	Write-ahead-log behavior; all data written to disk before sending	Disk I/O on every item reduces throughput	Maximum durability requirements

Error mode provides the minimal protection layer needed to prevent data loss when destinations temporarily fail. Without any persistent queue, a destination outage means data is lost. With error mode enabled, data is preserved on disk during failures and delivered automatically when the destination recovers. Because error mode buffers in memory first and writes to disk opportunistically, data that is still in memory when the agent crashes is lost.

Backpressure mode provides everything error mode offers, plus protection against slow destinations. When a destination is slow but not completely down, the agent spills data to disk and continues processing, isolating itself from the slow backend. This prevents a slow destination from cascading failures into your agent cluster. Like error mode, backpressure buffers in memory first, so data that has not yet been written to disk is lost if the agent crashes.

Always mode forces the agent to write every item to disk before attempting delivery, then reads from disk for transmission. This provides crash recovery: data written to disk is retried when the agent restarts. A small amount of in-flight data (items being processed at the instant of a crash) can still be lost. Only enable always mode if you have a specific, well-understood requirement where the crash recovery benefit outweighs the throughput reduction.

All three modes retry data that reaches disk after an agent crash — the difference is how much data is on disk when the crash occurs. always mode writes data to disk before delivery (write-ahead), so it recovers the most. error and backpressure modes write to disk opportunistically, so less data is available for recovery after a crash.

max_byte_size

The max_byte_size parameter defines the maximum disk space the persistent buffer is allowed to use. Once this limit is reached, new incoming items are dropped.

Note: This limit is total storage for the persistent queue, not per-worker. If you configure max_byte_size: 1GB and the destination has 15 workers, the buffer uses 1GB total, not 15GB.

Sizing guidance:

Deployment size	Recommended size	Approximate coverage
Small (1-10 logs/sec)	100MB - 500MB	15-60 minute outage
Medium (10-100 logs/sec)	500MB - 2GB	30-120 minute outage
Large (100+ logs/sec)	2GB - 10GB	1-3 hour outage

Calculate your buffer size based on expected outage duration:

Buffer size = Average log size × Log rate × Expected outage duration

Example:
Average log size: 1KB
Log rate: 100 logs/sec
Expected outage: 1 hour

Buffer size = 1KB × 100 logs/sec × 3600 sec = 360MB
Recommended: 500MB - 1GB (with safety margin)

drain_rate_limit

The drain_rate_limit parameter controls the maximum items per second when draining the buffer after a destination recovers.

The default value is 0 (unlimited), meaning the buffer drains as fast as the destination accepts data.

When a destination recovers from an outage, it may still be fragile. Immediately flooding it with hours of backlogged data can trigger another failure. The drain rate limit allows gradual, controlled recovery.

Scenario	Recommended rate	Reasoning
Stable, well-provisioned destination	`0` (unlimited)	Minimize recovery time
Shared or multi-tenant destination	20-50% of capacity	Leave headroom for live traffic
Recently recovered destination	10-25% of capacity	Gentle ramp-up to prevent re-triggering failure
Rate-limited destination (SaaS)	Below API rate limit	Avoid throttling or quota exhaustion

Impact on recovery time:

Buffer size: 1GB (~1,000,000 logs at 1KB each)

At unlimited (0):  Depends on destination capacity
At 5000 items/sec: ~3.5 minutes to drain
At 1000 items/sec: ~17 minutes to drain
At 100 items/sec:  ~2.8 hours to drain

strict_ordering

The strict_ordering parameter controls how items are consumed from the persistent buffer.

Value	Processing model	Buffer priority	Recovery latency
`false` (default)	Parallel workers	Buffered data drains in background	Lower - current state visible immediately
`true`	Single-threaded	Buffered items always drain first	Higher - queue must drain before new data

When strict_ordering: true, the agent runs with a single processing thread and prioritizes draining buffered items first. New incoming data waits until all buffered items are processed in exact chronological order.

When strict_ordering: false (default), multiple workers process data in parallel and new data flows directly to the destination while buffered data drains in the background.

Note: When strict_ordering: true, you must set parallel_workers: 1 on the destination node. Pipeline validation fails if parallel_workers is greater than 1 because parallel processing breaks ordering guarantees.

When to keep default (false):

Most observability use cases benefit from the default. When a destination recovers from an outage, operators typically want to see current system state on dashboards immediately, while historical data backfills in the background.

When to enable strict ordering:

Strict ordering is primarily needed for security-focused customers who build systems where events must arrive in exact delivery order:

Stateful security streaming engines that maintain state across events
Audit and compliance logs with regulatory requirements for exact temporal sequence
State reconstruction systems that replay events to rebuild state

Verifying buffer operation

Monitor buffer metrics to verify buffering is working:

Metric	Description
`ed.buffer.disk.bytes`	Total bytes currently stored in the disk buffer
`ed.buffer.disk.items`	Number of items currently stored in the disk buffer
`ed.buffer.disk.events`	Number of individual events stored in the disk buffer
`ed.buffer.disk.event_bytes`	Total bytes of event data in the disk buffer
`ed.buffer.memory.bytes`	Total bytes currently stored in the memory buffer
`ed.buffer.memory.items`	Number of items currently stored in the memory buffer
`ed.buffer.memory.events`	Number of individual events stored in the memory buffer
`ed.buffer.memory.event_bytes`	Total bytes of event data in the memory buffer

The *.items metrics track buffer queue depth (batched items). The *.events and *.event_bytes metrics provide more granular visibility into the actual event count and data volume being buffered, separate from buffer overhead. All buffer metrics are gauges scoped to the agent. For a full list and descriptions, see Metrics List.

During normal operation:

For error and backpressure modes, disk metrics should be 0
During destination failures, disk metrics increase as data buffers
After recovery, metrics gradually decrease as the buffer drains
When drain completes, metrics return to 0

You can view these metrics in the Metrics Explorer or Metrics Inventory. Filter by metric name (for example, ed.buffer.disk.bytes) to monitor buffer capacity for specific destination nodes.