Buffer Configuration
8 minute read
Overview
When a destination becomes unavailable or slow, the Edge Delta agent can buffer data to disk and automatically retry delivery when the destination recovers. This prevents data loss during outages without requiring manual intervention.
The persistent_queue configuration is available on destination nodes and provides disk-based buffering with configurable behavior for different failure scenarios. When enabled, the agent writes data to disk during destination issues and drains the buffer when normal operation resumes.
Why sender-side buffering
Most observability destinations follow a stateless receiver pattern: they accept data via HTTP or similar protocols but have no knowledge of sender state or identity. Destinations cannot request retries, signal senders to slow down, or track what data is missing from a specific source. Because the destination is unaware of sender state, the sender must own durability. The Edge Delta agent is the only component that knows what data hasn’t been acknowledged, what needs to be retried, and how long to buffer before dropping. This is why persistent_queue is configured on the agent rather than relying on destination-side buffering.
This architecture prioritizes loose coupling between components. The agent operates autonomously while destinations remain simple HTTP endpoints. You can swap destinations (Splunk, Datadog, S3) without changing the delivery mechanism because they all follow the same stateless contract: accept data, return success or failure. Agent failures don’t corrupt destination state, and destination failures don’t corrupt agent state.
The trade-off is delivery semantics. True exactly-once delivery would require tight coupling: both sender and receiver participating in distributed transactions, sharing state, and coordinating acknowledgments. Instead, Edge Delta provides at-least-once delivery where the agent retries until the destination acknowledges success. The benefit is a simpler, more resilient architecture where each component owns its domain completely: the agent owns buffering, retry logic, and rate limiting, while the destination owns ingestion.
How buffering works
The buffering process operates through four stages:
- Normal operation - Data flows directly to the destination (for
errorandbackpressuremodes) or through the disk buffer (foralwaysmode). - Issue detected - Based on the configured
mode, the agent detects destination failures or slowdowns and begins writing data to disk. - Recovery - When the destination becomes healthy, buffered data drains at the configured rate while new data continues flowing.
- Completion - The buffer clears and normal operation resumes.
Configuration parameters
Configure the persistent_queue block within a destination node:
nodes:
- name: my_http_destination
type: http_output
endpoint: https://my-destination.example.com/ingest
persistent_queue:
path: /var/lib/edgedelta/outputbuffer
mode: error
max_byte_size: 1GB
drain_rate_limit: 1000
strict_ordering: false
path
The path parameter specifies the directory where buffered data is stored on disk. This parameter is required when configuring a persistent queue.
Requirements:
- The directory must have sufficient disk space for the configured
max_byte_size - The agent process must have read/write permissions to this location
- The path should be on a persistent volume (not tmpfs or memory-backed filesystem)
- For Kubernetes deployments, the Helm chart pre-configures
/var/lib/edgedeltaas a hostPath volume
Best practices:
- Use dedicated storage for buffer data separate from logs
- Monitor disk usage to prevent the buffer from filling available space
- Ensure the path persists across agent restarts to maintain buffered data
mode
The mode parameter determines when data is buffered to disk. Three modes are available:
| Mode | Behavior | Trade-off | Recommended for |
|---|---|---|---|
error (default) | Buffers only when the destination returns errors | No protection during slow responses | Reliable destinations with consistent response times |
backpressure | Buffers when in-memory queue reaches 80% capacity or on errors | Slightly more disk writes during slowdowns | Most production deployments |
always | Write-ahead-log behavior; all data written to disk before sending | Disk I/O on every item reduces throughput | Maximum durability requirements |
Error mode provides the minimal protection layer needed to prevent data loss when destinations temporarily fail. Without any persistent queue, a destination outage means data is lost. With error mode enabled, data is preserved on disk during failures and delivered automatically when the destination recovers.
Backpressure mode provides everything error mode offers, plus protection against slow destinations. When a destination is slow but not completely down, the agent spills data to disk and continues processing, isolating itself from the slow backend. This prevents a slow destination from cascading failures into your agent cluster.
Always mode forces the agent to write every item to disk before attempting delivery, then reads from disk for transmission. This guarantees that data survives even sudden agent crashes or restarts. Only enable always mode if you have a specific, well-understood requirement where the durability guarantee outweighs the throughput reduction.
max_byte_size
The max_byte_size parameter defines the maximum disk space the persistent buffer is allowed to use. Once this limit is reached, new incoming items are dropped.
Sizing guidance:
| Deployment size | Recommended size | Approximate coverage |
|---|---|---|
| Small (1-10 logs/sec) | 100MB - 500MB | 15-60 minute outage |
| Medium (10-100 logs/sec) | 500MB - 2GB | 30-120 minute outage |
| Large (100+ logs/sec) | 2GB - 10GB | 1-3 hour outage |
Calculate your buffer size based on expected outage duration:
Buffer size = Average log size × Log rate × Expected outage duration
Example:
Average log size: 1KB
Log rate: 100 logs/sec
Expected outage: 1 hour
Buffer size = 1KB × 100 logs/sec × 3600 sec = 360MB
Recommended: 500MB - 1GB (with safety margin)
drain_rate_limit
The drain_rate_limit parameter controls the maximum items per second when draining the buffer after a destination recovers.
The default value is 0 (unlimited), meaning the buffer drains as fast as the destination accepts data.
When a destination recovers from an outage, it may still be fragile. Immediately flooding it with hours of backlogged data can trigger another failure. The drain rate limit allows gradual, controlled recovery.
| Scenario | Recommended rate | Reasoning |
|---|---|---|
| Stable, well-provisioned destination | 0 (unlimited) | Minimize recovery time |
| Shared or multi-tenant destination | 20-50% of capacity | Leave headroom for live traffic |
| Recently recovered destination | 10-25% of capacity | Gentle ramp-up to prevent re-triggering failure |
| Rate-limited destination (SaaS) | Below API rate limit | Avoid throttling or quota exhaustion |
Impact on recovery time:
Buffer size: 1GB (~1,000,000 logs at 1KB each)
At unlimited (0): Depends on destination capacity
At 5000 items/sec: ~3.5 minutes to drain
At 1000 items/sec: ~17 minutes to drain
At 100 items/sec: ~2.8 hours to drain
strict_ordering
The strict_ordering parameter controls how items are consumed from the persistent buffer.
| Value | Processing model | Buffer priority | Recovery latency |
|---|---|---|---|
false (default) | Parallel workers | Buffered data drains in background | Lower - current state visible immediately |
true | Single-threaded | Buffered items always drain first | Higher - queue must drain before new data |
When strict_ordering: true, the agent runs with a single processing thread and prioritizes draining buffered items first. New incoming data waits until all buffered items are processed in exact chronological order.
When strict_ordering: false (default), multiple workers process data in parallel and new data flows directly to the destination while buffered data drains in the background.
Note: When
strict_ordering: true, you must setparallel_workers: 1on the destination node. Pipeline validation fails ifparallel_workersis greater than 1 because parallel processing breaks ordering guarantees.
When to keep default (false):
Most observability use cases benefit from the default. When a destination recovers from an outage, operators typically want to see current system state on dashboards immediately, while historical data backfills in the background.
When to enable strict ordering:
Strict ordering is primarily needed for security-focused customers who build systems where events must arrive in exact delivery order:
- Stateful security streaming engines that maintain state across events
- Audit and compliance logs with regulatory requirements for exact temporal sequence
- State reconstruction systems that replay events to rebuild state
Verifying buffer operation
Monitor the ed.buffer.disk.bytes metric to verify buffering is working:
- During normal operation, this metric should be
0(forerrorandbackpressuremodes) - During destination failures, this metric increases as data buffers to disk
- After recovery, this metric gradually decreases as the buffer drains
- When drain completes, the metric returns to
0
You can view this metric in the Edge Delta UI under agent metrics.
Troubleshooting
Buffer not activating during failures
Verify your configuration:
- Confirm
persistent_queueis configured on the destination node - Check that
pathpoints to a writable directory with sufficient space - For
errormode, verify the destination is actually returning errors (check agent logs) - For
backpressuremode, verify the in-memory queue is reaching 80% capacity
Slow buffer drain
If the buffer takes longer than expected to drain:
- Check
drain_rate_limitsetting - a low value extends drain time - Verify destination health - the destination must be accepting requests
- Check network connectivity between agent and destination
- Review
batch_sizeon the destination - larger batches may improve throughput
Buffer filling to maximum
If ed.buffer.disk.bytes reaches max_byte_size and stops increasing:
- New data is being dropped - this is expected behavior at capacity
- Consider increasing
max_byte_sizeif you expect longer outages - Review destination health and recovery timeline
- Check available disk space on the agent host
Data not appearing after recovery
If the destination recovered but buffered data is not appearing:
- Allow sufficient time for drain - check the drain time calculation above
- Monitor
ed.buffer.disk.bytesto confirm drain is progressing - Check destination logs for any ingestion errors
- Verify
strict_orderingsetting - iftrue, new data waits for buffer to drain
Agent-level buffer settings
The agent has separate buffer settings at the pipeline level that control internal processing buffers. These are different from the destination persistent_queue:
| Setting | Purpose | Default |
|---|---|---|
item_buffer_flush_interval | Interval after which internal item buffers flush their contents | 5s |
item_buffer_max_byte_limit | Size limit that triggers an internal item buffer flush | 1MiB |
These settings control how the agent batches items internally before sending to destinations. They do not provide disk-based persistence or data loss protection - use persistent_queue for that purpose.
Note: The
item_buffer_flush_intervalanditem_buffer_max_byte_limitsettings are legacy parameters primarily used for internal agent operation. For data loss prevention during destination failures, always use thepersistent_queueconfiguration on destination nodes.
See also
- HTTP Destination - persistent_queue configuration reference
- Agent Settings - Pipeline-level agent configuration
- Troubleshooting Overview - General troubleshooting resources