Buffer Configuration

Configure destination buffering in Edge Delta to prevent data loss during destination outages, slowdowns, or network issues.

Overview

When a destination becomes unavailable or slow, the Edge Delta agent can buffer data to disk and automatically retry delivery when the destination recovers. This prevents data loss during outages without requiring manual intervention.

The persistent_queue configuration is available on destination nodes and provides disk-based buffering with configurable behavior for different failure scenarios. When enabled, the agent writes data to disk during destination issues and drains the buffer when normal operation resumes.

Why sender-side buffering

Most observability destinations follow a stateless receiver pattern: they accept data via HTTP or similar protocols but have no knowledge of sender state or identity. Destinations cannot request retries, signal senders to slow down, or track what data is missing from a specific source. Because the destination is unaware of sender state, the sender must own durability. The Edge Delta agent is the only component that knows what data hasn’t been acknowledged, what needs to be retried, and how long to buffer before dropping. This is why persistent_queue is configured on the agent rather than relying on destination-side buffering.

This architecture prioritizes loose coupling between components. The agent operates autonomously while destinations remain simple HTTP endpoints. You can swap destinations (Splunk, Datadog, S3) without changing the delivery mechanism because they all follow the same stateless contract: accept data, return success or failure. Agent failures don’t corrupt destination state, and destination failures don’t corrupt agent state.

The trade-off is delivery semantics. True exactly-once delivery would require tight coupling: both sender and receiver participating in distributed transactions, sharing state, and coordinating acknowledgments. Instead, Edge Delta provides at-least-once delivery where the agent retries until the destination acknowledges success. The benefit is a simpler, more resilient architecture where each component owns its domain completely: the agent owns buffering, retry logic, and rate limiting, while the destination owns ingestion.

How buffering works

The buffering process operates through four stages:

  1. Normal operation - Data flows directly to the destination (for error and backpressure modes) or through the disk buffer (for always mode).
  2. Issue detected - Based on the configured mode, the agent detects destination failures or slowdowns and begins writing data to disk.
  3. Recovery - When the destination becomes healthy, buffered data drains at the configured rate while new data continues flowing.
  4. Completion - The buffer clears and normal operation resumes.

Configuration parameters

Configure the persistent_queue block within a destination node:

nodes:
  - name: my_http_destination
    type: http_output
    endpoint: https://my-destination.example.com/ingest
    persistent_queue:
      path: /var/lib/edgedelta/outputbuffer
      mode: error
      max_byte_size: 1GB
      drain_rate_limit: 1000
      strict_ordering: false

path

The path parameter specifies the directory where buffered data is stored on disk. This parameter is required when configuring a persistent queue.

Requirements:

  • The directory must have sufficient disk space for the configured max_byte_size
  • The agent process must have read/write permissions to this location
  • The path should be on a persistent volume (not tmpfs or memory-backed filesystem)
  • For Kubernetes deployments, the Helm chart pre-configures /var/lib/edgedelta as a hostPath volume

Best practices:

  • Use dedicated storage for buffer data separate from logs
  • Monitor disk usage to prevent the buffer from filling available space
  • Ensure the path persists across agent restarts to maintain buffered data

mode

The mode parameter determines when data is buffered to disk. Three modes are available:

ModeBehaviorTrade-offRecommended for
error (default)Buffers only when the destination returns errorsNo protection during slow responsesReliable destinations with consistent response times
backpressureBuffers when in-memory queue reaches 80% capacity or on errorsSlightly more disk writes during slowdownsMost production deployments
alwaysWrite-ahead-log behavior; all data written to disk before sendingDisk I/O on every item reduces throughputMaximum durability requirements

Error mode provides the minimal protection layer needed to prevent data loss when destinations temporarily fail. Without any persistent queue, a destination outage means data is lost. With error mode enabled, data is preserved on disk during failures and delivered automatically when the destination recovers.

Backpressure mode provides everything error mode offers, plus protection against slow destinations. When a destination is slow but not completely down, the agent spills data to disk and continues processing, isolating itself from the slow backend. This prevents a slow destination from cascading failures into your agent cluster.

Always mode forces the agent to write every item to disk before attempting delivery, then reads from disk for transmission. This guarantees that data survives even sudden agent crashes or restarts. Only enable always mode if you have a specific, well-understood requirement where the durability guarantee outweighs the throughput reduction.

max_byte_size

The max_byte_size parameter defines the maximum disk space the persistent buffer is allowed to use. Once this limit is reached, new incoming items are dropped.

Sizing guidance:

Deployment sizeRecommended sizeApproximate coverage
Small (1-10 logs/sec)100MB - 500MB15-60 minute outage
Medium (10-100 logs/sec)500MB - 2GB30-120 minute outage
Large (100+ logs/sec)2GB - 10GB1-3 hour outage

Calculate your buffer size based on expected outage duration:

Buffer size = Average log size × Log rate × Expected outage duration

Example:
Average log size: 1KB
Log rate: 100 logs/sec
Expected outage: 1 hour

Buffer size = 1KB × 100 logs/sec × 3600 sec = 360MB
Recommended: 500MB - 1GB (with safety margin)

drain_rate_limit

The drain_rate_limit parameter controls the maximum items per second when draining the buffer after a destination recovers.

The default value is 0 (unlimited), meaning the buffer drains as fast as the destination accepts data.

When a destination recovers from an outage, it may still be fragile. Immediately flooding it with hours of backlogged data can trigger another failure. The drain rate limit allows gradual, controlled recovery.

ScenarioRecommended rateReasoning
Stable, well-provisioned destination0 (unlimited)Minimize recovery time
Shared or multi-tenant destination20-50% of capacityLeave headroom for live traffic
Recently recovered destination10-25% of capacityGentle ramp-up to prevent re-triggering failure
Rate-limited destination (SaaS)Below API rate limitAvoid throttling or quota exhaustion

Impact on recovery time:

Buffer size: 1GB (~1,000,000 logs at 1KB each)

At unlimited (0):  Depends on destination capacity
At 5000 items/sec: ~3.5 minutes to drain
At 1000 items/sec: ~17 minutes to drain
At 100 items/sec:  ~2.8 hours to drain

strict_ordering

The strict_ordering parameter controls how items are consumed from the persistent buffer.

ValueProcessing modelBuffer priorityRecovery latency
false (default)Parallel workersBuffered data drains in backgroundLower - current state visible immediately
trueSingle-threadedBuffered items always drain firstHigher - queue must drain before new data

When strict_ordering: true, the agent runs with a single processing thread and prioritizes draining buffered items first. New incoming data waits until all buffered items are processed in exact chronological order.

When strict_ordering: false (default), multiple workers process data in parallel and new data flows directly to the destination while buffered data drains in the background.

Note: When strict_ordering: true, you must set parallel_workers: 1 on the destination node. Pipeline validation fails if parallel_workers is greater than 1 because parallel processing breaks ordering guarantees.

When to keep default (false):

Most observability use cases benefit from the default. When a destination recovers from an outage, operators typically want to see current system state on dashboards immediately, while historical data backfills in the background.

When to enable strict ordering:

Strict ordering is primarily needed for security-focused customers who build systems where events must arrive in exact delivery order:

  • Stateful security streaming engines that maintain state across events
  • Audit and compliance logs with regulatory requirements for exact temporal sequence
  • State reconstruction systems that replay events to rebuild state

Verifying buffer operation

Monitor the ed.buffer.disk.bytes metric to verify buffering is working:

  1. During normal operation, this metric should be 0 (for error and backpressure modes)
  2. During destination failures, this metric increases as data buffers to disk
  3. After recovery, this metric gradually decreases as the buffer drains
  4. When drain completes, the metric returns to 0

You can view this metric in the Edge Delta UI under agent metrics.

Troubleshooting

Buffer not activating during failures

Verify your configuration:

  1. Confirm persistent_queue is configured on the destination node
  2. Check that path points to a writable directory with sufficient space
  3. For error mode, verify the destination is actually returning errors (check agent logs)
  4. For backpressure mode, verify the in-memory queue is reaching 80% capacity

Slow buffer drain

If the buffer takes longer than expected to drain:

  1. Check drain_rate_limit setting - a low value extends drain time
  2. Verify destination health - the destination must be accepting requests
  3. Check network connectivity between agent and destination
  4. Review batch_size on the destination - larger batches may improve throughput

Buffer filling to maximum

If ed.buffer.disk.bytes reaches max_byte_size and stops increasing:

  1. New data is being dropped - this is expected behavior at capacity
  2. Consider increasing max_byte_size if you expect longer outages
  3. Review destination health and recovery timeline
  4. Check available disk space on the agent host

Data not appearing after recovery

If the destination recovered but buffered data is not appearing:

  1. Allow sufficient time for drain - check the drain time calculation above
  2. Monitor ed.buffer.disk.bytes to confirm drain is progressing
  3. Check destination logs for any ingestion errors
  4. Verify strict_ordering setting - if true, new data waits for buffer to drain

Agent-level buffer settings

The agent has separate buffer settings at the pipeline level that control internal processing buffers. These are different from the destination persistent_queue:

SettingPurposeDefault
item_buffer_flush_intervalInterval after which internal item buffers flush their contents5s
item_buffer_max_byte_limitSize limit that triggers an internal item buffer flush1MiB

These settings control how the agent batches items internally before sending to destinations. They do not provide disk-based persistence or data loss protection - use persistent_queue for that purpose.

Note: The item_buffer_flush_interval and item_buffer_max_byte_limit settings are legacy parameters primarily used for internal agent operation. For data loss prevention during destination failures, always use the persistent_queue configuration on destination nodes.

See also