Circuit Breaker

Implement circuit breaker protections for output nodes to prevent cascading failures and protect telemetry delivery pipelines.

Overview

The Circuit Breaker feature adds resilience and fault tolerance to Edge Delta’s output nodes. When enabled, this mechanism monitors downstream destination health and automatically blocks traffic when failures occur, preventing overload, enabling recovery, and preserving data integrity.

Use circuit breakers to:

  • Isolate failure-prone destinations
  • Maintain agent stability under load or error spikes
  • Reduce retry storms and preserve critical resources
  • Enable fallback strategies (reroute, sample, drop)

How It Works

The circuit breaker wraps the output connection logic and tracks success/failure metrics in real time. If a threshold of consecutive failures is reached, the circuit transitions through the following states:

State Behavior
CLOSED Normal operation. Requests are sent. Failures are counted.
OPEN Requests are blocked. Fallback strategies are triggered.
HALF-OPEN Limited test requests allowed. Determines if recovery is possible.

State transitions are controlled by parameters such as failure_threshold, open_timeout, and half_open_max_calls.

In addition to failure-based transitions, Global Health Monitoring can proactively open circuits based on memory and queue pressure—adding resource-based protection to the error-based model.

YAML Configuration

Basic Example

outputs:
  - type: ed_gateway_output
    name: "production-gateway"
    endpoint: "gateway.example.com"
    port: 4317
    protocol: grpc
    resilience:
      circuit_breaker:
        enabled: true
        failure_threshold: 10
        open_timeout: 30s
        half_open_max_calls: 5
        half_open_timeout: 90s
        check_interval: 10s

This configuration defines an Edge Delta output node that sends telemetry data to a gateway at gateway.example.com over gRPC on port 4317. The resilience section enables a circuit breaker to protect the pipeline from repeated failures. If 10 consecutive failures occur (failure_threshold: 10), the circuit transitions to an open state for 30 seconds (open_timeout: 30s), during which traffic is blocked to prevent overload. After this timeout, the circuit enters a half-open state where up to 5 test calls are allowed (half_open_max_calls: 5) over a 90-second window (half_open_timeout: 90s) to evaluate recovery. The circuit’s state is evaluated every 10 seconds (check_interval: 10s), enabling timely transitions based on destination health.

With Global Health Monitoring

outputs:
  - type: ed_gateway_output
    name: "production-gateway"
    endpoint: "gateway.example.com"
    port: 4317
    protocol: grpc
    resilience:
      circuit_breaker:
        enabled: true
        failure_threshold: 10
        open_timeout: 30s
        half_open_max_calls: 5
        half_open_timeout: 90s
        check_interval: 10s
        global_health:
          enabled: true
          memory_check_enabled: true
          memory_threshold: 1024MB
          queue_check_enabled: true
          queue_threshold_percent: 80
          check_interval: 30s

This configuration extends the circuit breaker setup by adding global health monitoring to the ed_gateway_output node. In addition to the standard circuit breaker behavior—where the circuit opens after 10 consecutive failures (failure_threshold: 10) and attempts recovery through controlled test calls—this configuration also monitors system-level metrics. When global_health.enabled is set to true, the node will proactively trip the circuit if memory usage exceeds 1024MB (memory_threshold: 1024MB) or if the output queue reaches 80% capacity (queue_threshold_percent: 80). These checks are performed every 30 seconds (check_interval: 30s). This hybrid model protects both destination integrity and local system stability by combining error-rate detection with resource-aware triggers.

Parameters Reference

Parameter Description
enabled Enables the circuit breaker mechanism.
failure_threshold Number of failures before transitioning to OPEN.
open_timeout Duration the circuit remains open before attempting recovery.
half_open_max_calls Number of test calls allowed in the HALF-OPEN state.
half_open_timeout Duration of half-open testing before resetting to OPEN or CLOSED.
check_interval Frequency of background checks for circuit state transitions.

Global Health Parameters

Parameter Description
global_health.enabled Activates system-level health monitoring.
memory_check_enabled Enables memory usage tracking.
memory_threshold Memory usage limit before circuit trips.
queue_check_enabled Enables queue size tracking.
queue_threshold_percent Queue saturation percentage to trip circuit.
check_interval Frequency of global health checks.

Circuit States

CLOSED (Healthy)

  • All requests processed normally
  • Tracks failure count
  • Transitions to OPEN when failure_threshold is reached

OPEN (Isolating)

  • Requests are blocked to prevent downstream overload
  • Fallback strategies activated (reroute, sample, drop)
  • Transitions to HALF-OPEN after open_timeout expires

HALF-OPEN (Testing)

  • Limited test traffic sent
  • If test calls succeed, circuit closes
  • If test calls fail, circuit returns to OPEN

Fallback Strategy

If the circuit transitions to OPEN, Edge Delta executes a tiered fallback sequence to ensure system stability and best-effort data delivery:

  1. Reroute: The system attempts to deliver the data to a different healthy destination. This includes up to three retry attempts with alternate endpoints, when available.
  • Success → data is delivered.
  • Failure → proceeds to health-based sampling.
  1. Health-Based Sampling: The system dynamically calculates a sampling rate based on the overall health of available destinations. A higher rate is applied when more destinations are healthy. Sampling ensures that representative data is preserved without overwhelming the system.
  • Success → sampled data is transmitted.
  • Failure → proceeds to final strategy.
  1. Graceful Drop: As a last resort, data is dropped in a controlled manner to protect system resources. This avoids infinite retry loops and ensures the agent remains responsive under sustained failure conditions.

This fallback sequence prioritizes delivery and observability continuity while ensuring the system does not degrade under load or repeated downstream failures.

Use Cases

Use circuit breakers in environments where:

  • Downstream systems are unstable or intermittently unreachable
  • Telemetry volumes are bursty or unpredictable
  • You require high availability without risking pipeline collapse
  • System memory or queue resources must be actively guarded

Troubleshooting

Symptom Solution
Frequent circuit openings Increase failure_threshold or adjust timeout values
Slow recovery Reduce half_open_timeout and increase half_open_max_calls
Premature trips from memory Raise memory_threshold or tune global health check intervals

Additional Resources