Circuit Breaker
4 minute read
Overview
The Circuit Breaker feature adds resilience and fault tolerance to Edge Delta’s output nodes. When enabled, this mechanism monitors downstream destination health and automatically blocks traffic when failures occur, preventing overload, enabling recovery, and preserving data integrity.
Use circuit breakers to:
- Isolate failure-prone destinations
- Maintain agent stability under load or error spikes
- Reduce retry storms and preserve critical resources
- Enable fallback strategies (reroute, sample, drop)
How It Works
The circuit breaker wraps the output connection logic and tracks success/failure metrics in real time. If a threshold of consecutive failures is reached, the circuit transitions through the following states:
State | Behavior |
---|---|
CLOSED |
Normal operation. Requests are sent. Failures are counted. |
OPEN |
Requests are blocked. Fallback strategies are triggered. |
HALF-OPEN |
Limited test requests allowed. Determines if recovery is possible. |
State transitions are controlled by parameters such as failure_threshold
, open_timeout
, and half_open_max_calls
.
In addition to failure-based transitions, Global Health Monitoring can proactively open circuits based on memory and queue pressure—adding resource-based protection to the error-based model.
YAML Configuration
Basic Example
outputs:
- type: ed_gateway_output
name: "production-gateway"
endpoint: "gateway.example.com"
port: 4317
protocol: grpc
resilience:
circuit_breaker:
enabled: true
failure_threshold: 10
open_timeout: 30s
half_open_max_calls: 5
half_open_timeout: 90s
check_interval: 10s
This configuration defines an Edge Delta output node that sends telemetry data to a gateway at gateway.example.com
over gRPC on port 4317
. The resilience
section enables a circuit breaker to protect the pipeline from repeated failures. If 10 consecutive failures occur (failure_threshold: 10
), the circuit transitions to an open state for 30 seconds (open_timeout: 30s
), during which traffic is blocked to prevent overload. After this timeout, the circuit enters a half-open state where up to 5 test calls are allowed (half_open_max_calls: 5
) over a 90-second window (half_open_timeout: 90s
) to evaluate recovery. The circuit’s state is evaluated every 10 seconds (check_interval: 10s
), enabling timely transitions based on destination health.
With Global Health Monitoring
outputs:
- type: ed_gateway_output
name: "production-gateway"
endpoint: "gateway.example.com"
port: 4317
protocol: grpc
resilience:
circuit_breaker:
enabled: true
failure_threshold: 10
open_timeout: 30s
half_open_max_calls: 5
half_open_timeout: 90s
check_interval: 10s
global_health:
enabled: true
memory_check_enabled: true
memory_threshold: 1024MB
queue_check_enabled: true
queue_threshold_percent: 80
check_interval: 30s
This configuration extends the circuit breaker setup by adding global health monitoring to the ed_gateway_output
node. In addition to the standard circuit breaker behavior—where the circuit opens after 10 consecutive failures (failure_threshold: 10
) and attempts recovery through controlled test calls—this configuration also monitors system-level metrics. When global_health.enabled
is set to true
, the node will proactively trip the circuit if memory usage exceeds 1024MB (memory_threshold: 1024MB
) or if the output queue reaches 80% capacity (queue_threshold_percent: 80
). These checks are performed every 30 seconds (check_interval: 30s
). This hybrid model protects both destination integrity and local system stability by combining error-rate detection with resource-aware triggers.
Parameters Reference
Parameter | Description |
---|---|
enabled |
Enables the circuit breaker mechanism. |
failure_threshold |
Number of failures before transitioning to OPEN . |
open_timeout |
Duration the circuit remains open before attempting recovery. |
half_open_max_calls |
Number of test calls allowed in the HALF-OPEN state. |
half_open_timeout |
Duration of half-open testing before resetting to OPEN or CLOSED . |
check_interval |
Frequency of background checks for circuit state transitions. |
Global Health Parameters
Parameter | Description |
---|---|
global_health.enabled |
Activates system-level health monitoring. |
memory_check_enabled |
Enables memory usage tracking. |
memory_threshold |
Memory usage limit before circuit trips. |
queue_check_enabled |
Enables queue size tracking. |
queue_threshold_percent |
Queue saturation percentage to trip circuit. |
check_interval |
Frequency of global health checks. |
Circuit States
CLOSED (Healthy)
- All requests processed normally
- Tracks failure count
- Transitions to
OPEN
whenfailure_threshold
is reached
OPEN (Isolating)
- Requests are blocked to prevent downstream overload
- Fallback strategies activated (reroute, sample, drop)
- Transitions to
HALF-OPEN
afteropen_timeout
expires
HALF-OPEN (Testing)
- Limited test traffic sent
- If test calls succeed, circuit closes
- If test calls fail, circuit returns to
OPEN
Fallback Strategy
If the circuit transitions to OPEN
, Edge Delta executes a tiered fallback sequence to ensure system stability and best-effort data delivery:
- Reroute: The system attempts to deliver the data to a different healthy destination. This includes up to three retry attempts with alternate endpoints, when available.
- Success → data is delivered.
- Failure → proceeds to health-based sampling.
- Health-Based Sampling: The system dynamically calculates a sampling rate based on the overall health of available destinations. A higher rate is applied when more destinations are healthy. Sampling ensures that representative data is preserved without overwhelming the system.
- Success → sampled data is transmitted.
- Failure → proceeds to final strategy.
- Graceful Drop: As a last resort, data is dropped in a controlled manner to protect system resources. This avoids infinite retry loops and ensures the agent remains responsive under sustained failure conditions.
This fallback sequence prioritizes delivery and observability continuity while ensuring the system does not degrade under load or repeated downstream failures.
Use Cases
Use circuit breakers in environments where:
- Downstream systems are unstable or intermittently unreachable
- Telemetry volumes are bursty or unpredictable
- You require high availability without risking pipeline collapse
- System memory or queue resources must be actively guarded
Troubleshooting
Symptom | Solution |
---|---|
Frequent circuit openings | Increase failure_threshold or adjust timeout values |
Slow recovery | Reduce half_open_timeout and increase half_open_max_calls |
Premature trips from memory | Raise memory_threshold or tune global health check intervals |