Monitoring and Visibility
3 minute read
Overview
Edge Delta provides continuous monitoring of agent health, throughput, and performance across your entire infrastructure. Built-in telemetry inputs capture operational data from every agent, enabling real-time detection of issues and proactive capacity planning.
Agent Health Monitoring
Edge Delta agents emit health telemetry that provides visibility into operational status. Built-in health inputs include:
ed_component_health: Component-level health statused_node_health: Node-level health metricsed_agent_stats: Agent performance statisticsed_pipeline_io_stats: Input/output throughput data
Each agent sends a heartbeat every minute to the Edge Delta backend, enabling real-time detection of connectivity issues, crashes, or configuration problems.
Health Indicators
Health indicators show agent state:
| State | Description |
|---|---|
| Healthy | Agent is running and processing data normally |
| Warning | Performance degradation or partial failures detected |
| Critical | Agent is down or experiencing severe issues |
| Unknown | No recent heartbeat received |
Pipeline Dashboard
The Pipeline Dashboard provides:
- Pipeline overview with visual status of all pipelines at a glance
- Individual agent status with deployment details
- Deployment status to track agent versions and configuration state
- Heartbeat monitoring with minute-by-minute agent availability checks
Throughput Monitoring
Track data volume and processing rates across all pipeline stages:
| Metric | Description | Use Case |
|---|---|---|
| Input Rate | Events/sec ingested by sources | Capacity planning |
| Processing Rate | Events/sec through processors | Performance tuning |
| Output Rate | Events/sec sent to destinations | Destination health |
| Drop Rate | Events/sec filtered or dropped | Filter effectiveness |
| Backpressure | Queue depth and latency | Flow control |
Pipeline I/O statistics show the flow through each stage. For example, a production logs pipeline might show:
- 45,000 events/sec input
- 12,000 events/sec filtered (26.7%)
- 33,000 events/sec processed
- 28,000 events/sec enriched
- 28,000 events/sec output (62.2% reduction)
These metrics enable teams to:
- Identify bottlenecks in processing pipelines
- Validate filter effectiveness and data reduction
- Detect anomalies in traffic patterns
- Optimize resource allocation
Performance Metrics
Monitor resource utilization and processing efficiency. Agent performance metrics include:
- CPU usage: Per-agent utilization and trends
- Memory usage: Heap allocation and garbage collection
- Disk I/O: Buffer usage for output queuing
- Network: Egress bandwidth to destinations
- Latency: End-to-end processing latency by node
Each processor node reports individual performance metrics including:
- Events processed per second
- Processing latency (P50, P95, P99)
- Error rate and retry statistics
- Cache hit rates for stateful processors
For example, an agent might show CPU at 245m/500m (49%), memory at 512MB/1GB (51%), processing at 12,500 events/sec, latency at P95=45ms and P99=120ms, and error rate at 0.02%.
Monitoring Strategy Best Practices
Establish observability practices that scale with your pipelines:
Establish Baselines
- Measure normal throughput and latency
- Track resource utilization patterns
- Document expected behavior
Define SLOs
Typical targets for pipeline SLOs include:
- 99.9% agent availability
- P99 processing latency under 200ms
- Error rate below 0.1%
- Zero data loss
Alert on Trends
Reduce alert fatigue by alerting on trends rather than spikes:
- Use rate-of-change alerts
- Apply moving averages
- Set appropriate thresholds
- Configure meaningful alert windows
Related Documentation
- Pipeline Dashboard - View pipeline health and status
- Self Telemetry Source - Configure agent self-telemetry
- Reducing Agent Resource Consumption - Performance tuning
- Flow Control - Manage data volume dynamically
- Anomaly Detection - Detect anomalies in telemetry patterns