Monitoring and Visibility

Monitor agent health, throughput, and performance metrics across your Edge Delta pipelines and agents.

Overview

Edge Delta provides continuous monitoring of agent health, throughput, and performance across your entire infrastructure. Built-in telemetry inputs capture operational data from every agent, enabling real-time detection of issues and proactive capacity planning.

Agent Health Monitoring

Edge Delta agents emit health telemetry that provides visibility into operational status. Built-in health inputs include:

  • ed_component_health: Component-level health status
  • ed_node_health: Node-level health metrics
  • ed_agent_stats: Agent performance statistics
  • ed_pipeline_io_stats: Input/output throughput data

Each agent sends a heartbeat every minute to the Edge Delta backend, enabling real-time detection of connectivity issues, crashes, or configuration problems.

Health Indicators

Health indicators show agent state:

StateDescription
HealthyAgent is running and processing data normally
WarningPerformance degradation or partial failures detected
CriticalAgent is down or experiencing severe issues
UnknownNo recent heartbeat received

Pipeline Dashboard

The Pipeline Dashboard provides:

  • Pipeline overview with visual status of all pipelines at a glance
  • Individual agent status with deployment details
  • Deployment status to track agent versions and configuration state
  • Heartbeat monitoring with minute-by-minute agent availability checks

Throughput Monitoring

Track data volume and processing rates across all pipeline stages:

MetricDescriptionUse Case
Input RateEvents/sec ingested by sourcesCapacity planning
Processing RateEvents/sec through processorsPerformance tuning
Output RateEvents/sec sent to destinationsDestination health
Drop RateEvents/sec filtered or droppedFilter effectiveness
BackpressureQueue depth and latencyFlow control

Pipeline I/O statistics show the flow through each stage. For example, a production logs pipeline might show:

  • 45,000 events/sec input
  • 12,000 events/sec filtered (26.7%)
  • 33,000 events/sec processed
  • 28,000 events/sec enriched
  • 28,000 events/sec output (62.2% reduction)

These metrics enable teams to:

  • Identify bottlenecks in processing pipelines
  • Validate filter effectiveness and data reduction
  • Detect anomalies in traffic patterns
  • Optimize resource allocation

Performance Metrics

Monitor resource utilization and processing efficiency. Agent performance metrics include:

  • CPU usage: Per-agent utilization and trends
  • Memory usage: Heap allocation and garbage collection
  • Disk I/O: Buffer usage for output queuing
  • Network: Egress bandwidth to destinations
  • Latency: End-to-end processing latency by node

Each processor node reports individual performance metrics including:

  • Events processed per second
  • Processing latency (P50, P95, P99)
  • Error rate and retry statistics
  • Cache hit rates for stateful processors

For example, an agent might show CPU at 245m/500m (49%), memory at 512MB/1GB (51%), processing at 12,500 events/sec, latency at P95=45ms and P99=120ms, and error rate at 0.02%.

Monitoring Strategy Best Practices

Establish observability practices that scale with your pipelines:

Establish Baselines

  • Measure normal throughput and latency
  • Track resource utilization patterns
  • Document expected behavior

Define SLOs

Typical targets for pipeline SLOs include:

  • 99.9% agent availability
  • P99 processing latency under 200ms
  • Error rate below 0.1%
  • Zero data loss

Reduce alert fatigue by alerting on trends rather than spikes:

  • Use rate-of-change alerts
  • Apply moving averages
  • Set appropriate thresholds
  • Configure meaningful alert windows