Fleet Management

Comprehensive monitoring, audit, and control capabilities for managing Edge Delta agents and pipelines at scale across your entire fleet.

Overview

Edge Delta provides comprehensive fleet management capabilities that enable teams to monitor, control, and audit telemetry pipelines and agents across their entire infrastructure. Fleet management delivers deep visibility into throughput, performance, and operational health, enabling proactive flow control, intelligent alerting, and complete audit trails for compliance and troubleshooting.

Note: In Edge Delta, a pipeline refers to a group of deployed agents that share a common configuration—such as processor pods, compactor pods in Kubernetes, or agent binaries on VMs. The term “fleet” in this document is used semantically to describe managing multiple pipelines and agents at scale. For more on Edge Delta’s architecture, see Edge Delta Architecture.

Key capabilities include real-time monitoring of agent health and pipeline performance, throughput and data flow metrics across all nodes, integrated audit trails with CI/CD and change history, flow control mechanisms for managing data volume, centralized alerting and anomaly detection, and configuration versioning and rollback.

Fleet Architecture

Cloud Fleets and Edge Fleets

Edge Delta organizes agents into fleets that provide logical grouping and unified management.

  • Edge Fleets are deployed close to data sources (Kubernetes clusters, VMs, containers). They perform local processing, filtering, and aggregation, reduce network egress and centralized processing load, and scale horizontally based on local workload.
  • Cloud Fleets are deployed in centralized locations (gateway pattern). They handle cross-fleet aggregation and analysis, provide unified routing to destinations, and enable organization-wide policies. See Cloud Pipelines for configuration details.

This two-tier architecture enables both edge intelligence and centralized control, giving teams the flexibility to optimize for latency, cost, and operational requirements.

Learn more: How Fleets Work with Telemetry Pipelines

Monitoring and Visibility

Agent Health Monitoring

Edge Delta provides continuous monitoring of agent health across your entire fleet. Built-in health inputs include ed_component_health for component-level health status, ed_node_health for node-level health metrics, ed_agent_stats for agent performance statistics, and ed_pipeline_io_stats for input/output throughput data.

The health dashboard provides a fleet overview with visual status of all fleets at a glance, individual agent status with deployment details, deployment status to track agent versions and configuration state, and heartbeat monitoring with minute-by-minute agent availability checks.

Each agent sends a heartbeat every minute to the Edge Delta backend, enabling real-time detection of connectivity issues, crashes, or configuration problems. The dashboard aggregates this data to provide instant visibility into fleet-wide health.

Health indicators show agent state: Healthy means the agent is running and processing data normally, Warning indicates performance degradation or partial failures, Critical means the agent is down or experiencing severe issues, and Unknown means no recent heartbeat was received.

View your fleet: Pipeline Dashboard

Throughput Monitoring

Track data volume and processing rates across all pipeline stages:

MetricDescriptionUse Case
Input RateEvents/sec ingested by sourcesCapacity planning
Processing RateEvents/sec through processorsPerformance tuning
Output RateEvents/sec sent to destinationsDestination health
Drop RateEvents/sec filtered or droppedFilter effectiveness
BackpressureQueue depth and latencyFlow control

Pipeline I/O statistics show the flow through each stage. For example, a production logs pipeline might show 45,000 events/sec input, 12,000 events/sec filtered (26.7%), 33,000 events/sec processed, 28,000 events/sec enriched, and 28,000 events/sec output (62.2% reduction).

These metrics enable teams to:

  • Identify bottlenecks in processing pipelines
  • Validate filter effectiveness and data reduction
  • Detect anomalies in traffic patterns
  • Optimize resource allocation

Performance Metrics

Monitor resource utilization and processing efficiency. Agent performance metrics include:

  • CPU usage (per-agent utilization and trends)
  • Memory usage (heap allocation and garbage collection)
  • Disk I/O (buffer usage for output queuing)
  • Network (egress bandwidth to destinations)
  • Latency (end-to-end processing latency by node)

Each processor node reports individual performance metrics including events processed per second, processing latency (P50, P95, P99), error rate and retry statistics, and cache hit rates for stateful processors.

For example, an agent might show CPU at 245m/500m (49%), memory at 512MB/1GB (51%), processing at 12,500 events/sec, latency at P95=45ms and P99=120ms, and error rate at 0.02%.

Flow Control

Dynamic Sampling and Data Quotas

Edge Delta provides intelligent flow control that balances full-fidelity data routing during incidents with cost-effective sampling during normal operations. Rather than static sampling rates, flow control adjusts data volume dynamically based on real-time conditions, enabling teams to meet both cost objectives and troubleshooting requirements.

Flow control enables teams to automatically send 100% of data during alerts or incidents while maintaining aggressive sampling (1-10%) during steady-state operations. This approach reduces observability costs by up to 95% while preserving complete context when troubleshooting matters most.

How Dynamic Rate Sampling Works

Flow control operates through a four-stage pipeline that separates configuration from execution:

  1. Initial Tagging - Incoming telemetry is marked with a default sampling rate in its attributes using OTTL transformation statements.
  2. Lookup Consultation - The system checks a lookup table (typically a CSV file) for the current flow rate (0-100%) and expiration date based on service name or other attributes.
  3. Conditional Logic - The pipeline verifies if the expiration date has passed, applying either the default rate or the lookup value. This time-based expiration ensures temporary rate changes automatically revert to normal.
  4. Probabilistic Sampling - The Sample Processor executes the final sampling decision using the dynamically assigned rate.

This architecture decouples sampling configuration from pipeline deployment, enabling operational changes without code modifications or pipeline restarts.

Management and Automation

Teams manage flow control by updating a simple CSV lookup file containing service names, flow rates, and expiration dates. Changes can be made manually via the Edge Delta UI, programmatically via API, or automatically through monitor-triggered actions.

Edge Delta monitors can automatically trigger flow rate adjustments when specific conditions occur: high error rates can switch to 100% sampling, SLA violations can increase sampling for affected services, budget alerts can reduce sampling for non-critical services, and incident creation can enable full-fidelity for troubleshooting.

This tight integration eliminates the need for external workflow orchestration and ensures sampling rates respond instantly to changing conditions.

Use Cases and Value

For incident response, consider a payment service running at 10% sampling ($100/month cost) during normal operations. When an alert triggers, flow control automatically switches to 100% sampling for 2 hours, providing full-fidelity data for troubleshooting at an incremental cost of just $6.67. After the expiration period, sampling automatically reverts to 10%.

You can assign different sampling rates based on service criticality: critical services like payment and auth at 100% sampling, standard services at 20%, batch jobs at 5%, and development services at 1%.

Sampling rates can also adjust based on time of day or business cycles. During peak hours with high traffic, 5% sampling helps manage volume, while off-hours and weekends can use 20% sampling for better visibility.

For cost management, when approaching budget limits (such as 96% of monthly budget consumed), flow control can automatically reduce sampling for non-critical services while maintaining 100% for business-critical workloads.

Benefits

Aggressive sampling (1-10%) during stable periods can reduce observability costs by 90-95% compared to full-fidelity ingestion while maintaining continuous visibility. Automatic full-fidelity routing during alerts ensures rich context for incident investigation without manual intervention. Integration with Edge Delta monitors enables closed-loop automation where the observability system itself manages data volume based on operational conditions. Point-and-click CSV management eliminates complex engineering workflows and decouples data volume configuration from pipeline deployment cycles.

For step-by-step instructions on implementing flow control with dynamic sampling, including complete configuration examples and lookup table setup, see How to Implement Flow Control with Dynamic Sampling

Alerting and Anomaly Detection

Alert Types

Edge Delta supports multiple alert mechanisms for fleet monitoring.

Health alerts cover:

  • Agent connectivity failures
  • Version drift across fleet
  • Configuration sync failures
  • Resource exhaustion (CPU, memory, disk)

Performance alerts trigger on:

  • Throughput drops below threshold
  • Processing latency exceeding SLA
  • Error rate spikes
  • Backpressure buildup

Data quality alerts detect:

  • Unexpected traffic patterns
  • Missing expected data sources
  • Schema changes or parsing failures
  • Anomalous field values

The following example shows an alert configuration:

monitors:
  - name: high_drop_rate
    type: metric
    query: |
      rate(pipeline_events_dropped_total[5m]) > 1000      
    severity: warning
    annotations:
      summary: "High event drop rate detected"
      description: "Pipeline {{ $labels.pipeline }} dropping {{ $value }} events/sec"

Integration with Alerting Platforms

Edge Delta integrates with external alerting systems including PagerDuty for incident management and on-call routing, Slack and Microsoft Teams for team notifications and collaboration, webhooks for custom integrations with internal systems, and email for direct notifications to operations teams.

Setup guides: Monitor Configuration

Audit and Compliance

Configuration Audit Trail

Edge Delta maintains comprehensive audit logs for all configuration changes including pipeline configuration updates, processor node additions and deletions, destination modifications, sampling policy changes, and access control updates.

The following example shows audit log fields:

{
  "timestamp": "2025-12-03T20:15:00Z",
  "user": "alex.cain@edgedelta.com",
  "action": "pipeline.update",
  "resource": "production-logs",
  "changes": {
    "nodes": {
      "added": ["mask_pii_processor"],
      "modified": ["filter_debug_logs"],
      "removed": []
    }
  },
  "version": "v2.1.47",
  "commit_id": "a7b3c9d2"
}

Change history tracks who made the change, what was changed (with diff view), when the change occurred, why the change was made (commit message), and deployment status with rollback options.

CI/CD Integration

Edge Delta integrates seamlessly with CI/CD pipelines to enable Configuration as Code and Monitoring as Code. Pipeline configurations can be stored in Git with a directory structure like pipeline-configs/production/, pipeline-configs/staging/, and pipeline-configs/tests/ containing your YAML pipeline definitions.

The following example shows a CI/CD workflow:

# .github/workflows/deploy-pipeline.yml
name: Deploy Pipeline Configuration

on:
  push:
    branches: [main]
    paths: ['pipeline-configs/**']

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - name: Validate Pipeline Syntax
        run: edgedelta validate pipeline-configs/production/*.yaml

      - name: Run Integration Tests
        run: edgedelta test pipeline-configs/production/*.yaml

  deploy:
    needs: validate
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to Production
        run: edgedelta deploy --config pipeline-configs/production/
        env:
          EDGEDELTA_API_TOKEN: ${{ secrets.EDGEDELTA_API_TOKEN }}

This approach provides version control with complete history of configuration changes, code review before production deployment, automated testing to validate configurations before deployment, rollback capability to instantly revert to previous versions, and compliance through audit trails for regulatory requirements.

Learn more: Mastering CI/CD Monitoring

Monitoring as Code

Define monitoring policies alongside pipeline configurations. The following example shows a monitor definition:

# monitors/high-error-rate.yaml
apiVersion: edgedelta.com/v1
kind: Monitor
metadata:
  name: high-error-rate
spec:
  query: |
    rate(http_requests_total{status=~"5.."}[5m]) > 100    
  severity: critical
  annotations:
    summary: High error rate detected
    runbook_url: https://wiki.example.com/runbooks/high-error-rate
  notifications:
    - pagerduty: ops-team
    - slack: "#alerts"

Version control benefits include tracking monitor changes over time, reviewing and approving monitoring updates, synchronizing monitors with application deployments, and sharing monitor definitions across teams.

Deep dive: Monitoring as Code

Compliance and Governance

Fleet management supports compliance requirements including data residency (control where data is processed and stored), retention policies (enforce data retention rules), access controls (role-based access to configurations), encryption (in-transit and at-rest), and audit logging (complete audit trail for compliance audits).

Example compliance use cases include GDPR (PII masking before cross-border transfer), HIPAA (PHI redaction and audit trails), SOC 2 (configuration change tracking and approval workflows), and PCI DSS (cardholder data tokenization and access logs).

Troubleshooting and Diagnostics

Live Debugging

Edge Delta provides in-stream debugging capabilities without disrupting production. The following example shows a debug output configuration:

nodes:
  - name: debug_suspicious_traffic
    type: debug_output
    filter: |
      attributes["http.status_code"] >= 500      
    sample_rate: 100           # Capture all matching events
    max_events: 1000           # Limit capture size
    ttl: 1h                    # Auto-expire after 1 hour

This allows engineers to:

  • Capture live data matching specific criteria
  • Inspect event structure and attributes
  • Validate filter and processor logic
  • Debug issues without log aggregation

Learn more: Live Capture In-Stream Debugging

Agent Logs

Access detailed agent logs for troubleshooting. Log levels include INFO for normal operational messages, WARN for potential issues or degraded performance, ERROR for failed operations requiring attention, and DEBUG for verbose logging during deep troubleshooting.

Access agent logs using the following commands:

# Kubernetes
kubectl logs -n edgedelta <agent-pod-name>

# Docker
docker logs <agent-container-id>

# Systemd
journalctl -u edgedelta-agent

Common troubleshooting scenarios include agent not connecting to backend, high memory or CPU usage, destination connection failures, configuration syntax errors, and permission or credential issues.

Troubleshooting guide: Troubleshoot Edge Delta Agent with Helm

Best Practices

Fleet Organization

Organize your pipelines and agents for clarity and maintainability:

  • For logical fleet grouping, group by environment (prod, staging, dev), by region or data center, by application or service, or by compliance zone.
  • Use consistent naming conventions following a pattern like <environment>-<region>-<purpose>. Examples include prod-us-east-1-logs, staging-eu-west-1-metrics, and dev-global-traces.
  • For configuration inheritance, define base configurations at organization level, override at fleet level for specific needs, and use templates for common patterns.

Monitoring Strategy

Establish observability practices that scale with your fleet:

  • Establish baselines by measuring normal throughput and latency, tracking resource utilization patterns, and documenting expected behavior.
  • Define SLOs for your fleet. Typical targets include 99.9% agent availability, P99 processing latency under 200ms, error rate below 0.1%, and zero data loss.
  • Alert on trends rather than spikes by using rate-of-change alerts, applying moving averages, setting appropriate thresholds, and reducing alert fatigue.

Security and Compliance

Protect your telemetry infrastructure and data:

  • For least privilege access, limit who can modify configurations, use role-based access control, and audit access regularly.
  • For secrets management, never commit credentials to Git, use secret stores like Vault or AWS Secrets Manager, and rotate credentials regularly.
  • For data protection, mask PII at collection time, encrypt data in transit, and enforce data retention policies.

Documentation

Blog Posts

Summary

Edge Delta’s fleet management capabilities provide real-time visibility into agent health, throughput, and performance; sophisticated flow control with circuit breakers and backpressure management; comprehensive audit trails for compliance and troubleshooting; CI/CD integration for Configuration as Code and Monitoring as Code; intelligent alerting with integration to external platforms; deployment safety with staged rollouts and automatic rollback; and API access for programmatic management and automation.

These capabilities enable teams to operate telemetry pipelines at scale with confidence, maintaining high availability while reducing operational overhead.