Fleet Management
11 minute read
Overview
Edge Delta provides comprehensive fleet management capabilities that enable teams to monitor, control, and audit telemetry pipelines and agents across their entire infrastructure. Fleet management delivers deep visibility into throughput, performance, and operational health, enabling proactive flow control, intelligent alerting, and complete audit trails for compliance and troubleshooting.
Note: In Edge Delta, a pipeline refers to a group of deployed agents that share a common configuration—such as processor pods, compactor pods in Kubernetes, or agent binaries on VMs. The term “fleet” in this document is used semantically to describe managing multiple pipelines and agents at scale. For more on Edge Delta’s architecture, see Edge Delta Architecture.
Key capabilities include real-time monitoring of agent health and pipeline performance, throughput and data flow metrics across all nodes, integrated audit trails with CI/CD and change history, flow control mechanisms for managing data volume, centralized alerting and anomaly detection, and configuration versioning and rollback.
Fleet Architecture
Cloud Fleets and Edge Fleets
Edge Delta organizes agents into fleets that provide logical grouping and unified management.
- Edge Fleets are deployed close to data sources (Kubernetes clusters, VMs, containers). They perform local processing, filtering, and aggregation, reduce network egress and centralized processing load, and scale horizontally based on local workload.
- Cloud Fleets are deployed in centralized locations (gateway pattern). They handle cross-fleet aggregation and analysis, provide unified routing to destinations, and enable organization-wide policies. See Cloud Pipelines for configuration details.
This two-tier architecture enables both edge intelligence and centralized control, giving teams the flexibility to optimize for latency, cost, and operational requirements.
Learn more: How Fleets Work with Telemetry Pipelines
Monitoring and Visibility
Agent Health Monitoring
Edge Delta provides continuous monitoring of agent health across your entire fleet. Built-in health inputs include ed_component_health for component-level health status, ed_node_health for node-level health metrics, ed_agent_stats for agent performance statistics, and ed_pipeline_io_stats for input/output throughput data.
The health dashboard provides a fleet overview with visual status of all fleets at a glance, individual agent status with deployment details, deployment status to track agent versions and configuration state, and heartbeat monitoring with minute-by-minute agent availability checks.
Each agent sends a heartbeat every minute to the Edge Delta backend, enabling real-time detection of connectivity issues, crashes, or configuration problems. The dashboard aggregates this data to provide instant visibility into fleet-wide health.
Health indicators show agent state: Healthy means the agent is running and processing data normally, Warning indicates performance degradation or partial failures, Critical means the agent is down or experiencing severe issues, and Unknown means no recent heartbeat was received.
View your fleet: Pipeline Dashboard
Throughput Monitoring
Track data volume and processing rates across all pipeline stages:
| Metric | Description | Use Case |
|---|---|---|
| Input Rate | Events/sec ingested by sources | Capacity planning |
| Processing Rate | Events/sec through processors | Performance tuning |
| Output Rate | Events/sec sent to destinations | Destination health |
| Drop Rate | Events/sec filtered or dropped | Filter effectiveness |
| Backpressure | Queue depth and latency | Flow control |
Pipeline I/O statistics show the flow through each stage. For example, a production logs pipeline might show 45,000 events/sec input, 12,000 events/sec filtered (26.7%), 33,000 events/sec processed, 28,000 events/sec enriched, and 28,000 events/sec output (62.2% reduction).
These metrics enable teams to:
- Identify bottlenecks in processing pipelines
- Validate filter effectiveness and data reduction
- Detect anomalies in traffic patterns
- Optimize resource allocation
Performance Metrics
Monitor resource utilization and processing efficiency. Agent performance metrics include:
- CPU usage (per-agent utilization and trends)
- Memory usage (heap allocation and garbage collection)
- Disk I/O (buffer usage for output queuing)
- Network (egress bandwidth to destinations)
- Latency (end-to-end processing latency by node)
Each processor node reports individual performance metrics including events processed per second, processing latency (P50, P95, P99), error rate and retry statistics, and cache hit rates for stateful processors.
For example, an agent might show CPU at 245m/500m (49%), memory at 512MB/1GB (51%), processing at 12,500 events/sec, latency at P95=45ms and P99=120ms, and error rate at 0.02%.
Flow Control
Dynamic Sampling and Data Quotas
Edge Delta provides intelligent flow control that balances full-fidelity data routing during incidents with cost-effective sampling during normal operations. Rather than static sampling rates, flow control adjusts data volume dynamically based on real-time conditions, enabling teams to meet both cost objectives and troubleshooting requirements.
Flow control enables teams to automatically send 100% of data during alerts or incidents while maintaining aggressive sampling (1-10%) during steady-state operations. This approach reduces observability costs by up to 95% while preserving complete context when troubleshooting matters most.
How Dynamic Rate Sampling Works
Flow control operates through a four-stage pipeline that separates configuration from execution:
- Initial Tagging - Incoming telemetry is marked with a default sampling rate in its attributes using OTTL transformation statements.
- Lookup Consultation - The system checks a lookup table (typically a CSV file) for the current flow rate (0-100%) and expiration date based on service name or other attributes.
- Conditional Logic - The pipeline verifies if the expiration date has passed, applying either the default rate or the lookup value. This time-based expiration ensures temporary rate changes automatically revert to normal.
- Probabilistic Sampling - The Sample Processor executes the final sampling decision using the dynamically assigned rate.
This architecture decouples sampling configuration from pipeline deployment, enabling operational changes without code modifications or pipeline restarts.
Management and Automation
Teams manage flow control by updating a simple CSV lookup file containing service names, flow rates, and expiration dates. Changes can be made manually via the Edge Delta UI, programmatically via API, or automatically through monitor-triggered actions.
Edge Delta monitors can automatically trigger flow rate adjustments when specific conditions occur: high error rates can switch to 100% sampling, SLA violations can increase sampling for affected services, budget alerts can reduce sampling for non-critical services, and incident creation can enable full-fidelity for troubleshooting.
This tight integration eliminates the need for external workflow orchestration and ensures sampling rates respond instantly to changing conditions.
Use Cases and Value
For incident response, consider a payment service running at 10% sampling ($100/month cost) during normal operations. When an alert triggers, flow control automatically switches to 100% sampling for 2 hours, providing full-fidelity data for troubleshooting at an incremental cost of just $6.67. After the expiration period, sampling automatically reverts to 10%.
You can assign different sampling rates based on service criticality: critical services like payment and auth at 100% sampling, standard services at 20%, batch jobs at 5%, and development services at 1%.
Sampling rates can also adjust based on time of day or business cycles. During peak hours with high traffic, 5% sampling helps manage volume, while off-hours and weekends can use 20% sampling for better visibility.
For cost management, when approaching budget limits (such as 96% of monthly budget consumed), flow control can automatically reduce sampling for non-critical services while maintaining 100% for business-critical workloads.
Benefits
Aggressive sampling (1-10%) during stable periods can reduce observability costs by 90-95% compared to full-fidelity ingestion while maintaining continuous visibility. Automatic full-fidelity routing during alerts ensures rich context for incident investigation without manual intervention. Integration with Edge Delta monitors enables closed-loop automation where the observability system itself manages data volume based on operational conditions. Point-and-click CSV management eliminates complex engineering workflows and decouples data volume configuration from pipeline deployment cycles.
For step-by-step instructions on implementing flow control with dynamic sampling, including complete configuration examples and lookup table setup, see How to Implement Flow Control with Dynamic Sampling
Alerting and Anomaly Detection
Alert Types
Edge Delta supports multiple alert mechanisms for fleet monitoring.
Health alerts cover:
- Agent connectivity failures
- Version drift across fleet
- Configuration sync failures
- Resource exhaustion (CPU, memory, disk)
Performance alerts trigger on:
- Throughput drops below threshold
- Processing latency exceeding SLA
- Error rate spikes
- Backpressure buildup
Data quality alerts detect:
- Unexpected traffic patterns
- Missing expected data sources
- Schema changes or parsing failures
- Anomalous field values
The following example shows an alert configuration:
monitors:
- name: high_drop_rate
type: metric
query: |
rate(pipeline_events_dropped_total[5m]) > 1000
severity: warning
annotations:
summary: "High event drop rate detected"
description: "Pipeline {{ $labels.pipeline }} dropping {{ $value }} events/sec"
Integration with Alerting Platforms
Edge Delta integrates with external alerting systems including PagerDuty for incident management and on-call routing, Slack and Microsoft Teams for team notifications and collaboration, webhooks for custom integrations with internal systems, and email for direct notifications to operations teams.
Setup guides: Monitor Configuration
Audit and Compliance
Configuration Audit Trail
Edge Delta maintains comprehensive audit logs for all configuration changes including pipeline configuration updates, processor node additions and deletions, destination modifications, sampling policy changes, and access control updates.
The following example shows audit log fields:
{
"timestamp": "2025-12-03T20:15:00Z",
"user": "alex.cain@edgedelta.com",
"action": "pipeline.update",
"resource": "production-logs",
"changes": {
"nodes": {
"added": ["mask_pii_processor"],
"modified": ["filter_debug_logs"],
"removed": []
}
},
"version": "v2.1.47",
"commit_id": "a7b3c9d2"
}
Change history tracks who made the change, what was changed (with diff view), when the change occurred, why the change was made (commit message), and deployment status with rollback options.
CI/CD Integration
Edge Delta integrates seamlessly with CI/CD pipelines to enable Configuration as Code and Monitoring as Code. Pipeline configurations can be stored in Git with a directory structure like pipeline-configs/production/, pipeline-configs/staging/, and pipeline-configs/tests/ containing your YAML pipeline definitions.
The following example shows a CI/CD workflow:
# .github/workflows/deploy-pipeline.yml
name: Deploy Pipeline Configuration
on:
push:
branches: [main]
paths: ['pipeline-configs/**']
jobs:
validate:
runs-on: ubuntu-latest
steps:
- name: Validate Pipeline Syntax
run: edgedelta validate pipeline-configs/production/*.yaml
- name: Run Integration Tests
run: edgedelta test pipeline-configs/production/*.yaml
deploy:
needs: validate
runs-on: ubuntu-latest
steps:
- name: Deploy to Production
run: edgedelta deploy --config pipeline-configs/production/
env:
EDGEDELTA_API_TOKEN: ${{ secrets.EDGEDELTA_API_TOKEN }}
This approach provides version control with complete history of configuration changes, code review before production deployment, automated testing to validate configurations before deployment, rollback capability to instantly revert to previous versions, and compliance through audit trails for regulatory requirements.
Learn more: Mastering CI/CD Monitoring
Monitoring as Code
Define monitoring policies alongside pipeline configurations. The following example shows a monitor definition:
# monitors/high-error-rate.yaml
apiVersion: edgedelta.com/v1
kind: Monitor
metadata:
name: high-error-rate
spec:
query: |
rate(http_requests_total{status=~"5.."}[5m]) > 100
severity: critical
annotations:
summary: High error rate detected
runbook_url: https://wiki.example.com/runbooks/high-error-rate
notifications:
- pagerduty: ops-team
- slack: "#alerts"
Version control benefits include tracking monitor changes over time, reviewing and approving monitoring updates, synchronizing monitors with application deployments, and sharing monitor definitions across teams.
Deep dive: Monitoring as Code
Compliance and Governance
Fleet management supports compliance requirements including data residency (control where data is processed and stored), retention policies (enforce data retention rules), access controls (role-based access to configurations), encryption (in-transit and at-rest), and audit logging (complete audit trail for compliance audits).
Example compliance use cases include GDPR (PII masking before cross-border transfer), HIPAA (PHI redaction and audit trails), SOC 2 (configuration change tracking and approval workflows), and PCI DSS (cardholder data tokenization and access logs).
Troubleshooting and Diagnostics
Live Debugging
Edge Delta provides in-stream debugging capabilities without disrupting production. The following example shows a debug output configuration:
nodes:
- name: debug_suspicious_traffic
type: debug_output
filter: |
attributes["http.status_code"] >= 500
sample_rate: 100 # Capture all matching events
max_events: 1000 # Limit capture size
ttl: 1h # Auto-expire after 1 hour
This allows engineers to:
- Capture live data matching specific criteria
- Inspect event structure and attributes
- Validate filter and processor logic
- Debug issues without log aggregation
Learn more: Live Capture In-Stream Debugging
Agent Logs
Access detailed agent logs for troubleshooting. Log levels include INFO for normal operational messages, WARN for potential issues or degraded performance, ERROR for failed operations requiring attention, and DEBUG for verbose logging during deep troubleshooting.
Access agent logs using the following commands:
# Kubernetes
kubectl logs -n edgedelta <agent-pod-name>
# Docker
docker logs <agent-container-id>
# Systemd
journalctl -u edgedelta-agent
Common troubleshooting scenarios include agent not connecting to backend, high memory or CPU usage, destination connection failures, configuration syntax errors, and permission or credential issues.
Troubleshooting guide: Troubleshoot Edge Delta Agent with Helm
Best Practices
Fleet Organization
Organize your pipelines and agents for clarity and maintainability:
- For logical fleet grouping, group by environment (prod, staging, dev), by region or data center, by application or service, or by compliance zone.
- Use consistent naming conventions following a pattern like
<environment>-<region>-<purpose>. Examples includeprod-us-east-1-logs,staging-eu-west-1-metrics, anddev-global-traces. - For configuration inheritance, define base configurations at organization level, override at fleet level for specific needs, and use templates for common patterns.
Monitoring Strategy
Establish observability practices that scale with your fleet:
- Establish baselines by measuring normal throughput and latency, tracking resource utilization patterns, and documenting expected behavior.
- Define SLOs for your fleet. Typical targets include 99.9% agent availability, P99 processing latency under 200ms, error rate below 0.1%, and zero data loss.
- Alert on trends rather than spikes by using rate-of-change alerts, applying moving averages, setting appropriate thresholds, and reducing alert fatigue.
Security and Compliance
Protect your telemetry infrastructure and data:
- For least privilege access, limit who can modify configurations, use role-based access control, and audit access regularly.
- For secrets management, never commit credentials to Git, use secret stores like Vault or AWS Secrets Manager, and rotate credentials regularly.
- For data protection, mask PII at collection time, encrypt data in transit, and enforce data retention policies.
Related Topics
Documentation
- Pipeline Dashboard - View fleet health and status
- Install Edge Delta Fleet with Helm - Kubernetes deployment guide
- Circuit Breaker Configuration - Flow control setup
- Agent Settings - Global configuration options
- API Reference - Programmatic access
Blog Posts
- How Fleets Work with Telemetry Pipelines - Fleet architecture deep dive
- Mastering CI/CD Monitoring - CI/CD integration best practices
- Monitoring as Code - Configuration management strategies
- Observability as Code - GitOps for observability
- Telemetry Pipelines Architecture - Architecture overview
Summary
Edge Delta’s fleet management capabilities provide real-time visibility into agent health, throughput, and performance; sophisticated flow control with circuit breakers and backpressure management; comprehensive audit trails for compliance and troubleshooting; CI/CD integration for Configuration as Code and Monitoring as Code; intelligent alerting with integration to external platforms; deployment safety with staged rollouts and automatic rollback; and API access for programmatic management and automation.
These capabilities enable teams to operate telemetry pipelines at scale with confidence, maintaining high availability while reducing operational overhead.