Troubleshooting Ingest Failures
4 minute read
Overview
Ingest failures often cascade across multiple system layers. A broker redundancy loss can trigger frontend overload, disk pressure, and consumer lag, with each symptom masking the underlying cause. This guide provides a structured approach to diagnosing and resolving ingest failures using Edge Delta’s telemetry capabilities.
For AI-assisted investigation of ingest failures, see Ingest Failure Investigation.
Recommended Pipeline Configuration
Configure your Telemetry Pipeline to capture the signals needed for effective troubleshooting:
Sources
| Source | Purpose |
|---|---|
kubernetes_input | Collect logs from ingest services and pods |
kafka_input | Capture broker logs and partition events |
otlp_input | Ingest traces and host metrics (CPU, memory, disk) |
http_input | Receive control-plane change events |
Enrichment
Add OTTL transforms to include scoping metadata:
region: geographic or availability zonecluster: Kafka cluster identifierpartition: affected partition numbertenant: customer or service owner
Consistent enrichment enables fast impact scoping during incidents.
Pattern and Metric Extraction
| Processor | Purpose |
|---|---|
log_to_pattern_metric | Emit pattern metrics for broker errors and ingest failures |
extract_metric | Derive ingest error rate, consumer lag, and retry volume |
Destinations
| Destination | Purpose |
|---|---|
ed_output | Send to Edge Delta for analysis and alerting |
s3_output | Archive for rehydration and compliance |
ed_ai_event_output | Route events to AI Team (optional) |
Troubleshooting by Symptom
Broker Partition Errors
Symptoms:
- Leaderless partition warnings in broker logs
- Uneven write availability across partitions
- Increased producer retries
Diagnostic Steps:
Query broker logs filtered by
partitionto identify affected partitions:resource.k8s.container.name = "kafka" AND body CONTAINS "leaderless"Check pattern metrics for broker error signatures in the Logs Explorer
Review the service map to identify dependent services
Resolution:
- Trigger partition leader election if brokers are available
- Reduce producer load to affected partitions
- Route low-priority logs to archive to reduce broker pressure
Frontend 5xx Errors
Symptoms:
- Intermittent 5xx responses from ingest API
- Elevated queue depth at frontends
- Customer-reported ingestion failures
Diagnostic Steps:
Query extracted metrics for 5xx error rate trends
Examine traces showing request latency breakdown:
- Queue wait time
- Broker write latency
- Serialization overhead
Check the service map for degraded edges between ingest API and brokers
Resolution:
- Enable pipeline sampling to reduce low-priority log volume
- Scale ingest frontends horizontally if queue depth is high
- Check broker health before adding frontend capacity
Disk Pressure
Symptoms:
- Broker disk utilization exceeding thresholds
- Write failures or slowdowns
- Retention policy violations
Diagnostic Steps:
Query OTLP host metrics for disk free percentage:
metric.name = "system.disk.free" AND resource.host.name CONTAINS "kafka"Correlate disk trends with retry volume and buffer growth
Review dashboards showing buffer accumulation by source
Resolution:
- Route low-priority telemetry to S3 using pipeline routing
- Reduce retention period temporarily
- Add broker capacity or storage
- Validate rehydration path before archiving
Consumer Lag
Symptoms:
- Consumer groups falling behind
- Stale data in SLO calculations
- Missing or delayed alerts
Diagnostic Steps:
Query consumer lag metrics extracted from broker logs
Check monitors for lag threshold violations
Review dashboards showing lag alongside SLO freshness
Resolution:
- Scale consumer group instances
- Prioritize SLO-related telemetry in pipeline routing
- Investigate upstream producer rate increases
Control Plane Failures
Symptoms:
- Administrative API failures (topic creation, config changes)
- Metadata inconsistencies
- Cluster management operations timing out
Diagnostic Steps:
Query change events captured via
http_inputCorrelate admin failures with broker health metrics
Review dashboards showing admin operation success rates
Resolution:
- Wait for cluster stabilization before retrying admin operations
- Use staged configuration changes with approval steps
- Document changes for post-incident review
Recovery Checklist
After resolving the immediate issue:
- Validate service map continuity: Confirm dependency relationships are accurate
- Check dashboard data freshness: Ensure metrics and logs are current
- Initiate rehydration: Backfill missing windows from S3 archives using Rehydration
- Review monitors: Adjust thresholds based on incident learnings
- Update runbooks: Document new failure modes and resolutions
Prevention Checklist
Use these guidelines to improve resilience:
- Add enrichment fields (
region,cluster,partition,tenant) for fast scoping - Enable pattern extraction for broker and ingest error signatures
- Extract metrics for error rate, consumer lag, and retry volume
- Connect metrics to monitors with appropriate thresholds
- Build dashboards combining traces and core metrics
- Route low-priority telemetry to archive and validate rehydration before incidents
- Review the service map regularly to understand dependency changes
Related Resources
- Ingest Failure Investigation: AI-assisted investigation workflow
- Monitors: Configure alerting thresholds
- Dashboards: Build operational dashboards
- Rehydration: Recover archived data
- Service Map: Visualize service dependencies