Troubleshooting Ingest Failures

Diagnose and resolve ingest failures across message buses, frontends, and storage tiers.

4 minute read

Overview

Ingest failures often cascade across multiple system layers. A broker redundancy loss can trigger frontend overload, disk pressure, and consumer lag, with each symptom masking the underlying cause. This guide provides a structured approach to diagnosing and resolving ingest failures using Edge Delta’s telemetry capabilities.

For AI-assisted investigation of ingest failures, see Ingest Failure Investigation.

Recommended Pipeline Configuration

Configure your Telemetry Pipeline to capture the signals needed for effective troubleshooting:

Sources

Source	Purpose
`kubernetes_input`	Collect logs from ingest services and pods
`kafka_input`	Capture broker logs and partition events
`otlp_input`	Ingest traces and host metrics (CPU, memory, disk)
`http_input`	Receive control-plane change events

Enrichment

Add OTTL transforms to include scoping metadata:

region: geographic or availability zone
cluster: Kafka cluster identifier
partition: affected partition number
tenant: customer or service owner

Consistent enrichment enables fast impact scoping during incidents.

Pattern and Metric Extraction

Processor	Purpose
`log_to_pattern_metric`	Emit pattern metrics for broker errors and ingest failures
`extract_metric`	Derive ingest error rate, consumer lag, and retry volume

Destinations

Destination	Purpose
`ed_output`	Send to Edge Delta for analysis and alerting
`s3_output`	Archive for rehydration and compliance
`ed_ai_event_output`	Route events to AI Team (optional)

Troubleshooting by Symptom

Broker Partition Errors

Symptoms:

Leaderless partition warnings in broker logs
Uneven write availability across partitions
Increased producer retries

Diagnostic Steps:

Query broker logs filtered by partition to identify affected partitions:

resource.k8s.container.name = "kafka" AND body CONTAINS "leaderless"

Check pattern metrics for broker error signatures in the Logs Explorer
Review the service map to identify dependent services

Resolution:

Trigger partition leader election if brokers are available
Reduce producer load to affected partitions
Route low-priority logs to archive to reduce broker pressure

Frontend 5xx Errors

Symptoms:

Intermittent 5xx responses from ingest API
Elevated queue depth at frontends
Customer-reported ingestion failures

Diagnostic Steps:

Query extracted metrics for 5xx error rate trends
Examine traces showing request latency breakdown:
- Queue wait time
- Broker write latency
- Serialization overhead
Check the service map for degraded edges between ingest API and brokers

Resolution:

Enable pipeline sampling to reduce low-priority log volume
Scale ingest frontends horizontally if queue depth is high
Check broker health before adding frontend capacity

Disk Pressure

Symptoms:

Broker disk utilization exceeding thresholds
Write failures or slowdowns
Retention policy violations

Diagnostic Steps:

Query OTLP host metrics for disk free percentage:

metric.name = "system.disk.free" AND resource.host.name CONTAINS "kafka"

Correlate disk trends with retry volume and buffer growth
Review dashboards showing buffer accumulation by source

Resolution:

Route low-priority telemetry to S3 using pipeline routing
Reduce retention period temporarily
Add broker capacity or storage
Validate rehydration path before archiving

Consumer Lag

Symptoms:

Consumer groups falling behind
Stale data in SLO calculations
Missing or delayed alerts

Diagnostic Steps:

Query consumer lag metrics extracted from broker logs
Check monitors for lag threshold violations
Review dashboards showing lag alongside SLO freshness

Resolution:

Scale consumer group instances
Prioritize SLO-related telemetry in pipeline routing
Investigate upstream producer rate increases

Control Plane Failures

Symptoms:

Administrative API failures (topic creation, config changes)
Metadata inconsistencies
Cluster management operations timing out

Diagnostic Steps:

Query change events captured via http_input
Correlate admin failures with broker health metrics
Review dashboards showing admin operation success rates

Resolution:

Wait for cluster stabilization before retrying admin operations
Use staged configuration changes with approval steps
Document changes for post-incident review

Recovery Checklist

After resolving the immediate issue:

Validate service map continuity: Confirm dependency relationships are accurate
Check dashboard data freshness: Ensure metrics and logs are current
Initiate rehydration: Backfill missing windows from S3 archives using Rehydration
Review monitors: Adjust thresholds based on incident learnings
Update runbooks: Document new failure modes and resolutions

Prevention Checklist

Use these guidelines to improve resilience:

Add enrichment fields (region, cluster, partition, tenant) for fast scoping
Enable pattern extraction for broker and ingest error signatures
Extract metrics for error rate, consumer lag, and retry volume
Connect metrics to monitors with appropriate thresholds
Build dashboards combining traces and core metrics
Route low-priority telemetry to archive and validate rehydration before incidents
Review the service map regularly to understand dependency changes

Ingest Failure Investigation: AI-assisted investigation workflow
Monitors: Configure alerting thresholds
Dashboards: Build operational dashboards
Rehydration: Recover archived data
Service Map: Visualize service dependencies

Troubleshooting Ingest Failures

Overview

Recommended Pipeline Configuration

Sources

Enrichment

Pattern and Metric Extraction

Destinations

Troubleshooting by Symptom

Broker Partition Errors

Frontend 5xx Errors

Disk Pressure

Consumer Lag

Control Plane Failures

Recovery Checklist

Prevention Checklist

Related Resources

Edge Delta AI Assistant

Conversations

Hi! I'm your Edge Delta AI Assistant

Current Context