Troubleshooting Ingest Failures

Diagnose and resolve ingest failures across message buses, frontends, and storage tiers.

Overview

Ingest failures often cascade across multiple system layers. A broker redundancy loss can trigger frontend overload, disk pressure, and consumer lag, with each symptom masking the underlying cause. This guide provides a structured approach to diagnosing and resolving ingest failures using Edge Delta’s telemetry capabilities.

For AI-assisted investigation of ingest failures, see Ingest Failure Investigation.

Configure your Telemetry Pipeline to capture the signals needed for effective troubleshooting:

Sources

SourcePurpose
kubernetes_inputCollect logs from ingest services and pods
kafka_inputCapture broker logs and partition events
otlp_inputIngest traces and host metrics (CPU, memory, disk)
http_inputReceive control-plane change events

Enrichment

Add OTTL transforms to include scoping metadata:

  • region: geographic or availability zone
  • cluster: Kafka cluster identifier
  • partition: affected partition number
  • tenant: customer or service owner

Consistent enrichment enables fast impact scoping during incidents.

Pattern and Metric Extraction

ProcessorPurpose
log_to_pattern_metricEmit pattern metrics for broker errors and ingest failures
extract_metricDerive ingest error rate, consumer lag, and retry volume

Destinations

DestinationPurpose
ed_outputSend to Edge Delta for analysis and alerting
s3_outputArchive for rehydration and compliance
ed_ai_event_outputRoute events to AI Team (optional)

Troubleshooting by Symptom

Broker Partition Errors

Symptoms:

  • Leaderless partition warnings in broker logs
  • Uneven write availability across partitions
  • Increased producer retries

Diagnostic Steps:

  1. Query broker logs filtered by partition to identify affected partitions:

    resource.k8s.container.name = "kafka" AND body CONTAINS "leaderless"
    
  2. Check pattern metrics for broker error signatures in the Logs Explorer

  3. Review the service map to identify dependent services

Resolution:

  • Trigger partition leader election if brokers are available
  • Reduce producer load to affected partitions
  • Route low-priority logs to archive to reduce broker pressure

Frontend 5xx Errors

Symptoms:

  • Intermittent 5xx responses from ingest API
  • Elevated queue depth at frontends
  • Customer-reported ingestion failures

Diagnostic Steps:

  1. Query extracted metrics for 5xx error rate trends

  2. Examine traces showing request latency breakdown:

    • Queue wait time
    • Broker write latency
    • Serialization overhead
  3. Check the service map for degraded edges between ingest API and brokers

Resolution:

  • Enable pipeline sampling to reduce low-priority log volume
  • Scale ingest frontends horizontally if queue depth is high
  • Check broker health before adding frontend capacity

Disk Pressure

Symptoms:

  • Broker disk utilization exceeding thresholds
  • Write failures or slowdowns
  • Retention policy violations

Diagnostic Steps:

  1. Query OTLP host metrics for disk free percentage:

    metric.name = "system.disk.free" AND resource.host.name CONTAINS "kafka"
    
  2. Correlate disk trends with retry volume and buffer growth

  3. Review dashboards showing buffer accumulation by source

Resolution:

  • Route low-priority telemetry to S3 using pipeline routing
  • Reduce retention period temporarily
  • Add broker capacity or storage
  • Validate rehydration path before archiving

Consumer Lag

Symptoms:

  • Consumer groups falling behind
  • Stale data in SLO calculations
  • Missing or delayed alerts

Diagnostic Steps:

  1. Query consumer lag metrics extracted from broker logs

  2. Check monitors for lag threshold violations

  3. Review dashboards showing lag alongside SLO freshness

Resolution:

  • Scale consumer group instances
  • Prioritize SLO-related telemetry in pipeline routing
  • Investigate upstream producer rate increases

Control Plane Failures

Symptoms:

  • Administrative API failures (topic creation, config changes)
  • Metadata inconsistencies
  • Cluster management operations timing out

Diagnostic Steps:

  1. Query change events captured via http_input

  2. Correlate admin failures with broker health metrics

  3. Review dashboards showing admin operation success rates

Resolution:

  • Wait for cluster stabilization before retrying admin operations
  • Use staged configuration changes with approval steps
  • Document changes for post-incident review

Recovery Checklist

After resolving the immediate issue:

  1. Validate service map continuity: Confirm dependency relationships are accurate
  2. Check dashboard data freshness: Ensure metrics and logs are current
  3. Initiate rehydration: Backfill missing windows from S3 archives using Rehydration
  4. Review monitors: Adjust thresholds based on incident learnings
  5. Update runbooks: Document new failure modes and resolutions

Prevention Checklist

Use these guidelines to improve resilience:

  • Add enrichment fields (region, cluster, partition, tenant) for fast scoping
  • Enable pattern extraction for broker and ingest error signatures
  • Extract metrics for error rate, consumer lag, and retry volume
  • Connect metrics to monitors with appropriate thresholds
  • Build dashboards combining traces and core metrics
  • Route low-priority telemetry to archive and validate rehydration before incidents
  • Review the service map regularly to understand dependency changes