Troubleshooting ingest failures

A troubleshooting guide showing how Edge Delta links telemetry pipelines, AI Team, and recovery workflows during ingest failures.

Overview

Ingest failures are not just monitoring problems. They are control problems. When Telemetry Pipelines are coupled to operational response, you can scope impact, reduce load, and recover data without losing context. This guide uses a single incident to compare two response patterns: a generic monitoring stack and an Edge Delta response.

Troubleshooting model: a controllable feedback loop

Edge Delta treats telemetry flow as a first-class control surface. The feedback loop looks like this:

  • Sources collect logs, metrics, traces, and events for Telemetry Pipelines.
  • Processors enrich data and extract patterns and metrics.
  • Destinations route data to Edge Delta for analysis, to archive for rehydration, and to AI connectors for coordination.
  • Monitors and dashboards surface thresholds. The service map and traces show dependency and latency changes.
  • AI Team uses the Edge Delta MCP connector to query logs, metrics, events, and dashboards, then proposes or stages pipeline changes.

The loop is safe because you can adjust routing and sampling at the edge, then repair missing windows with rehydration. The incident below is a concrete walkthrough.

The following diagram captures the feedback loop at a glance:

flowchart LR Sources --> Pipeline["Telemetry pipelines"] Pipeline --> Analysis["Monitors & AI Team"] Analysis --> Actions["Pipeline changes"] Actions --> Pipeline
Telemetry feedback loop

Baseline pipeline and AI Team setup

The Telemetry Pipelines configuration includes:

  • Sources: kubernetes_input for ingest services, kafka_input for broker logs, otlp_input for traces and host metrics, and http_input for control-plane change events.
  • Enrichment: OTTL transforms add region, cluster, partition, and tenant metadata so scoping is consistent across logs, metrics, traces, and events.
  • Pattern extraction: log_to_pattern_metric emits pattern metrics and anomaly events for broker errors and ingest failures.
  • Metric extraction: extract_metric derives ingest error rate, consumer lag, and retry volume from logs.
  • Destinations: ed_output for Edge Delta, s3_output for archive and rehydration, and ed_ai_event_output for AI Team event intake.

Operationally, monitors track ingest error rate, consumer lag, disk headroom, and pipeline drop rate. Dashboards summarize pipeline health and broker stability. The service map is built from traces and shows the ingest API, message bus, and storage tiers as a single path. AI Team is enabled with the Edge Delta MCP connector plus incident and change management event connectors.

Troubleshooting by phase

In this example, a cascade starts with a partial loss of broker redundancy, which creates leaderless partitions and uneven write availability. That unevenness increases retries and queue depth at the ingest frontends, which in turn accelerates disk pressure on the brokers. As consumer groups stall, SLO processing and alerting drift out of date, and the control plane becomes harder to manage just as the blast radius expands. The phases below show how that same chain of events plays out with different operational tooling.

Each phase compares the artifacts of a generic monitoring stack with an Edge Delta response. This is the same incident under two operating models.

Phase 1: Replication loss in the message bus

Broker redundancy loss leaves multiple partitions leaderless.

Generic monitoring stack artifactsEdge Delta response
• Alert dedupe suppresses repeated partition errors and does not include partition or tenant scope.
• Dashboards remain green because ingest lag panels are delayed and not tied to broker health.
• SREs grep broker and API logs manually and reconcile impact in a shared spreadsheet.
• Pattern extraction creates anomaly events for leaderless partitions.
• Enrichment adds partition and tenant, so log and metric queries scope quickly.
• AI Team uses the MCP connector to query scoped logs and metrics, then stages a route to move debug logs to archive.

Phase 2: Ingest frontends overload

The ingest API returns intermittent 5xx responses across the region.

Generic monitoring stack artifactsEdge Delta response
• 5xx alerts trigger late because thresholds are tuned for past peaks, so the first signal is customer tickets.
• Dashboards show elevated 5xx counts but do not correlate broker write failures or queue depth.
• Triage requires manual sampling across API pods and brokers to infer the cause.
• Metric extraction emits 5xx counts and monitors trigger on ingest error rate.
• Traces show queue latency before broker writes and the service map shows the ingest API to broker edge degrading.
• Pipeline sampling reduces low-priority logs while preserving error logs and traces for diagnosis.

Phase 3: Disk pressure and emergency ingest pause

Broker disk headroom collapses as retries and backlog accumulate.

Generic monitoring stack artifactsEdge Delta response
• Disk alerts fire late and are not correlated with ingest volume or retry rate.
• Operators pause ingest entirely because there is no safe load-shedding path.
• Data loss is accepted because archives are not usable for later replay.
• OTLP metrics show disk free dropping alongside retry volume, and dashboards show buffer growth by source.
• Monitors flag low headroom early, prompting a controlled reduction in ingest load.
• Pipeline routing sends low-priority logs to s3_output for later rehydration while keeping critical logs and traces in the live path.

Phase 4: Silent consumer lag stalls SLO processing

Consumer group metadata degrades and SLO processing stalls without obvious errors.

Generic monitoring stack artifactsEdge Delta response
• Consumer lag is not monitored, so health checks read as “up” while data freshness degrades.
• SLO dashboards show normal values because data is missing or delayed, masking the outage.
• Manual backlog checks start only after customers report missing alerts.
• Metric extraction turns consumer lag logs into time series, and monitors trigger on lag thresholds.
• Dashboards show lag and SLO freshness side-by-side to highlight stale windows.
• AI Team flags stale SLO windows and the pipeline prioritizes SLO-related telemetry.

Phase 5: Control plane degradation and migration

Administrative actions fail and the cluster becomes difficult to manage.

Generic monitoring stack artifactsEdge Delta response
• Admin API failures are not visible in alerts, so retries continue without root cause.
• Runbooks rely on manual CLI checks and ad hoc change tracking across teams.
• Migration takes weeks because there is no safe reprocessing path for missing data.
• Change events are captured as telemetry and tied to affected services in dashboards.
• AI Team coordinates pipeline cutover with approval steps, using pipeline history for staged deployment.
• Rehydration backfills missing windows so service map data and dashboard data regain continuity.

What changes in practice

With Edge Delta in place, the response path changes in concrete ways:

  • Faster scoping: Enriched logs and metrics allow impact slicing by tenant, partition, and service within minutes.
  • Safer load shedding: Sampling and routing changes reduce pressure without losing critical signals.
  • Clearer incident narrative: The service map and traces show where latency and errors accumulate.
  • Higher SLO confidence: Consumer lag monitors prevent silent staleness in SLO and dashboard data.
  • Predictable recovery: Rehydration restores missing windows after disruptive changes.

Applying the troubleshooting checklist in your environment

Use the incident as a checklist for your own Telemetry Pipelines:

  • Identify the critical path and add enrichment fields that make scoping trivial.
  • Enable pattern extraction for broker and ingest error signatures so anomaly events are explicit.
  • Extract metrics for error rate, consumer lag, and retry volume and connect them to monitors.
  • Build dashboards that place traces and core metrics on the same screen.
  • Review the service map separately to spot dependency shifts during the incident.
  • Route low-priority telemetry to archive and validate a rehydration path before you need it.
  • Enable AI Team with the Edge Delta MCP connector and require approval on pipeline deployment actions.