Ingest Failure Investigation
4 minute read
When ingest failures cascade across message buses, frontends, and storage tiers, AI teammates correlate telemetry from each layer to identify the root cause and coordinate pipeline adjustments.
Environment Setup
| Component | Purpose |
|---|---|
| Edge Delta MCP Connector | Query logs, metrics, traces, and dashboards from Edge Delta backend |
| GitHub Connector | Check recent deployments for consumer code or configuration changes |
| AWS Connector | Query infrastructure metrics and resource limits (optional) |
| Kubernetes Source | Ingest logs from Kafka brokers, ingest frontends, and storage services |
| Log to Pattern Processor | Extract pattern metrics for broker errors and ingest failures |
| Pattern Anomaly Monitor | Detect broker errors and ingest failure patterns |
| Monitor Notifications | Route alerts to an AI Team channel |
| AI Team Channel | Receive monitor notifications and route to OnCall AI |
Configure a Kubernetes Source to capture logs from Kafka brokers, ingest APIs, and storage services. Use OTTL transforms to enrich logs with region, cluster, partition, and tenant metadata for precise scoping during investigations. Add a Log to Pattern processor to extract pattern metrics for broker errors and ingest failures, and configure metric extraction for ingest error rate, consumer lag, and retry volume. A Pattern Anomaly Monitor watches for pattern spikes and routes notifications to an AI Team channel. The Edge Delta MCP and GitHub connectors provision an AI Team ingestion pipeline, enabling teammates to query telemetry and correlate with recent deployments. Add the AWS connector to check for infrastructure-level resource limits or throttling.
Data Flow
flowchart LR
A[Kafka Brokers] --> B[Telemetry Pipeline]
C[Ingest APIs] --> B
D[Storage Services] --> B
B --> E[Edge Delta Backend]
E --> F[Pattern Anomaly Monitor]
F -->|Anomaly Detected| G[OnCall AI]
G --> H[SRE Teammate]
G --> I[Code Analyzer]
H -->|Queries| J[Edge Delta MCP]
J --> E
I -->|Queries| K[GitHub]Logs from Kafka brokers, ingest APIs, and storage services flow through the Telemetry Pipeline to the Edge Delta backend. The Log to Pattern processor extracts pattern metrics, and the Pattern Anomaly Monitor compares current patterns against baselines. When broker errors or ingest failures spike, the monitor notifies OnCall AI through the configured channel. SRE queries across the ingest stack using the Edge Delta MCP connector to identify the root cause, while Code Analyzer checks GitHub for recent deployments that may have introduced the failure.
Investigation Workflow
- OnCall AI receives pattern anomaly events indicating broker errors or ingest failures and initiates an investigation thread
- SRE queries logs scoped by partition, tenant, and region to identify affected services and understand the blast radius
- SRE examines metrics for consumer lag, retry volume, and error rates across the ingest stack
- Code Analyzer reviews recent deployments for changes to consumer code, partition assignments, or configuration
- SRE reviews the service map to understand dependency relationships between brokers, ingest APIs, and storage services
- SRE correlates disk metrics with buffer growth to identify resource pressure points
- OnCall AI synthesizes findings and proposes remediation steps, such as sampling low-priority logs, routing to archive, or rolling back a problematic deployment
Failure Scenarios
Replication loss: When broker redundancy loss leaves partitions leaderless, SRE queries logs by partition and tenant to identify affected services, then OnCall AI proposes routing debug logs to archive to reduce broker load.
Frontend overload: When ingest APIs return intermittent 5xx responses, SRE examines traces showing queue latency before broker writes and uses the service map to identify the degrading edge. OnCall AI proposes pipeline sampling to reduce low-priority logs while preserving error logs and traces.
Disk pressure: When broker disk headroom collapses, SRE correlates disk metrics with retry volume and buffer growth. OnCall AI proposes routing low-priority logs to S3 for later rehydration.
Consumer lag: When consumer groups stall without obvious errors, SRE queries consumer lag metrics and reviews dashboards showing lag and SLO freshness side-by-side. OnCall AI flags stale SLO windows and proposes prioritizing SLO-related telemetry.
Recovery coordination: After stabilization, SRE verifies service map data and dashboard continuity. OnCall AI coordinates pipeline cutover with approval steps, and SRE initiates rehydration to backfill missing windows from S3 archives.