Ingest Failure Investigation

Use AI Team to investigate cascading ingest failures across message buses, frontends, and storage tiers.

4 minute read

When ingest failures cascade across message buses, frontends, and storage tiers, AI teammates correlate telemetry from each layer to identify the root cause and coordinate pipeline adjustments.

Environment setup

Component	Purpose
Edge Delta MCP Connector	Query logs, metrics, traces, and dashboards from Edge Delta backend
GitHub Connector	Check recent deployments for consumer code or configuration changes
AWS Connector	Query infrastructure metrics and resource limits (optional)
Kubernetes Source	Ingest logs from Kafka brokers, ingest frontends, and storage services
Log to Pattern Processor	Extract pattern metrics for broker errors and ingest failures
Pattern Anomaly Monitor	Detect broker errors and ingest failure patterns
Monitor Notifications	Route alerts to an AI Team channel
AI Team Channel	Receive monitor notifications and route to OnCall AI

Configure a Kubernetes Source to capture logs from Kafka brokers, ingest APIs, and storage services. Use OTTL transforms to enrich logs with region, cluster, partition, and tenant metadata for precise scoping during investigations. Add a Log to Pattern processor to extract pattern metrics for broker errors and ingest failures, and configure metric extraction for ingest error rate, consumer lag, and retry volume. A Pattern Anomaly Monitor watches for pattern spikes and routes notifications to an AI Team channel. The Edge Delta MCP and GitHub connectors provision an AI Team ingestion pipeline, enabling teammates to query telemetry and correlate with recent deployments. Add the AWS connector to check for infrastructure-level resource limits or throttling.

Data flow

flowchart LR
    A[Kafka Brokers] --> B[Telemetry Pipeline]
    C[Ingest APIs] --> B
    D[Storage Services] --> B
    B --> E[Edge Delta Backend]
    E --> F[Pattern Anomaly Monitor]
    F -->|Anomaly Detected| G[OnCall AI]
    G --> H[SRE Teammate]
    G --> I[Code Analyzer]
    H -->|Queries| J[Edge Delta MCP]
    J --> E
    I -->|Queries| K[GitHub]

Logs from Kafka brokers, ingest APIs, and storage services flow through the Telemetry Pipeline to the Edge Delta backend. The Log to Pattern processor extracts pattern metrics, and the Pattern Anomaly Monitor compares current patterns against baselines. When broker errors or ingest failures spike, the monitor notifies OnCall AI through the configured channel. SRE queries across the ingest stack using the Edge Delta MCP connector to identify the root cause, while Code Analyzer checks GitHub for recent deployments that may have introduced the failure.

Investigation workflow

The following is an example of how the teammates might investigate ingest failures. The exact behavior depends on your connector configuration, teammate instructions, and the affected infrastructure.

OnCall AI receives pattern anomaly events indicating broker errors or ingest failures and initiates an investigation thread
SRE queries logs scoped by partition, tenant, and region to identify affected services and understand the blast radius
SRE examines metrics for consumer lag, retry volume, and error rates across the ingest stack
Code Analyzer reviews recent deployments for changes to consumer code, partition assignments, or configuration
SRE reviews the service map to understand dependency relationships between brokers, ingest APIs, and storage services
SRE correlates disk metrics with buffer growth to identify resource pressure points
OnCall AI synthesizes findings and proposes remediation steps, such as sampling low-priority logs, routing to archive, or rolling back a problematic deployment

Failure scenarios

Replication loss: When broker redundancy loss leaves partitions leaderless, SRE queries logs by partition and tenant to identify affected services, then OnCall AI proposes routing debug logs to archive to reduce broker load.

Frontend overload: When ingest APIs return intermittent 5xx responses, SRE examines traces showing queue latency before broker writes and uses the service map to identify the degrading edge. OnCall AI proposes pipeline sampling to reduce low-priority logs while preserving error logs and traces.

Disk pressure: When broker disk headroom collapses, SRE correlates disk metrics with retry volume and buffer growth. OnCall AI proposes routing low-priority logs to S3 for later rehydration.

Consumer lag: When consumer groups stall without obvious errors, SRE queries consumer lag metrics and reviews dashboards showing lag and SLO freshness side-by-side. OnCall AI flags stale SLO windows and proposes prioritizing SLO-related telemetry.

Recovery coordination: After stabilization, SRE verifies service map data and dashboard continuity. OnCall AI coordinates pipeline cutover with approval steps, and SRE initiates rehydration to backfill missing windows from S3 archives.

Ingest Failure Investigation

Environment setup

Data flow

Investigation workflow

Failure scenarios

Learn more

Edge Delta AI Assistant

Conversations

Hi! I'm your Edge Delta AI Assistant

Current Context