Kubernetes Troubleshooting

Automate Kubernetes troubleshooting by correlating pod logs, cluster events, and resource metrics to resolve common operational issues.

3 minute read

When Kubernetes workloads fail, AI teammates autonomously gather pod logs, cluster events, and resource metrics to identify root causes and recommend remediation steps.

Environment setup

Component	Purpose
Kubernetes Source	Ingest pod logs, cluster events, and resource metrics
Edge Delta MCP Connector	Query Kubernetes telemetry from Edge Delta backend
GitHub Connector	Correlate failures with deployment manifests and recent changes
AI Team Channel	Receive Kubernetes alerts and route to OnCall AI

The Kubernetes Source collects pod logs, Kubernetes events, and resource metrics, uploading them to the Edge Delta backend. AI teammates query this indexed telemetry through the Edge Delta MCP connector, correlating signals across clusters to automate investigation. The GitHub connector provides access to deployment manifests and recent changes for root cause analysis.

Data flow

flowchart LR
    A[Kubernetes Cluster] --> B[Kubernetes Source]
    B --> C[Edge Delta Backend]
    C --> D[Pattern/Metric Monitors]
    D -->|Alert| E[OnCall AI]
    E --> F[SRE Teammate]
    F -->|Queries| G[Edge Delta MCP]
    G --> C
    F -->|Queries| H[GitHub]

The Kubernetes Source streams pod logs, cluster events, and resource metrics to the Edge Delta backend. Pattern and metric monitors detect anomalies such as CrashLoopBackOff events, scaling churn, or scheduling failures, then route alerts to an AI Team channel. OnCall AI delegates to SRE, who queries the backend through the Edge Delta MCP connector to correlate events across the affected workloads.

Investigation scenarios

The following examples illustrate how the teammates might handle common Kubernetes issues. The exact behavior depends on your connector configuration, teammate instructions, and cluster state.

Pod CrashLoopBackOff

When pods repeatedly crash, SRE typically gathers pod logs, Kubernetes events, and deployment history to identify root causes.

OnCall AI receives a CrashLoopBackOff alert and initiates an investigation thread
SRE queries recent pod logs and BackOff events from the affected deployment
SRE correlates events to identify patterns such as missing environment variables, failed health checks, or dependency failures
Code Analyzer reviews recent deployment changes to determine whether a configuration or code change triggered the failure
OnCall AI recommends prioritized remediation steps, such as adding missing secrets or fixing connection strings

HPA scaling inefficiency

When horizontal pod autoscaler settings cause unnecessary scaling churn, SRE may investigate whether thresholds need adjustment.

OnCall AI receives an alert about rapid replica oscillation and initiates an investigation
SRE queries scaling events and resource utilization metrics over the affected period
SRE identifies scaling volatility patterns, such as frequent scale-up and scale-down cycles within short intervals
SRE analyzes CPU and memory utilization to determine whether current thresholds are too aggressive
OnCall AI recommends tuned HPA parameters and estimates the impact on cost and stability

Pod scheduling failures

When pods cannot schedule, SRE may distinguish between actual resource usage and Kubernetes allocatable capacity.

OnCall AI receives an alert about pending pods and initiates an investigation
SRE queries pod scheduling events and node resource allocations
SRE correlates host-level resource usage with Kubernetes allocatable capacity to identify the bottleneck
SRE evaluates whether the issue stems from resource requests, node affinity rules, or actual capacity constraints
OnCall AI recommends remediation options such as adjusting resource requests, adding nodes, or rebalancing workloads

Learn more

Three Ways AI Teammates Transform Kubernetes Troubleshooting for SREs

Kubernetes Troubleshooting

Environment setup

Data flow

Investigation scenarios

Pod CrashLoopBackOff

HPA scaling inefficiency

Pod scheduling failures

Learn more

Edge Delta AI Assistant

Conversations

Hi! I'm your Edge Delta AI Assistant

Current Context