Kubernetes Troubleshooting
3 minute read
When Kubernetes workloads fail, AI teammates autonomously gather pod logs, cluster events, and resource metrics to identify root causes and recommend remediation steps.
Environment setup
| Component | Purpose |
|---|---|
| Kubernetes Source | Ingest pod logs, cluster events, and resource metrics |
| Edge Delta MCP Connector | Query Kubernetes telemetry from Edge Delta backend |
| GitHub Connector | Correlate failures with deployment manifests and recent changes |
| AI Team Channel | Receive Kubernetes alerts and route to OnCall AI |
The Kubernetes Source collects pod logs, Kubernetes events, and resource metrics, uploading them to the Edge Delta backend. AI teammates query this indexed telemetry through the Edge Delta MCP connector, correlating signals across clusters to automate investigation. The GitHub connector provides access to deployment manifests and recent changes for root cause analysis.
Data flow
flowchart LR
A[Kubernetes Cluster] --> B[Kubernetes Source]
B --> C[Edge Delta Backend]
C --> D[Pattern/Metric Monitors]
D -->|Alert| E[OnCall AI]
E --> F[SRE Teammate]
F -->|Queries| G[Edge Delta MCP]
G --> C
F -->|Queries| H[GitHub]The Kubernetes Source streams pod logs, cluster events, and resource metrics to the Edge Delta backend. Pattern and metric monitors detect anomalies such as CrashLoopBackOff events, scaling churn, or scheduling failures, then route alerts to an AI Team channel. OnCall AI delegates to SRE, who queries the backend through the Edge Delta MCP connector to correlate events across the affected workloads.
Investigation scenarios
Pod CrashLoopBackOff
When pods repeatedly crash, SRE gathers pod logs, Kubernetes events, and deployment history to identify root causes.
- OnCall AI receives a CrashLoopBackOff alert and initiates an investigation thread
- SRE queries recent pod logs and BackOff events from the affected deployment
- SRE correlates events to identify patterns such as missing environment variables, failed health checks, or dependency failures
- Code Analyzer reviews recent deployment changes to determine whether a configuration or code change triggered the failure
- OnCall AI recommends prioritized remediation steps, such as adding missing secrets or fixing connection strings
HPA scaling inefficiency
When horizontal pod autoscaler settings cause unnecessary scaling churn, SRE investigates whether thresholds need adjustment.
- OnCall AI receives an alert about rapid replica oscillation and initiates an investigation
- SRE queries scaling events and resource utilization metrics over the affected period
- SRE identifies scaling volatility patterns, such as frequent scale-up and scale-down cycles within short intervals
- SRE analyzes CPU and memory utilization to determine whether current thresholds are too aggressive
- OnCall AI recommends tuned HPA parameters and estimates the impact on cost and stability
Pod scheduling failures
When pods cannot schedule, SRE distinguishes between actual resource usage and Kubernetes allocatable capacity.
- OnCall AI receives an alert about pending pods and initiates an investigation
- SRE queries pod scheduling events and node resource allocations
- SRE correlates host-level resource usage with Kubernetes allocatable capacity to identify the bottleneck
- SRE evaluates whether the issue stems from resource requests, node affinity rules, or actual capacity constraints
- OnCall AI recommends remediation options such as adjusting resource requests, adding nodes, or rebalancing workloads