Kubernetes Troubleshooting

Automate Kubernetes troubleshooting by correlating pod logs, cluster events, and resource metrics to resolve common operational issues.

When Kubernetes workloads fail, AI teammates autonomously gather pod logs, cluster events, and resource metrics to identify root causes and recommend remediation steps.

Environment setup

ComponentPurpose
Kubernetes SourceIngest pod logs, cluster events, and resource metrics
Edge Delta MCP ConnectorQuery Kubernetes telemetry from Edge Delta backend
GitHub ConnectorCorrelate failures with deployment manifests and recent changes
AI Team ChannelReceive Kubernetes alerts and route to OnCall AI

The Kubernetes Source collects pod logs, Kubernetes events, and resource metrics, uploading them to the Edge Delta backend. AI teammates query this indexed telemetry through the Edge Delta MCP connector, correlating signals across clusters to automate investigation. The GitHub connector provides access to deployment manifests and recent changes for root cause analysis.

Data flow

flowchart LR
    A[Kubernetes Cluster] --> B[Kubernetes Source]
    B --> C[Edge Delta Backend]
    C --> D[Pattern/Metric Monitors]
    D -->|Alert| E[OnCall AI]
    E --> F[SRE Teammate]
    F -->|Queries| G[Edge Delta MCP]
    G --> C
    F -->|Queries| H[GitHub]

The Kubernetes Source streams pod logs, cluster events, and resource metrics to the Edge Delta backend. Pattern and metric monitors detect anomalies such as CrashLoopBackOff events, scaling churn, or scheduling failures, then route alerts to an AI Team channel. OnCall AI delegates to SRE, who queries the backend through the Edge Delta MCP connector to correlate events across the affected workloads.

Investigation scenarios

Pod CrashLoopBackOff

When pods repeatedly crash, SRE gathers pod logs, Kubernetes events, and deployment history to identify root causes.

  1. OnCall AI receives a CrashLoopBackOff alert and initiates an investigation thread
  2. SRE queries recent pod logs and BackOff events from the affected deployment
  3. SRE correlates events to identify patterns such as missing environment variables, failed health checks, or dependency failures
  4. Code Analyzer reviews recent deployment changes to determine whether a configuration or code change triggered the failure
  5. OnCall AI recommends prioritized remediation steps, such as adding missing secrets or fixing connection strings

HPA scaling inefficiency

When horizontal pod autoscaler settings cause unnecessary scaling churn, SRE investigates whether thresholds need adjustment.

  1. OnCall AI receives an alert about rapid replica oscillation and initiates an investigation
  2. SRE queries scaling events and resource utilization metrics over the affected period
  3. SRE identifies scaling volatility patterns, such as frequent scale-up and scale-down cycles within short intervals
  4. SRE analyzes CPU and memory utilization to determine whether current thresholds are too aggressive
  5. OnCall AI recommends tuned HPA parameters and estimates the impact on cost and stability

Pod scheduling failures

When pods cannot schedule, SRE distinguishes between actual resource usage and Kubernetes allocatable capacity.

  1. OnCall AI receives an alert about pending pods and initiates an investigation
  2. SRE queries pod scheduling events and node resource allocations
  3. SRE correlates host-level resource usage with Kubernetes allocatable capacity to identify the bottleneck
  4. SRE evaluates whether the issue stems from resource requests, node affinity rules, or actual capacity constraints
  5. OnCall AI recommends remediation options such as adjusting resource requests, adding nodes, or rebalancing workloads

Learn more