CI/CD Pipeline Failure Investigation
2 minute read
When build failures block deployments, teammates collaborate to determine whether the issue stems from code changes, test flakiness, or environmental problems.
Environment Setup
| Component | Purpose |
|---|---|
| CircleCI Connector | Receive build failure webhooks and retrieve job metadata, logs, and test results |
| GitHub Connector | Access pull request context, diffs, and code ownership |
| Edge Delta MCP Connector | Query infrastructure metrics if environmental issues are suspected |
| AI Team Channel | Receive build failure notifications and route to OnCall AI |
Configure the CircleCI connector with webhooks enabled to receive build failure notifications automatically. The CircleCI connector normalizes build results, test outputs, and job metadata into structured data that Code Analyzer can query. The GitHub connector provides pull request context, diffs, and code ownership signals. Add the Edge Delta MCP connector to query infrastructure metrics when environmental factors are suspected.
Data Flow
flowchart LR
A[CircleCI Build Failure] -->|Webhook| B[AI Team Channel]
B --> C[OnCall AI]
C --> D[Code Analyzer]
C --> E[SRE Teammate]
D -->|Queries| F[CircleCI Connector]
D -->|Queries| G[GitHub]
E -->|Queries| H[Edge Delta MCP]When CircleCI sends a build failure webhook, the event flows through an AI Team channel to OnCall AI. Code Analyzer leads the investigation, pulling structured job metadata from the CircleCI connector and correlating with GitHub context. SRE joins if environmental factors are suspected, querying infrastructure metrics through the Edge Delta MCP connector.
Investigation Workflow
- OnCall AI receives a build failure notification from CircleCI and initiates an investigation thread
- Code Analyzer pulls structured job metadata and test results from CircleCI, including JUnit artifacts, and enriches them with GitHub pull request context (diffs, recent commits, code ownership)
- Code Analyzer queries historical CI telemetry to determine whether this failure represents a new regression or recurring flakiness
- SRE correlates build timing with infrastructure metrics if environmental issues are suspected (such as resource exhaustion or network connectivity problems)
- OnCall AI synthesizes findings and recommends either re-running the build (for transient issues) or specific code fixes with links to the relevant changes
This workflow distinguishes between code problems requiring developer attention and environmental issues that resolve through re-runs, reducing time spent investigating false positives.