CI/CD Pipeline Failure Investigation

Determine whether build failures stem from code changes, test flakiness, environmental problems, or integration divergence.

When build failures block deployments, teammates collaborate to determine whether the issue stems from code changes, test flakiness, environmental problems, or integration divergence from long-lived branches.

Environment setup

ComponentPurpose
CircleCI ConnectorReceive build failure webhooks and retrieve job metadata, logs, and test results
GitHub ConnectorAccess pull request context, diffs, and code ownership
Edge Delta MCP ConnectorQuery infrastructure metrics if environmental issues are suspected
AI Team ChannelReceive build failure notifications and route to OnCall AI

Configure the CircleCI connector with webhooks enabled to receive build failure notifications automatically. The CircleCI connector normalizes build results, test outputs, and job metadata into structured data that Code Analyzer can query. The GitHub connector provides pull request context, diffs, and code ownership signals. Add the Edge Delta MCP connector to query infrastructure metrics when environmental factors are suspected.

Data flow

flowchart LR
    A[CircleCI Build Failure] -->|Webhook| B[AI Team Channel]
    B --> C[OnCall AI]
    C --> D[Code Analyzer]
    C --> E[SRE Teammate]
    D -->|Queries| F[CircleCI Connector]
    D -->|Queries| G[GitHub]
    E -->|Queries| H[Edge Delta MCP]

When CircleCI sends a build failure webhook, the event flows through an AI Team channel to OnCall AI. Code Analyzer leads the investigation, pulling structured job metadata from the CircleCI connector and correlating with GitHub context. SRE joins if environmental factors are suspected, querying infrastructure metrics through the Edge Delta MCP connector.

Investigation workflow

The following is an example of how the teammates might triage a build failure. The exact behavior depends on your connector configuration, teammate instructions, and the failure context.

  1. OnCall AI receives a build failure notification from CircleCI and initiates an investigation thread
  2. Code Analyzer pulls structured job metadata and test results from CircleCI, including JUnit artifacts, and enriches them with GitHub pull request context (diffs, recent commits, code ownership)
  3. Code Analyzer checks whether the PR branch has diverged significantly from the base branch and whether the failure correlates with recent mainline changes that have not been integrated
  4. Code Analyzer queries historical CI telemetry to determine whether this failure represents a new regression or recurring flakiness. Automatic retries can mask non-determinism, so Code Analyzer looks for pass-fail cycles without corresponding code changes rather than relying on single-run results. Correlating logs, metrics, and traces exposes patterns that log inspection alone misses, such as timing variability, resource contention, or shared mutable state between tests
  5. SRE correlates build timing with infrastructure metrics if environmental issues are suspected, such as resource exhaustion, network connectivity problems, or configuration drift between environments. Small differences in runner images, dependency versions, or container configurations accumulate gradually and cause artifacts that pass validation in one environment to behave differently in another
  6. OnCall AI classifies the failure into one of four categories and recommends the appropriate response:
    • Code regression - specific code fixes with links to the relevant changes
    • Test flakiness - re-run the build and flag the test for isolation review
    • Environment drift - investigate runner or dependency differences and standardize
    • Integration divergence - rebase against mainline to resolve accumulated drift

This workflow distinguishes between code problems requiring developer attention, environmental issues that resolve through standardization, and transient failures that resolve through re-runs, reducing time spent investigating false positives.

Over time, CI telemetry reveals systemic patterns beyond individual failures. Repeated merge-related breakages can signal that branches are living too long or that integration is being deferred. Rising flaky test rates can indicate infrastructure instability rather than test logic problems. Teammates surface these trends so teams can address the root cause rather than triaging the same class of failure repeatedly.

Learn more