CI/CD Pipeline Failure Investigation

Determine whether build failures stem from code changes, test flakiness, or environmental problems.

When build failures block deployments, teammates collaborate to determine whether the issue stems from code changes, test flakiness, or environmental problems.

Environment Setup

ComponentPurpose
CircleCI ConnectorReceive build failure webhooks and retrieve job metadata, logs, and test results
GitHub ConnectorAccess pull request context, diffs, and code ownership
Edge Delta MCP ConnectorQuery infrastructure metrics if environmental issues are suspected
AI Team ChannelReceive build failure notifications and route to OnCall AI

Configure the CircleCI connector with webhooks enabled to receive build failure notifications automatically. The CircleCI connector normalizes build results, test outputs, and job metadata into structured data that Code Analyzer can query. The GitHub connector provides pull request context, diffs, and code ownership signals. Add the Edge Delta MCP connector to query infrastructure metrics when environmental factors are suspected.

Data Flow

flowchart LR
    A[CircleCI Build Failure] -->|Webhook| B[AI Team Channel]
    B --> C[OnCall AI]
    C --> D[Code Analyzer]
    C --> E[SRE Teammate]
    D -->|Queries| F[CircleCI Connector]
    D -->|Queries| G[GitHub]
    E -->|Queries| H[Edge Delta MCP]

When CircleCI sends a build failure webhook, the event flows through an AI Team channel to OnCall AI. Code Analyzer leads the investigation, pulling structured job metadata from the CircleCI connector and correlating with GitHub context. SRE joins if environmental factors are suspected, querying infrastructure metrics through the Edge Delta MCP connector.

Investigation Workflow

  1. OnCall AI receives a build failure notification from CircleCI and initiates an investigation thread
  2. Code Analyzer pulls structured job metadata and test results from CircleCI, including JUnit artifacts, and enriches them with GitHub pull request context (diffs, recent commits, code ownership)
  3. Code Analyzer queries historical CI telemetry to determine whether this failure represents a new regression or recurring flakiness
  4. SRE correlates build timing with infrastructure metrics if environmental issues are suspected (such as resource exhaustion or network connectivity problems)
  5. OnCall AI synthesizes findings and recommends either re-running the build (for transient issues) or specific code fixes with links to the relevant changes

This workflow distinguishes between code problems requiring developer attention and environmental issues that resolve through re-runs, reducing time spent investigating false positives.

Learn More