PagerDuty Incident Response Automation

Reduce alert fatigue by automating PagerDuty incident triage and response with AI teammates and confidence-based action routing.

You can reduce alert fatigue by using AI teammates to triage PagerDuty incidents, investigate likely causes, and coordinate response actions while keeping humans in control for higher-risk changes.

Data flow

flowchart LR A[PagerDuty Incident Triggered] -->|Webhook| B[AI Team Channel] B --> C[OnCall AI] C --> D[SRE Teammate] C --> E[Code Analyzer] D -->|Queries| F[Edge Delta MCP] E -->|Queries| G[GitHub] D -->|Incident Updates| H[PagerDuty Connector] C -->|Approval Requests| I[Human On-Call] H --> J[PagerDuty Incident Timeline]

PagerDuty sends incident events through webhook to the configured channel. OnCall AI creates an investigation thread and delegates analysis tasks to SRE and Code Analyzer. SRE gathers telemetry evidence from Edge Delta through MCP tools, while Code Analyzer checks for recent code or deployment changes. OnCall AI then coordinates responder actions and writes findings back to PagerDuty.

Environment setup

ComponentPurpose
PagerDuty ConnectorReceive incident events via webhook and manage incident status, urgency, assignees, and notes
PagerDuty Integration GuideConfigure Generic Webhooks (v3), authorization headers, and event subscriptions
Edge Delta MCP ConnectorQuery logs, metrics, traces, and service context for incident investigation
GitHub ConnectorCorrelate incidents with recent deployments, pull requests, and configuration changes (optional)
AI Team ChannelReceive PagerDuty webhook events and route to OnCall AI for orchestration

Configure the PagerDuty connector and enable webhook delivery so incident lifecycle events are posted into an AI Team channel such as #alerts. Add the Edge Delta MCP connector for telemetry investigation, and optionally add GitHub for change-correlation checks. For production guardrails, keep read operations set to Allow and configure write operations with Ask Permission where human approval is required.

Investigation workflow

The following is an example of how the teammates might handle an incoming PagerDuty incident. The exact behavior depends on your connector configuration, teammate instructions, and incident context.

  1. OnCall AI receives the incident event and opens an investigation thread in the target channel
  2. SRE queries logs, metrics, and traces to determine service health, blast radius, and likely root cause
  3. Code Analyzer checks recent deployments and pull requests to identify potential change-related regressions
  4. OnCall AI synthesizes the findings, classifies urgency, and proposes next actions
  5. OnCall AI routes actions based on confidence level: it applies autonomous updates for low-risk tasks, or requests human approval for higher-impact remediation
  6. SRE and OnCall AI update the PagerDuty incident with timeline notes, assigned responders, and remediation status
  7. OnCall AI continues monitoring until service recovery is confirmed, then recommends closure and post-incident follow-up

Automation confidence levels

The teammates assess confidence levels to determine which actions to take autonomously and which to escalate for human approval. The examples below illustrate typical behavior, but teammates may adapt based on their instructions and the available evidence.

High confidence (autonomous)

When evidence is strong and the change is low-risk and reversible, the teammates act autonomously:

  • Update incident priority and urgency based on telemetry-backed impact
  • Assign responders using service ownership and on-call schedule context
  • Add structured incident notes with findings and runbook links
  • Suppress or de-prioritize clearly noisy, non-actionable alerts

Medium confidence (approval-gated)

When a proposed action can affect service behavior or rollout state, the teammates request human approval before proceeding:

  • Escalate to the on-call engineer for manual restart actions
  • Trigger a rollback through your CI/CD workflow
  • Run cloud remediation playbooks

Low confidence (investigate and recommend)

When evidence is incomplete or conflicting, the teammates gather context and surface recommendations without taking write actions:

  • Collect additional signals and identify gaps in the investigation
  • Ask follow-up questions to refine incident scope and impact
  • Recommend next steps for human responders to evaluate

Learn more