πŸ“š Table of Contents

  1. The 3 AM Call That Changed How I Think About Incidents
  2. What Actually Happens During a Production Incident
  3. Your Role as PM: What You Should and Shouldn’t Do
  4. Incident Severity Levels: The Framework That Matters
  5. The First 30 Minutes: Your Action Checklist
  6. Stakeholder Communication: Templates That Work
  7. The War Room: How to Participate Without Getting in the Way
  8. Escalation: When and How to Raise the Alarm
  9. Resolution and Recovery: Getting Back to Normal
  10. Post-Mortems: The Real Learning Opportunity
  11. Building Incident Resilience: What to Do Before Things Break
  12. Common PM Mistakes During Incidents
  13. Your Incident Readiness Checklist
  14. The Bottom Line

The 3 AM Call That Changed How I Think About Incidents

March 2024. My phone rang at 3:17 AM.

I didn’t recognize the number. I almost didn’t answer. But something made me pick up.

“Hey, it’s Marcus from engineering. We’ve got a production incident. Users can’t log in. We’re trying to figure out what’s wrong, but we need someone to make a call on whether we roll back.”

I was groggy. Confused. I asked questions that didn’t make sense. I said “let me check with stakeholders” (at 3 AM). I tried to micromanage the technical response.

I was the worst possible version of a PM in that moment.

The incident lasted 4 hours. It should have been 90 minutes. My confusion and hesitation added 2.5 hours to the resolution time. Users were locked out for most of the morning. The company lost $75,000 in revenue.

Afterwards, the engineering lead pulled me aside.

“Next time,” he said quietly, “just tell us you need time to wake up. We would have made the call and updated you after. Your panic helped nobody.”

He was right.

That experience sent me on a mission to understand what PMs should actually do during incidents. I interviewed engineering leads, SREs, and experienced PMs at companies with great incident response. I read every incident post-mortem I could find.

Here’s what I learned: Your role during incidents is specific, important, and completely different from what most PMs think it is.


What Actually Happens During a Production Incident

Before we talk about your role, you need to understand the anatomy of an incident.

The Incident Lifecycle

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DETECTION  β”‚ ──→ β”‚  TRIAGE     β”‚ ──→ β”‚  RESPONSE   β”‚
β”‚             β”‚     β”‚             β”‚     β”‚             β”‚
β”‚ Alert fires β”‚     β”‚ Severity    β”‚     β”‚ Investigationβ”‚
β”‚ User report β”‚     β”‚ Assignment   β”‚     β”‚ Mitigation  β”‚
β”‚ Monitoring  β”‚     β”‚ Escalation   β”‚     β”‚ Resolution  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                              β”‚
                                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  RECOVERY   β”‚ ←── β”‚ RESOLUTION  β”‚ ←── β”‚  COMMUNICATEβ”‚
β”‚             β”‚     β”‚             β”‚     β”‚             β”‚
β”‚ Monitoring  β”‚     β”‚ Fix applied β”‚     β”‚ Stakeholdersβ”‚
β”‚ Validation  β”‚     β”‚ Verified    β”‚     β”‚ Users       β”‚
β”‚ Normal ops  β”‚     β”‚ Stable      β”‚     β”‚ Teams       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                              β”‚
                                              β–Ό
                                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                       β”‚ POST-MORTEM β”‚
                                       β”‚             β”‚
                                       β”‚ Learn       β”‚
                                       β”‚ Improve     β”‚
                                       β”‚ Document    β”‚
                                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Who Does What

RoleResponsibility
On-Call EngineerFirst responder, investigates, attempts fix
Incident CommanderCoordinates response, makes calls, assigns tasks
Subject Matter ExpertDeep technical knowledge of affected system
Communications LeadUpdates stakeholders, manages messaging
Product ManagerBusiness context, prioritization, stakeholder management
Engineering ManagerResource allocation, team support, escalation

The key insight: You’re not the incident commander. You’re not debugging. You have a specific role.


Your Role as PM: What You Should and Shouldn’t Do

What You SHOULD Do

βœ… Provide business context

Engineers need to know business impact to make prioritization decisions. Is this affecting paying customers? Which features are down? What’s the revenue impact?

"This is affecting checkout, which is $X/hour in revenue.
The users impacted are primarily enterprise customers.
We have a major demo in 4 hours."

βœ… Make business trade-offs

Sometimes engineers need to make a call that affects the product. Roll back a feature? Disable a service? Prioritize one fix over another?

"If we have to choose, prioritize checkout over search.
Checkout is revenue-critical; search can be degraded."

βœ… Communicate with stakeholders

While engineers fix the problem, you handle communication. Executives, customer success, sales, marketingβ€”they all need updates.

"Update at 3:45 AM: We're aware of login issues affecting
approximately 15% of users. Engineering is investigating.
Will update in 30 minutes."

βœ… Document decisions

During the chaos, someone needs to record what decisions were made and why. This helps with post-mortems and future incidents.

"3:23 AM: Decided to roll back payment service to v2.3.1
instead of hotfixing. Reason: Rollback is faster and
safer given unknown root cause."

βœ… Support the team

Bring coffee. Order food. Shield them from interruptions. Your job is to enable engineers to focus.

What You Should NOT Do

❌ Don’t try to debug

You’re not qualified. Your questions will slow people down. Let the engineers do their job.

❌ Don’t micromanage

“Have you tried X?” “What about Y?” “Why is this taking so long?” These questions help nobody and frustrate everyone.

❌ Don’t make technical decisions

You don’t know whether to restart the database or scale the cluster. Don’t pretend you do.

❌ Don’t escalate prematurely

Calling the VP of Engineering at 3 AM because you’re scared is not helpful. Follow escalation procedures.

❌ Don’t make promises

“We’ll be back up in 30 minutes” is a promise you can’t keep. Communicate status, not predictions.


Incident Severity Levels: The Framework That Matters

Different incidents require different responses. Most teams use a severity framework:

SEV1: Critical

Definition: Complete outage, data loss risk, security breach, or major revenue impact

Examples:

  • All users unable to access product
  • Payment processing completely down
  • Data breach detected
  • Database corruption

Response:

  • All hands on deck
  • Immediate escalation to leadership
  • Wake people up
  • Customer communication within 15 minutes
  • Target resolution: ASAP

PM Role: Full engagement, executive communication, customer messaging

SEV2: High

Definition: Significant feature degradation affecting many users

Examples:

  • Login slow for 50%+ of users
  • Checkout errors >5%
  • Major feature unavailable
  • Significant data inconsistencies

Response:

  • On-call + relevant team
  • Page if outside business hours
  • Customer communication within 30 minutes
  • Target resolution: <2 hours

PM Role: Stakeholder communication, business context, monitoring escalation

SEV3: Medium

Definition: Minor degradation affecting some users

Examples:

  • Specific feature slow or broken
  • Errors affecting <5% of users
  • Non-critical service degraded

Response:

  • On-call handles during business hours
  • Slack update to team
  • No customer communication unless asked
  • Target resolution: <8 hours

PM Role: Awareness, potential prioritization input

SEV4: Low

Definition: Minor issues with no user impact

Examples:

  • Internal tool slow
  • Non-critical background job failing
  • Monitoring alert with no visible impact

Response:

  • Ticket created
  • Address in normal workflow
  • No urgency

PM Role: None required

The Decision Matrix

User Impact?
β”œβ”€β”€ None β†’ SEV4 (Low)
β”œβ”€β”€ Small number of users β†’ SEV3 (Medium)
β”œβ”€β”€ Many users, degraded β†’ SEV2 (High)
└── All users, complete outage β†’ SEV1 (Critical)

The First 30 Minutes: Your Action Checklist

When you learn about an incident, here’s exactly what to do:

Minutes 0-5: Assess

  • Understand the severity level
  • Confirm who’s responding (incident commander, on-call)
  • Identify the user impact (how many, which segment)
  • Determine if you’re needed immediately or can join later

Script:

“I understand we have an incident. What’s the current severity? Who’s the incident commander? What’s the user impact? Do you need me right now?”

Minutes 5-15: Gather Context

  • Understand what feature/system is affected
  • Identify business impact (revenue, customers, demos)
  • Check if there’s a scheduled event that amplifies impact
  • Identify who needs to be informed

Script:

“The checkout flow is affected. That’s about $X/hour in revenue. We have a major customer demo at 10 AM. I’ll start stakeholder communication.”

Minutes 15-30: Communicate

  • Send initial update to stakeholders (see templates)
  • Set expectations for next update time
  • Join incident channel/bridge
  • Identify if you can help or should stand by

Script:

“Initial update sent. I’m in the incident channel if you need business input. I’ll send another update in 30 minutes unless something changes.”


Stakeholder Communication: Templates That Work

Communication during incidents is its own skill. Here are templates that work:

Initial Acknowledgment (Within 15 minutes of SEV1/SEV2)

To: Leadership, Customer Success, Support
Subject: [SEV2] Incident: Checkout errors affecting users

We are aware of an issue affecting the checkout process.
Engineering is actively investigating.

Impact: Users may experience errors when attempting to complete purchases
Affected: Approximately 20% of checkout attempts
Status: Investigating
Next update: [30 minutes from now]

If you receive customer inquiries, please direct them to support.
I will provide updates every 30 minutes until resolved.

[Your name]

Status Update (Every 30-60 minutes)

To: Same as above
Subject: [SEV2] Update #2: Checkout errors

Update on the checkout incident:

Current Status: We have identified the issue as related to the payment
processor integration. Engineering is implementing a fix.

Progress:
- Identified root cause: Payment API timeout
- Fix in progress: Implementing fallback payment gateway
- ETA: Expecting resolution in approximately 30 minutes

Impact: Issue continues to affect ~20% of checkout attempts
Next update: [30 minutes from now]

[Your name]

Resolution Notice

To: Same as above
Subject: [RESOLVED] SEV2: Checkout errors

The checkout incident has been resolved.

Summary:
- Duration: 2 hours 15 minutes
- Root Cause: Payment processor API timeout
- Resolution: Implemented fallback to secondary payment gateway
- Impact: ~20% of checkout attempts failed during incident

Next Steps:
- Full post-mortem will be completed within 48 hours
- Preventive measures will be documented and shared
- Customer Success: Please follow up with affected customers

Thank you to everyone who helped respond.

[Your name]

Customer-Facing Message (If needed)

For status page or direct communication:

We experienced an issue with checkout processing between
[TIME] and [TIME] UTC. During this time, some customers
may have encountered errors when completing purchases.

The issue has been resolved. If your payment was affected:
- Failed transactions were not charged
- Please retry your purchase
- Contact support if you continue to experience issues

We apologize for any inconvenience.

The War Room: How to Participate Without Getting in the Way

During major incidents, teams gather (virtually or physically) in a “war room.” Here’s how to participate effectively:

Your Job in the War Room

  1. Listen. Don’t interrupt with questions. Engineers need to communicate with each other.
  2. Note business context. If someone asks about impact, provide it quickly.
  3. Handle external communication. Shield the team from stakeholder interruptions.
  4. Document decisions. Keep a running log.

What Not to Do

  • Don’t ask “what’s happening?” every 5 minutes
  • Don’t suggest technical solutions
  • Don’t interrupt debugging conversations
  • Don’t pull people away for non-urgent updates

The Incident Log Template

Keep this updated during the incident:

INCIDENT LOG: [DATE] - [INCIDENT NAME]
Severity: SEV[X]
Incident Commander: [Name]
Start Time: [Time]
Affected Systems: [List]

TIMELINE:
[Time] - Incident detected via [alert/user report]
[Time] - On-call [Name] acknowledged
[Time] - Severity set to SEV[X]
[Time] - [Decision made] - [Reason]
[Time] - [Action taken] - [By whom]
[Time] - [Update] - [Progress]
...

DECISIONS MADE:
1. Roll back payment service at [Time] - Faster than debugging
2. Wake senior engineer at [Time] - Needed SME knowledge
3. Customer communication sent at [Time] - Per SEV2 protocol

STAKEHOLDER UPDATES:
- [Time] Email sent to leadership
- [Time] Status page updated
- [Time] Support team notified

RESOLUTION:
[Time] - Fix deployed
[Time] - Verified working
[Time] - Incident closed
Duration: [X hours Y minutes]

Escalation: When and How to Raise the Alarm

Knowing when to escalate is crucial. Here’s a framework:

Automatic Escalation Triggers

Always escalate if:

  • SEV1 is declared
  • Resolution ETA exceeds 2 hours
  • Customer churn risk is high
  • Media/legal/regulatory exposure
  • Data breach suspected
  • Multiple major customers affected

How to Escalate

Step 1: Inform the Incident Commander

“Given the customer impact and approaching renewal deadline, I believe we need to escalate this to leadership. Do you agree?”

Step 2: Send escalation message

To: VP Engineering, VP Product (or appropriate leadership)
Subject: [ESCALATION] SEV2: Checkout incident - Customer impact

Escalating due to customer impact.

Summary: Checkout errors affecting 20% of attempts for 90 minutes
Impact: [Major customer] renewal decision tomorrow; estimated $X revenue at risk
Current Status: Engineering investigating, no ETA yet
What I Need: Guidance on customer communication; decision on mitigation options

I'm available to discuss immediately.

[Your name]
[Phone number]

What NOT to Do

  • Don’t escalate silently (tell the incident commander)
  • Don’t escalate to bypass decisions you don’t like
  • Don’t copy too many people (creates noise)
  • Don’t send long emails (executives need summaries)

Resolution and Recovery: Getting Back to Normal

Once the fix is deployed, there’s still work to do.

Immediate (First Hour)

  • Verify fix is working
  • Monitor for recurrence
  • Update all stakeholders
  • Close incident channel/bridge
  • Thank the responders

Short-Term (First 24 Hours)

  • Send follow-up to affected customers (if appropriate)
  • Update status page with resolution
  • Gather initial data for post-mortem
  • Schedule post-mortem meeting

Medium-Term (First 48 Hours)

  • Complete post-mortem
  • Create action items for prevention
  • Share learnings with broader team
  • Update runbooks if needed

Post-Mortems: The Real Learning Opportunity

The post-mortem is where incidents become valuable. Here’s how to run them effectively.

The Blameless Principle

Critical rule: Post-mortems are never about blame. They’re about system improvement.

❌ Blameful: "John pushed bad code"
βœ… Blameless: "The deployment process lacks automated validation"

Why this matters: If people fear blame, they hide information. You can’t improve if you don’t know what happened.

The Post-Mortem Template

# Incident Post-Mortem: [Incident Name]

**Date:** [Date]
**Severity:** SEV[X]
**Duration:** [Start] to [End] ([Total])
**Author:** [Name]

## Summary
[2-3 sentences describing what happened]

## Impact
- User impact: [X users affected, Y% of total]
- Business impact: [Revenue lost, customers affected]
- Duration: [X hours Y minutes]

## Timeline
[Detailed timeline of what happened]

## Root Cause
[The underlying reason this happened - not the immediate trigger]

## Contributing Factors
- Factor 1
- Factor 2
- Factor 3

## What Went Well
- [Things that helped resolve the incident quickly]

## What Could Be Improved
- [Things that slowed resolution or made it worse]

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| [Action 1] | [Name] | [Date] |
| [Action 2] | [Name] | [Date] |

## Lessons Learned
[Key takeaways for future incidents]

Your Role in Post-Mortems

As PM, you contribute:

  1. Business impact analysis: Quantify the damage (revenue, users, churn risk)
  2. Customer perspective: What did users experience? How did they react?
  3. Prioritization input: Which action items matter most?
  4. Follow-up ownership: Own non-technical action items (customer communication, documentation)

Questions to Ask in Post-Mortems

  • “What would have prevented this entirely?”
  • “How could we have detected this faster?”
  • “What slowed down our resolution?”
  • “What would we do differently next time?”
  • “Is there a pattern with similar incidents?”

Building Incident Resilience: What to Do Before Things Break

The best incident response is preparation. Here’s what to do now:

Know Your Incident Process

  • Where is the incident runbook?
  • Who’s on-call this week?
  • What’s the escalation path?
  • Where is the incident channel?
  • What’s the severity criteria?

Build Relationships Before Crises

  • Know the engineering leads by name
  • Understand which teams own which systems
  • Have executive contact info ready
  • Build trust so they trust your judgment during incidents

Pre-Write Communication Templates

  • Have email templates ready to customize
  • Know the status page update process
  • Have customer messaging approved in advance
  • Create a stakeholder distribution list

Practice

  • Participate in incident drills (game days)
  • Review past post-mortems
  • Shadow an actual incident if possible
  • Know what you don’t know

Common PM Mistakes During Incidents

Mistake 1: Trying to Be Technical

What happens: You ask technical questions, suggest solutions, or try to debug.

The result: You distract engineers, slow down resolution, and look incompetent.

The fix: Stay in your lane. Business context. Communication. Support.

Mistake 2: Going Silent

What happens: You don’t know what to do, so you do nothing.

The result: Stakeholders are uninformed, customers are angry, and you look disengaged.

The fix: Always send initial acknowledgment. Set update cadence. Even “still investigating” is an update.

Mistake 3: Over-Promising

What happens: You say “we’ll be back up in 30 minutes” based on hope.

The result: When 30 minutes passes, stakeholders lose trust. You’ve created an expectation you can’t control.

The fix: Communicate status, not predictions. “Engineering is working on a fix” is honest. “We’ll be back in 30 minutes” is guessing.

Mistake 4: Escalating Emotionally

What happens: You’re scared, so you call leadership. Or you wake up the VP because “this is serious.”

The result: You create panic, damage trust with engineering, and distract leadership without cause.

The fix: Follow escalation procedures. Have criteria. Escalate strategically, not emotionally.

Mistake 5: Not Doing the Post-Mortem

What happens: Incident resolved, everyone moves on. No documentation, no learning.

The result: Same incident happens again. You didn’t improve.

The fix: Always do post-mortems for SEV1 and SEV2. Create action items. Track completion.


Your Incident Readiness Checklist

Right Now (Before the Next Incident)

  • Save the incident channel name/link
  • Know who’s on-call (bookmark the schedule)
  • Bookmark the incident runbook
  • Save communication templates somewhere accessible
  • Know your severity criteria

This Week

  • Introduce yourself to the on-call engineers
  • Review the last 3 post-mortems
  • Understand escalation procedures
  • Create stakeholder distribution list
  • Ask engineering if there are any known risks

This Month

  • Participate in a game day/drill
  • Review incident metrics (MTTR, frequency)
  • Identify gaps in incident process
  • Propose improvements based on patterns

The Bottom Line

Incidents are inevitable. Your response is not.

Good PM incident response:

  • Provides business context when needed
  • Handles stakeholder communication
  • Supports the team without interfering
  • Learns from every incident

Bad PM incident response:

  • Tries to be technical
  • Goes silent or over-communicates
  • Creates more noise than signal
  • Skips the post-mortem

The difference isn’t experience or technical knowledge. It’s understanding your role and executing it well.

Your action item: Find your team’s incident runbook. Read it. Bookmark it. Then ask an engineer to walk you through what they need from you during an incident.

Because the next 3 AM call is coming. The question is: will you be ready?


What’s your biggest concern about incident response? What would help you feel more prepared?

Related Reading: