Production Incidents: The Product Manager's Playbook for When Everything Breaks

📚 Table of Contents

The 3 AM Call That Changed How I Think About Incidents
What Actually Happens During a Production Incident
Your Role as PM: What You Should and Shouldn’t Do
Incident Severity Levels: The Framework That Matters
The First 30 Minutes: Your Action Checklist
Stakeholder Communication: Templates That Work
The War Room: How to Participate Without Getting in the Way
Escalation: When and How to Raise the Alarm
Resolution and Recovery: Getting Back to Normal
Post-Mortems: The Real Learning Opportunity
Building Incident Resilience: What to Do Before Things Break
Common PM Mistakes During Incidents
Your Incident Readiness Checklist
The Bottom Line

The 3 AM Call That Changed How I Think About Incidents

March 2024. My phone rang at 3:17 AM.

I didn’t recognize the number. I almost didn’t answer. But something made me pick up.

“Hey, it’s Marcus from engineering. We’ve got a production incident. Users can’t log in. We’re trying to figure out what’s wrong, but we need someone to make a call on whether we roll back.”

I was groggy. Confused. I asked questions that didn’t make sense. I said “let me check with stakeholders” (at 3 AM). I tried to micromanage the technical response.

I was the worst possible version of a PM in that moment.

The incident lasted 4 hours. It should have been 90 minutes. My confusion and hesitation added 2.5 hours to the resolution time. Users were locked out for most of the morning. The company lost $75,000 in revenue.

Afterwards, the engineering lead pulled me aside.

“Next time,” he said quietly, “just tell us you need time to wake up. We would have made the call and updated you after. Your panic helped nobody.”

He was right.

That experience sent me on a mission to understand what PMs should actually do during incidents. I interviewed engineering leads, SREs, and experienced PMs at companies with great incident response. I read every incident post-mortem I could find.

Here’s what I learned: Your role during incidents is specific, important, and completely different from what most PMs think it is.

What Actually Happens During a Production Incident

Before we talk about your role, you need to understand the anatomy of an incident.

The Incident Lifecycle

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  DETECTION  │ ──→ │  TRIAGE     │ ──→ │  RESPONSE   │
│             │     │             │     │             │
│ Alert fires │     │ Severity    │     │ Investigation│
│ User report │     │ Assignment   │     │ Mitigation  │
│ Monitoring  │     │ Escalation   │     │ Resolution  │
└─────────────┘     └─────────────┘     └─────────────┘
                                              │
                                              ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  RECOVERY   │ ←── │ RESOLUTION  │ ←── │  COMMUNICATE│
│             │     │             │     │             │
│ Monitoring  │     │ Fix applied │     │ Stakeholders│
│ Validation  │     │ Verified    │     │ Users       │
│ Normal ops  │     │ Stable      │     │ Teams       │
└─────────────┘     └─────────────┘     └─────────────┘
                                              │
                                              ▼
                                       ┌─────────────┐
                                       │ POST-MORTEM │
                                       │             │
                                       │ Learn       │
                                       │ Improve     │
                                       │ Document    │
                                       └─────────────┘

Who Does What

Role	Responsibility
On-Call Engineer	First responder, investigates, attempts fix
Incident Commander	Coordinates response, makes calls, assigns tasks
Subject Matter Expert	Deep technical knowledge of affected system
Communications Lead	Updates stakeholders, manages messaging
Product Manager	Business context, prioritization, stakeholder management
Engineering Manager	Resource allocation, team support, escalation

The key insight: You’re not the incident commander. You’re not debugging. You have a specific role.

Your Role as PM: What You Should and Shouldn’t Do

What You SHOULD Do

✅ Provide business context

Engineers need to know business impact to make prioritization decisions. Is this affecting paying customers? Which features are down? What’s the revenue impact?

"This is affecting checkout, which is $X/hour in revenue.
The users impacted are primarily enterprise customers.
We have a major demo in 4 hours."

✅ Make business trade-offs

Sometimes engineers need to make a call that affects the product. Roll back a feature? Disable a service? Prioritize one fix over another?

"If we have to choose, prioritize checkout over search.
Checkout is revenue-critical; search can be degraded."

✅ Communicate with stakeholders

While engineers fix the problem, you handle communication. Executives, customer success, sales, marketing—they all need updates.

"Update at 3:45 AM: We're aware of login issues affecting
approximately 15% of users. Engineering is investigating.
Will update in 30 minutes."

✅ Document decisions

During the chaos, someone needs to record what decisions were made and why. This helps with post-mortems and future incidents.

"3:23 AM: Decided to roll back payment service to v2.3.1
instead of hotfixing. Reason: Rollback is faster and
safer given unknown root cause."

✅ Support the team

Bring coffee. Order food. Shield them from interruptions. Your job is to enable engineers to focus.

What You Should NOT Do

❌ Don’t try to debug

You’re not qualified. Your questions will slow people down. Let the engineers do their job.

❌ Don’t micromanage

“Have you tried X?” “What about Y?” “Why is this taking so long?” These questions help nobody and frustrate everyone.

❌ Don’t make technical decisions

You don’t know whether to restart the database or scale the cluster. Don’t pretend you do.

❌ Don’t escalate prematurely

Calling the VP of Engineering at 3 AM because you’re scared is not helpful. Follow escalation procedures.

❌ Don’t make promises

“We’ll be back up in 30 minutes” is a promise you can’t keep. Communicate status, not predictions.

Incident Severity Levels: The Framework That Matters

Different incidents require different responses. Most teams use a severity framework:

SEV1: Critical

Definition: Complete outage, data loss risk, security breach, or major revenue impact

Examples:

All users unable to access product
Payment processing completely down
Data breach detected
Database corruption

Response:

All hands on deck
Immediate escalation to leadership
Wake people up
Customer communication within 15 minutes
Target resolution: ASAP

PM Role: Full engagement, executive communication, customer messaging

SEV2: High

Definition: Significant feature degradation affecting many users

Examples:

Login slow for 50%+ of users
Checkout errors >5%
Major feature unavailable
Significant data inconsistencies

Response:

On-call + relevant team
Page if outside business hours
Customer communication within 30 minutes
Target resolution: <2 hours

PM Role: Stakeholder communication, business context, monitoring escalation

SEV3: Medium

Definition: Minor degradation affecting some users

Examples:

Specific feature slow or broken
Errors affecting <5% of users
Non-critical service degraded

Response:

On-call handles during business hours
Slack update to team
No customer communication unless asked
Target resolution: <8 hours

PM Role: Awareness, potential prioritization input

SEV4: Low

Definition: Minor issues with no user impact

Examples:

Internal tool slow
Non-critical background job failing
Monitoring alert with no visible impact

Response:

Ticket created
Address in normal workflow
No urgency

PM Role: None required

The Decision Matrix

User Impact?
├── None → SEV4 (Low)
├── Small number of users → SEV3 (Medium)
├── Many users, degraded → SEV2 (High)
└── All users, complete outage → SEV1 (Critical)

The First 30 Minutes: Your Action Checklist

When you learn about an incident, here’s exactly what to do:

Minutes 0-5: Assess

Understand the severity level
Confirm who’s responding (incident commander, on-call)
Identify the user impact (how many, which segment)
Determine if you’re needed immediately or can join later

Script:

“I understand we have an incident. What’s the current severity? Who’s the incident commander? What’s the user impact? Do you need me right now?”

Minutes 5-15: Gather Context

Understand what feature/system is affected
Identify business impact (revenue, customers, demos)
Check if there’s a scheduled event that amplifies impact
Identify who needs to be informed

Script:

“The checkout flow is affected. That’s about $X/hour in revenue. We have a major customer demo at 10 AM. I’ll start stakeholder communication.”

Minutes 15-30: Communicate

Send initial update to stakeholders (see templates)
Set expectations for next update time
Join incident channel/bridge
Identify if you can help or should stand by

Script:

“Initial update sent. I’m in the incident channel if you need business input. I’ll send another update in 30 minutes unless something changes.”

Stakeholder Communication: Templates That Work

Communication during incidents is its own skill. Here are templates that work:

Initial Acknowledgment (Within 15 minutes of SEV1/SEV2)

To: Leadership, Customer Success, Support
Subject: [SEV2] Incident: Checkout errors affecting users

We are aware of an issue affecting the checkout process.
Engineering is actively investigating.

Impact: Users may experience errors when attempting to complete purchases
Affected: Approximately 20% of checkout attempts
Status: Investigating
Next update: [30 minutes from now]

If you receive customer inquiries, please direct them to support.
I will provide updates every 30 minutes until resolved.

[Your name]

Status Update (Every 30-60 minutes)

To: Same as above
Subject: [SEV2] Update #2: Checkout errors

Update on the checkout incident:

Current Status: We have identified the issue as related to the payment
processor integration. Engineering is implementing a fix.

Progress:
- Identified root cause: Payment API timeout
- Fix in progress: Implementing fallback payment gateway
- ETA: Expecting resolution in approximately 30 minutes

Impact: Issue continues to affect ~20% of checkout attempts
Next update: [30 minutes from now]

[Your name]

Resolution Notice

To: Same as above
Subject: [RESOLVED] SEV2: Checkout errors

The checkout incident has been resolved.

Summary:
- Duration: 2 hours 15 minutes
- Root Cause: Payment processor API timeout
- Resolution: Implemented fallback to secondary payment gateway
- Impact: ~20% of checkout attempts failed during incident

Next Steps:
- Full post-mortem will be completed within 48 hours
- Preventive measures will be documented and shared
- Customer Success: Please follow up with affected customers

Thank you to everyone who helped respond.

[Your name]

Customer-Facing Message (If needed)

For status page or direct communication:

We experienced an issue with checkout processing between
[TIME] and [TIME] UTC. During this time, some customers
may have encountered errors when completing purchases.

The issue has been resolved. If your payment was affected:
- Failed transactions were not charged
- Please retry your purchase
- Contact support if you continue to experience issues

We apologize for any inconvenience.

The War Room: How to Participate Without Getting in the Way

During major incidents, teams gather (virtually or physically) in a “war room.” Here’s how to participate effectively:

Your Job in the War Room

Listen. Don’t interrupt with questions. Engineers need to communicate with each other.
Note business context. If someone asks about impact, provide it quickly.
Handle external communication. Shield the team from stakeholder interruptions.
Document decisions. Keep a running log.

What Not to Do

Don’t ask “what’s happening?” every 5 minutes
Don’t suggest technical solutions
Don’t interrupt debugging conversations
Don’t pull people away for non-urgent updates

The Incident Log Template

Keep this updated during the incident:

INCIDENT LOG: [DATE] - [INCIDENT NAME]
Severity: SEV[X]
Incident Commander: [Name]
Start Time: [Time]
Affected Systems: [List]

TIMELINE:
[Time] - Incident detected via [alert/user report]
[Time] - On-call [Name] acknowledged
[Time] - Severity set to SEV[X]
[Time] - [Decision made] - [Reason]
[Time] - [Action taken] - [By whom]
[Time] - [Update] - [Progress]
...

DECISIONS MADE:
1. Roll back payment service at [Time] - Faster than debugging
2. Wake senior engineer at [Time] - Needed SME knowledge
3. Customer communication sent at [Time] - Per SEV2 protocol

STAKEHOLDER UPDATES:
- [Time] Email sent to leadership
- [Time] Status page updated
- [Time] Support team notified

RESOLUTION:
[Time] - Fix deployed
[Time] - Verified working
[Time] - Incident closed
Duration: [X hours Y minutes]

Escalation: When and How to Raise the Alarm

Knowing when to escalate is crucial. Here’s a framework:

Automatic Escalation Triggers

Always escalate if:

SEV1 is declared
Resolution ETA exceeds 2 hours
Customer churn risk is high
Media/legal/regulatory exposure
Data breach suspected
Multiple major customers affected

How to Escalate

Step 1: Inform the Incident Commander

“Given the customer impact and approaching renewal deadline, I believe we need to escalate this to leadership. Do you agree?”

Step 2: Send escalation message

To: VP Engineering, VP Product (or appropriate leadership)
Subject: [ESCALATION] SEV2: Checkout incident - Customer impact

Escalating due to customer impact.

Summary: Checkout errors affecting 20% of attempts for 90 minutes
Impact: [Major customer] renewal decision tomorrow; estimated $X revenue at risk
Current Status: Engineering investigating, no ETA yet
What I Need: Guidance on customer communication; decision on mitigation options

I'm available to discuss immediately.

[Your name]
[Phone number]

What NOT to Do

Don’t escalate silently (tell the incident commander)
Don’t escalate to bypass decisions you don’t like
Don’t copy too many people (creates noise)
Don’t send long emails (executives need summaries)

Resolution and Recovery: Getting Back to Normal

Once the fix is deployed, there’s still work to do.

Immediate (First Hour)

Verify fix is working
Monitor for recurrence
Update all stakeholders
Close incident channel/bridge
Thank the responders

Short-Term (First 24 Hours)

Send follow-up to affected customers (if appropriate)
Update status page with resolution
Gather initial data for post-mortem
Schedule post-mortem meeting

Medium-Term (First 48 Hours)

Complete post-mortem
Create action items for prevention
Share learnings with broader team
Update runbooks if needed

Post-Mortems: The Real Learning Opportunity

The post-mortem is where incidents become valuable. Here’s how to run them effectively.

The Blameless Principle

Critical rule: Post-mortems are never about blame. They’re about system improvement.

❌ Blameful: "John pushed bad code"
✅ Blameless: "The deployment process lacks automated validation"

Why this matters: If people fear blame, they hide information. You can’t improve if you don’t know what happened.

The Post-Mortem Template

# Incident Post-Mortem: [Incident Name]

**Date:** [Date]
**Severity:** SEV[X]
**Duration:** [Start] to [End] ([Total])
**Author:** [Name]

## Summary
[2-3 sentences describing what happened]

## Impact
- User impact: [X users affected, Y% of total]
- Business impact: [Revenue lost, customers affected]
- Duration: [X hours Y minutes]

## Timeline
[Detailed timeline of what happened]

## Root Cause
[The underlying reason this happened - not the immediate trigger]

## Contributing Factors
- Factor 1
- Factor 2
- Factor 3

## What Went Well
- [Things that helped resolve the incident quickly]

## What Could Be Improved
- [Things that slowed resolution or made it worse]

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| [Action 1] | [Name] | [Date] |
| [Action 2] | [Name] | [Date] |

## Lessons Learned
[Key takeaways for future incidents]

Your Role in Post-Mortems

As PM, you contribute:

Business impact analysis: Quantify the damage (revenue, users, churn risk)
Customer perspective: What did users experience? How did they react?
Prioritization input: Which action items matter most?
Follow-up ownership: Own non-technical action items (customer communication, documentation)

Questions to Ask in Post-Mortems

“What would have prevented this entirely?”
“How could we have detected this faster?”
“What slowed down our resolution?”
“What would we do differently next time?”
“Is there a pattern with similar incidents?”

Building Incident Resilience: What to Do Before Things Break

The best incident response is preparation. Here’s what to do now:

Know Your Incident Process

Where is the incident runbook?
Who’s on-call this week?
What’s the escalation path?
Where is the incident channel?
What’s the severity criteria?

Build Relationships Before Crises

Know the engineering leads by name
Understand which teams own which systems
Have executive contact info ready
Build trust so they trust your judgment during incidents

Pre-Write Communication Templates

Have email templates ready to customize
Know the status page update process
Have customer messaging approved in advance
Create a stakeholder distribution list

Practice

Participate in incident drills (game days)
Review past post-mortems
Shadow an actual incident if possible
Know what you don’t know

Common PM Mistakes During Incidents

Mistake 1: Trying to Be Technical

What happens: You ask technical questions, suggest solutions, or try to debug.

The result: You distract engineers, slow down resolution, and look incompetent.

The fix: Stay in your lane. Business context. Communication. Support.

Mistake 2: Going Silent

What happens: You don’t know what to do, so you do nothing.

The result: Stakeholders are uninformed, customers are angry, and you look disengaged.

The fix: Always send initial acknowledgment. Set update cadence. Even “still investigating” is an update.

Mistake 3: Over-Promising

What happens: You say “we’ll be back up in 30 minutes” based on hope.

The result: When 30 minutes passes, stakeholders lose trust. You’ve created an expectation you can’t control.

The fix: Communicate status, not predictions. “Engineering is working on a fix” is honest. “We’ll be back in 30 minutes” is guessing.

Mistake 4: Escalating Emotionally

What happens: You’re scared, so you call leadership. Or you wake up the VP because “this is serious.”

The result: You create panic, damage trust with engineering, and distract leadership without cause.

The fix: Follow escalation procedures. Have criteria. Escalate strategically, not emotionally.

Mistake 5: Not Doing the Post-Mortem

What happens: Incident resolved, everyone moves on. No documentation, no learning.

The result: Same incident happens again. You didn’t improve.

The fix: Always do post-mortems for SEV1 and SEV2. Create action items. Track completion.

Your Incident Readiness Checklist

Right Now (Before the Next Incident)

Save the incident channel name/link
Know who’s on-call (bookmark the schedule)
Bookmark the incident runbook
Save communication templates somewhere accessible
Know your severity criteria

This Week

Introduce yourself to the on-call engineers
Review the last 3 post-mortems
Understand escalation procedures
Create stakeholder distribution list
Ask engineering if there are any known risks

This Month

Participate in a game day/drill
Review incident metrics (MTTR, frequency)
Identify gaps in incident process
Propose improvements based on patterns

The Bottom Line

Incidents are inevitable. Your response is not.

Good PM incident response:

Provides business context when needed
Handles stakeholder communication
Supports the team without interfering
Learns from every incident

Bad PM incident response:

Tries to be technical
Goes silent or over-communicates
Creates more noise than signal
Skips the post-mortem

The difference isn’t experience or technical knowledge. It’s understanding your role and executing it well.

Your action item: Find your team’s incident runbook. Read it. Bookmark it. Then ask an engineer to walk you through what they need from you during an incident.

Because the next 3 AM call is coming. The question is: will you be ready?

What’s your biggest concern about incident response? What would help you feel more prepared?

Related Reading:

📚 Table of Contents#

The 3 AM Call That Changed How I Think About Incidents#

What Actually Happens During a Production Incident#

The Incident Lifecycle#

Who Does What#

Your Role as PM: What You Should and Shouldn’t Do#

What You SHOULD Do#

What You Should NOT Do#

Incident Severity Levels: The Framework That Matters#

SEV1: Critical#

SEV2: High#

SEV3: Medium#

SEV4: Low#

The Decision Matrix#

The First 30 Minutes: Your Action Checklist#

Minutes 0-5: Assess#

Minutes 5-15: Gather Context#

Minutes 15-30: Communicate#

Stakeholder Communication: Templates That Work#

Initial Acknowledgment (Within 15 minutes of SEV1/SEV2)#

Status Update (Every 30-60 minutes)#

Resolution Notice#

Customer-Facing Message (If needed)#

The War Room: How to Participate Without Getting in the Way#

Your Job in the War Room#

What Not to Do#

The Incident Log Template#

Escalation: When and How to Raise the Alarm#

Automatic Escalation Triggers#

How to Escalate#

What NOT to Do#

Resolution and Recovery: Getting Back to Normal#

Immediate (First Hour)#

Short-Term (First 24 Hours)#

Medium-Term (First 48 Hours)#

Post-Mortems: The Real Learning Opportunity#

The Blameless Principle#

The Post-Mortem Template#

Your Role in Post-Mortems#

Questions to Ask in Post-Mortems#

Building Incident Resilience: What to Do Before Things Break#

Know Your Incident Process#

Build Relationships Before Crises#

Pre-Write Communication Templates#

Practice#

Common PM Mistakes During Incidents#

Mistake 1: Trying to Be Technical#

Mistake 2: Going Silent#

Mistake 3: Over-Promising#

Mistake 4: Escalating Emotionally#

Mistake 5: Not Doing the Post-Mortem#

Your Incident Readiness Checklist#

Right Now (Before the Next Incident)#

This Week#

This Month#

The Bottom Line#

💬 Join the Conversation

📚 Table of Contents

The 3 AM Call That Changed How I Think About Incidents

What Actually Happens During a Production Incident

The Incident Lifecycle

Who Does What

Your Role as PM: What You Should and Shouldn’t Do

What You SHOULD Do

What You Should NOT Do

Incident Severity Levels: The Framework That Matters

SEV1: Critical

SEV2: High

SEV3: Medium

SEV4: Low

The Decision Matrix

The First 30 Minutes: Your Action Checklist

Minutes 0-5: Assess

Minutes 5-15: Gather Context

Minutes 15-30: Communicate

Stakeholder Communication: Templates That Work

Initial Acknowledgment (Within 15 minutes of SEV1/SEV2)

Status Update (Every 30-60 minutes)

Resolution Notice

Customer-Facing Message (If needed)

The War Room: How to Participate Without Getting in the Way

Your Job in the War Room

What Not to Do

The Incident Log Template

Escalation: When and How to Raise the Alarm

Automatic Escalation Triggers

How to Escalate

What NOT to Do

Resolution and Recovery: Getting Back to Normal

Immediate (First Hour)

Short-Term (First 24 Hours)

Medium-Term (First 48 Hours)

Post-Mortems: The Real Learning Opportunity

The Blameless Principle

The Post-Mortem Template

Your Role in Post-Mortems

Questions to Ask in Post-Mortems

Building Incident Resilience: What to Do Before Things Break

Know Your Incident Process

Build Relationships Before Crises

Pre-Write Communication Templates

Practice

Common PM Mistakes During Incidents

Mistake 1: Trying to Be Technical

Mistake 2: Going Silent

Mistake 3: Over-Promising

Mistake 4: Escalating Emotionally

Mistake 5: Not Doing the Post-Mortem

Your Incident Readiness Checklist

Right Now (Before the Next Incident)

This Week

This Month

The Bottom Line