π Table of Contents
- The 3 AM Call That Changed How I Think About Incidents
- What Actually Happens During a Production Incident
- Your Role as PM: What You Should and Shouldn’t Do
- Incident Severity Levels: The Framework That Matters
- The First 30 Minutes: Your Action Checklist
- Stakeholder Communication: Templates That Work
- The War Room: How to Participate Without Getting in the Way
- Escalation: When and How to Raise the Alarm
- Resolution and Recovery: Getting Back to Normal
- Post-Mortems: The Real Learning Opportunity
- Building Incident Resilience: What to Do Before Things Break
- Common PM Mistakes During Incidents
- Your Incident Readiness Checklist
- The Bottom Line
The 3 AM Call That Changed How I Think About Incidents
March 2024. My phone rang at 3:17 AM.
I didn’t recognize the number. I almost didn’t answer. But something made me pick up.
“Hey, it’s Marcus from engineering. We’ve got a production incident. Users can’t log in. We’re trying to figure out what’s wrong, but we need someone to make a call on whether we roll back.”
I was groggy. Confused. I asked questions that didn’t make sense. I said “let me check with stakeholders” (at 3 AM). I tried to micromanage the technical response.
I was the worst possible version of a PM in that moment.
The incident lasted 4 hours. It should have been 90 minutes. My confusion and hesitation added 2.5 hours to the resolution time. Users were locked out for most of the morning. The company lost $75,000 in revenue.
Afterwards, the engineering lead pulled me aside.
“Next time,” he said quietly, “just tell us you need time to wake up. We would have made the call and updated you after. Your panic helped nobody.”
He was right.
That experience sent me on a mission to understand what PMs should actually do during incidents. I interviewed engineering leads, SREs, and experienced PMs at companies with great incident response. I read every incident post-mortem I could find.
Here’s what I learned: Your role during incidents is specific, important, and completely different from what most PMs think it is.
What Actually Happens During a Production Incident
Before we talk about your role, you need to understand the anatomy of an incident.
The Incident Lifecycle
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β DETECTION β βββ β TRIAGE β βββ β RESPONSE β
β β β β β β
β Alert fires β β Severity β β Investigationβ
β User report β β Assignment β β Mitigation β
β Monitoring β β Escalation β β Resolution β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β
βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β RECOVERY β βββ β RESOLUTION β βββ β COMMUNICATEβ
β β β β β β
β Monitoring β β Fix applied β β Stakeholdersβ
β Validation β β Verified β β Users β
β Normal ops β β Stable β β Teams β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β
βΌ
βββββββββββββββ
β POST-MORTEM β
β β
β Learn β
β Improve β
β Document β
βββββββββββββββ
Who Does What
| Role | Responsibility |
|---|---|
| On-Call Engineer | First responder, investigates, attempts fix |
| Incident Commander | Coordinates response, makes calls, assigns tasks |
| Subject Matter Expert | Deep technical knowledge of affected system |
| Communications Lead | Updates stakeholders, manages messaging |
| Product Manager | Business context, prioritization, stakeholder management |
| Engineering Manager | Resource allocation, team support, escalation |
The key insight: You’re not the incident commander. You’re not debugging. You have a specific role.
Your Role as PM: What You Should and Shouldn’t Do
What You SHOULD Do
β Provide business context
Engineers need to know business impact to make prioritization decisions. Is this affecting paying customers? Which features are down? What’s the revenue impact?
"This is affecting checkout, which is $X/hour in revenue.
The users impacted are primarily enterprise customers.
We have a major demo in 4 hours."
β Make business trade-offs
Sometimes engineers need to make a call that affects the product. Roll back a feature? Disable a service? Prioritize one fix over another?
"If we have to choose, prioritize checkout over search.
Checkout is revenue-critical; search can be degraded."
β Communicate with stakeholders
While engineers fix the problem, you handle communication. Executives, customer success, sales, marketingβthey all need updates.
"Update at 3:45 AM: We're aware of login issues affecting
approximately 15% of users. Engineering is investigating.
Will update in 30 minutes."
β Document decisions
During the chaos, someone needs to record what decisions were made and why. This helps with post-mortems and future incidents.
"3:23 AM: Decided to roll back payment service to v2.3.1
instead of hotfixing. Reason: Rollback is faster and
safer given unknown root cause."
β Support the team
Bring coffee. Order food. Shield them from interruptions. Your job is to enable engineers to focus.
What You Should NOT Do
β Don’t try to debug
You’re not qualified. Your questions will slow people down. Let the engineers do their job.
β Don’t micromanage
“Have you tried X?” “What about Y?” “Why is this taking so long?” These questions help nobody and frustrate everyone.
β Don’t make technical decisions
You don’t know whether to restart the database or scale the cluster. Don’t pretend you do.
β Don’t escalate prematurely
Calling the VP of Engineering at 3 AM because you’re scared is not helpful. Follow escalation procedures.
β Don’t make promises
“We’ll be back up in 30 minutes” is a promise you can’t keep. Communicate status, not predictions.
Incident Severity Levels: The Framework That Matters
Different incidents require different responses. Most teams use a severity framework:
SEV1: Critical
Definition: Complete outage, data loss risk, security breach, or major revenue impact
Examples:
- All users unable to access product
- Payment processing completely down
- Data breach detected
- Database corruption
Response:
- All hands on deck
- Immediate escalation to leadership
- Wake people up
- Customer communication within 15 minutes
- Target resolution: ASAP
PM Role: Full engagement, executive communication, customer messaging
SEV2: High
Definition: Significant feature degradation affecting many users
Examples:
- Login slow for 50%+ of users
- Checkout errors >5%
- Major feature unavailable
- Significant data inconsistencies
Response:
- On-call + relevant team
- Page if outside business hours
- Customer communication within 30 minutes
- Target resolution: <2 hours
PM Role: Stakeholder communication, business context, monitoring escalation
SEV3: Medium
Definition: Minor degradation affecting some users
Examples:
- Specific feature slow or broken
- Errors affecting <5% of users
- Non-critical service degraded
Response:
- On-call handles during business hours
- Slack update to team
- No customer communication unless asked
- Target resolution: <8 hours
PM Role: Awareness, potential prioritization input
SEV4: Low
Definition: Minor issues with no user impact
Examples:
- Internal tool slow
- Non-critical background job failing
- Monitoring alert with no visible impact
Response:
- Ticket created
- Address in normal workflow
- No urgency
PM Role: None required
The Decision Matrix
User Impact?
βββ None β SEV4 (Low)
βββ Small number of users β SEV3 (Medium)
βββ Many users, degraded β SEV2 (High)
βββ All users, complete outage β SEV1 (Critical)
The First 30 Minutes: Your Action Checklist
When you learn about an incident, here’s exactly what to do:
Minutes 0-5: Assess
- Understand the severity level
- Confirm who’s responding (incident commander, on-call)
- Identify the user impact (how many, which segment)
- Determine if you’re needed immediately or can join later
Script:
“I understand we have an incident. What’s the current severity? Who’s the incident commander? What’s the user impact? Do you need me right now?”
Minutes 5-15: Gather Context
- Understand what feature/system is affected
- Identify business impact (revenue, customers, demos)
- Check if there’s a scheduled event that amplifies impact
- Identify who needs to be informed
Script:
“The checkout flow is affected. That’s about $X/hour in revenue. We have a major customer demo at 10 AM. I’ll start stakeholder communication.”
Minutes 15-30: Communicate
- Send initial update to stakeholders (see templates)
- Set expectations for next update time
- Join incident channel/bridge
- Identify if you can help or should stand by
Script:
“Initial update sent. I’m in the incident channel if you need business input. I’ll send another update in 30 minutes unless something changes.”
Stakeholder Communication: Templates That Work
Communication during incidents is its own skill. Here are templates that work:
Initial Acknowledgment (Within 15 minutes of SEV1/SEV2)
To: Leadership, Customer Success, Support
Subject: [SEV2] Incident: Checkout errors affecting users
We are aware of an issue affecting the checkout process.
Engineering is actively investigating.
Impact: Users may experience errors when attempting to complete purchases
Affected: Approximately 20% of checkout attempts
Status: Investigating
Next update: [30 minutes from now]
If you receive customer inquiries, please direct them to support.
I will provide updates every 30 minutes until resolved.
[Your name]
Status Update (Every 30-60 minutes)
To: Same as above
Subject: [SEV2] Update #2: Checkout errors
Update on the checkout incident:
Current Status: We have identified the issue as related to the payment
processor integration. Engineering is implementing a fix.
Progress:
- Identified root cause: Payment API timeout
- Fix in progress: Implementing fallback payment gateway
- ETA: Expecting resolution in approximately 30 minutes
Impact: Issue continues to affect ~20% of checkout attempts
Next update: [30 minutes from now]
[Your name]
Resolution Notice
To: Same as above
Subject: [RESOLVED] SEV2: Checkout errors
The checkout incident has been resolved.
Summary:
- Duration: 2 hours 15 minutes
- Root Cause: Payment processor API timeout
- Resolution: Implemented fallback to secondary payment gateway
- Impact: ~20% of checkout attempts failed during incident
Next Steps:
- Full post-mortem will be completed within 48 hours
- Preventive measures will be documented and shared
- Customer Success: Please follow up with affected customers
Thank you to everyone who helped respond.
[Your name]
Customer-Facing Message (If needed)
For status page or direct communication:
We experienced an issue with checkout processing between
[TIME] and [TIME] UTC. During this time, some customers
may have encountered errors when completing purchases.
The issue has been resolved. If your payment was affected:
- Failed transactions were not charged
- Please retry your purchase
- Contact support if you continue to experience issues
We apologize for any inconvenience.
The War Room: How to Participate Without Getting in the Way
During major incidents, teams gather (virtually or physically) in a “war room.” Here’s how to participate effectively:
Your Job in the War Room
- Listen. Don’t interrupt with questions. Engineers need to communicate with each other.
- Note business context. If someone asks about impact, provide it quickly.
- Handle external communication. Shield the team from stakeholder interruptions.
- Document decisions. Keep a running log.
What Not to Do
- Don’t ask “what’s happening?” every 5 minutes
- Don’t suggest technical solutions
- Don’t interrupt debugging conversations
- Don’t pull people away for non-urgent updates
The Incident Log Template
Keep this updated during the incident:
INCIDENT LOG: [DATE] - [INCIDENT NAME]
Severity: SEV[X]
Incident Commander: [Name]
Start Time: [Time]
Affected Systems: [List]
TIMELINE:
[Time] - Incident detected via [alert/user report]
[Time] - On-call [Name] acknowledged
[Time] - Severity set to SEV[X]
[Time] - [Decision made] - [Reason]
[Time] - [Action taken] - [By whom]
[Time] - [Update] - [Progress]
...
DECISIONS MADE:
1. Roll back payment service at [Time] - Faster than debugging
2. Wake senior engineer at [Time] - Needed SME knowledge
3. Customer communication sent at [Time] - Per SEV2 protocol
STAKEHOLDER UPDATES:
- [Time] Email sent to leadership
- [Time] Status page updated
- [Time] Support team notified
RESOLUTION:
[Time] - Fix deployed
[Time] - Verified working
[Time] - Incident closed
Duration: [X hours Y minutes]
Escalation: When and How to Raise the Alarm
Knowing when to escalate is crucial. Here’s a framework:
Automatic Escalation Triggers
Always escalate if:
- SEV1 is declared
- Resolution ETA exceeds 2 hours
- Customer churn risk is high
- Media/legal/regulatory exposure
- Data breach suspected
- Multiple major customers affected
How to Escalate
Step 1: Inform the Incident Commander
“Given the customer impact and approaching renewal deadline, I believe we need to escalate this to leadership. Do you agree?”
Step 2: Send escalation message
To: VP Engineering, VP Product (or appropriate leadership)
Subject: [ESCALATION] SEV2: Checkout incident - Customer impact
Escalating due to customer impact.
Summary: Checkout errors affecting 20% of attempts for 90 minutes
Impact: [Major customer] renewal decision tomorrow; estimated $X revenue at risk
Current Status: Engineering investigating, no ETA yet
What I Need: Guidance on customer communication; decision on mitigation options
I'm available to discuss immediately.
[Your name]
[Phone number]
What NOT to Do
- Don’t escalate silently (tell the incident commander)
- Don’t escalate to bypass decisions you don’t like
- Don’t copy too many people (creates noise)
- Don’t send long emails (executives need summaries)
Resolution and Recovery: Getting Back to Normal
Once the fix is deployed, there’s still work to do.
Immediate (First Hour)
- Verify fix is working
- Monitor for recurrence
- Update all stakeholders
- Close incident channel/bridge
- Thank the responders
Short-Term (First 24 Hours)
- Send follow-up to affected customers (if appropriate)
- Update status page with resolution
- Gather initial data for post-mortem
- Schedule post-mortem meeting
Medium-Term (First 48 Hours)
- Complete post-mortem
- Create action items for prevention
- Share learnings with broader team
- Update runbooks if needed
Post-Mortems: The Real Learning Opportunity
The post-mortem is where incidents become valuable. Here’s how to run them effectively.
The Blameless Principle
Critical rule: Post-mortems are never about blame. They’re about system improvement.
β Blameful: "John pushed bad code"
β
Blameless: "The deployment process lacks automated validation"
Why this matters: If people fear blame, they hide information. You can’t improve if you don’t know what happened.
The Post-Mortem Template
# Incident Post-Mortem: [Incident Name]
**Date:** [Date]
**Severity:** SEV[X]
**Duration:** [Start] to [End] ([Total])
**Author:** [Name]
## Summary
[2-3 sentences describing what happened]
## Impact
- User impact: [X users affected, Y% of total]
- Business impact: [Revenue lost, customers affected]
- Duration: [X hours Y minutes]
## Timeline
[Detailed timeline of what happened]
## Root Cause
[The underlying reason this happened - not the immediate trigger]
## Contributing Factors
- Factor 1
- Factor 2
- Factor 3
## What Went Well
- [Things that helped resolve the incident quickly]
## What Could Be Improved
- [Things that slowed resolution or made it worse]
## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| [Action 1] | [Name] | [Date] |
| [Action 2] | [Name] | [Date] |
## Lessons Learned
[Key takeaways for future incidents]
Your Role in Post-Mortems
As PM, you contribute:
- Business impact analysis: Quantify the damage (revenue, users, churn risk)
- Customer perspective: What did users experience? How did they react?
- Prioritization input: Which action items matter most?
- Follow-up ownership: Own non-technical action items (customer communication, documentation)
Questions to Ask in Post-Mortems
- “What would have prevented this entirely?”
- “How could we have detected this faster?”
- “What slowed down our resolution?”
- “What would we do differently next time?”
- “Is there a pattern with similar incidents?”
Building Incident Resilience: What to Do Before Things Break
The best incident response is preparation. Here’s what to do now:
Know Your Incident Process
- Where is the incident runbook?
- Who’s on-call this week?
- What’s the escalation path?
- Where is the incident channel?
- What’s the severity criteria?
Build Relationships Before Crises
- Know the engineering leads by name
- Understand which teams own which systems
- Have executive contact info ready
- Build trust so they trust your judgment during incidents
Pre-Write Communication Templates
- Have email templates ready to customize
- Know the status page update process
- Have customer messaging approved in advance
- Create a stakeholder distribution list
Practice
- Participate in incident drills (game days)
- Review past post-mortems
- Shadow an actual incident if possible
- Know what you don’t know
Common PM Mistakes During Incidents
Mistake 1: Trying to Be Technical
What happens: You ask technical questions, suggest solutions, or try to debug.
The result: You distract engineers, slow down resolution, and look incompetent.
The fix: Stay in your lane. Business context. Communication. Support.
Mistake 2: Going Silent
What happens: You don’t know what to do, so you do nothing.
The result: Stakeholders are uninformed, customers are angry, and you look disengaged.
The fix: Always send initial acknowledgment. Set update cadence. Even “still investigating” is an update.
Mistake 3: Over-Promising
What happens: You say “we’ll be back up in 30 minutes” based on hope.
The result: When 30 minutes passes, stakeholders lose trust. You’ve created an expectation you can’t control.
The fix: Communicate status, not predictions. “Engineering is working on a fix” is honest. “We’ll be back in 30 minutes” is guessing.
Mistake 4: Escalating Emotionally
What happens: You’re scared, so you call leadership. Or you wake up the VP because “this is serious.”
The result: You create panic, damage trust with engineering, and distract leadership without cause.
The fix: Follow escalation procedures. Have criteria. Escalate strategically, not emotionally.
Mistake 5: Not Doing the Post-Mortem
What happens: Incident resolved, everyone moves on. No documentation, no learning.
The result: Same incident happens again. You didn’t improve.
The fix: Always do post-mortems for SEV1 and SEV2. Create action items. Track completion.
Your Incident Readiness Checklist
Right Now (Before the Next Incident)
- Save the incident channel name/link
- Know who’s on-call (bookmark the schedule)
- Bookmark the incident runbook
- Save communication templates somewhere accessible
- Know your severity criteria
This Week
- Introduce yourself to the on-call engineers
- Review the last 3 post-mortems
- Understand escalation procedures
- Create stakeholder distribution list
- Ask engineering if there are any known risks
This Month
- Participate in a game day/drill
- Review incident metrics (MTTR, frequency)
- Identify gaps in incident process
- Propose improvements based on patterns
The Bottom Line
Incidents are inevitable. Your response is not.
Good PM incident response:
- Provides business context when needed
- Handles stakeholder communication
- Supports the team without interfering
- Learns from every incident
Bad PM incident response:
- Tries to be technical
- Goes silent or over-communicates
- Creates more noise than signal
- Skips the post-mortem
The difference isn’t experience or technical knowledge. It’s understanding your role and executing it well.
Your action item: Find your team’s incident runbook. Read it. Bookmark it. Then ask an engineer to walk you through what they need from you during an incident.
Because the next 3 AM call is coming. The question is: will you be ready?
What’s your biggest concern about incident response? What would help you feel more prepared?
Related Reading:

π¬ Join the Conversation