Production Incidents: A Product Manager's Survival Guide

Introduction: When Production Breaks
Understanding Incident Severity Levels
The PM’s Role During Incidents
Communication Templates for Stakeholders
Coordinating with Engineering During Crises
Post-Incident Review Best Practices
Building Incident Response Playbooks
Preventing Incidents: PM’s Proactive Role
Learning from Incidents: Continuous Improvement
Incident Response Checklist for PMs

Introduction: When Production Breaks

It’s 2:47 AM. Your phone screams with the unmistakable sound of your PagerDuty alert. You fumble for it, eyes barely adjusting to the harsh screen light. The message is brief but terrifying: “SEV1 - Payment processing down - All users affected.”

Your heart rate spikes. Payment processing down? For ALL users? In the next 30 seconds, your brain cycles through panic, denial, and then—hopefully—into action mode.

This is the moment that separates good Product Managers from great ones.

I’ve lived through this scenario more times than I care to remember. The SEV1 that took down checkout during Black Friday. The database migration that went sideways at 3 AM. The certificate expiration that nobody noticed until users started screaming on Twitter.

Each incident taught me something. But the biggest lesson? Your behavior during an incident matters as much as the technical fix.

The Naked Truth: Incidents are when your users and stakeholders see the real you. Anyone can manage a product when everything works. How you handle the moments when everything breaks—that’s your legacy.

This guide isn’t about fixing technical issues. Your engineers will handle that. This is about what YOU do when production breaks: how to communicate, how to coordinate, how to make decisions under pressure, and how to turn disasters into learning opportunities.

Let’s dive in.

Understanding Incident Severity Levels

Not all incidents are created equal. Before you can respond appropriately, you need to understand what you’re dealing with.

The Standard Severity Framework

Most engineering organizations use a severity scale. Here’s the typical framework:

SEV1 (Critical)

Complete service outage or critical functionality broken
Affects all or most users
Immediate revenue/customer impact
Requires all-hands-on-deck response
Examples: Payment system down, complete site outage, data breach

SEV2 (High)

Major functionality impaired but not completely broken
Affects significant portion of users
Workarounds exist but are painful
Requires immediate response
Examples: Search functionality degraded, checkout slow for 30% of users, specific feature broken

SEV3 (Medium)

Non-critical functionality impaired
Affects small portion of users or edge cases
Reasonable workarounds exist
Response within business hours acceptable
Examples: Minor feature bug, performance degradation in specific region, non-critical API errors

SEV4 (Low)

Cosmetic issues or minor inconveniences
Minimal user impact
Can be addressed in normal sprint work
Examples: UI glitches, typos, non-blocking errors in logs

Why This Matters for PMs

Severity levels determine your level of involvement:

Severity	PM Involvement	Communication	Escalation
SEV1	Immediate, active role	All stakeholders immediately	Executive team, possibly CEO
SEV2	Prompt involvement	Key stakeholders within 15 min	Engineering leadership
SEV3	Awareness, monitoring	Team lead, support lead	Engineering manager
SEV4	Ticket created	As needed	None

The Naked Truth: Severity isn’t just about technical impact—it’s about business impact. A minor technical issue during peak sales hours might be SEV2. The same issue at 3 AM on a Tuesday might be SEV3. Context matters.

Common Severity Assessment Mistakes

Over-classifying: Declaring SEV1 for every issue. This desensitizes the team and wastes resources. Reserve SEV1 for genuine emergencies.

Under-classifying: Downplaying issues to avoid alarming stakeholders. This leads to delayed response and worse outcomes.

Not reassessing: An incident that starts as SEV3 can escalate to SEV1. Keep reassessing as you learn more.

Your Role in Severity Assessment

As a PM, you bring the business context that engineers might lack:

Is this our peak usage time?
Are major customers affected?
Is there a press event or important demo happening?
What’s the revenue impact per hour of downtime?

You should be part of the severity assessment conversation, providing context that helps the team calibrate appropriately.

The PM’s Role During Incidents

Let’s be crystal clear about what you should and shouldn’t do during incidents.

What You SHOULD Do

1. Provide Business Context

Engineers focus on technical aspects. You focus on business impact:

Which customers are affected?
What’s the revenue impact?
Are there SLAs or contracts at risk?
Is there PR exposure?

2. Communicate with Stakeholders

You’re the bridge between the incident response team and the outside world. This includes:

Internal stakeholders (leadership, other teams)
Customer-facing teams (support, sales, customer success)
External stakeholders (major customers, possibly public)

3. Make Business Decisions

Some incident decisions are business decisions, not technical:

Do we disable a feature to restore service?
Do we communicate publicly about the issue?
Do we offer compensation to affected customers?
Do we need to invoke disaster recovery?

4. Coordinate Cross-Team Efforts

Incidents often require coordination across teams:

Support needs talking points
Sales needs customer updates
Marketing might need to pause campaigns
Legal might need to be involved

5. Document Everything

Keep a timeline of events, decisions, and communications. This feeds into the post-mortem and might be needed for compliance or legal reasons.

What You Should NOT Do

1. Don’t Interfere with Technical Response

The engineers are working on the fix. Your job isn’t to hover, ask “how much longer,” or suggest technical solutions. Let them work.

2. Don’t Add Pressure

They’re already under pressure. Adding “the CEO is asking” or “we’re losing $50K per hour” doesn’t help and might hurt.

3. Don’t Make Technical Decisions

Unless you’re also an engineer, you’re not qualified to decide which technical approach to take. Ask questions, understand tradeoffs, but let engineers decide.

4. Don’t Speculate

When communicating, stick to facts. Speculation creates confusion and can come back to haunt you.

The Incident Command Framework

Many organizations use an Incident Command System (ICS) adapted from emergency response. In this framework:

Incident Commander (IC): Leads the response, makes final decisions
Technical Lead: Directs the technical investigation and fix
Communications Lead: Handles all internal and external communication
Scribe: Documents everything

As a PM, you might serve as Communications Lead or support the IC with business decisions.

The Naked Truth: The worst incidents I’ve seen involved PMs who couldn’t resist micromanaging the technical response. Trust your engineers. Your job is different but equally important. Stay in your lane and excel there.

Communication Templates for Stakeholders

Communication during incidents is an art. You need to be clear, honest, and appropriately urgent without causing panic. Here are templates you can adapt.

Initial Internal Notification (First 15 minutes)

Subject: [SEV LEVEL] - Brief Description - STATUS

INCIDENT SUMMARY
- Issue: [Brief description]
- Severity: [SEV1/2/3]
- Impact: [Who/what is affected]
- Current Status: [Investigating/Mitigating/Resolved]
- Incident Lead: [Name]
- Incident Channel: [Slack channel/link]

NEXT UPDATE
- [Time, typically 15-30 minutes for SEV1/2]

DO
- [Specific actions recipients should take]
- [Specific actions recipients should NOT take]

DO NOT
- [Actions to avoid]

QUESTIONS
- Direct to: [Incident channel or designated person]

Stakeholder Update (Every 15-30 minutes for SEV1)

Subject: [SEV LEVEL] UPDATE #X - Brief Description

CURRENT STATUS
- Time since incident started: [X hours/minutes]
- Current status: [Investigating/Mitigating/Resolved]
- Progress update: [What we've learned, what we're trying]

IMPACT
- Users affected: [Number or percentage]
- Business impact: [Revenue, key customers, etc.]
- Customer reports: [Number of tickets, social mentions]

NEXT STEPS
- [What we're doing next]
- Next update in: [X minutes]

ADDITIONAL NOTES
- [Any other relevant information]

Customer-Facing Communication (Public Status Page)

TITLE: [Service Name] - Service Degradation

We are currently experiencing issues with [service name].
Impact: [Describe what users are experiencing]
Duration: Started at [time]
Status: Our team is actively investigating

We will provide updates every [X minutes] until resolved.

For the latest status, visit: [status page URL]

Resolution Communication

Subject: [RESOLVED] [SEV LEVEL] - Brief Description

INCIDENT SUMMARY
- Duration: [Start time] - [End time] (Total: X hours Y minutes)
- Root Cause: [Brief description - avoid speculation if not confirmed]
- Resolution: [What fixed it]

IMPACT SUMMARY
- Users affected: [Number/percentage]
- Business impact: [Revenue, customers, etc.]

NEXT STEPS
- Post-mortem scheduled: [Date/time]
- Follow-up actions: [Key items already identified]

PREVENTIVE MEASURES
- [Any immediate steps taken to prevent recurrence]

Thank you to everyone who responded to this incident.

Communication Principles

Be timely: Communicate early and often. Silence breeds panic and speculation.

Be honest: Don’t minimize the issue. Don’t sugarcoat. State the facts.

Be clear: Avoid jargon. Your audience might not be technical.

Be helpful: Tell people what they should do (or not do).

Be consistent: All channels should have the same message.

The Naked Truth: Your communication during an incident becomes part of the permanent record. Write everything as if it might be read by executives, customers, or lawyers. Because it might be.

Coordinating with Engineering During Crises

Your relationship with engineering during incidents can make or break the response. Here’s how to do it right.

Establish Communication Channels Before Incidents

Don’t figure out communication during the incident. Have established channels:

Incident Slack channel: A dedicated channel for active incidents
War room: A physical or virtual space for real-time collaboration
On-call rotation: Know who’s on-call and how to reach them
Escalation path: Know who to escalate to and when

The Information Flow

Here’s how information should flow:

Detection → Engineering response → Assessment → PM notification → Stakeholder communication
                    ↑                                          ↓
                    └────────── Updates from PM ←─────────────┘

You need to be in the loop but not in the way. Some practices:

1. Join the incident channel silently

When you join, don’t immediately ask questions. Read the history, understand the situation, then ask focused questions.

2. Ask the right questions

Good questions:

“What’s the current impact?” (factual)
“What’s blocking progress?” (helpful)
“What do you need from me?” (supportive)

Bad questions:

“When will this be fixed?” (adds pressure)
“How did this happen?” (distraction during response)
“Why didn’t we catch this earlier?” (blame, not helpful)

3. Offer help, don’t demand it

“This is the information I have about customer impact. Let me know if this changes your priorities.” vs. “You need to prioritize the payment issue because customers are complaining.”

4. Create space for them to work

Sometimes the best thing you can do is handle all the non-technical coordination so engineers can focus entirely on the fix.

When to Escalate

Sometimes you need to escalate to leadership or bring in additional resources:

Escalate when:

The incident is getting worse, not better
You need decisions that are above your pay grade
Additional teams or resources need to be mobilized
There are major customer or PR implications
The technical team is stuck and needs fresh perspectives

How to escalate:

Briefly summarize the situation
State what you need
Provide context on why it’s important
Be specific about what you’re asking for

Managing Customer Escalations

VIP customers or major accounts might require special handling:

Identify affected VIPs early
Proactively reach out through account teams
Provide dedicated communication channels if appropriate
Coordinate any compensation discussions
Follow up personally after resolution

The Naked Truth: The engineers fixing the incident are stressed, tired, and under pressure. Be the calm in the storm. Your job is to make their job easier, not to add to their burden.

Post-Incident Review Best Practices

Once the incident is resolved, the real work begins. Post-incident reviews (also called post-mortems) are where learning happens.

The Blameless Post-Mortem

The most important principle: blamelessness.

The goal is to understand what happened and prevent recurrence, not to assign blame. When people fear blame, they hide information. When they hide information, you can’t prevent future incidents.

Language matters:

“The engineer deployed without testing” → “The deployment process didn’t have adequate testing gates”
“Someone deleted the production database” → “The database deletion command didn’t have confirmation safeguards”

Focus on systems and processes, not individuals.

Post-Mortem Structure

1. Incident Timeline

Document what happened, when, and by whom:

Detection time
Response start time
Key diagnostic steps
Attempted fixes
Resolution time

2. Impact Analysis

Quantify the impact:

How many users were affected?
How long was the service degraded?
What was the revenue impact?
What was the reputational impact?

3. Root Cause Analysis

This is the heart of the post-mortem. Common techniques:

5 Whys:

Why did the service go down? Database connection exhausted
Why was the connection exhausted? Query was slow
Why was the query slow? Missing index
Why was the index missing? Not added during schema migration
Why wasn’t it added during migration? Migration checklist didn’t include index verification

Fishbone Diagram: Categories: People, Process, Technology, Environment

For each category, ask what factors contributed.

4. Contributing Factors

Rarely is there a single cause. Usually multiple factors align:

Technical cause (the immediate trigger)
Process gap (why it wasn’t caught)
Human factors (why someone made a mistake)
Organizational factors (why systems weren’t in place)

5. Action Items

Specific, assigned, and tracked:

What will we do to prevent recurrence?
What will we do to detect similar issues earlier?
What will we do to respond faster next time?

Each action item needs:

Clear description
Owner
Due date
Tracking mechanism

Your Role in Post-Mortems

As a PM, you contribute:

Business Context:

Impact quantification
Customer perspective
Business implications

Action Item Ownership:

Some action items will be product-related (not technical)
Examples: Customer communication improvements, documentation updates, feature changes

Prioritization:

Help prioritize action items against other product work
Balance prevention investment against other priorities

Follow-Up:

Ensure action items actually get done
Track trends across incidents

The Naked Truth: A post-mortem without action items is just storytelling. A post-mortem with action items but no follow-up is theater. The value is in the learning and improvement that follows.

Building Incident Response Playbooks

The best time to prepare for incidents is before they happen. Playbooks help your team respond faster and more consistently.

What Should Be in Your Playbook

1. Severity Classification Guide

Clear criteria for each severity level
Examples of each severity
Who to notify for each level

2. Communication Templates

All the templates we discussed earlier
Contact lists for different scenarios
Approval workflows for external communication

3. Role Assignments

Who is incident commander?
Who handles communications?
Who is the technical lead?
Backup assignments for each role

4. Escalation Paths

When to escalate
Who to escalate to
How to escalate (phone, page, email)

5. Technical Runbooks

Common incident types and responses
Diagnostic commands
Known workarounds

6. Customer Playbooks

How to identify affected customers
Communication cadence for different customer tiers
Compensation guidelines

Creating and Maintaining Playbooks

Creating playbooks:

Start with your most common incident types
Write what you know from experience
Don’t try to cover everything at once

Maintaining playbooks:

Review after every major incident
Update based on what you learned
Practice regularly (incident simulations)

Practice Makes Perfect

Playbooks are useless if nobody knows they exist. Regular practice:

Tabletop exercises: Walk through a hypothetical incident scenario. Discuss what each person would do.

Game days: Simulate a real incident in a controlled environment. Practice the full response.

Blameless post-mortem reviews: Periodically review past incidents and playbooks together.

The Naked Truth: The first time you use a playbook should not be during a real incident. Practice until the response is muscle memory. When the real thing happens, you want to be executing, not reading.

Preventing Incidents: PM’s Proactive Role

The best incident is the one that never happens. Here’s how PMs can prevent incidents proactively.

During Feature Development

1. Risk Assessment

For every feature, ask:

What could go wrong?
What’s the blast radius if it fails?
How will we know if something goes wrong?
What’s the rollback plan?

2. Feature Flags

Advocate for feature flags on all significant changes:

Gradual rollouts reduce blast radius
Instant disable capability if issues arise
A/B testing potential before full launch

3. Monitoring Requirements

Every feature should have monitoring:

What metrics indicate health?
What alerts should be set up?
What dashboards need to be updated?

4. Documentation

Ensure documentation is complete:

Runbooks for the new feature
Updated system diagrams
Known limitations and edge cases

During Planning

1. Account for Technical Debt

Allocate capacity for:

Reliability improvements
Security patches
Performance optimization
Deprecation of old systems

2. Balance Speed and Stability

Use error budgets (from SRE practices):

If error budget is healthy, you can take more risks
If error budget is depleted, prioritize stability

3. Code Review Standards

Advocate for:

Adequate code review time
Review checklist that includes reliability considerations
Testing requirements

During Releases

1. Release Criteria

Insist on clear release criteria:

What tests must pass?
What performance thresholds?
What sign-offs are required?

2. Gradual Rollout

For significant changes:

Canary deployments (small percentage first)
Monitor closely during rollout
Have rollback criteria defined upfront

3. Post-Release Monitoring

After release:

Monitor key metrics
Have someone on-call for the feature
Schedule post-release check-in

The Naked Truth: You can’t prevent every incident. But you can make incidents less frequent, less severe, and easier to recover from. Every hour you invest in prevention pays off during the next incident.

Learning from Incidents: Continuous Improvement

Incidents are expensive learning opportunities. Don’t waste them.

Incident Trend Analysis

Look for patterns across incidents:

Common themes:

Same system failing repeatedly?
Same type of error occurring?
Same time of day/week?
Same team involved?

Create a trend dashboard:

Incidents by severity over time
Incidents by system/team
Time to resolution trends
Action item completion rates

Building Institutional Knowledge

Incident database:

Store all post-mortems in a searchable location
Tag incidents by system, type, cause
Make it easy to find similar past incidents

Knowledge sharing:

Share learnings at team meetings
Rotate post-mortem facilitation
Create a “lessons learned” newsletter

Onboarding:

Include incident history in new engineer onboarding
Review past incidents as training material
Pair new PMs with experienced incident responders

Measuring Improvement

Key metrics:

MTTD (Mean Time to Detect): How quickly do we notice issues?
MTTR (Mean Time to Resolve): How quickly do we fix issues?
Incident frequency: Are we having more or fewer incidents?
Severity distribution: Are incidents getting less severe?
Action item completion: Are we actually doing the work to improve?

Track these over time and set improvement goals.

Creating a Learning Culture

The best teams treat incidents as learning opportunities, not failures to be hidden:

Celebrate good incident responses publicly
Share post-mortems widely (internally)
Encourage transparency about mistakes
Recognize people who raise concerns early

The Naked Truth: If your team hides incidents or avoids reporting them, you have a culture problem, not a process problem. Fear of blame is the enemy of improvement. Fix the culture first.

Incident Response Checklist for PMs

Here’s a practical checklist to keep handy. Save this. Print it. Put it on your wall.

At Incident Detection

Assess severity level (SEV1/2/3/4)
Join incident response channel
Understand current status and impact
Identify who’s leading the technical response
Determine your role (communications, coordination, etc.)

During Active Incident

Provide business context to responders
Communicate with stakeholders (use templates)
Update stakeholders at regular intervals
Document timeline of events
Handle cross-team coordination
Manage customer escalations
Escalate to leadership if needed

At Resolution

Confirm resolution with technical team
Send resolution communication
Update status page
Schedule post-mortem
Collect initial feedback from stakeholders

Post-Incident

Attend post-mortem meeting
Contribute business impact analysis
Own product-related action items
Follow up on action item completion
Update playbooks if needed
Thank everyone who responded

Proactive (Ongoing)

Know your incident response process
Keep contact lists updated
Participate in incident simulations
Review incident trends monthly
Advocate for reliability improvements
Build relationships with engineering leads

Conclusion: Embrace the Inevitable

Production incidents are inevitable. Systems will fail. Things will break. The question isn’t whether you’ll face incidents—it’s how you’ll handle them when they come.

The Product Managers who excel during incidents are the ones who:

Understand their role and stay in their lane
Communicate clearly and honestly
Support their engineering team rather than adding pressure
Learn from every incident and drive improvement
Build systems and culture that prevent future incidents

The Naked Truth: An incident is a terrible thing to waste. Every incident is a chance to learn, improve, and build a more resilient product and team. The best Product Managers I know don’t just survive incidents—they use them to get better.

The next time your phone screams at 2:47 AM, take a breath. Remember what you’ve learned here. And know that how you handle the next few hours will define you as a Product Manager.

Now go create your incident playbooks. Before you need them.

Want to learn more about building reliable products? Check out my guide on DevOps for Product Managers: The Complete 2025 Update for more on how DevOps practices can help prevent incidents before they happen.

About the Author

Karthick Sivaraj is the founder of The Naked PM blog and a Product Manager who’s survived his fair share of 2 AM pages. He’s led incident response for products serving millions of users and believes that how you handle failure matters as much as how you celebrate success. Connect with him on LinkedIn or Twitter for more honest takes on product management and DevOps.

Table of Contents#

Introduction: When Production Breaks#

Understanding Incident Severity Levels#

The Standard Severity Framework#

Why This Matters for PMs#

Common Severity Assessment Mistakes#

Your Role in Severity Assessment#

The PM’s Role During Incidents#

What You SHOULD Do#

What You Should NOT Do#

The Incident Command Framework#

Communication Templates for Stakeholders#

Initial Internal Notification (First 15 minutes)#

Stakeholder Update (Every 15-30 minutes for SEV1)#

Customer-Facing Communication (Public Status Page)#

Resolution Communication#

Communication Principles#

Coordinating with Engineering During Crises#

Establish Communication Channels Before Incidents#

The Information Flow#

When to Escalate#

Managing Customer Escalations#

Post-Incident Review Best Practices#

The Blameless Post-Mortem#

Post-Mortem Structure#

Your Role in Post-Mortems#

Building Incident Response Playbooks#

What Should Be in Your Playbook#

Creating and Maintaining Playbooks#

Practice Makes Perfect#

Preventing Incidents: PM’s Proactive Role#

During Feature Development#

During Planning#

During Releases#

Learning from Incidents: Continuous Improvement#

Incident Trend Analysis#

Building Institutional Knowledge#

Measuring Improvement#

Creating a Learning Culture#

Incident Response Checklist for PMs#

At Incident Detection#

During Active Incident#

At Resolution#

Post-Incident#

Proactive (Ongoing)#

Conclusion: Embrace the Inevitable#

💬 Join the Conversation

Table of Contents