Skip to content
7 min read·Lesson 6 of 8

Postmortems and Blameless Culture

Writing postmortems that generate lasting improvement — structure, facilitation, root cause analysis, and the cultural practices that make them effective.

The incident is resolved. Users are happy. The team is tired. The temptation is to move on. The organisations that move on are the ones that experience the same incident three months later. The organisations that do postmortems properly are the ones that get measurably more reliable over time.

What Is a Postmortem?

A postmortem (also: incident review, after-action review, retrospective) is a structured analysis of a significant incident that produces:

  1. A shared, accurate understanding of what happened and why
  2. Identified contributing factors (plural — incidents are rarely single-cause)
  3. Concrete action items with owners and due dates
  4. A written record for institutional memory

The goal is learning, not accountability. This distinction is the foundation of blameless postmortems.

Why Blameless?

When postmortems are blame-focused:

  • Engineers hide information to avoid punishment
  • On-call engineers hesitate to escalate, fearing they'll be blamed for the disruption
  • The timeline is sanitised to omit embarrassing decisions
  • Root causes remain hidden because people focus on defending their actions
  • New engineers fear the on-call rotation

When postmortems are learning-focused:

  • Engineers share full information, including mistakes
  • Root causes are actually found and fixed
  • The organisation builds resilience over time
  • Engineers feel safe escalating and declaring incidents

"We approach incidents with the assumption that engineers did the best they could with the information, tools, and context they had at the time."

Blameless does not mean consequence-free for systemic bad behaviour — but the postmortem is not the venue for HR issues. Those are handled separately.

The Postmortem Template

# Postmortem: Payments Service Outage — 2026-05-27

**Severity:** SEV2  
**Duration:** 14:22 – 14:42 UTC (20 minutes)  
**Author:** @carol  
**Reviewers:** @alice, @bob, @dave  
**Published:** 2026-05-29  

---

## Summary
A connection leak in the payments service v2.14 deploy caused the database connection 
pool to saturate, resulting in 502 errors for approximately 20% of payment transactions 
over 20 minutes. Service was restored via rollback. No data loss occurred.

## Timeline
| Time (UTC) | Event |
|-----------|-------|
| 13:45 | payments-service v2.14 deployed (automated, part of release train) |
| 14:22 | Alert: payments error rate exceeded 5% threshold |
| 14:24 | @bob acknowledged; INC-2934 opened |
| 14:26 | SEV2 declared; @alice joined as IC |
| 14:30 | DB connection pool at 98%; correlation with v2.14 noted |
| 14:32 | Rollback initiated |
| 14:42 | Error rate normalised; incident resolved |

## Root Cause
A connection leak was introduced in v2.14 when a new retry mechanism failed to return 
connections to the pool on timeout. Over 37 minutes, the pool saturated, causing all 
subsequent requests to fail with a connection timeout.

## Contributing Factors
1. The connection leak unit test coverage did not include the timeout code path.
2. Load testing does not simulate connection pool exhaustion.
3. The deploy passed all health checks because saturation occurs slowly — 
   health checks don't measure pool utilisation.
4. No dashboard alert existed for connection pool utilisation (only error rate).

## What Went Well
- Alert fired quickly (37 minutes after deploy; 2 minutes after saturation)
- Rollback was executed without delay once hypothesis confirmed
- Incident channel stayed focused; no noise from additional checkins
- Postmortem runbook was up to date

## What Could Be Improved
- Connection pool saturation took 37 minutes to cause visible user impact — 
  we should detect it earlier
- No runbook for DB connection pool investigation existed — created by @bob post-incident

## Action Items
| # | Action | Owner | Due | Status |
|---|--------|-------|-----|--------|
| 1 | Add connection pool utilisation alert (threshold: 80%) | @bob | 2026-06-03 | Open |
| 2 | Add unit tests for retry timeout → pool return path | @carol | 2026-06-06 | Open |
| 3 | Include pool saturation scenario in load test suite | @dave | 2026-06-13 | Open |
| 4 | Create RB-052 (DB connection pool investigation) | @bob | 2026-05-31 | Done |

## Lessons Learned
Connection pool exhaustion is insidious because the failure is gradual and doesn't 
trip health checks until it's severe. We need proactive monitoring of pool utilisation 
in addition to outcome-based (error rate) monitoring.

Root Cause Analysis Techniques

Five Whys

Ask "why" repeatedly until you reach a systemic cause, not an individual action:

Why did the payments service fail?
  → Because the DB connection pool was exhausted.
Why was the pool exhausted?
  → Because connections were not being returned on timeout.
Why were connections not returned on timeout?
  → Because the retry mechanism had a bug.
Why did the bug reach production?
  → Because the unit tests didn't cover the timeout code path.
Why didn't tests cover that path?
  → Because load testing doesn't simulate slow connection acquisition, 
    and we had no process to require pool exhaustion test coverage.

Root cause: No test coverage requirement or load-test scenario for connection 
pool behaviour under resource contention.

Stop at systemic causes — "because the engineer made a mistake" is never a useful stopping point. Continue asking why the system allowed that mistake to reach production.

Fishbone (Ishikawa) Diagram

For complex incidents with many contributing factors, a fishbone diagram maps causes across categories:

  • People — training gaps, communication failures
  • Process — missing review steps, absent runbooks
  • Technology — software bugs, tooling gaps
  • Environment — load patterns, external dependencies
  • Measurements — missing alerts, incorrect thresholds

Useful when a single "five whys" chain misses interdependent causes.

Action Item Quality

Action items are where postmortems deliver — or fail. Poor action items:

  • "Improve monitoring" — not specific
  • "Consider adding tests" — not committed
  • "Engineering team to review" — no owner

Good action items follow SMART criteria:

Bad:  "Improve connection pool monitoring"
Good: "Add Prometheus alert on DB connection pool utilisation > 80% 
       for the payments service. 
       Owner: @bob. Due: 2026-06-03. 
       Tracked in Jira: INFRA-4521."

Track action items in your issue tracker (Jira, GitHub Issues, Linear) — not just in the postmortem doc. Review completion in sprint planning and in the next postmortem if incidents recur.

The Postmortem Meeting

For SEV1/2, a 60–90 minute meeting within 72 hours of resolution:

  1. Facilitator (often IC or senior engineer not primary responder) sets learning-focused tone
  2. Walk the timeline collaboratively — allow corrections and additions
  3. Identify contributing factors — brainstorm broadly before narrowing
  4. Generate action items — every contributing factor should map to at least one action
  5. Close with "What did we learn?"

Avoid:

  • Starting with "what went wrong" before building a shared timeline
  • Letting individuals defend their actions
  • Vague action items generated in the last 5 minutes
  • Meeting without a prepared draft postmortem — the doc should be ready before the meeting, updated during

Publishing Postmortems

Internal publication: Always. Publish to a shared wiki or postmortem database accessible to all engineers. Learning compounds across teams when incidents are visible.

External publication: Consider. Many leading tech companies publish postmortems publicly (GitHub, Cloudflare, AWS, Stripe). Benefits: builds customer trust, demonstrates engineering maturity, contributes to the industry's shared learning.

External postmortems require:

  • Plain language — no jargon
  • No internal system names that reveal security posture
  • Focus on customer impact and remediation, with some technical detail
  • Review by Legal and Comms before publication

Building a Postmortem Culture

Culture is established by what leadership does, not what it says. Signs of a healthy postmortem culture:

  • Senior engineers attend and participate as equals
  • Action items from previous postmortems are reviewed at the start of new ones
  • Engineers voluntarily write postmortems for near-misses, not just actual incidents
  • The frequency of recurring incident types decreases over quarters
  • On-call engineers feel that postmortems make their job easier, not more stressful

The postmortem is the highest-leverage reliability practice available. One good postmortem that produces two completed action items is worth more than ten that produce nothing.

Key Takeaways

  • A blameless postmortem focuses on system and process failures, not individual mistakes.
  • The five-whys and fishbone diagram are structured root cause analysis tools.
  • Action items must be SMART — specific, measurable, assignable, realistic, time-bound — or they will not be completed.
  • Publish postmortems internally and consider external publication — it builds trust.
  • The postmortem review meeting is a learning forum, not a retrospective on performance.

Test your knowledge

Try exam-style practice questions to reinforce what you've learned.

Practice Questions →