Postmortems · 2026

Blameless Postmortems Explained: Lessons From Real Outages

By Akshat SandhaliyaCo-founder and CTO, Sherlocks.aiPublished on: May 22, 2026Last updated: May 22, 202620 min read
TL;DR

Blameless postmortems are not a process innovation. They are an organizational mechanism for making truth easier to surface than fear. Most teams operate at Level 2 of the Postmortem Maturity Ladder, compliance theater, while believing they are at Level 3 or 4.

The practitioners who built this for real, Allspaw, Milstein, Jones, Majors, all arrived at the same conclusion: the consequences of honesty must feel better than the consequences of silence.

This article covers what the guides on templates and severity tiers do not: what actually happened when experienced engineering teams tried to make blameless culture real.

Every engineering team after a major outage faces the same quiet fork in the road.

One path asks

Who caused this?

The other asks

What made this possible?

Most teams say they take the second path. Most teams are wrong, not out of malice, but because the first path is psychologically easier, organizationally convenient, and feels like accountability even when it produces none.

This is the gap that blameless postmortems were designed to close. And it is the reason building a genuine blameless culture is so much harder than any template or process document suggests.

The internet already has many good explanations of what a blameless postmortem is. Google's SRE Book wrote the canonical version. John Allspaw at Etsy made it operational. Atlassian, PagerDuty, and incident.io documented the mechanics in detail. We link to all of them below.

Blameless postmortems are not primarily a process innovation. They are an organizational mechanism for making truth easier to surface than fear. Everything else, the templates, the facilitation scripts, the severity tiers, is scaffolding around that one idea.

Most engineering teams have the scaffolding. Very few have the idea. This article is about the difference.

What is a blameless postmortem?

A blameless postmortem is a structured review conducted after an incident or outage. The goal is to understand what happened, why it happened, and how to prevent recurrence, without assigning fault to individuals.

Blameless does not mean consequence-free. It means the investigation focuses on the system that produced the failure rather than the person who was closest to it when it broke. The accountability shifts from punishment to responsibility. Engineers are not off the hook. They are on the hook for helping the organization become safer.

The concept did not originate in software engineering. It came from aviation, healthcare, and nuclear safety, industries where failures kill people. Researchers in all three fields reached the same conclusion independently: when organizations punish individuals for failures in complex systems, they get less information, not more. And with less information, the same failures repeat.

Sidney Dekker, an aviation safety professor who also flew commercially as a Boeing 737 First Officer, captured it precisely:

“Underneath every simple, obvious story about human error, there is a deeper, more complex story about the organization.”

John Allspaw read Dekker. In 2012, while CTO of Etsy, he published “Blameless PostMortems and a Just Culture” on Etsy's engineering blog. Google's SRE team formalized the practice. The industry followed.

The misconception worth addressing directly

The most common pushback: blameless postmortems protect bad actors. That without blame, there is no accountability. This gets the logic backwards.

Blame-based cultures produce concealment, not accountability. Engineers learn that admitting involvement in a failure carries professional risk. So they minimize their role and avoid ownership. The organization gets a sanitized story and learns nothing.

You cannot fix what you cannot see. Blame makes failures invisible.

Before we go further: resources on the mechanics

The mechanics of blameless postmortems are already well documented. These are the resources we recommend for that foundation.

The Postmortem Maturity Ladder

Most organizations believe their postmortem practice is more mature than it actually is. The Postmortem Maturity Ladder describes the five levels seen in practice, not as a ranking of organizations, but as a description of where any given postmortem tends to land.

Level 1Punitive Review

The postmortem exists to assign fault. Engineers attend defensively. The document records who was responsible, not what the system allowed. The same failure mode recurs.

Level 2Compliance TheaterMost common

The template is followed. Action items are written. But items have no owners, no deadlines, and no follow-up. This is the most common level. Most teams that believe they are doing blameless postmortems are doing this.

Level 3Honest Diagnosis

The timeline is reconstructed accurately. Contributing factors are named without softening. Some action items get completed. Engineers leave the meeting feeling heard rather than scrutinized.

Level 4Systemic Learning

Architecture changes, not just behavior changes. The postmortem is shared broadly. Action items are funded, owned, and tracked to completion.

Level 5Organizational Memory

Past postmortems actively inform future system design. New engineers read incident history during onboarding. Learning compounds over time.

Most teams are at Level 2. The best teams operate consistently at Level 3, with occasional Level 4. Level 5 is rare. The gap between where most teams think they are and where they actually are is what this article is really about.

Why blame happens anyway

If blameless postmortems are clearly better, why do so many organizations default to blame? The honest answer is that blame is not irrational. It is a completely natural response to a stressful situation. Understanding why it happens is more useful than pretending it should not.

Hindsight bias

After an incident is resolved, the failure looks obvious. But the engineer who made the triggering decision was operating with incomplete information, time pressure, and assumptions that had worked correctly many times before. What looks like negligence from the outside was reasonable judgment from the inside. Hindsight bias makes ordinary decisions look like obvious errors, and makes punishment feel justified when it rarely is.

The organizational need for a simple story

Major outages cost money. Leadership is under pressure. A person is a simpler answer than a system. "The engineer pushed a bad config" is a clean, containable narrative. The accurate version, a combination of insufficient validation, deployment pressure, and missing monitoring, implies the organization itself has work to do. Blame lets organizations close the story quickly. Systems thinking keeps it open longer but produces better outcomes.

Performance review anxiety

This one rarely gets named directly, but every engineer knows it exists. Even in organizations that profess blameless culture, engineers quietly calculate: will this affect how my manager sees me at the next review cycle? That calculation shapes what people say, what they admit, and how much of the real story they surface.

Executive pressure in the room

When a senior leader walks into a postmortem, the dynamic shifts. Engineers read the room. They soften their accounts. The Google SRE Workbook documents this explicitly: a single leading question, "someone must have known this was a problem, why didn't anyone raise it?", is enough to make the meeting defensive in seconds.

The cognitive cost of complexity

Modern incidents are almost never caused by a single thing. Tracing every contributing thread is exhausting. "Someone made a mistake" is a relief. It ends the investigation at a point that feels satisfying rather than at a point that is actually complete. Blameless postmortem facilitation exists specifically to keep the investigation open when it naturally wants to close.

Blame does not survive in engineering organizations because people are cruel. It survives because it is cognitively convenient, organizationally satisfying, and feels like accountability, right up until the same incident happens again six months later.

What experienced practitioners actually learned

The theory fits on a slide. The practice does not. What follows are eight accounts from people who tried to build this for real, what they found, what broke, and what they learned that no guide prepared them for.

1 of 8Sidney Dekkersidneydekker.com

Professor of Safety Science, Griffith University · Boeing 737 First Officer

The idea came from people who could not afford to get it wrong

Before blameless postmortems became an SRE practice, they were an aviation safety principle. Sidney Dekker studied why complex systems fail. To make sure he understood failure from the inside, he flew commercially as a Boeing 737 First Officer for Sterling and Cimber Airlines out of Copenhagen while teaching safety science at Lund University. Not as a hobby, but because he believed you cannot study operational failure from a desk.

His argument, from The Field Guide to Understanding Human Error:

“Underneath every simple, obvious story about human error, there is a deeper, more complex story about the organization.”

Dekker called the dominant approach the “Bad Apple Theory”: find the person, remove them, assume the system is now safe. His research across aviation, healthcare, and nuclear safety showed the same result consistently, when organizations punish individuals for failures in complex systems, engineers hide what happened, near misses go unreported, and the same failure repeats with a different person next time.

The alternative: instead of asking who made the mistake, ask why the system placed that person in a position where the mistake was the most likely outcome. John Allspaw read Dekker. What followed changed how the software industry handles incidents.

What to do with this today

Look at your last postmortem. If it accepted “human error” as a finding, the investigation stopped too early. The real question starts there: what about the system made that error possible, likely, and uncatchable?

2 of 8John Allspawlinkedin.com/in/jallspaw

Former CTO, Etsy · Founder, Adaptive Capacity Labs

The post that changed the industry

In May 2012, John Allspaw published “Blameless PostMortems and a Just Culture” on Etsy's engineering blog. More than a decade later, it remains one of the most linked-to articles in all of SRE writing.

His contribution was translation: taking safety science and making it operational inside a fast-moving engineering team. He described what happens when organizations default to blame:

“This cycle of name, blame, shame... Engineers become silent on details about actions, situations, observations, resulting in Cover-Your-Ass engineering from fear of punishment. Management becomes less aware and informed on how work is being performed day to day.”

His framework was the “second story.” Every incident has two. The first story is what happened on the surface. The second story is why the decision made sense to the person making it, given what they knew at the time:

“The action made sense to the person at the time they took it, because if it hadn't made sense to them at the time, they wouldn't have taken the action in the first place.”

He was explicit that blamelessness does not mean engineers escape responsibility: “Engineers are not at all off the hook with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end.”

Etsy built an open-source postmortem tracker called Morgue to operationalize the practice. Because Allspaw understood that philosophy without tooling stays philosophy.

What to do with this today

In your next postmortem, ask the facilitator to find the second story. Not what happened, but why it made sense to the person involved to do what they did. That is where the real learning begins.

Cover-Your-Ass engineering is not a character flaw. It is a rational response to an environment where honesty is punished.

Adapted from John Allspaw, Etsy

3 of 8Dan Milsteinlinkedin.com/in/danmilstein

Former Principal Engineer, HubSpot

One sentence that changed 250 meetings

HubSpot discovered the same pattern Allspaw described, from the inside of 250 postmortem sessions. Dan Milstein noticed it early. Someone would walk into the review, take personal ownership, and say: “Look, this was my fault. I'll be more careful next time.” The room would accept it. The meeting would end. Nothing would change.

As Milstein wrote:

“The very short summary of which is: We're going to fix this problem by being less stupid in the future. Which, well, you can guess how that's going to turn out.”

His response was a single sentence he now says at the start of every session:

“We're trying to prepare for a future where we're all just as stupid as we are today.”

That opening does something specific before anyone speaks. It removes the assumption that the fix is behavioral. It makes “be more careful” an unacceptable conclusion before the meeting begins. Engineers stop managing their reputation and start describing what actually happened.

What to do with this today

Before your next postmortem, say this sentence out loud to the room. Then read your last three action items. If any of them require a human to behave differently without changing the system, they are wishes, not fixes.

Enterprise software · Incident management at scale

The engineer who still works there

One engineer. One syntax error in a configuration file. The entire company went dark for 45 minutes. The financial cost ran into hundreds of thousands of dollars. The engineer was not fired. Atlassian ran a blameless postmortem.

The investigation did not stop at “engineer made a syntax error.” It asked why a single syntax error could take down an entire company. The answer: there was no automated check that validated whether a config file would work before it was loaded. Human interaction with that configuration was the fragile point. Nobody had addressed it because nothing had forced the issue until that day.

The fix was an automated “will it start” validation check. Eventually they removed all human interaction with that system's configuration entirely. The engineer still works at Atlassian.

A blame-based response would have removed the engineer and left the fragile architecture in place. Their postmortem guide captures the principle:

“Ensure that the postmortem timeline, causal chain, and mitigations are framed in the context of systems, process, and roles, not individuals.”
What to do with this today

After your next postmortem, ask one question: did we change the system, or did we change our expectations of the humans inside it? Only one of those is a real fix.

Founder and CEO, Jeli (acquired by PagerDuty) · Sr Director of Product, PagerDuty

Most organizations are doing this wrong

Nora Jones spent years at Netflix, Jet.com, and Slack leading chaos engineering, deliberately breaking production systems to understand how they fail. Along the way, something bothered her more than the incidents themselves. Organizations were running postmortems. Using the templates. Saying the right words. Learning almost nothing.

She tweeted once, on a whim: would anyone be interested in a community for sharing postmortems and incident learnings across organizations? She got 200 DMs overnight. The practitioners running these reviews knew the practice was broken. They just had nowhere to say it. She founded Jeli to fix that. And she named the gap directly in a 2021 conference talk:

“We've all heard about blameless postmortems, but yet we all use it a little bit incorrectly.”

Her diagnosis: organizations adopt the language without building the conditions that make honesty safe. Her contribution is a reframe she calls “blame-aware” culture. Blameless, taken literally, asks people to pretend blame does not exist. Blame-aware acknowledges it as a natural human response, works with it openly, and focuses on moving past it.

She also founded learningfromincidents.io, an open community for sharing postmortems across organizations. Jeli was acquired by PagerDuty in November 2023.

What to do with this today

In your last postmortem, did people say what actually happened, or what was safe to say? If you cannot answer that with confidence, the conditions for honest learning are not yet in place.

6 of 8Noel Pullen · HootsuiteRead Noel's writeup

Software Engineer, Hootsuite

Naming is not blaming

On April 6th, 2017, Noel Pullen, an engineer at Hootsuite, accidentally deleted a critical LinkedIn connection on a production social account, taking down users' ability to post to LinkedIn for 39 minutes. The team named their five-whys writeup: “That Time Noel Deleted LinkedIn.”

Two people immediately flagged the contradiction. You named the postmortem after the engineer. How is that blameless? Noel's response, in his own writeup, was precise: naming who was involved is not the same as blaming them.

The investigation did not stop at Noel. It asked why deleting that connection was even possible. The answer: personal social accounts had administrative access to production systems, a structural vulnerability that had existed for years, waiting for exactly this moment.

“Human error is a symptom, never the cause, of trouble deeper within the system.”
What to do with this today

Name who was involved. Name what they did. Then keep asking why until you reach the system condition that made it possible. Stop before you get there and you have only told the first story.

7 of 8Google SRE WorkbookRead the SRE Workbook

Site Reliability Engineering · Postmortem culture at scale

What to say when the senior leader breaks the culture

Google's SRE team encountered the problem at its most difficult level: when blame comes from the top of the room. The SRE Workbook gives you exact words for the moment the most senior person present says something blameful:

“I know we are supposed to be blameless, but this is a safe space. Someone must have known beforehand this was a bad idea, so why didn't you listen to that person?”

Said quietly by the most powerful person in the room, every engineer recalibrates. The Workbook gives the facilitator a word-for-word response:

“Hmmm, I'm sure everyone had the best intent, so to keep it blameless, maybe we ask generically if there were any warning signs we could have heeded, and why we might have dismissed them.”

That redirect does not confront the VP. It does not validate the blame framing. It moves the conversation forward without making the senior leader an obstacle.

What to do with this today

Write your version of the redirect before you need it. The moment a senior leader uses blameful language in a postmortem is not the time to improvise.

8 of 8Charity Majorslinkedin.com/in/charity-majors

Co-founder and CTO, Honeycomb

Transparency is not a risk, it is infrastructure

Charity Majors, CTO and co-founder of Honeycomb, is direct: managers who obscure what happened during an outage, to protect the team's image or their own, actively destroy the conditions that make learning possible:

“My managers would always be like, no, if we say this, they're going to think that we're stupid. I'm like, that does not earn you trust with engineers. They're going to think less of us if it's a mystery to them.”

At Honeycomb, every outage gets a public postmortem. She went further: importing actual event data from outages into Honeycomb's public demo dataset, making incident learning publicly interactive, not just internally filed.

Every time an organization communicates openly about what went wrong, the next engineer becomes slightly more willing to surface a near miss or flag a risk before it becomes an incident. Opacity works in the opposite direction.

What to do with this today

How your organization talks publicly about its failures is a signal to every engineer inside it. Polished external communication that contradicts internal reality is one of the fastest ways to erode the culture without anyone naming what happened.

Patterns across the stories

Eight stories. Eight different organizations, roles, and moments. But read together, three patterns emerge so consistently they are worth naming directly.

Pattern 1The best incident leaders reduce fear before they ask questions

In every story where blameless culture actually worked, the conditions for honesty were built before honesty was requested. Allspaw built a Just Culture at Etsy before asking engineers to give detailed accounts. Jones built learningfromincidents.io as a trusted community before asking practitioners to share publicly. Majors published Honeycomb's own outage data before asking engineers to be transparent. Organizations that skip this step are asking engineers to take a risk the organization has not yet earned.

The first job of anyone running a postmortem is not to gather information. It is to make the room safe enough that the information people share is actually true.

Pattern 2Most root causes are organizational tradeoffs that were never resolved

The Atlassian config file had no automated validation, a known gap nobody had fixed because nothing had forced the issue. Hootsuite's production systems were accessible through personal social accounts, a structural risk that had existed for years. Milstein's engineers kept saying "I'll be more careful next time" because the system had no mechanism to prevent the same class of error.

A genuinely honest postmortem does not just explain what happened last Tuesday. It exposes what the organization chose not to fix for the past two years.

Pattern 3The postmortem document is the least important part

Allspaw built Morgue because writing postmortems without tracking them produced nothing. Milstein found vague action items were indistinguishable from no action items. Jones watched organizations file technically correct postmortems and learn nothing from them. The postmortem document is evidence that a conversation happened. The conversation is what matters.

Most teams measure postmortem culture by whether postmortems get written. The better measure is whether the same contributing factors appear in the next incident.

The consequences of honesty must feel better than the consequences of silence. Everything else follows from that.

Arrived at independently by every practitioner above

Where blameless culture actually fails

Every organization in the previous section believed in blameless postmortems. Several still got it wrong at critical moments. Blameless culture is not a policy you install. It is a set of conditions you maintain, and those conditions are hardest to maintain precisely when they matter most.

Postmortem theater

The most common failure mode, and the hardest to detect because it looks exactly like success. The postmortem gets scheduled. The template gets filled in. Action items are written. Everyone leaves feeling like something was accomplished. Six months later, the same class of failure recurs with a different name.

Check what percentage of action items from your last five postmortems were completed, not started, but completed, verified, and closed. A postmortem that changes no operational behavior is a blameless filing exercise.

Blameless as shield

Some organizations use blamelessness to avoid accountability entirely. When every failure is attributed to "the system" and no individual owns an outcome, action items drift and patterns repeat.

A blameless postmortem should make it easier to identify what needs to change and who is responsible for changing it, not harder.

Performative transparency

A major incident occurs. The public postmortem is carefully written. Contributing factors are described at a level of abstraction that protects reputation without exposing what actually happened. Internally, engineers read it and recognize the gap. The Atlassian 13-day cloud outage in 2022 illustrates this. Their public review described "a communication gap" as a contributing cause, a framing that edges back toward human error under the pressure of 800,000 affected users.

Performative transparency is hardest to avoid when the stakes are highest. That is the moment the culture gets tested, not when incidents are small.

The senior leader exception

Every organization has a version of VP Ash. The leader who says "I know we are supposed to be blameless, but..." Under pressure, leaders revert to patterns that feel like accountability because they are simpler than confronting systemic complexity. Engineers are watching. Always.

The Google SRE Workbook is explicit: blameful language from senior leadership is the single most common way postmortem culture erodes. The fix is not a training program. It is leaders who hold the line on their own behavior, in the room, when the pressure is highest.

Why this matters more now

Everything in this article has been true for the past decade. So why does it matter more in 2026 than it did in 2016? Because the systems engineers are responsible for have fundamentally changed, and the old failure modes have not gone away. They have scaled.

Systems are more complex, failures are less traceable

Modern systems look different: hundreds of microservices, distributed architectures spanning multiple cloud providers, CI/CD pipelines deploying dozens of times a day, AI-generated code that passes review and fails in production because of interactions nobody anticipated. Blame-based investigation fails not just culturally but technically in this environment. See our guide on observability trends shaping engineering teams in 2026 for how teams are building the infrastructure to surface second stories faster.

Deploy velocity increases the surface area for incidents

In high-velocity environments, blame-based cultures become operationally dysfunctional. Engineers hedge their deploys. They avoid taking ownership of services. They wait for others to push first. The organization's ability to ship slows down, not because of technical constraints, but because cultural incentives around incidents create friction. Blameless culture is not just ethically better in this environment. It is operationally necessary.

AI-generated code is blurring ownership

When code is generated by an AI tool, reviewed quickly, and shipped, the ownership chain is murkier. Who is responsible for the failure? The engineer who accepted the suggestion? The team that set the review standards? The organization that chose the tool? The answer is almost always: all of them, plus the system conditions that made it possible for that code to reach production without catching the failure mode.

The human cost has not gone away

Engineer burnout is not primarily a workload problem. It is often a safety problem. Engineers who feel that every incident is a potential career event become risk-averse, defensive, and eventually disengaged. For teams thinking about how on-call culture connects to incident learning, our on-call playbook for 2026 covers the intersection in detail.

The organizations that learn faster than their failures evolve will outperform the ones that do not. Blameless culture, done properly, is one of the primary mechanisms by which that faster learning happens.

The only question that matters

Incidents are inevitable in complex systems. The question was never whether failures would happen. It was always what the organization would do when they did.

The practitioners in this article, Dekker, Allspaw, Milstein, Jones, and the teams at Atlassian, Hootsuite, Google, and Honeycomb, arrived at the same answer from different directions and different decades. Not through idealism, but through the repeated, expensive experience of watching blame-based investigation produce worse outcomes than honest inquiry.

Make truth easier to surface than fear. Everything else follows.

The conclusion reached independently by every practitioner in this article

Not because it is kind. Because it is the only way complex systems actually get safer. Because engineers managing their exposure during a postmortem are not investigating the incident. Because the second story, the one that explains why the failure was possible, only gets told when the people who lived through the incident believe it is safe to tell it.

The organizations that do this well are not the ones where incidents never happen. They are the ones where, when an incident does happen, the full story gets told, the real contributing factors get named, and the system that produced the failure gets changed rather than the expectations placed on the humans inside it.

That is not a process. It is a decision, made by leaders, repeatedly, especially when it is hardest, about what kind of organization they are building.

If you are choosing tooling to support that process, read our guide to incident response platforms for DevOps teams in 2026.

Frequently Asked Questions

A regular postmortem often focuses on identifying who made a mistake and preventing that person from repeating it. A blameless postmortem shifts the focus to the system conditions that made the mistake possible in the first place. The difference is not just cultural, it is investigative. Blameless postmortems produce more complete information because engineers are not managing their exposure while describing what happened. Regular postmortems frequently produce sanitized accounts that protect individuals but teach the organization nothing useful.

No, and this is the most common misconception. Blameless postmortems do not eliminate accountability. They redirect it. Instead of holding someone accountable for the failure, the organization holds people accountable for the fix. As John Allspaw put it: engineers are very much on the hook for helping the organization become safer. The question blameless culture asks is not “who do we punish?” but “who owns making sure this cannot happen again?”

This is the hardest practical challenge and the Google SRE Workbook addresses it directly. When a senior leader uses blameful language in a postmortem, the facilitator's job is to redirect without confrontation: “I'm sure everyone had the best intent, maybe we ask generically if there were any warning signs we could have heeded, and why we might have dismissed them.” The goal is to acknowledge the underlying concern while keeping the investigation focused on systemic conditions.

At minimum: an accurate timeline reconstructed from system data rather than memory, the contributing factors that allowed the failure to cascade, a clear distinction between proximate causes and underlying systemic conditions, action items with specific owners and deadlines, and a section on what went well. The quality test for action items: does completing this item change the system, or does it just change what we expect from the humans inside it? Only the former counts as a real fix.

The most reliable signal is not whether postmortems are being written. It is whether the same contributing factors keep appearing in new incidents. If your last five major incidents share a common systemic condition that earlier postmortems identified but nobody fixed, the postmortem culture is not working. A second signal is whether engineers volunteer information in postmortems or have to be drawn out. When people share the full story without prompting, the culture is working.

Treating the document as the deliverable. Most teams declare success when the postmortem is written, formatted, and filed. But a well-written postmortem that produces no architectural changes, no completed action items, and no organizational learning is just postmortem theater. The document is evidence that a conversation happened. The measure of success is what changes in the system after that conversation ends.

Related Reading

Never Miss What's Breaking in Prod

Breaking Prod is a weekly newsletter for SRE and DevOps engineers.

Subscribe on LinkedIn →
Sherlocks.ai

Building a more resilient, autonomous ecosystem without the strain of traditional on-call work. © 2026 Sherlocks.ai. All rights reserved.