Side-by-side edition·18 observations·Cook 1998 / 2000 → SRE 2026

/reliability-essays/an-sre-perspective

How Complex Systems Fail:
An SRE Perspective.

Richard Cook wrote 18 observations for emergency medicine, aviation, and nuclear power. Read his original framing on the left, the SRE translation on the right. Where they line up will surprise you.

Cook · 1998
SRE · 2026

Cook · TL;DR

Failure in complex systems is the natural result of the system's complexity. Defenses contain hazard but never eliminate it. Catastrophe is a combination, not a cause.

SRE · TL;DR

Production software is a complex system in Cook's sense. The 18 points describe outages, post-mortems, and on-call life with uncomfortable accuracy.

01

Complex systems are intrinsically hazardous

Hazard / Latent

Cook · 1998

The frequency of hazard exposure can sometimes be changed but the processes involved in the system are themselves intrinsically and irreducibly hazardous. The presence of these hazards drives the creation of defenses against them.

↓ TRANSLATED

SRE · 2026

Every production system ships with latent hazard: data loss, cascading failure, credential leak, silent corruption. You don't add the hazard by deploying; it was there the moment you chose distributed state.

SRE work is not the elimination of hazard, it is the continuous containment of it.

02

Complex systems are heavily and successfully defended against failure

Defense / Layered

Cook · 1998

The high consequences of failure lead over time to the construction of multiple layers of defense against failure. These defenses include obvious technical components and human components, and also organizational and regulatory ones. Their effect is to provide a series of shields that normally divert operations away from accidents.

↓ TRANSLATED

SRE · 2026

The reason your platform doesn't melt down daily is that it is layered with defenses: retries, circuit breakers, quotas, health checks, canaries, multi-AZ, multi-region, human review, change freezes, alert routing. Most of these defenses are invisible when they work. You only notice them when one is missing.

03

Catastrophe requires multiple failures

Swiss-cheese / Compound

Cook · 1998

Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient. There are many more such failure opportunities than overt system accidents.

↓ TRANSLATED

SRE · 2026

Nobody gets paged because of one bad commit. They get paged because a bad commit merged on a Friday during a deploy freeze exception, shipped through a canary with a broken metric, into a region whose autoscaler was already saturated, while the on-call was at dinner with their phone on silent.

Single points of failure exist (an expired TLS cert, a corrupted config), but the catastrophe still required the surrounding process defenses (monitoring, runbooks, ownership) to fail too. The single point is necessary; it is rarely sufficient.

04

Complex systems contain changing mixtures of failures latent within them

Latent / Drift

Cook · 1998

The complexity of these systems makes it impossible for them to run without multiple flaws being present. Eradication of all latent failures is limited by economic cost and because it is difficult to see how such failures might contribute to an accident before the fact.

↓ TRANSLATED

SRE · 2026

Your system right now contains dozens of bugs, misconfigurations, expired-but-cached credentials, drift between IaC and reality, and dependencies pinned to versions that have since been yanked. You don't know which ones. You will learn about some of them during the next incident.

05

Complex systems run in degraded mode

Degraded / Always

Cook · 1998

A corollary to the preceding point is that complex systems run as broken systems. The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.

↓ TRANSLATED

SRE · 2026

The production system you imagine in your head, where every pod is healthy, every queue is drained, every replica is in sync, does not exist and never has. Real systems always have something broken.

The art is running usefully despite constant partial failure.

06

Catastrophe is always just around the corner

Tail-risk

Cook · 1998

Complex systems possess potential for catastrophic failure. Human practitioners are nearly always in close physical and temporal proximity to these potential failures, both in time and space. The potential for catastrophic outcome is a hallmark of complex systems. It is impossible to eliminate the potential for such catastrophic failure.

↓ TRANSLATED

SRE · 2026

The same system that served a billion requests today can serve zero tomorrow. Nothing about yesterday's uptime grants you tomorrow's. This is why leaders who say "we haven't had an outage in 90 days, we must be doing something right" terrify experienced SREs.

07

Post-accident attribution to a 'root cause' is fundamentally wrong

Post-mortem / Causality

Cook · 1998

Because overt failure requires multiple faults, there is no isolated cause of an accident. There are multiple contributors to accidents. Each is necessary but only jointly sufficient. The label 'root cause' reflects the social, cultural need to blame specific, localized forces or events for outcomes.

↓ TRANSLATED

SRE · 2026

There is no root cause. There is a tree of contributing factors, and the node you pick to call "root" says more about your organizational incentives than about the system. Good post-mortems resist that choice for as long as possible.

Take the 2017 S3 us-east-1 outage. The "root cause" was an engineer typing the wrong argument. But equally true: no confirmation prompt, blast radius unbounded, subsystem hadn't been restarted in years, dependent services had no fallback. Pick all of them and you change how the system is built.

08

Hindsight biases post-accident assessments

Cognition / Bias

Cook · 1998

Knowledge of the outcome makes it seem that events leading to the outcome should have appeared more salient to practitioners at the time than was actually the case. After an accident, practitioners' actions are judged with knowledge of the eventual outcome, which biases assessment of those actions toward the negative.

↓ TRANSLATED

SRE · 2026

After the incident, the signal looks obvious. "CPU was climbing for 20 minutes, why didn't anyone see it?" Because at the time, CPU was climbing on 400 other dashboards too, and 399 of them resolved themselves. Hindsight collapses a fan of possibilities into a single narrative arc. Evaluate decisions by what was knowable at the time, not what the timeline reveals later.

09

Operators have dual roles: producers and defenders

Org / Tension

Cook · 1998

Practitioners and the organizations of which they are part must simultaneously deal with production demands and the possibility of failure. The need to balance these dual roles is a source of perpetual conflict for practitioners and managers.

↓ TRANSLATED

SRE · 2026

Cook's framing doesn't map cleanly onto a modern SRE org chart. You'll find pure producers (product engineers), pure defenders (platform teams), and hybrids. The dual role isn't a property of every individual, it's a property of the system as a whole.

The faster the org ships, the more defense it owes. Organizations that pretend this tradeoff doesn't exist pay for it in burnout or outages, usually both.

10

All practitioner actions are gambles

Risk / Action

Cook · 1998

After accidents, the gamble of practitioner action looks ill-considered, almost reckless. But all practitioner action is gambling, that is, it is action that takes place in the face of uncertain outcomes. The degree of uncertainty may change from moment to moment.

↓ TRANSLATED

SRE · 2026

Every deploy, every rollback, every kubectl delete pod, every "let me try restarting it" is a bet against a system you cannot fully observe. Most bets pay off. The ones that don't become incidents. The goal is not to stop gambling, it is to make the gambles smaller and more reversible.

11

Actions at the sharp end resolve all ambiguity

On-call / Decision

Cook · 1998

Organizations are ambiguous, often intentionally, about the relationship between production targets, efficient use of resources, economy and costs of operations, and acceptable risks of low and high consequence accidents. All ambiguity is resolved by actions of practitioners at the sharp end of the system.

↓ TRANSLATED

SRE · 2026

Dashboards, runbooks, and alerts give you signals, not answers. At 3 AM, when 429s are spiking in eu-west-1 only for authenticated traffic, no document tells you whether to roll back, drain the region, or wait it out. The on-call engineer is the one who decides, and that decision is what actually changes the system. All the prep work is judged by whether it sharpens that decision or muddies it.

12

Human practitioners are the adaptable element

Human-in-loop

Cook · 1998

Practitioners and first-line management actively adapt the system to maximize production and minimize accidents. These adaptations often occur on a moment-by-moment basis. Some of these adaptations include: restructuring the system, concentrating critical resources, retraining and recombining workers.

↓ TRANSLATED

SRE · 2026

When the incident is novel, tooling cannot save you. A human decides to failover, to drain a node, to call the vendor, to accept the data loss and restore from backup. Automation handles the known; humans handle the unknown.

This is why the right framing for AI in SRE is assistive, not autonomous. AI that hands an on-call engineer better context faster makes the human decisive. AI that tries to replace the human decision makes the human unaccountable.

13

Human expertise is constantly changing

Tribal-knowledge

Cook · 1998

Complex systems require substantial human expertise in their operation and management. This expertise changes in character as technology changes but it also changes because of the need to replace experts who leave. Training and refinement of skill and expertise is one part of the function of every complex system.

↓ TRANSLATED

SRE · 2026

The expert on your payments pipeline last year is not the expert this year, because the pipeline has changed and so has the expert. Expertise is perishable. Tribal knowledge is the most expensive form of knowledge: it leaves with the person. Documented, searchable, queryable system knowledge is the antidote.

14

Change introduces new forms of failure

Change / Surface

Cook · 1998

The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes may actually create opportunities for new, low frequency but high consequence failures.

↓ TRANSLATED

SRE · 2026

Every deploy, every migration, every provider switch, every version bump opens a new failure surface. The stability you feel right before a major change is partly an illusion: you have simply not yet discovered the failure modes the change will introduce. Plan rollbacks accordingly.

15

Views of 'cause' limit the effectiveness of defenses

Causality / Defense

Cook · 1998

Most of the proposed countermeasures to failure are predicated on a particular notion of cause. Locating 'the cause' of an accident in a particular component restricts the kinds of countermeasures that can be considered for application to the component itself.

↓ TRANSLATED

SRE · 2026

If your post-mortem concludes "engineer pushed bad YAML," your remediation is "more review." If it concludes "our validation pipeline didn't catch an invalid resource spec," your remediation is "better validation." Same incident, very different defenses. Where you locate the cause determines what defenses you build.

16

Safety is a property of systems, not components

Emergence

Cook · 1998

Safety is an emergent property of systems; it does not reside in a person, device, or department of an organization or system. Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system.

↓ TRANSLATED

SRE · 2026

You cannot buy safety by buying safe components. A 99.99% service composed of 99.99% components is not 99.99% reliable; the composition matters. Reliability is an emergent property of the whole, which is why "we use AWS, so we're fine" is not a reliability strategy.

Buying safer components is necessary, just not sufficient. Good components don't compose into a safe system on their own. You still have to design for how they fail together.

17

People continuously create safety

Invisible-labor

Cook · 1998

Failure free operations are the result of activities of people who work to keep the system within the boundaries of tolerable performance. These activities are, for the most part, part of normal operations and superficially straightforward. But because system operations are never trouble free, human practitioner adaptations to changing conditions actually create safety from moment to moment.

↓ TRANSLATED

SRE · 2026

The reason your system is up right now is that, at this moment, dozens of people are quietly choosing not to do risky things, catching near-misses in review, noticing a weird metric, and fixing a fragile deploy. This invisible labor is the bulk of reliability work, and it never shows up in any dashboard.

The loud hero breaks something on Friday and gets a Slack post. The quiet hero prevented the break on Tuesday by rewriting a fragile script and gets nothing. Most orgs reward the first and depend on the second.

18

Failure-free operations require experience with failure

Practice / Chaos

Cook · 1998

Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the "edge of the envelope." It also depends on calibration of how their actions move system performance towards or away from the edge.

↓ TRANSLATED

SRE · 2026

Teams that never fail never learn. Game days, chaos engineering, and controlled failure injection generate the experience that pure uptime denies you. A team that has rehearsed its failover will execute it. A team that has only read the runbook will not.

You don't need real failures, you need exposure to failure. A team that hasn't had a real incident in two years can mistake that for resilience. The first real one will be larger and less informed precisely because nobody has practiced.

// the spec

An AI system useful during incidents has to reason across contributing factors, respect the on-call as the decider, stay accurate in degraded mode, and carry knowledge across engineers who come and go.

That's the design center for Sherlocks.ai. Cook's 18 points are the spec.

$start with point #7. the rest follows._

Questions

The left column paraphrases Cook's original explanation of each observation, condensed for readability and faithful to his framing in medicine, aviation, and high-consequence systems. For the source text, see how.complexsystems.fail.

Cook's paper translates almost verbatim to running production software. Reading them side by side makes the parallel concrete: distributed systems are not a new domain of failure, they are the same domain with different vocabulary.

// breaking-prod

Never miss what's breaking in prod

A weekly newsletter for SRE and DevOps engineers.

Subscribe on LinkedIn →
Sherlocks.ai

Building a more resilient, autonomous ecosystem without the strain of traditional on-call work. © 2026 Sherlocks.ai.