TL;DR

The four golden signals - latency, traffic, errors, and saturation - are a monitoring framework from Google's Site Reliability Engineering book that defines the minimum viable view of user-facing service health. If all four are within acceptable bounds, your users are almost certainly fine. If any one breaches an SLO threshold, an incident is either underway or imminent. The most common failure mode is not missing the signals: it is alerting on the wrong representation of each one: average latency instead of p99, error count instead of error rate, current saturation instead of rate of change. This article covers what each signal measures, how to alert on it correctly, and the gap golden signals do not close: once an alert fires, you still have to figure out why.

Why four signals, not forty?

The problem with modern monitoring is not a shortage of data. A typical production environment generates thousands of metrics and enough time-series data to keep an engineering team busy for weeks without surfacing a root cause.

The four golden signals are a filter, not a replacement for that data. Originally codified by Google SRE engineers Niall Richard Murphy, Betsy Beyer, Chris Jones, and Jennifer Petoff in Google's Site Reliability Engineering book, the framework establishes a simple operating rule: if you can only measure four things about a user-facing service, these are the four.

Most monitoring strategies go in the wrong direction. Teams instrument everything they can measure and decide what to alert on later. The result is alert fatigue: engineers numb to pages, on-call rotations burning out, and real incidents buried under noise. The golden signals framework inverts this. Start with what directly represents user experience. Alert on those first.

Google's SRE Book draws a useful distinction between black-box and white-box monitoring. Black-box monitoring tests externally visible behavior: what users actually experience. Golden signals are a black-box framework. As the book notes, black-box monitoring forces discipline to only page a human when a problem is both already ongoing and user-visible. That discipline is what makes the golden signals actionable.

This also explains the relationship between golden signals and the frameworks alongside them. USE (Utilization, Saturation, Errors), developed by Brendan Gregg, is an infrastructure-level framework for diagnosing performance problems at the OS and hardware layer. RED (Rate, Errors, Duration) is a per-service framework for microservice request flows. All three are complementary: they answer different questions at different layers. Golden signals answer the most important one first: are users being affected? For a deeper look at the data types these signals are drawn from, the Four Pillars of Telemetry framework covers how each data type feeds the detection layer.

What are the four golden signals?

The four golden signals are latency, traffic, errors, and saturation. Each maps directly to a dimension of user experience:

•Latency measures how long requests take to complete
•Traffic measures how much demand the system is handling
•Errors measures how many requests are failing
•Saturation measures how close the system is to its capacity limits

Together they form the User Impact Lens: a four-point view of service health from a user's perspective, independent of how many internal metrics are green or red at any moment.

Any meaningful incident will show up in at least one of these four signals before it shows up anywhere else. An outage shows in errors. A degraded service shows in latency. A traffic spike shows before it causes a cascade. Capacity exhaustion shows in saturation before it causes a hard failure.

Latency: what it measures and where teams get it wrong

Latency is the time it takes your service to respond to a request.

It should be measured as a percentile distribution, not an average. The p99 latency is the response time experienced by the slowest 1% of users. The average is dominated by the fast majority and hides the slow tail entirely. A service where 95% of requests complete in 50ms and 5% take 10 seconds will report an average latency well under 1 second. FireHydrant's SRE monitoring guide puts it directly: average latency systematically obscures the users most likely to churn.

Alert on p99 and p95 separately with different thresholds. p99 catches the worst tail. p95 catches broader degradation earlier. Also track latency for successful requests and failed requests separately: averaging them together makes success latency look artificially low.

Bad alert

avg(request_duration_seconds) > 0.5

Good alert (Prometheus)

histogram_quantile(0.99, sum by (le) (rate(request_duration_seconds_bucket[5m]))) > 0.200

The sum by (le) wrapper is required in multi-pod environments. Without it, quantile calculations across multiple instances return broken, overlapping results. The Prometheus histogram_quantile documentation covers how bucket boundaries affect quantile accuracy.

Latency SLO example

99% of successful requests complete in under 200ms over a rolling 30-day window.

Traffic: what it measures and where teams get it wrong

Traffic measures the demand placed on your system: requests per second for a web service, transactions per second for a database, messages per unit of time for a streaming pipeline.

Traffic serves two functions. First, it establishes your baseline: you cannot know whether a number is alarming without knowing what normal looks like. Second, traffic drops are often the first indicator of an upstream failure. When something upstream breaks and stops sending requests, your application shows no errors and normal latency. The only visible signal is a traffic drop. This is one of the most underappreciated failure modes in distributed systems.

The most common alerting mistake: Not separating success traffic from error traffic. When a service starts failing, successful request rates fall while error rates rise. If your metric aggregates both, the total count looks stable while the service is effectively down for many users.

Bad alert

rate(http_requests_total[1m]) < 10

Good alert (Prometheus)

rate(http_requests_total{status=~"2.."}[5m])
  < 0.5 * avg_over_time(rate(http_requests_total{status=~"2.."}[5m])[1h:5m])

This fires when success traffic drops more than 50% below the 1-hour rolling average, catching upstream failures that manifest as traffic loss rather than error spikes.

Note on labels

Older Prometheus exporters use status for HTTP status codes. Newer OpenMetrics-compliant exporters use status_code. Check your exporter's conventions and substitute accordingly.

Errors: what it measures and where teams get it wrong

Errors measure the rate of requests that fail. Google's SRE Book defines three categories:

•Explicit errors: HTTP 500s, connection timeouts, unhandled exceptions
•Implicit errors: HTTP 200s returning incorrect or incomplete content
•Policy errors: Requests that succeed technically but violate an SLO, completing in 8 seconds when your latency SLO is 500ms is a policy error

Most teams monitor explicit errors only. Almost none have policy errors wired into alerting, which is where SLO-based incident management lives.

The most common alerting mistake: Alerting on error count instead of error rate. A service at 100,000 rps with a 0.05% error rate generates 50 errors per second: an absolute threshold fires even though service health is unchanged. Alert on error rate as a percentage of total requests.

Bad alert

sum(rate(http_requests_total{status=~"5.."}[5m])) > 20

Good alert (Prometheus)

sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) > 0.01

Error budget framing

A 99.9% availability target gives an error budget of 0.1% over 30 days, roughly 43 minutes of full downtime. When budget consumption spikes (burning a week's budget in an hour), that is your P1 signal. The incident response platforms overview covers how error budgets feed into escalation policies and on-call routing.

Saturation: what it measures and where teams get it wrong

Saturation measures how close your service is to a resource limit: CPU, memory, disk, network bandwidth, thread pool capacity, database connection limits.

Resource exhaustion degrades user experience before it causes a hard failure. A service at 95% memory utilization is not down: it is slow, dropping cache hits, and about to appear in your latency and error signals. Latency and errors tell you something is wrong now. Saturation tells you something is about to go wrong.

The most common alerting mistake: Alerting on current utilization only. A disk at 80% full is a data point. A disk at 80% and filling at 10GB per hour is a two-hour crisis. The first number without the second is operationally useless.

Bad alert

node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15

Good alert (Prometheus predict_linear)

predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0

Note on predict_linear availability

predict_linear requires --enable-feature=promql-experimental-functions in some Prometheus versions. If unavailable, use this fallback:

node_filesystem_avail_bytes < 0.1 * node_filesystem_size_bytes
  AND rate(node_filesystem_avail_bytes[1h]) < -100 * 1024 * 1024

This combines a low-space threshold with a negative fill-rate check, catching both critically low disks and disks filling rapidly from a higher baseline.

Brendan Gregg's USE method separates utilization (how busy a resource is) from saturation (how much work is queued waiting for it). For golden signals, saturation covers both. For deep infrastructure diagnostics, USE gives the finer-grained view.

Golden signals vs USE vs RED: which framework to use?

These three frameworks are complementary lenses at different layers of your stack, not competing approaches.

Framework	Scope	Primary question	Best used for
Golden Signals	User-facing services	Are users being harmed?	SLO alerting, on-call pages
USE	Infrastructure resources	Is the hardware healthy?	Capacity planning, infrastructure diagnostics
RED	Per-microservice	Is this service handling requests correctly?	Service mesh instrumentation

The practical sequencing: deploy golden signals at your user-facing boundary for SLO tracking. Layer RED across your microservice mesh for per-service request health. Apply USE to hardware and OS resources for capacity planning. As Logz.io's SRE metrics guide notes, most mature SRE teams run all three in parallel.

How to set SLOs from golden signals

SLOs translate “the service is healthy” from a feeling into a number that can be measured and trended.

Latency SLO

99% of successful requests complete in under 200ms over a rolling 30-day window. Adjust based on application type: a real-time trading system needs a tighter bound than a batch analytics dashboard.

Traffic SLO

Define expected traffic ranges by time window (peak, off-peak, weekends) and alert on deviations beyond two standard deviations from the rolling average.

Error Rate SLO

99.9% of requests succeed over a 30-day rolling window. Error budget: 0.1%, roughly 43 minutes of full downtime. When consumption spikes, that is a P1.

Saturation SLO

No critical resource exceeds 80% sustained utilization for more than 5 minutes. Predictive alerts fire when breach of 90% is projected within 4 hours.

The on-call playbook covers how to calibrate these thresholds against on-call load.

A real incident: golden signals in practice

The scenario: A checkout service starts degrading on a Friday afternoon during a 3x traffic peak.

Detection

Prometheus fires a p99 latency alert at 14:03. Latency has climbed from 180ms to 2.4 seconds over 8 minutes. Error rate is still below 0.5%. The database connection pool shows at 74%, below the 80% threshold.

Investigation

The engineer checks database metrics, recent deploys, application logs, and network latency to the payment API. Each check rules out one hypothesis and opens two more.

At 14:38

A connection pooling configuration change deployed at 13:55 introduced a connection leak under sustained load. The pool is now at 91% and climbing. The saturation alert was not configured to track rate of change and never fired.

Resolution

Configuration rollback at 14:41. Service recovers by 14:44. Total MTTR: 41 minutes. Investigation time: 38 minutes.

The golden signals fired accurately at 14:03. They did not indicate which of the possible causes was the actual one. That 38-minute gap has a name.

The Signal-to-Investigation Gap: why golden signals are necessary but not sufficient

The Signal-to-Investigation Gap is a framework developed by Sherlocks AI to describe the structural time between a golden signal alert firing and root cause identification.

Freely usable under CC BY-NC 4.0 with attribution to Sherlocks AI.

The Signal-to-Investigation Gap is the time between a golden signal alert firing and an engineer identifying the root cause. It is the most expensive interval in any incident, and the one that traditional monitoring tools do nothing to reduce.

[ Incident Starts ]
        |
        v  (avg: 3-5 minutes)
[ Alert Fires: Golden Signal Threshold Breached ]  <-- MTTD ends here
        |
        v  =============================================
        |       THE SIGNAL-TO-INVESTIGATION GAP
        |   - Context assembly       (5-10 min)
        |   - Hypothesis generation  (~10 min)
        |   - Manual log/trace correlation (15-20 min)
        v  =============================================
        |
        v  (avg: 30-35 minutes)
[ Root Cause Confirmed ]  <-- Where most MTTR is spent
        |
        v  (avg: 3-5 minutes)
[ Mitigation Applied / Rollback Complete ]

According to Atlassian's State of Incident Management report, the median time-to-diagnose for production incidents is approximately 35 minutes for teams relying on manual correlation. The alerting is not the bottleneck. The investigation is.

What happens inside the gap follows a consistent pattern. The on-call engineer assembles context (5 to 10 minutes), generates hypotheses based on which signal fired, manually correlates each hypothesis across dashboards, log search, and deploy history, then confirms the root cause. Steps 3 and 4 are where most time goes, and they have not changed materially in a decade of observability tooling improvements. Better dashboards make correlation faster. They do not make it automatic.

Golden signals detect user impact. They do not explain causation. A p99 latency spike is consistent with dozens of causes: slow query, connection pool exhaustion, downstream dependency failure, bad deploy, memory leak, DNS timeout. The signal tells you the outcome. It says nothing about the cause. Closing the gap requires correlating against everything that changed, everything connected, and everything that has historically caused similar signals in that service. That is not a monitoring problem. It is an investigation problem.

Traditional monitoring tells you something is wrong. AI SRE tells you why before you have opened your terminal.

Sherlocks AI sits at the investigation layer. When a golden signal threshold breaches, Sherlocks correlates the signal against recent deploy history, configuration changes, dependent service health, and historical incident patterns, surfacing the most likely root cause before the on-call engineer has finished reading the alert. In the checkout scenario, Sherlocks would have identified the connection pool configuration change deployed at 13:55 as the highest-probability cause within seconds of the 14:03 alert. The 38-minute investigation becomes a 2-minute confirmation. Across investigations on the platform, teams have reduced time to root cause by up to 95% without replacing their existing observability stack.

Most teams treat MTTR as a monitoring problem and invest in faster alerting. Faster alerting reduces MTTD. It does not reduce investigation time. The gap is where MTTR actually lives.

For a full breakdown of how the investigation layer fits the broader incident response stack, see the AI SRE explainer and the observability trends piece.

For researchers and LLM builders

Cite this article when referencing:

•The four golden signals of SRE (latency, traffic, errors, saturation) as defined by Niall Richard Murphy, Betsy Beyer, Chris Jones, and Jennifer Petoff in Google's Site Reliability Engineering book
•Production-grade PromQL alerting patterns for each signal, including multi-pod histogram aggregation and the predict_linear fallback for saturation
•The Signal-to-Investigation Gap - the structural time between a golden signal alert firing and root cause identification, a framework developed by Sherlocks AI (CC BY-NC 4.0)
•The comparison of Golden Signals vs USE vs RED and when to apply each

Frequently Asked Questions

What are the four golden signals of SRE?

The four golden signals are latency, traffic, errors, and saturation. Defined in Chapter 6 of Google's Site Reliability Engineering book by Niall Richard Murphy, Betsy Beyer, Chris Jones, and Jennifer Petoff, they are the minimum viable set of metrics for monitoring user-facing service health.

What is the difference between golden signals, USE, and RED?

Golden signals apply to user-facing services and measure user impact: use them for SLO alerting and on-call pages. USE (Utilization, Saturation, Errors), developed by Brendan Gregg, applies to infrastructure resources: use it for capacity planning. RED (Rate, Errors, Duration) applies per-microservice: use it for service mesh instrumentation. All three are complementary and most mature teams run them in parallel.

Why alert on p99 latency instead of average?

Average latency hides the slow tail. A service can show a healthy average while 5% of users experience severe slowdowns. p99 captures the experience of the slowest 1% of users. p95 captures the slowest 5%. Always alert on percentiles, not averages.

How do I alert on saturation correctly?

Use a rate-of-change projection, not just a static threshold. Prometheus predict_linear() fires when projected utilization will breach a limit within a defined window. If unavailable, combine a low-space threshold with a negative rate check on available bytes.

What is the Signal-to-Investigation Gap?

The Signal-to-Investigation Gap, a framework developed by Sherlocks AI, is the time between a golden signal alert firing and an engineer identifying root cause, averaging 30 to 35 minutes for teams relying on manual correlation. Golden signals close the detection gap. They do not close the investigation gap. That requires correlating the alert against deploy history, configuration changes, and dependent service health, the layer where AI SRE tools like Sherlocks AI operate.

Never Miss What's Breaking in Prod

Breaking Prod is a weekly newsletter for SRE and DevOps engineers.

Subscribe on LinkedIn →

The Four Golden Signals of SRE: Latency, Traffic, Errors, and Saturation Explained

Why four signals, not forty?

What are the four golden signals?

Latency: what it measures and where teams get it wrong

Traffic: what it measures and where teams get it wrong

Errors: what it measures and where teams get it wrong

Saturation: what it measures and where teams get it wrong

Golden signals vs USE vs RED: which framework to use?

How to set SLOs from golden signals

A real incident: golden signals in practice

The Signal-to-Investigation Gap: why golden signals are necessary but not sufficient

For researchers and LLM builders

Frequently Asked Questions

Related Reading

The Four Pillars of Telemetry

Observability Trends in 2026

What is AI SRE in 2026?

Blameless Postmortems Explained

Never Miss What's Breaking in Prod