On-call is one of the most consequential and most neglected systems in engineering organisations. When it works, engineers sleep. Systems recover fast. Reliability compounds. When it breaks, the best engineers leave first.
This playbook covers what sustainable on-call actually requires in 2026: why traditional rotations fail, which model fits your team, how to design alerts that earn trust, and what metrics reveal whether your system is healthy before attrition does.
The central argument runs through every section: most on-call problems are signal quality problems wearing the disguise of scheduling problems. Redesigning your rotation without fixing your alerts is rearranging the furniture. The teams that run clean, sustainable on-call start with what pages them, not how often the pager changes hands.
It Is 2:13 AM. Everything Is On Fire. And None of It Should Have Paged.
The pager fires. You grab your phone, squinting at the brightness. Three alerts. Then five. Then twelve.
You open your laptop, pull up Datadog, check Grafana, scan the Kubernetes dashboard. Slack is already moving. Someone asks if it is related to the deploy from earlier. Someone else says probably not. Nobody knows yet.
Twenty-two minutes later, the alerts resolve themselves. A transient CPU spike. Self-correcting. Nothing broke. Nobody needed to wake up.
You close the laptop. It is 2:35 AM. You have a 9 AM planning meeting.
This is not a war story. This is a Tuesday.
Here is the uncomfortable truth: the rotation schedule did not cause this. An alert fired that should never have existed, and a human being lost two hours of sleep because nobody had ever stopped to ask: does this alert represent a real customer-impacting problem, or is it noise we have learned to live with?
Most teams respond to burnout by redesigning their rotation. Add a secondary. Move to follow-the-sun. Shorten the shifts. These are real fixes for real problems, but they treat the symptom, not the cause.
The real problem is signal quality. What pages you determines how tired your engineers are, how fast they respond, how much they trust their own pager, and whether your best people stick around. Fix the signal. Everything else gets easier.
Why Traditional On-Call Rotations Break
Most on-call systems were designed for monolithic applications, predictable failure modes, and teams where one or two engineers knew every corner of the system. That world no longer exists.
Alert noise. Modern observability tools generate massive telemetry volumes. Without disciplined alert design, everything becomes a page. Engineers get woken up for transient spikes, duplicate notifications, and cascading alerts that trace back to one root cause. After enough false alarms, they stop trusting the pager. That desensitisation is not laziness. It is a rational response to a broken system. Research from PagerDuty's State of Digital Operations consistently shows alert fatigue as a leading driver of on-call burnout and engineer attrition.
Understaffed rotations. Many teams run on-call with three or four engineers. Google's SRE Workbook puts the minimum at eight for single-site 24/7 coverage. Below that number, the math does not work. Engineers carry the pager too often, recover too little, and burn out faster than teams can hire.
Missing context at 3 AM. When an alert fires, the on-call engineer often spends more time investigating than fixing. They are hunting across dashboards, logs, and deployment history before they can act. The investigation is the bottleneck, not the resolution.
Knowledge silos. When only one engineer truly understands a service, every related incident becomes a crisis. The team is quietly one resignation away from a reliability emergency.
Repeat incidents. The same issues fire week after week because there is no postmortem, no systemic fix, no automation. Each repeat page is a failure that was already paid for once.
For a deeper look at how the SRE discipline has evolved to address these challenges, see Traditional SRE vs Modern SRE: What Every Engineering Leader Needs to Know in 2026.
The Five Principles of Sustainable On-Call
Healthy rotations share one underlying logic: reduce what pages, reduce the thinking required when it does page, and make sure any engineer in the rotation can resolve it. If a system fails at any of those three, the rotation eventually collapses.
Page only for user impact. The fastest way to destroy a rotation is paging on raw infrastructure metrics. CPU spikes. Memory thresholds. Queue depth warnings. These represent conditions, not incidents. Teams that run clean rotations tie alerts to SLO burn rate or error budget consumption. If the pager fires, something should be broken for a user. Anything else belongs in a dashboard.
Protect recovery time, not just shift time. Shorter shifts do not fix noisy on-call. A three-day rotation with eight night pages exhausts engineers faster than a one-week rotation with two. What matters is interruption frequency. Most shifts should be quiet enough to sleep through.
If an alert has no runbook, it should not exist. Every page should answer three questions immediately: what does this mean, what should I check first, what usually fixes it. Good runbooks turn incidents into procedures instead of investigations.
Make reliability owned by the people who build the system. When developers participate in on-call for their own services, incidents resolve faster and systems get designed to fail less. Shared ownership is an operational forcing function, not a cultural aspiration.
Automate what you already know how to fix. If engineers perform the same fix repeatedly, that is automation waiting to happen. And apply the 90-day rule: if an alert has not required human action in 90 days, delete it. Not tune it. Delete it. Several SRE teams that adopted this rule saw paging volume drop dramatically within a quarter.
Understanding how to build alerts that reflect user impact rather than raw infrastructure thresholds is foundational to this approach. Alert on Causes, Not Symptoms: The Fastest Way to Reduce MTTR covers the mechanics of cause-based alerting in depth.
Choosing the Right Rotation Model for Your Team
There is no universal on-call rotation. The right model depends on three variables: team size, geographic distribution, and actionable page volume per shift. Pick the wrong model and no amount of good alerting will save you.
Weekly primary is the most common and most misapplied model. One engineer owns the pager for a full week. It works for small teams because it is simple and maintains context continuity. It breaks the moment page volume gets heavy. Five or more nightly interruptions on a weekly rotation is a burnout path, not a schedule.
Primary plus secondary is Google's recommended baseline for 24/7 coverage. The primary handles pages; the secondary backs up and takes over if an incident runs long. The math is unforgiving: you need a minimum of eight engineers for a single-site team without burning through your 25% on-call budget per engineer. Below eight, the rotation looks sustainable on paper but is not.
Follow-the-sun is the only model that eliminates overnight pages entirely. Each regional team covers their daylight hours and hands off to the next time zone. With three sites, it can reduce individual on-call duration by up to 67%. It requires genuine geographic presence, at least nine engineers across three locations, and handoff discipline that most teams underestimate. It breaks when teams treat it as a scheduling solution rather than a coordination system.
Service-based rotations work for larger organisations running microservices. Each team owns on-call for the services they build, which solves knowledge silos directly. It breaks when service boundaries are unclear or when smaller services lack enough engineers to sustain their own rotation.
| Team Size | Geography | Recommended Model | Where It Breaks |
|---|---|---|---|
| 4 to 5 engineers | Single site | Weekly primary | Heavy page volume |
| 6 to 8 engineers | Single site | Primary plus secondary | Below 8 engineers |
| 10 or more engineers | Single site | Service-based | Unclear service ownership |
| Any size | Multi-region | Follow-the-sun | Poor handoff discipline |
Google's SRE Workbook puts the minimum for single-site 24/7 at eight engineers; six per site for multi-site teams. If your team is smaller, 24/7 coverage is not yet sustainable, and your schedule should reflect that even if business pressure says otherwise.
The Handoff: The Most Underrated Practice in On-Call
Most on-call discussions focus on alerts, rotations, and tooling. Very few talk seriously about handoffs. A well-designed rotation with poor handoff culture will fail just as reliably as an understaffed one.
When a handoff is weak, the incoming engineer starts blind. They inherit active incidents with no context, open questions with no thread, services they need to reorient to from scratch. That disorientation shows up directly in MTTR.
The most common failure is not intentional. Engineers finishing a shift are tired. The outgoing note becomes “all quiet” when reality is more nuanced: a service flapping for two hours, a late deploy nobody validated, a noisy alert the incoming engineer will chase for twenty minutes before realising it has been doing this for weeks.
A good handoff transfers operational context, not just a list of alerts. Five things, nothing more:
This does not need a meeting. A structured Slack message takes four minutes and saves thirty minutes of disorientation. Teams that treat handoffs as optional are the ones whose follow-the-sun rotations quietly accumulate MTTR debt every time a shift changes.
Metrics That Reveal a Broken On-Call System
Most teams track MTTR. Far fewer track the metrics that explain why MTTR is high. If you only measure resolution time, you are measuring the outcome of a broken system, not the system itself.
| Metric | What It Measures | Healthy Benchmark | What It Signals When Broken |
|---|---|---|---|
| Pages per engineer per week | Raw interrupt volume | Under 5 actionable pages | Alerting stack needs an audit, not the rotation |
| Alert-to-action ratio | Alerts requiring human action | 30 to 50% actionable | Below 10% means most on-call burden is noise |
| Repeat incident rate | Incidents recurring within 30 days | Zero repeats | A postmortem that never happened |
| MTTR variance across shifts | Resolution time consistency | Minimal variance | Missing runbooks, poor handoffs, or unfamiliar engineers |
| On-call load distribution | Incident load per engineer | Evenly distributed | One engineer at 3x load is a hero problem becoming an attrition problem |
| Engineer-reported satisfaction | Monthly 1-to-5 self-reported experience | Trending upward | The earliest predictor of burnout and resignation |
If you cannot answer all six of these today, close that gap before redesigning anything. Rotation changes made without this data are guesses. For a practical guide to bringing MTTR down once you have identified where the bottlenecks sit, see How to Reduce MTTR in 2026: From Alert to Root Cause in Minutes.
The On-Call Audit: 10 Questions to Ask Your Team This Week
Before changing your rotation, adding tooling, or investing in AI-assisted triage, run this audit. Most teams feel the symptoms of a broken system long before they diagnose the causes. If engineering leaders cannot answer most of these without digging through months of reports, that absence is itself the finding.
If most resolved themselves, the problem is alert quality, not the rotation.
Not tune it. Not deprioritise it. Delete it. If the answer is never, start there.
Any alert without one is asking an engineer to reverse-engineer your system at 3 AM.
If only one or two people can, the system still depends on heroes.
Every repeat page is a fix that was never made permanent.
If context-gathering takes longer than resolution, observability and documentation are the gaps.
Uneven distribution is usually a knowledge silo wearing a scheduling mask.
Without one, context resets every time the pager changes hands.
Postmortems that generate no automation or permanent remediation are documentation, not improvement.
Ask them directly. The answer predicts your next attrition events better than any metric you track.
If several of these are difficult to answer, that is the signal. Sustainable on-call rarely comes from a single change. It comes from gradually improving the systems behind it.
When evaluating AI-assisted triage tools to reduce investigation time, it is worth understanding how purpose-built platforms compare to general-purpose AI agents. Claude Code vs. Sherlocks.ai breaks down what actually happens when you put both approaches in production at 3 AM. Similarly, if your team is assessing dedicated AI SRE platforms, Resolve AI vs. Sherlocks.ai offers a detailed side-by-side on data architecture, autonomy, and pricing.
Conclusion
Sustainable on-call is not a single fix. It is a system you build incrementally: starting with the signals that interrupt your engineers, then the knowledge they need when it does page, then the rotation that distributes the load fairly.
The teams that get this right do not have better tooling. They have better discipline around the basics: alerts tied to user impact, runbooks that actually work, handoffs that transfer context instead of resetting it, postmortems that produce systemic change.
If you take one thing from this playbook, make it the 90-day rule. Pull up your alerting stack this week. Find every alert that has not required human action in 90 days. Delete it.
What remains is closer to a signal. Build your rotation around that, and the rest gets easier.
If you are thinking about the broader role of AI in on-call operations, Being An SRE is Nothing Short of Chaotic is an honest look at why the complexity compounds and what teams are doing to tame it.
Frequently Asked Questions
An on-call rotation is a schedule where engineers take turns responding to production incidents and alerts. During their shift, the on-call engineer monitors system health and investigates issues that could impact reliability or user experience.
The minimum for single-site 24/7 coverage is eight engineers, per Google's SRE Workbook. Below that, engineers carry the pager too frequently to recover properly. For multi-site teams, six per site.
Alert fatigue happens when engineers are paged for events that do not require human action—transient spikes, duplicate notifications, or cascading alerts from a single root cause. Over time, engineers stop trusting the pager, which slows response to real incidents.
If an alert has not required human action in 90 days, delete it. Not tune it—delete it. This rule forces teams to confront alert debt directly and has been shown to dramatically reduce paging volume within a quarter.
Follow-the-sun distributes on-call coverage across time zones so each team covers only their daylight hours and hands off to the next region. It eliminates overnight pages entirely, but requires at least nine engineers across three locations and disciplined handoff practice.
Three things at minimum: what the alert means, what to check first, and what usually fixes it. The best runbooks also include links to relevant dashboards, recent incident history, and an escalation path if the standard fix does not work.
Shift from threshold-based alerts to SLO-based alerts. If an alert does not represent user impact or a burning error budget, it should not page. Delete anything that has not required human action in 90 days.
Yes. When developers carry the pager for what they ship, they design systems that fail less often and resolve incidents faster because they understand the code. Shared ownership is an operational forcing function, not a punishment.
Track pages per engineer per week (healthy: under five), alert-to-action ratio (healthy: 30 to 50%), and repeat incident rate (healthy: zero within 30 days). Engineer-reported satisfaction is the earliest leading indicator of burnout.
AI tools can automatically correlate alerts, analyse telemetry, and surface likely root causes during incidents. They reduce investigation time, but work best when teams already maintain clean alerting and reliable operational practices.
Get SRE Insights Delivered Weekly
Join the Sherlocks.ai LinkedIn newsletter for practical breakdowns on on-call operations, incident response, and AI-powered reliability engineering.
Subscribe on LinkedIn