On-Call Operations · 2026

The On-Call Playbook for 2026: How to Build Sustainable Rotations

By Akshat SandhaliyaPublished on: Mar 9, 2026Last edited: Mar 9, 202610 min read

On-call is one of the most consequential and most neglected systems in engineering organisations. When it works, engineers sleep. Systems recover fast. Reliability compounds. When it breaks, the best engineers leave first.

This playbook covers what sustainable on-call actually requires in 2026: why traditional rotations fail, which model fits your team, how to design alerts that earn trust, and what metrics reveal whether your system is healthy before attrition does.

The central argument runs through every section: most on-call problems are signal quality problems wearing the disguise of scheduling problems. Redesigning your rotation without fixing your alerts is rearranging the furniture. The teams that run clean, sustainable on-call start with what pages them, not how often the pager changes hands.

It Is 2:13 AM. Everything Is On Fire. And None of It Should Have Paged.

The pager fires. You grab your phone, squinting at the brightness. Three alerts. Then five. Then twelve.

You open your laptop, pull up Datadog, check Grafana, scan the Kubernetes dashboard. Slack is already moving. Someone asks if it is related to the deploy from earlier. Someone else says probably not. Nobody knows yet.

Twenty-two minutes later, the alerts resolve themselves. A transient CPU spike. Self-correcting. Nothing broke. Nobody needed to wake up.

You close the laptop. It is 2:35 AM. You have a 9 AM planning meeting.

This is not a war story. This is a Tuesday.

Here is the uncomfortable truth: the rotation schedule did not cause this. An alert fired that should never have existed, and a human being lost two hours of sleep because nobody had ever stopped to ask: does this alert represent a real customer-impacting problem, or is it noise we have learned to live with?

Most teams respond to burnout by redesigning their rotation. Add a secondary. Move to follow-the-sun. Shorten the shifts. These are real fixes for real problems, but they treat the symptom, not the cause.

The real problem is signal quality. What pages you determines how tired your engineers are, how fast they respond, how much they trust their own pager, and whether your best people stick around. Fix the signal. Everything else gets easier.

Why Traditional On-Call Rotations Break

Most on-call systems were designed for monolithic applications, predictable failure modes, and teams where one or two engineers knew every corner of the system. That world no longer exists.

Alert noise. Modern observability tools generate massive telemetry volumes. Without disciplined alert design, everything becomes a page. Engineers get woken up for transient spikes, duplicate notifications, and cascading alerts that trace back to one root cause. After enough false alarms, they stop trusting the pager. That desensitisation is not laziness. It is a rational response to a broken system. Research from PagerDuty's State of Digital Operations consistently shows alert fatigue as a leading driver of on-call burnout and engineer attrition.

Understaffed rotations. Many teams run on-call with three or four engineers. Google's SRE Workbook puts the minimum at eight for single-site 24/7 coverage. Below that number, the math does not work. Engineers carry the pager too often, recover too little, and burn out faster than teams can hire.

Missing context at 3 AM. When an alert fires, the on-call engineer often spends more time investigating than fixing. They are hunting across dashboards, logs, and deployment history before they can act. The investigation is the bottleneck, not the resolution.

Knowledge silos. When only one engineer truly understands a service, every related incident becomes a crisis. The team is quietly one resignation away from a reliability emergency.

Repeat incidents. The same issues fire week after week because there is no postmortem, no systemic fix, no automation. Each repeat page is a failure that was already paid for once.

For a deeper look at how the SRE discipline has evolved to address these challenges, see Traditional SRE vs Modern SRE: What Every Engineering Leader Needs to Know in 2026.

The Five Principles of Sustainable On-Call

Healthy rotations share one underlying logic: reduce what pages, reduce the thinking required when it does page, and make sure any engineer in the rotation can resolve it. If a system fails at any of those three, the rotation eventually collapses.

Page only for user impact. The fastest way to destroy a rotation is paging on raw infrastructure metrics. CPU spikes. Memory thresholds. Queue depth warnings. These represent conditions, not incidents. Teams that run clean rotations tie alerts to SLO burn rate or error budget consumption. If the pager fires, something should be broken for a user. Anything else belongs in a dashboard.

Protect recovery time, not just shift time. Shorter shifts do not fix noisy on-call. A three-day rotation with eight night pages exhausts engineers faster than a one-week rotation with two. What matters is interruption frequency. Most shifts should be quiet enough to sleep through.

If an alert has no runbook, it should not exist. Every page should answer three questions immediately: what does this mean, what should I check first, what usually fixes it. Good runbooks turn incidents into procedures instead of investigations.

Make reliability owned by the people who build the system. When developers participate in on-call for their own services, incidents resolve faster and systems get designed to fail less. Shared ownership is an operational forcing function, not a cultural aspiration.

Automate what you already know how to fix. If engineers perform the same fix repeatedly, that is automation waiting to happen. And apply the 90-day rule: if an alert has not required human action in 90 days, delete it. Not tune it. Delete it. Several SRE teams that adopted this rule saw paging volume drop dramatically within a quarter.

Understanding how to build alerts that reflect user impact rather than raw infrastructure thresholds is foundational to this approach. Alert on Causes, Not Symptoms: The Fastest Way to Reduce MTTR covers the mechanics of cause-based alerting in depth.

Choosing the Right Rotation Model for Your Team

There is no universal on-call rotation. The right model depends on three variables: team size, geographic distribution, and actionable page volume per shift. Pick the wrong model and no amount of good alerting will save you.

Weekly primary is the most common and most misapplied model. One engineer owns the pager for a full week. It works for small teams because it is simple and maintains context continuity. It breaks the moment page volume gets heavy. Five or more nightly interruptions on a weekly rotation is a burnout path, not a schedule.

Primary plus secondary is Google's recommended baseline for 24/7 coverage. The primary handles pages; the secondary backs up and takes over if an incident runs long. The math is unforgiving: you need a minimum of eight engineers for a single-site team without burning through your 25% on-call budget per engineer. Below eight, the rotation looks sustainable on paper but is not.

Follow-the-sun is the only model that eliminates overnight pages entirely. Each regional team covers their daylight hours and hands off to the next time zone. With three sites, it can reduce individual on-call duration by up to 67%. It requires genuine geographic presence, at least nine engineers across three locations, and handoff discipline that most teams underestimate. It breaks when teams treat it as a scheduling solution rather than a coordination system.

Service-based rotations work for larger organisations running microservices. Each team owns on-call for the services they build, which solves knowledge silos directly. It breaks when service boundaries are unclear or when smaller services lack enough engineers to sustain their own rotation.

Team SizeGeographyRecommended ModelWhere It Breaks
4 to 5 engineersSingle siteWeekly primaryHeavy page volume
6 to 8 engineersSingle sitePrimary plus secondaryBelow 8 engineers
10 or more engineersSingle siteService-basedUnclear service ownership
Any sizeMulti-regionFollow-the-sunPoor handoff discipline

Google's SRE Workbook puts the minimum for single-site 24/7 at eight engineers; six per site for multi-site teams. If your team is smaller, 24/7 coverage is not yet sustainable, and your schedule should reflect that even if business pressure says otherwise.

The Handoff: The Most Underrated Practice in On-Call

Most on-call discussions focus on alerts, rotations, and tooling. Very few talk seriously about handoffs. A well-designed rotation with poor handoff culture will fail just as reliably as an understaffed one.

When a handoff is weak, the incoming engineer starts blind. They inherit active incidents with no context, open questions with no thread, services they need to reorient to from scratch. That disorientation shows up directly in MTTR.

The most common failure is not intentional. Engineers finishing a shift are tired. The outgoing note becomes “all quiet” when reality is more nuanced: a service flapping for two hours, a late deploy nobody validated, a noisy alert the incoming engineer will chase for twenty minutes before realising it has been doing this for weeks.

A good handoff transfers operational context, not just a list of alerts. Five things, nothing more:

1
Active incidents. What is unresolved and what mitigation is already in place?
2
Recent deployments. Even unrelated changes become clues during an incident.
3
Anything unusual. Traffic spikes, instability, dependency outages in the last few hours.
4
Known flapping alerts. Signals that triggered but needed no action.
5
One thing to watch. A system that looks unstable but has not crossed a threshold yet.

This does not need a meeting. A structured Slack message takes four minutes and saves thirty minutes of disorientation. Teams that treat handoffs as optional are the ones whose follow-the-sun rotations quietly accumulate MTTR debt every time a shift changes.

Metrics That Reveal a Broken On-Call System

Most teams track MTTR. Far fewer track the metrics that explain why MTTR is high. If you only measure resolution time, you are measuring the outcome of a broken system, not the system itself.

MetricWhat It MeasuresHealthy BenchmarkWhat It Signals When Broken
Pages per engineer per weekRaw interrupt volumeUnder 5 actionable pagesAlerting stack needs an audit, not the rotation
Alert-to-action ratioAlerts requiring human action30 to 50% actionableBelow 10% means most on-call burden is noise
Repeat incident rateIncidents recurring within 30 daysZero repeatsA postmortem that never happened
MTTR variance across shiftsResolution time consistencyMinimal varianceMissing runbooks, poor handoffs, or unfamiliar engineers
On-call load distributionIncident load per engineerEvenly distributedOne engineer at 3x load is a hero problem becoming an attrition problem
Engineer-reported satisfactionMonthly 1-to-5 self-reported experienceTrending upwardThe earliest predictor of burnout and resignation

If you cannot answer all six of these today, close that gap before redesigning anything. Rotation changes made without this data are guesses. For a practical guide to bringing MTTR down once you have identified where the bottlenecks sit, see How to Reduce MTTR in 2026: From Alert to Root Cause in Minutes.

The On-Call Audit: 10 Questions to Ask Your Team This Week

Before changing your rotation, adding tooling, or investing in AI-assisted triage, run this audit. Most teams feel the symptoms of a broken system long before they diagnose the causes. If engineering leaders cannot answer most of these without digging through months of reports, that absence is itself the finding.

1. What percentage of last month's pages required human action?

If most resolved themselves, the problem is alert quality, not the rotation.

2. When did your team last delete an alert?

Not tune it. Not deprioritise it. Delete it. If the answer is never, start there.

3. Does every alert link to a runbook?

Any alert without one is asking an engineer to reverse-engineer your system at 3 AM.

4. Can any engineer resolve your top five incident types?

If only one or two people can, the system still depends on heroes.

5. What is your repeat incident rate over the last 30 days?

Every repeat page is a fix that was never made permanent.

6. How long does investigation take before the fix begins?

If context-gathering takes longer than resolution, observability and documentation are the gaps.

7. Is on-call load evenly distributed?

Uneven distribution is usually a knowledge silo wearing a scheduling mask.

8. Do you have a standard handoff template used every shift?

Without one, context resets every time the pager changes hands.

9. When did a postmortem last produce a systemic fix?

Postmortems that generate no automation or permanent remediation are documentation, not improvement.

10. Would your engineers describe on-call as sustainable right now?

Ask them directly. The answer predicts your next attrition events better than any metric you track.

If several of these are difficult to answer, that is the signal. Sustainable on-call rarely comes from a single change. It comes from gradually improving the systems behind it.

When evaluating AI-assisted triage tools to reduce investigation time, it is worth understanding how purpose-built platforms compare to general-purpose AI agents. Claude Code vs. Sherlocks.ai breaks down what actually happens when you put both approaches in production at 3 AM. Similarly, if your team is assessing dedicated AI SRE platforms, Resolve AI vs. Sherlocks.ai offers a detailed side-by-side on data architecture, autonomy, and pricing.

Conclusion

Sustainable on-call is not a single fix. It is a system you build incrementally: starting with the signals that interrupt your engineers, then the knowledge they need when it does page, then the rotation that distributes the load fairly.

The teams that get this right do not have better tooling. They have better discipline around the basics: alerts tied to user impact, runbooks that actually work, handoffs that transfer context instead of resetting it, postmortems that produce systemic change.

If you take one thing from this playbook, make it the 90-day rule. Pull up your alerting stack this week. Find every alert that has not required human action in 90 days. Delete it.

What remains is closer to a signal. Build your rotation around that, and the rest gets easier.

If you are thinking about the broader role of AI in on-call operations, Being An SRE is Nothing Short of Chaotic is an honest look at why the complexity compounds and what teams are doing to tame it.

Frequently Asked Questions

An on-call rotation is a schedule where engineers take turns responding to production incidents and alerts. During their shift, the on-call engineer monitors system health and investigates issues that could impact reliability or user experience.

The minimum for single-site 24/7 coverage is eight engineers, per Google's SRE Workbook. Below that, engineers carry the pager too frequently to recover properly. For multi-site teams, six per site.

Alert fatigue happens when engineers are paged for events that do not require human action—transient spikes, duplicate notifications, or cascading alerts from a single root cause. Over time, engineers stop trusting the pager, which slows response to real incidents.

If an alert has not required human action in 90 days, delete it. Not tune it—delete it. This rule forces teams to confront alert debt directly and has been shown to dramatically reduce paging volume within a quarter.

Follow-the-sun distributes on-call coverage across time zones so each team covers only their daylight hours and hands off to the next region. It eliminates overnight pages entirely, but requires at least nine engineers across three locations and disciplined handoff practice.

Three things at minimum: what the alert means, what to check first, and what usually fixes it. The best runbooks also include links to relevant dashboards, recent incident history, and an escalation path if the standard fix does not work.

Shift from threshold-based alerts to SLO-based alerts. If an alert does not represent user impact or a burning error budget, it should not page. Delete anything that has not required human action in 90 days.

Yes. When developers carry the pager for what they ship, they design systems that fail less often and resolve incidents faster because they understand the code. Shared ownership is an operational forcing function, not a punishment.

Track pages per engineer per week (healthy: under five), alert-to-action ratio (healthy: 30 to 50%), and repeat incident rate (healthy: zero within 30 days). Engineer-reported satisfaction is the earliest leading indicator of burnout.

AI tools can automatically correlate alerts, analyse telemetry, and surface likely root causes during incidents. They reduce investigation time, but work best when teams already maintain clean alerting and reliable operational practices.

Get SRE Insights Delivered Weekly

Join the Sherlocks.ai LinkedIn newsletter for practical breakdowns on on-call operations, incident response, and AI-powered reliability engineering.

Subscribe on LinkedIn
Sherlocks.ai

Building a more resilient, autonomous ecosystem without the strain of traditional on-call work. © 2026 Sherlocks.ai. All rights reserved.