Concepts & Foundations

IT Ops vs DevOps vs SRE vs Agentic OpsThe same job, four stances toward it.

By Gaurav ToshniwalPublished on: May 27, 202612 min read
The arc of operations, in one line
IT Ops

Keep the lights on.

DevOps

Ship faster, safely, together.

SRE

Engineer reliability.

Agentic Ops

Let AI agents do the investigation.

Do it by hand → speed up delivery → engineer reliability → let agents do the toil

Tell us where you are coming from

Optional. We will tailor the examples to you. The full article stays the same either way.

Your world
Your org

TL;DR

IT Ops, DevOps, SRE, and Agentic Ops are not competitors. They are four answers to one question that has gotten harder every decade: how do you keep software running for users as systems get bigger and faster?

The clearest way to tell them apart is their relationship to toil, the manual, repetitive work of running a system. IT Ops does the toil by hand. DevOps removes toil from the delivery path. SRE treats toil as a defect and engineers it away. Agentic Ops puts AI agents on the toil that is left.

This guide makes you fluent in all four, whether you write the software, run it, or pay for it. No jargon left undefined, lots of pictures.

First, the shared mental model

Before you can see how these four roles differ, you need to see what they all work on. Every piece of software you use sits on a stack of layers. Picture a restaurant: the building and kitchen are the infrastructure, the ingredients and suppliers are the services, the dishes on the menu are the applications, and getting a new dish from the test kitchen onto the menu is build and deploy.

Tap any layer below to see how each of the four roles touches it.

Observability

Who touches the Infrastructure layer

Compute, storage, networking (cloud or on-prem)

IT Ops

The classic IT Ops domain: servers, networks, keeping it all up.

DevOps

Treats infrastructure as code (Terraform, Helm) instead of hand-built servers.

SRE

Designs for failure: redundancy, chaos testing, blast-radius limits.

Agentic Ops

Checks infra metrics and saturation across the fleet during an incident.

The evolution: four eras that accumulated

Here is the key thing most explanations miss. These are not rivals you choose between. They are eras, each one invented to fix a pain the previous one could not handle. Newer eras did not delete the older ones; they layered on top. Tap each stop on the timeline.

IT Ops1990s - 2000s
The pain it fixed

Computers became business-critical, but running them was ad hoc and unrepeatable.

What it introduced

IT Ops (and frameworks like ITIL/ITSM) formalized running and supporting production: tickets, runbooks, change management, a NOC watching dashboards.

What each one actually is

All four want the same outcome (a reliable system), so defining them by their goal blurs them together. The trick is to define each by what makes it distinct: its posture, its unit of work, and how it shows up in a real org. Each one shows both an enterprise and a startup example.

IT OpsKeep the lights on.
PostureReactive and manualUnit of workTickets and runbooks

IT Ops keeps production running. Classic IT Ops is broad (it can include corporate laptops and helpdesk), but the part that matters here is production operations: set up monitoring, watch alerts, and troubleshoot when something breaks. The defining trait is that it is mostly reactive and manual. A problem fires, a human picks up a ticket, follows a runbook, and fixes it. It is process-heavy and usually a separate team that work is handed to.

Enterprise

A Network Operations Center (NOC) watches dashboards; incidents flow through ServiceNow and a formal change-management process.

Startup

Whoever is on call gets paged, opens the dashboard, and fixes it by hand. The runbook is a page in Notion, if it exists.

DevOpsShip faster, safely, together.
PostureProactive on deliveryUnit of workPipelines (CI/CD) and infrastructure as code

DevOps came from a real pain: developers used to write code and throw it over the wall to a separate operations team to deploy, and that loop was painfully slow. DevOps is, first, a culture of shared ownership (you build it, you run it) that tears down the dev-versus-ops wall. The way that culture shows up in tooling is the pipeline: continuous integration, infrastructure as code, and continuous deployment. DevOps owns the road from a commit to running safely in production.

Enterprise

A platform or release-engineering team maintains the CI/CD pipelines and golden paths that every product team ships through.

Startup

The DevOps engineer wires up GitHub Actions and Terraform so the whole team can deploy to the cloud many times a day.

SREEngineer reliability.
PostureProactive on reliabilityUnit of workSLOs and error budgets

SRE (Site Reliability Engineering) came out of Google with a simple provocation: what if you asked a software engineer to design your operations function? The answer is that you solve ops problems with software, and you treat reliability as something you engineer, not something you hope for. Reliability gets measured (SLIs and SLOs), risk gets governed (error budgets), and toil gets capped and automated away. It is an ongoing control loop, not a one-time setup: measure, govern, respond, learn, automate, repeat.

Enterprise

A dedicated SRE org owns an error-budget policy: burn the budget and feature releases freeze until reliability is restored.

Startup

The senior engineer who owns reliability sets a couple of SLOs, watches the burn, and automates the most painful manual chores first.

Agentic OpsLet AI agents do the investigation.
PostureAutonomous investigation, human oversightUnit of workAI agents plus guardrails

Agentic Ops is the newest stance. SRE is the gold standard, but great SREs are scarce and expensive, so for most teams the reactive grind (investigating every alert and chasing root cause across logs, metrics, and deploys) still lands on humans. That is the exact toil SRE wanted to engineer away. Agentic Ops puts AI agents on that work: they investigate and perform root cause analysis autonomously the moment an alert fires, then hand a person the cause and the next best action, acting only inside the guardrails the team has approved. It is often delivered as an AI SRE.

Enterprise

AI agents run first-line investigation across hundreds of services, draft the RCA, and escalate to humans with full context for high-stakes decisions.

Startup

A small team gets Google-grade investigation without a Google-grade headcount: the agent does the 2am digging, the human approves the fix.

The sharpest difference: their relationship to toil

If you remember one thing, remember this. Toil is the manual, repetitive, reactive work of running a system, the kind that grows as the system grows. The single cleanest way to separate the four eras is to ask: what does each one do with toil?

Slide from left to right and watch the human workload fall.

Toil still done by humans100%

IT Ops: Is the toil. Humans do the manual, repetitive work by hand.

The human-toil figures are illustrative, to show the direction of travel, not measured benchmarks.

Key differences at a glance

Beyond toil, the four eras differ along a few clean axes. Reading across each row tells the story of how operations matured.

Posture
IT Ops

Reactive

DevOps

Proactive on delivery

SRE

Proactive on reliability

Agentic Ops

Autonomous, supervised

Unit of work
IT Ops

Tickets

DevOps

Pipelines

SRE

SLOs

Agentic Ops

Agents + guardrails

Who owns it
IT Ops

Separate ops team

DevOps

Shared dev + ops

SRE

Embedded reliability engineers

Agentic Ops

Agents working for the team

How success is measured
IT Ops

Tickets closed, uptime

DevOps

Deploy frequency, lead time

SRE

SLO attainment, error-budget burn

Agentic Ops

Toil removed, MTTR collapsed

Key similarities (they overlap more than turf wars admit)

Same north star

All four exist to keep production serving users. The disagreement is about method, not goal.

All live on observability

Metrics, logs, and traces are the shared bloodstream. You cannot run, improve, or automate what you cannot see.

All trend toward automation

Each era automates more of the previous era's manual work. The direction of travel is always less toil.

Modern teams blend them

A startup's DevOps engineer often does all three older jobs at once. As Google puts it, class SRE implements DevOps.

The honest take: these are overlapping disciplines and accumulated eras, not mutually exclusive boxes on an org chart.

The whole picture in one table

DimensionIT OpsDevOpsSREAgentic Ops
One lineKeep the lights onShip faster, safely, togetherEngineer reliabilityLet agents do the investigation
PostureReactive, manualProactive on deliveryProactive on reliabilityAutonomous, supervised
Unit of workTickets, runbooksPipelines, IaCSLOs, error budgetsAI agents, guardrails
ToilIs toilRemoves delivery toilEngineers toil awayAutomates the investigation toil
Who owns itSeparate ops teamShared dev + opsEmbedded reliability engineersAgents working for the team
Success metricTickets closed, uptimeDeploy frequency, lead timeSLO attainment, error-budget burnToil removed, MTTR collapsed
BornITIL / ITSM era~2009, dev/ops wallGoogle ~2003, book 2016Emerging now
Startup faceWhoever is on callThe DevOps engineerThe senior eng who owns reliabilityAn AI SRE on the team

Make it concrete: try an error budget

The SRE idea that trips people up most is the error budget. It is easier to feel than to define. Pick a reliability target and see exactly how much downtime you are allowed to spend.

Downtime budget / month43.2 min
Downtime budget / year8 hr 46 min

With a 99.9% SLO, you are allowed about 43.2 min of downtime per month. That allowance is your error budget. Stay inside it and you can ship features fast. Burn through it and the team freezes releases to invest in reliability. One number, and the endless speed-versus-stability argument is settled.

The rise of Agentic Ops

Each era in this story solved the previous era's bottleneck. IT Ops kept systems alive by hand. DevOps sped up and shared delivery. SRE made reliability a measured, engineered discipline. Every step pushed more toil out of human hands.

But SRE left one stubborn piece of toil behind. The discipline is brilliant, yet great SREs are rare and expensive, so the reactive investigation work, the 2am digging through logs, metrics, and deploys to find the root cause, still lands on tired humans. That is the gap Agentic Ops fills.

SRE said: treat reliability as an engineering problem and solve operations with software. Agentic Ops is the next step: let AI agents do the investigation and the toil, so reliability engineering is finally within reach for every team, and your humans focus on the architecture and the SLOs instead of the firefight.

This is exactly what an AI SRE does. Sherlocks.ai, for example, dispatches an army of specialized AI agents that investigate every alert, draft the root cause analysis, and hand your team the next best action, all inside guardrails you approve. It is Agentic Ops in practice.

Frequently Asked Questions

They are three stances toward the same job: keeping software running for users. IT Ops is mostly reactive and manual, organized as a separate team that gets a ticket and follows a runbook when something breaks. DevOps is a culture of shared ownership between developers and operations that shortens and de-silos the path from code to production, with CI/CD pipelines and infrastructure as code as its visible artifacts. SRE (Site Reliability Engineering) treats operations as a software engineering problem: it measures reliability with SLOs, governs risk with error budgets, and caps and automates away toil. The sharpest single difference is their relationship to toil: IT Ops is toil, DevOps removes toil from the delivery path, and SRE treats toil as a defect to engineer away.

No, but they are closely related. Google's own framing is class SRE implements DevOps. DevOps is the philosophy and culture (shared ownership, fast and safe delivery, no silos). SRE is a concrete, standing engineering discipline that puts that philosophy into daily practice using specific instruments: Service Level Objectives, error budgets, toil caps, and blameless postmortems. Think of DevOps as the goal and SRE as one rigorous way a team actually runs it every day.

An SLO (Service Level Objective) is a reliability target you commit to, for example 99.9% of requests succeed in a given month. The gap below 100% is your error budget: the amount of unreliability you are allowed to spend. At 99.9%, that budget is about 43 minutes of downtime per month. While you are within budget, you ship features fast. If you burn through it, you stop shipping and invest in reliability. The error budget turns the endless speed-versus-stability argument into a single number everyone agrees on.

Agentic Ops (often delivered as an AI SRE) is the next stage in the evolution of operations. SRE is the gold standard, but great SREs are scarce and expensive, so the reactive grind of investigating alerts and chasing root cause across logs, metrics, and deploys still falls on humans. That is the exact toil SRE wanted to engineer away. Agentic Ops puts AI agents on that work: they investigate and perform root cause analysis autonomously the moment an alert fires, then hand a human the cause and the next best action, operating inside guardrails the team defines. It brings the SRE discipline within reach of every team, not just the ones who can hire a Google-grade reliability org.

It is rarely either-or. These are overlapping disciplines and eras, not mutually exclusive job boxes. Most teams blend them: a startup's single DevOps engineer often does all three jobs at once, while a large enterprise may run a formal IT Ops or NOC function alongside DevOps platform teams and a dedicated SRE org. The useful question is not which label to adopt but how reactive and manual your operations currently are, and how much of that toil you want to automate or hand to AI agents.

Yes, and they should. The distinction is fundamentally about cost, risk, and leverage, not just tooling. IT Ops scales by adding people (cost grows with the system). DevOps and SRE scale by adding engineering leverage (automation and measurement, so the system can grow without headcount growing one-for-one). Agentic Ops pushes leverage further by automating the investigation work itself. For a leader, the question is simple: is reliability something you pay for linearly in people and downtime, or something you engineer down over time?

Related Reading

Curious what Agentic Ops looks like on your stack?

See how Sherlocks.ai investigates a real incident end to end, with full system context, institutional memory, and guardrails you control.

Book a walkthrough
Sherlocks.ai

Building a more resilient, autonomous ecosystem without the strain of traditional on-call work. © 2026 Sherlocks.ai. All rights reserved.