On AI SRE Tools: What Actually Works vs. What's Just Marketing Fluff

Look, we need to talk. You're probably drowning in vendor demos right now, watching slick presentations about how AI will solve all your SRE problems. Everyone's promising the moon, but here's what they're not telling you: most of these tools are really good at one thing and not that good at everything else.

I’ll be upfront about something: as the founder of Sherlocks.ai, I’m obviously biased. But I didn't start this company because I thought the market needed another monitoring tool.

Let me tell you the story of why Sherlocks exists, and why I think we're solving a fundamentally different problem than everyone else in this space.

PagerDuty: The Alert Fatigue "Solution" That Misses the Point

As an SRE, you know this pain well. It's midnight, you're getting 15 Slack notifications about database CPU spikes, and half of them are probably noise. PagerDuty Intelligence has built their entire value proposition around this nightmare.

PagerDuty will tell you they reduce alert fatigue by 91%. That sounds good until you realize what they actually mean: instead of getting 100 alerts, you get 9. Great! But guess what? You're still getting woken up to investigate those 9 alerts manually. They've made you smarter about your problems, but the problems still exist.

Last month I watched a team spend 40 minutes debugging what turned out to be the exact same Redis connection issue they'd solved back in March. PagerDuty dutifully grouped all the related alerts, escalated properly, and even suggested some investigation paths. But they still had to drag themselves out of bed and fix the damn thing manually.

PagerDuty's Process Automation tries to bridge this gap, and I must give them credit for that. They'll absolutely execute predefined runbooks when incidents match specific patterns. But step outside those exact boundaries—say your database CPU spike happens during a deployment instead of normal operations—and you're back to manual mode faster than you can say “edge case.”

New Relic AI: Beautiful Insights, Manual Actions

New Relic AI has built some genuinely impressive monitoring intelligence. Their anomaly detection is sophisticated, their correlation capabilities can connect seemingly unrelated performance issues, and honestly? The dashboards are beautiful.

But what bothers me about this whole approach is, it's 2025, and we're still expecting humans to stare at dashboards to understand what's happening with our systems. Why are we making monitoring intelligence when we should be making monitoring action?

Don't get me wrong; New Relic Applied Intelligence is legitimately useful. When your API response times spike, it'll show you exactly which deployment, configuration change, or infrastructure event might be related. The AI can spot patterns that would take human analysts hours to identify.

But then what? You still need to investigate. You still need to implement a fix. You still need to validate that the fix worked. We've essentially built a really smart detective that never makes any arrests.

Maybe it's just me, but I am a little miffed at how their auto-refresh feature works. Every 30 seconds, the whole dashboard reloads, and I lose my place in whatever investigation I was doing. Small thing, but it drives me nuts.

Datadog Watchdog: Brilliant Analysis, Zero Follow-Through

Datadog Watchdog might be the most technically impressive of the bunch. The correlation analysis it does is genuinely sophisticated. It's like having a really smart detective who can trace the crime scene perfectly and present you with a comprehensive case file.

When your application latency increases, Watchdog will correlate it with database query performance, memory usage patterns, and even external service dependencies. Their Bits AI feature lets you literally ask "Why was our checkout service slow yesterday?" and get intelligent responses based on your actual telemetry data.

A specific example would clarify what I mean. A team I worked with had this weird latency spike that only affected their mobile API endpoints. Watchdog correctly identified that it correlated with increased memory usage in our Redis cluster, a recent deployment of their auth service, and some third-party payment processor issues. It even ranked the most likely root causes.

But here's the thing—and this is where I get frustrated with all these tools—correlation isn't causation, and even perfect correlation doesn't equal resolution. Datadog can tell you with high confidence what's probably wrong, but it doesn't know your system's quirks, your team's previous solutions, or what actually works in your specific environment.

That team spent another 25 minutes manually testing each of Watchdog's suggested root causes before finding the real issue (a memory leak in a library they'd updated). The analysis was spot-on, but they still had to do all the actual work.

The Integration Reality

Something that never comes up in those polished vendor demos is this: you're probably going to end up using all three of these tools together. PagerDuty for alerting and escalation, New Relic or Datadog for monitoring and analysis, plus whatever automation you can cobble together with bash scripts and Terraform.

Each tool has its own interface, its own data model, and its own idea of what constitutes an "incident." So when something breaks at 3 AM, you're jumping between three different systems, copying and pasting incident IDs, and losing context every time you switch tools.

Your incident timeline ends up a mess:

Minute 0: Datadog detects an anomaly
Minute 2: PagerDuty creates an incident and pages the on-call engineer
Minute 8: Engineer wakes up, acknowledges the page (if you’re lucky; sometimes it takes longer)
Minute 12: Engineer logs into New Relic to investigate performance metrics
Minute 18: Engineer switches to Datadog to correlate with infrastructure data
Minute 25: Engineer identifies likely root cause (maybe)
Minute 35: Engineer implements fix manually
Minute 45: Engineer updates PagerDuty incident status and goes back to bed angry

Notice how much time gets lost in context switching and manual investigation? The tools are individually intelligent, but they don't share intelligence with each other in any meaningful way. Three smart people in one room and no one is conversing.

Sherlocks.ai: A Different Approach

So here's where I tell you about Sherlocks.ai, and before you roll your eyes and think "here comes the sales pitch," let me be clear about something: we're not perfect either. Our initial setup is more involved than I'd like, there's a learning period of 2-3 months before the system has enough context to be fully autonomous.

But we’re doing something fundamentally different that actually addresses the core problem. Instead of just organizing alerts or creating prettier dashboards, Sherlocks reads your previous incident reports, your team Slack conversations, your documentation. It learns your system's personality.

Traditional approach:

Anomaly detected → Alert fired → Engineer paged → 15 minutes of investigation → Manual fix implementation → 45-minute total resolution time

Sherlocks approach:

Anomaly detected → Historical context retrieved → Previous solution applied automatically → Success verified → Team notified → 3-minute total resolution time

The difference is institutional memory. PagerDuty knows what's alerting right now. Datadog knows what correlates with what. But Sherlocks knows what actually worked last time this happened to your specific system, with your specific configuration, solved by your specific team.

Building this wasn't just a matter of connecting APIs and writing some correlation logic. We had to solve some genuinely hard problems. How do you determine if two incidents are "similar enough" that the same solution should apply? How do you know when to apply an automated fix versus escalating to humans? What happens when an automated solution doesn't work? We thought of it all.

What This Actually Means for Your Sanity

We’re not saying these other tools are bad. PagerDuty's alert management is genuinely sophisticated. New Relic's observability platform is comprehensive. Datadog's anomaly detection is impressive. And any of them might bring out new features that place them at the top of the ranking.

The point is this: we need to stop optimizing for making incident response and start optimizing for making incident response unnecessary.

Your senior engineers are still spending their nights troubleshooting problems. Your on-call rotation is still stressful. Your team is still fighting the same fires over and over because none of these tools actually learn from your specific problem-solving approaches.

We had this one issue last year where Datadog correctly identified an anomaly in our API response times, but it turned out to be caused by a developer testing something in production. (Don't ask. We've had conversations.) The traditional tools handled it perfectly—detected the issue, alerted appropriately, provided good analysis. But we still had to wake someone up to figure out it was a false alarm.

With Sherlocks, that same scenario would have been cross-referenced against our incident history, recognized as a testing pattern, and either ignored or handled automatically based on our previous decisions about similar events.

The Bottom Line

I genuinely believe we’re at an inflexion point in SRE tooling. We have solved the monitoring and alerting problems. Detecting issues is a breeze now. The question is what we do next.

Do we keep building better interfaces for humans to respond to incidents? Or do we build systems that learn how humans respond to incidents and start doing it for us?

Most of the tools on the market are optimizing for the former. At Sherlocks.ai, we’re betting on the latter. Your mileage may vary, and there are definitely scenarios where the traditional approach makes more sense. But if you’re tired of being called in to fix the same problems over and over, let’s hop on a call.

Sherlocks.ai v/s PagerDuty v/s New Relic v/s Datadog BITS AI