Akshat Sandhaliya

Co-founder and CTO of Sherlocks.ai. A former CTO who lived through the 3 a.m. pages, Akshat writes about AI for SRE, observability, and the engineering behind autonomous incident investigation.

Posts by Akshat Sandhaliya

What Happened at KubeCon India 2026? A Complete Recap

KubeConCNCFSREDevOpsKubernetesCommunityOpen SourceCloud Native

June 21, 2026

What Happened at KubeCon India 2026? A Complete Recap

A complete, simple recap of KubeCon India 2026 in Mumbai. The stat everyone repeated, platform engineering, security, the show floor, community, and the AI SRE

How Do You Detect Hallucinations in Production? A Monitoring Guide for AI Systems

ObservabilityAIHallucinationReliabilitySREProductionIncident Management

June 16, 2026

How Do You Detect Hallucinations in Production? A Monitoring Guide for AI Systems

Hallucinations fail users silently. No error. No alert. Just a confident wrong answer your users act on. Here is how to catch them before they cause damage.

The Four Golden Signals of SRE: Latency, Traffic, Errors, and Saturation Explained

SREPerformanceReliability

June 10, 2026

The Four Golden Signals of SRE: Latency, Traffic, Errors, and Saturation Explained

The four golden signals of SRE: latency, traffic, errors, and saturation are explained with correct PromQL alerting patterns and the Signal-to-Investigation Gap

Why AI Agents Fail in Production: The Agent Failure Stack Explained

SREDevOpsAutomationPerformanceObservability

June 3, 2026

Why AI Agents Fail in Production: The Agent Failure Stack Explained

AI agents fail in production not because models are weak, but because the systems around them are incomplete. Learn the Agent Failure Stack — a six-layer framework for understanding where agents break, why standard observability misses it, and how to fix each layer before it compounds.

Blameless Postmortems Explained: Lessons From Real Outages

SREIncident ManagementDevOpsBlameless Postmortems

May 27, 2026

Blameless Postmortems Explained: Lessons From Real Outages

Most engineering teams have blameless postmortem templates. Very few have blameless cultures. This guide explores what experienced practitioners at Etsy, HubSpot, Atlassian, Google, and Honeycomb actually learned when they tried to build incident review cultures that stick.

Agent Observability for Autonomous AI SREs in 2026

ObservabilityAI SREAI AgentsOpenTelemetry

May 25, 2026

Agent Observability for Autonomous AI SREs in 2026

Traditional APM wasn't built for AI agents. What agent observability means for autonomous AI SREs in 2026: the semantic gap, the market, and how to start.

Four Paths to AI-Driven Reliability: Native, OSS, Hybrid, and Agentic SRE Stacks

SREDevOpsReliability

May 14, 2026

Four Paths to AI-Driven Reliability: Native, OSS, Hybrid, and Agentic SRE Stacks

Compare native, OSS, hybrid, and agentic AI SRE approaches. Learn the AI-SRE Maturity Curve and choose the right stack to reduce MTTR.

How Complex Systems Fail: An SRE Perspective

SREReliabilityPost-mortemsIncident Response

April 29, 2026

How Complex Systems Fail: An SRE Perspective

Richard Cook's 18 observations on complex system failure, paired line by line with the SRE translation. Medicine and aviation on the left, distributed systems on the right.

The Hallucination Gap: Why General LLMs Fail at Kubernetes RCA

PerformanceSRERCAKubernetes

April 25, 2026

The Hallucination Gap: Why General LLMs Fail at Kubernetes RCA

LLMs can’t debug your Kubernetes cluster. Discover the Hallucination Gap and why AI without live system context produces misleading root cause analysis.

Three Approaches to AI SRE: How Your Telemetry Philosophy Shapes Everything

AutomationObservabilityReliabilitySRE

April 20, 2026

Three Approaches to AI SRE: How Your Telemetry Philosophy Shapes Everything

Three philosophies define today's AI SRE tools: work with existing telemetry, collect your own, or assume no monitoring. Here is how to choose the right fit.

Observability Trend in 2026: More Data, Fewer Answers

ObservabilityAutomation

April 9, 2026

Observability Trend in 2026: More Data, Fewer Answers

Observability in 2026 is more expensive than ever but still failing teams during incidents. Learn why the Visibility-Understanding Gap exists, what the industry is getting wrong, and what is actually changing.

Best Incident Response Platforms for DevOps (2026 Guide)

SREDevOpsIncident Management

March 30, 2026

Best Incident Response Platforms for DevOps (2026 Guide)

Compare the best incident response platforms for DevOps in 2026. Learn the 4-layer IR stack, top tools by category, and how to reduce MTTR fast.

The On-Call Playbook for 2026: How to Build Sustainable Rotations

ReliabilityObservabilityOn callSRE

March 9, 2026

The On-Call Playbook for 2026: How to Build Sustainable Rotations

A practical guide to sustainable on-call rotations: reduce alert fatigue, design better alerts, choose the right rotation model, and improve MTTR.

Traditional SRE vs Modern SRE: What Every Engineering Leader Needs to Know in 2026

SREDevOpsReliabilityObservabilityIncident Management

February 25, 2026

Traditional SRE vs Modern SRE: What Every Engineering Leader Needs to Know in 2026

Traditional SRE vs Modern SRE: how the discipline has evolved from reactive runbooks to AI-driven, autonomous reliability. A practical guide for CTOs and engineering leaders on SLOs, AIOps, platform engineering, and what to do next.

Alert on Causes, Not Symptoms: The Fastest Way to Reduce MTTR

AlertingSREMonitoringObservabilityIncident Management

February 5, 2026

Alert on Causes, Not Symptoms: The Fastest Way to Reduce MTTR

Learn why cause-based alerting eliminates 10-35 minutes of investigation time per incident. A deep dive into building alerting systems that actually work.

AI SRE Incident Triage and Root Cause Analysis Demo

SREDevOpsIncident ManagementAlertingAutomation

January 20, 2026

AI SRE Incident Triage and Root Cause Analysis Demo

Watch a demo of Sherlocks.ai automatically investigating a critical production alert, identifying the real root cause, and recommending actionable fixes to speed up incident resolution.

DevOpsSREToolscomparison2026

February 19, 2026

Best SRE and DevOps Tools for 2026

Compare 30+ SRE and DevOps tools for 2026 across CI/CD, monitoring, incident management, Kubernetes, and AI. Includes pricing, integration depth, and which tools actually work together.

What Should Be Your N+1 Tool for Predictable Uptime in 2026?

SREReliabilityDevOpsAI Tools

January 13, 2026

What Should Be Your N+1 Tool for Predictable Uptime in 2026?

You already have dashboards, logs, traces, and alerts. The missing piece? An AI agent that connects them all during incidents. Learn why your N+1 tool is the key to predictable uptime in 2026.

What’s an AI SRE, and What Does it Address?

SREMonitoring

January 17, 2026

What’s an AI SRE, and What Does it Address?

AI SRE agents investigate incidents autonomously, correlating logs, metrics, and code changes in seconds. Learn what makes AI SRE possible now and how to evaluate tools for your team.

January 14, 2026

Sherlocks.ai vs k8sgpt vs RunWhen – A Straight-Up Field Report

How is Sherlocks.ai different from k8sgpt or RunWhen? A field report comparing scope, production readiness, and what each tool actually does when an incident hits your Kubernetes cluster.

January 14, 2026

The Future of SRE: AI-Powered Incident Management

The future of SRE is autonomous — AI agents now handle alert triage, root cause analysis, and remediation in minutes. Learn how AI is reshaping the SRE role in incident management for 2026 and beyond.