Sherlocks.ai Documentation

Introduction

Learn about Sherlocks.ai and how it transforms SRE operations

What is Sherlocks.ai?

Sherlocks.ai is an AI-powered SRE platform that acts as an autonomous reliability engineer for your team. It continuously monitors your infrastructure, applications, and services to detect issues, perform root cause analysis, and provide actionable insights—all in real-time through Slack.

Think of Sherlocks as having expert SREs who work 24/7/365, never sleep, and have perfect memory of every incident, deployment, and system behavior across your entire stack.

Why Sherlocks.ai Exists

The Problem:

  • Modern systems are too complex for manual monitoring
  • Mean Time To Resolution (MTTR) is increasing as systems grow
  • SRE teams spend 60-80% of time on toil and firefighting
  • Incident context is scattered across multiple tools and platforms
  • Knowledge is siloed in individual team members' heads
  • Post-mortems are written but rarely referenced

The Solution:

Sherlocks consolidate all your telemetry, learn from every incident, and provide instant, context-aware analysis when things go wrong. They don't replace your SRE team—they amplify your team.

High-Level Value Proposition

Reduce MTTR by 70%

Get instant root cause analysis instead of hours of investigation

Improve Customer Satisfaction (CSAT)

Reduce time to resolve Engineering Support tickets by reducing back and forth between engineering and getting to the RCA faster

Institutional Memory

Never lose incident knowledge when team members leave

Slack-Native

Work where your team already collaborates

Architecture Summary

Sherlocks offers flexible deployment models to meet your security and compliance requirements:

SaaS

Fully managed cloud service with Watson agent deployed in your VPC for secure data access

Self-Hosted

Deploy the entire Sherlocks platform in your own infrastructure

Hybrid

Mix and match components based on your security requirements

Sherlocks Platform Architecture Diagram

Core Concepts

Understanding the building blocks of Sherlocks

Supported Integrations

Sherlocks integrate with your existing tools and infrastructure

Cloud Providers

AWS

EC2, RDS, Lambda, S3, CloudWatch, ECS

Google Cloud Platform

Compute Engine, Cloud SQL, GKE, Cloud Monitoring

Microsoft Azure

VMs, Azure SQL, AKS, Azure Monitor

Kubernetes

Kubernetes

Pods, Deployments, Services, Events, Logs, Resource Metrics

Helm

Release tracking and version management

Datastores

MySQL

Query performance, replication status, deadlocks

PostgreSQL

Query stats, connection pools, replication lag

MongoDB

Operations, index stats, replication state

Redis

Memory usage, client connections, keyspace stats

Cassandra

Node health, compaction metrics, latency

Message Queues

Apache Kafka

Consumer lag, partition health, broker metrics

RabbitMQ

Queue metrics, bindings, node health

Amazon SQS

Queue depth, message age, DLQ stats

Azure Service Bus

Message throughput, DLQ monitoring

CI/CD & Version Control

GitHub Actions

Workflow runs, failures, deployment tracking

Jenkins

Build history, job failures, pipeline correlation

Azure Pipelines

Pipeline runs, logs, artifacts

GitHub

Commits, PRs, branches, deployment events

Observability & APM

Prometheus

Metrics, time-series, alert rules

Datadog

Infrastructure metrics, APM traces, logs, dashboards

New Relic

APM, error tracking, transaction metrics

Sentry

Error events, performance traces

Coralogix

Log aggregation, queries, correlations

Logging

Elasticsearch (ELK)

Log search, aggregations, cluster health

Coralogix

Centralized logging and analysis

Loki

Log aggregation, queries, alerts

Collaboration

Slack

Incident channels, thread analysis, bot interactions

Microsoft Teams

Coming soon

How Sherlocks.ai Works

Understanding the end-to-end flow from data ingestion to incident resolution

1
Slack Ingestion

Sherlocks monitor your Slack workspace for incident-related conversations, questions, and alerts. They learn from team discussions, incident channels, and post-mortems to build contextual understanding.

@sherlocks why is the API slow?

2
Telemetry Access for RCA via Watson

During an RCA, the Watson agent accesses telemetry from your infrastructure as needed to provide context and insights:

  • Metrics from Prometheus, Datadog, CloudWatch
  • Logs from ELK, Coralogix, or cloud logging
  • Traces from APM tools
  • Infrastructure state from Kubernetes, cloud APIs
  • Database and queue health metrics
  • CI/CD pipeline events

3
Awareness Graph Building

Sherlocks construct a living map of your system by correlating:

  • Service dependencies and communication patterns
  • Normal vs. abnormal behavior baselines
  • Historical incident patterns
  • Deployment and code change timelines
  • Team knowledge from Slack conversations

This graph continuously evolves as your system changes and new incidents occur.

4
AI Reasoning Engine

When an investigation is triggered (manually, by alert, or proactively), Sherlocks:

  • Identify relevant signals from the Awareness Graph
  • Correlate metrics, logs, and traces across time and services
  • Generate hypotheses about potential root causes
  • Test hypotheses against available data
  • Rank causes by likelihood and impact
  • Consider historical similar incidents

The LLM reasoning happens in your chosen environment (cloud or self-hosted).

5
RCA Generator

Sherlocks generate a comprehensive Root Cause Analysis including:

  • Primary root cause with confidence level
  • Contributing factors
  • Timeline of events leading to the issue
  • Affected services and blast radius
  • Recommended remediation steps
  • Links to relevant logs, metrics, and commits

6
Slack Response

Sherlocks deliver findings directly in Slack with:

  • Clear, actionable summary
  • Interactive elements for deeper investigation
  • Links to dashboards and relevant resources
  • Suggested next steps
  • Option to ask follow-up questions

Example Response:

"The API slowdown started at 14:23 UTC, 2 minutes after deployment v2.3.4. Root cause: N+1 query in UserService.getProfile() introduced in commit abc123. This is causing 50x more database queries. Recommend: rollback to v2.3.3 or apply hotfix to add eager loading."

Deployment Models

Flexible deployment options to meet your security and compliance requirements

SaaS Model
Recommended for most teams

The Sherlocks platform runs in our cloud, while the Watson agent runs inside your VPC/infrastructure.

Architecture:

  • Watson agent deployed via Helm in your Kubernetes cluster
  • Agent has read-only access to your infrastructure
  • Telemetry metadata sent to Sherlocks cloud (encrypted in transit)
  • AI reasoning happens in Sherlocks cloud
  • Results delivered via Slack
Security: No raw application data leaves your environment. Only metadata and metrics are transmitted.

Cloud-Native SaaS
Quickest to get started

The entire Sherlocks platform, including Investigators and the Watson Data Agent, is deployed and managed within Sherlocks.ai's secure cloud infrastructure. This model connects directly to your cloud accounts and other monitoring sources, eliminating the need for any agent installation or additional infrastructure on your end.

Architecture:

  • Sherlocks Investigators and Watson Data Agent deployed in Sherlocks.ai cloud
  • Connects directly to your cloud accounts via secure API authentication
  • No agents, Kubernetes pods, or VMs required in your infrastructure
  • All monitoring sources integrated directly from Sherlocks cloud
  • Managed updates and scaling by Sherlocks.ai
Benefits: Get started instantly with no agent installation or infrastructure management. Leverage Sherlocks.ai's secure, SOC2 Type 2 certified cloud environment for all your monitoring needs.

Fully In-VPC Sherlocks
Maximum control

Deploy the entire Sherlocks platform within your own infrastructure with no external dependencies.

Architecture:

  • Complete Sherlocks stack runs in your VPC
  • No data leaves your environment
  • Self-managed updates and scaling
  • Requires self-hosted LLM or private cloud LLM
Best for: Highly regulated industries (finance, healthcare) or air-gapped environments.

In-VPC LLM (Azure OpenAI)
Enterprise AI security

Use Azure OpenAI Service within your own Azure tenant for enterprise-grade AI with data residency guarantees.

Benefits:

  • LLM runs in your Azure subscription
  • Microsoft's enterprise data protection guarantees
  • No training on your data
  • Compliance with SOC 2, HIPAA, GDPR
  • Data residency in your chosen Azure region
Azure OpenAI Service: Learn more about Azure OpenAI Service

Installation & Integration Steps

Step-by-step guide to get Sherlocks running in your environment

Prerequisites

  • Kubernetes cluster (v1.20+) with Helm 3 installed
  • Admin access to grant IAM roles for cloud providers
  • Slack workspace admin access
  • Access to observability tools (Prometheus, Datadog, etc.)
  • Read-only credentials for databases and queues

Example Use Cases

Real-world scenarios where Sherlocks accelerate incident resolution

Slow API Calls

Problem

API response time increased from 100ms to 1s

Sherlocks Investigation

Sherlocks correlate the latency spike with a recent deployment, identify a new N+1 query pattern in the database, and point to the specific commit that introduced it.

Outcome

MTTR reduced from 2 hours to 5 minutes

Build Failure Leading to Partial Deployment

Problem

Some microservices deployed, others failed silently

Sherlocks Investigation

Sherlocks correlate CI/CD pipeline failures with missing services in Kubernetes, identify the failed build step, and link to the problematic commit.

Outcome

Complete context provided immediately

Kafka Backlog + Consumer Lag

Problem

Message processing falling behind, queue growing

Sherlocks Investigation

Sherlocks identify that a consumer pod is crash-looping due to OOM, correlate with recent traffic spike, and suggest scaling.

Outcome

Proactive detection before customer impact

Kubernetes Crash Loop

Problem

Pod repeatedly restarting

Sherlocks Investigation

Sherlocks analyze pod logs, identify a missing environment variable introduced in the latest deployment, and provide the exact fix.

Outcome

Root cause identified in seconds

Low DB Connections or Replication Lag

Problem

Database connection pool exhausted, queries timing out

Sherlocks Investigation

Sherlocks detect connection pool saturation, identify a long-running query holding connections, and correlate with a recent code change.

Outcome

Prevented database outage

Permissions Model

Detailed breakdown of required permissions and security guarantees

Security Guarantee

All permissions are read-only. Sherlocks cannot modify your infrastructure, databases, or application data. We collect only metadata and metrics—never raw application data, PII, or secrets.

19 of 19 systems
SystemRequired PermissionsPurposeBusiness Value
MySQL
Database
SELECT on INFORMATION_SCHEMA, SHOW commands for metadataMonitor query performance, replication status, connection pools, and deadlocksIdentify slow queries and database bottlenecks affecting user experience
MongoDB
Database
Read-only access to admin and local databases, serverStatus commandTrack operations, index statistics, replication state, and connection metricsDetect performance degradation and replication lag before it impacts users
Redis
Database
INFO command, read-only key access for metadataMonitor memory usage, key counts, replication status, and eviction metricsPrevent cache misses and memory issues that slow down applications
Elasticsearch
Database
Read-only cluster and index stats API accessTrack index health, search performance, and cluster resource usageEnsure search functionality remains responsive during high traffic
Cassandra
Database
Read-only access to system tables and nodetool metricsMonitor cluster health, compaction status, and read/write latenciesMaintain database availability and prevent query timeouts
Kafka
Queue
Read-only consumer group and topic metadata accessTrack consumer lag, partition health, broker metrics, and throughputPrevent message backlog that causes delayed processing and user complaints
RabbitMQ
Queue
Read-only access to management API for queue and exchange statsMonitor queue depths, message rates, and connection healthDetect message processing bottlenecks before they cause system failures
Amazon SQS
Queue
CloudWatch metrics read access, SQS GetQueueAttributesTrack queue depth, message age, and throughput metricsIdentify and resolve message processing delays proactively
Azure Service Bus (Queues/Topics)
Queue
Read-only access to queue/topic metrics and message countsMonitor active message counts, dead letter queues, and processing ratesPrevent message accumulation that leads to service degradation
Kubernetes
Orchestration
Read-only access to pods, services, deployments, events, and nodesMap service dependencies, track resource usage, and monitor pod healthEnable rapid incident response by understanding service topology and health
Clouds (AWS, GCP, Azure)
Cloud
Read-only access to CloudWatch/Stackdriver/Monitor metrics and resource metadataCollect infrastructure metrics, resource utilization, and service health dataCorrelate application issues with infrastructure problems for faster root cause analysis
Prometheus
Observability
Read-only query API access to metrics and time-series dataQuery metrics for baseline establishment and anomaly detectionLeverage existing metrics infrastructure for comprehensive system visibility
Datadog
Observability
Read-only API access to metrics, events, and dashboardsQuery metrics and correlate with application performance dataUnify observability data for faster incident investigation
New Relic / Sentry (APM & Error Tracking)
Observability
Read-only API access to application performance metrics and error tracesCorrelate infrastructure issues with application errors and performance degradationConnect infrastructure problems to user-facing issues for complete incident understanding
Coralogix / ELK
Observability
Read-only access to log aggregation and search APIsQuery logs during incidents to understand system behavior and errorsSpeed up root cause analysis by correlating metrics with log events
GitHub (Code Repository)
CI/CD
Read-only access to repository metadata, commit history, and codeAnalyze code changes that may have caused incidents and understand service dependenciesLink incidents to code changes for faster resolution and prevention
Jenkins (CI Server)
CI/CD
Read-only access to build history and job metadataCorrelate deployments and builds with incident timelinesIdentify if recent deployments caused issues, enabling quick rollback decisions
GitHub Actions (CI/CD)
CI/CD
Read-only access to workflow runs and job metadataTrack deployment history and correlate with incident occurrencesUnderstand deployment impact on system stability
Azure Pipelines (Azure DevOps)
CI/CD
Read-only access to pipeline runs and release metadataMonitor deployment frequency and correlate releases with incidentsEnable data-driven deployment decisions and faster incident resolution

What Sherlocks Does With This Data

  • Builds the Awareness Graph mapping service dependencies
  • Establishes baselines for normal system behavior
  • Correlates signals during incident investigations
  • Identifies anomalies and potential issues
  • Generates root cause analyses
  • Learns from incidents to improve future investigations

Data Exfiltration Protection

Watson agent runs inside your VPC with no ability to access application data from databases or queues. Only aggregated metrics and metadata are transmitted to the Sherlocks platform (or kept entirely in your environment with self-hosted deployment).

Data Security & Privacy

How Sherlocks protect your data and maintain security

Isolation Model

Watson agent runs entirely within your VPC or infrastructure, ensuring your data never leaves your control:

  • Agent deployed via Helm in your Kubernetes cluster
  • Direct access to your infrastructure using your network
  • No VPN or external access required
  • Telemetry processed locally before transmission

Read-Only IAM Roles

Watson operates with strictly read-only permissions across all integrations:

  • Cannot modify infrastructure, databases, or queues
  • Cannot execute commands or deploy changes
  • Cannot access secrets or credentials
  • Permissions follow principle of least privilege
Guarantee: Even if compromised, Watson cannot modify your systems or exfiltrate application data.

No Raw Data Access

Watson collects metadata and metrics, not application data:

What We Collect
  • Database connection counts
  • Query execution times
  • Replication lag metrics
  • Queue depth and message age
  • Error rates and types
What We Don't Collect
  • Table data or records
  • Message contents
  • Customer PII
  • API keys or secrets
  • Source code

TLS & Encryption

  • All data in transit encrypted with TLS 1.3
  • Data at rest encrypted using AES-256
  • Separate encryption keys per customer
  • Key rotation policies enforced

Optional Private LLM

For maximum data privacy, use a private LLM deployment:

  • Azure OpenAI: LLM runs in your Azure tenant with Microsoft's data protection guarantees
  • AWS Bedrock: Fully managed in your AWS account
  • Self-Hosted: Run open-source models in your infrastructure
With private LLM, your telemetry data never leaves your cloud environment.

Retention Policy

  • Telemetry metadata retained for 90 days by default (configurable)
  • Incident analyses retained for 1 year
  • Awareness Graph continuously updated, old patterns pruned
  • Data deletion requests honored within 30 days

Self-Hosting Options

For complete control, deploy Sherlocks entirely in your infrastructure:

  • All components run in your VPC
  • No external dependencies
  • You control all data retention and deletion
  • Suitable for air-gapped environments

Bring Your Own Cloud (BYOC) LLM

Use your own LLM API keys and accounts:

  • Sherlocks use your Azure OpenAI or Anthropic account
  • LLM costs billed directly to you
  • Full visibility into LLM usage and costs
  • Compliance with your existing LLM vendor agreements

Custom Instructions

Teach Sherlocks about your team's processes and preferences

Global Instructions

Set team-wide guidelines that apply to all investigations:

Example Global Instructions:

  • Always check the #deployments channel for recent changes
  • Our peak traffic hours are 9 AM - 5 PM EST
  • Database queries over 1s are considered slow
  • Escalate to @oncall-sre for production issues
  • We prefer rollback over hotfix for critical issues

Per-Service Overrides

Provide service-specific context and thresholds:

Payment Service:

  • Normal latency: 50-100ms
  • Depends on: wallet-service, stripe-api
  • Known issue: Stripe rate limits at 100 req/s
  • Owner: @payments-team

User Service:

  • Cache hit rate should be \u003e 90%
  • Redis failover takes 30s (expected)
  • Owner: @backend-team

Assistant Levels

Choose the depth of analysis for different situations:

Intern

Quick Triage

Fast, surface-level analysis. Good for initial assessment or low-priority issues.

Senior

Detailed Investigation

Comprehensive RCA with correlation across multiple signals. Default mode.

Architect

Strategic Analysis

Deep analysis with architectural recommendations and long-term solutions.

@sherlocks investigate this as an architect

Safety Boundaries

Define what Sherlocks should never do:

  • Never suggest deleting production data
  • Never recommend scaling down during business hours
  • Always require approval before suggesting rollbacks
  • Don't investigate PII-related logs

Company SRE Cultural Preferences

Encode your team's debugging philosophy:

Example Cultural Instructions:

  • "We value blameless post-mortems"
  • "Always consider blast radius before suggesting changes"
  • "Prefer gradual rollouts over big-bang deploys"
  • "Document everything in Confluence"
  • "We communicate outages in #customer-updates within 5 minutes"

Escalation Rules

Define when and how to escalate:

  • Page @oncall-sre for SEV-1 incidents (customer-facing outage)
  • Notify @backend-lead for database issues
  • Alert @security-team for authentication failures \u003e 10/min
  • Create Jira ticket for non-urgent issues

Example Templates

"How we debug production issues"

1. Check recent deployments
2. Review error rates in Datadog
3. Examine logs for stack traces
4. Check database connection pools
5. Verify external API health
6. Consider rollback if deployed \u003c 1 hour ago

"How we talk to customers on outages"

- Be transparent about impact
- Provide ETAs only if confident
- Update every 30 minutes
- Never blame third parties publicly

"Our rollback policies"

- Rollback immediately for SEV-1
- Rollback within 15 min for SEV-2
- Hotfix acceptable for SEV-3+
- Always notify #engineering before rollback

Sherlocks in Slack

How to interact with Sherlocks through Slack

Query Formats

Ask Sherlocks questions using natural language:

@sherlocks why is the API slow?

General investigation request

@sherlocks what caused the deployment failure?

Specific incident investigation

@sherlocks show me the health of the payment service

Service health check

@sherlocks what changed in the last hour?

Change tracking

@sherlocks explain this error: [paste stack trace]

Error analysis

Slack Shortcuts

Use Slack shortcuts for quick actions:

  • /investigate - Start a new investigation
  • /sherlock-status - Check system health
  • /sherlock-recent - View recent incidents
  • /sherlock-help - Get help and examples

Screen Share & Voice Debugging

During Slack huddles or calls, Sherlocks can participate in real-time debugging sessions:

  • Share your screen showing dashboards or logs
  • Ask Sherlocks questions verbally
  • Sherlocks respond in the thread with analysis
  • Collaborative investigation with your team
Coming Soon: Voice responses and interactive screen analysis

Incident Channel Automation

When you create an incident channel (e.g., #incident-2024-01-15), Sherlocks automatically:

  • Join the channel
  • Begin investigating based on channel name or initial messages
  • Post preliminary findings within minutes
  • Monitor the conversation for context
  • Update analysis as new information emerges
  • Generate RCA summary at incident resolution

Auto-Generated RCA Summaries

At the end of an incident, Sherlocks automatically generate a comprehensive RCA including:

  • Timeline of events
  • Root cause with supporting evidence
  • Services and users impacted
  • Remediation steps taken
  • Recommendations to prevent recurrence
  • Links to relevant metrics, logs, and commits

This RCA is posted in the incident channel and stored in the Awareness Graph for future reference.

Interactive Elements

Sherlocks responses include interactive buttons and menus:

  • Dig Deeper: Request more detailed analysis
  • Show Logs: View relevant log entries
  • View Metrics: Open related dashboards
  • Similar Incidents: See past similar issues
  • Mark Resolved: Close the investigation
  • Escalate: Page on-call engineer

Troubleshooting

Common issues and how to resolve them

Agent not visible in Slack

Possible Causes
  • Sherlocks app not installed in workspace
  • App not invited to channel
  • Permissions not granted
Solutions
  • Reinstall Sherlocks app from Slack App Directory
  • Invite @sherlocks to the channel with /invite @sherlocks
  • Check workspace admin approved the app

Missing permissions errors

Possible Causes
  • IAM roles not configured correctly
  • Credentials expired
  • Insufficient permissions granted
Solutions
  • Review permissions table and ensure all required permissions are granted
  • Rotate credentials if expired
  • Check Watson logs for specific permission errors
  • Use sherlock-status command to see integration health

Telemetry not flowing

Possible Causes
  • Watson agent not running
  • Network connectivity issues
  • Integration credentials incorrect
Solutions
  • Check Watson pod status: kubectl get pods -n sherlocks
  • View Watson logs: kubectl logs -n sherlocks -l app=watson
  • Verify network policies allow outbound connections
  • Test integration credentials manually

Slack not responding

Possible Causes
  • Sherlocks platform outage
  • Rate limiting
  • Investigation queue backlog
Solutions
  • Check status.sherlocks.ai for platform status
  • Wait a few minutes and retry
  • Use /sherlock-status to check queue depth
  • Contact support if issue persists

Stuck investigations

Possible Causes
  • LLM provider timeout
  • Insufficient data to analyze
  • Complex query requiring more time
Solutions
  • Wait up to 5 minutes for complex investigations
  • Rephrase query to be more specific
  • Check LLM provider status
  • Cancel and restart investigation

LLM timeouts or restrictions

Possible Causes
  • LLM API rate limits hit
  • LLM provider outage
  • Content filtering triggered
Solutions
  • Configure fallback LLM provider
  • Increase rate limits with your LLM provider
  • Review content filtering policies
  • Switch to self-hosted LLM

Getting Help

If you're still experiencing issues:

  • Email: support@sherlocks.ai
  • Slack: Join our community Slack workspace
  • Documentation: docs.sherlocks.ai
  • Status Page: status.sherlocks.ai

Include Watson logs and investigation IDs when contacting support for faster resolution.

Frequently Asked Questions

Common questions about Sherlocks capabilities and deployment

Glossary

Key terms and concepts in the Sherlocks platform

Awareness Graph

A living knowledge base that maps your entire system's architecture, dependencies, behavior patterns, and incident history. It continuously evolves as your system changes.

Investigator

The AI reasoning engine that analyzes telemetry, correlates signals, and generates root cause analyses.

Watson

The data collection agent that runs in your infrastructure with read-only access to gather telemetry and metadata.

Signals

Data points used in investigations: metrics, logs, traces, events, and Slack conversations.

RCA (Root Cause Analysis)

A comprehensive analysis identifying the primary cause of an incident, contributing factors, timeline, and recommended remediation.

Hypothesis

A potential explanation for an incident that Sherlocks generates and tests against available data.

Forensics

The process of analyzing historical data to understand what happened during an incident.

Incident Memory

Sherlocks' record of past incidents, their causes, and resolutions, used to identify patterns and accelerate future investigations.

Playbooks

Predefined investigation patterns and remediation steps for common incident types.

Blast Radius

The scope of services and users affected by an incident.

MTTR (Mean Time To Resolution)

The average time it takes to resolve an incident from detection to fix.

Toil

Repetitive, manual operational work that doesn't provide lasting value (e.g., manually correlating logs during incidents).

Custom Instructions

Team-specific guidance you provide to Sherlocks about how to investigate issues, communicate findings, and prioritize concerns.

Assistant Levels

Different modes of Sherlocks operation: Intern (basic analysis), Senior (detailed investigation), Architect (strategic recommendations).