Sherlocks.ai Documentation

Introduction

Learn about Sherlocks.ai and how it transforms SRE operations

What is Sherlocks.ai?

Sherlocks.ai is an AI-powered SRE platform that acts as an autonomous reliability engineer for your team. It continuously monitors your infrastructure, applications, and services to detect issues, perform root cause analysis, and provide actionable insights—all in real-time through Slack.

Think of Sherlocks as having expert SREs who work 24/7/365, never sleep, and have perfect memory of every incident, deployment, and system behavior across your entire stack.

Why Sherlocks.ai Exists

The Problem:

Modern systems are too complex for manual monitoring
Mean Time To Resolution (MTTR) is increasing as systems grow
SRE teams spend 60-80% of time on toil and firefighting
Incident context is scattered across multiple tools and platforms
Knowledge is siloed in individual team members' heads
Post-mortems are written but rarely referenced

The Solution:

Sherlocks consolidate all your telemetry, learn from every incident, and provide instant, context-aware analysis when things go wrong. They don't replace your SRE team—they amplify your team.

High-Level Value Proposition

Reduce MTTR by 70%

Get instant root cause analysis instead of hours of investigation

Improve Customer Satisfaction (CSAT)

Reduce time to resolve Engineering Support tickets by reducing back and forth between engineering and getting to the RCA faster

Institutional Memory

Never lose incident knowledge when team members leave

Slack-Native

Work where your team already collaborates

Architecture Summary

Sherlocks offers flexible deployment models to meet your security and compliance requirements:

SaaS

Fully managed cloud service with Watson agent deployed in your VPC for secure data access

Self-Hosted

Deploy the entire Sherlocks platform in your own infrastructure

Hybrid

Mix and match components based on your security requirements

Core Concepts

Understanding the building blocks of Sherlocks

Supported Integrations

Sherlocks integrate with your existing tools and infrastructure

Cloud Providers

AWS

EC2, RDS, Lambda, S3, CloudWatch, ECS

Google Cloud Platform

Compute Engine, Cloud SQL, GKE, Cloud Monitoring

Microsoft Azure

VMs, Azure SQL, AKS, Azure Monitor

Kubernetes

Pods, Deployments, Services, Events, Logs, Resource Metrics

Helm

Release tracking and version management

Datastores

MySQL

Query performance, replication status, deadlocks

PostgreSQL

Query stats, connection pools, replication lag

MongoDB

Operations, index stats, replication state

Redis

Memory usage, client connections, keyspace stats

Cassandra

Node health, compaction metrics, latency

Message Queues

Apache Kafka

Consumer lag, partition health, broker metrics

RabbitMQ

Queue metrics, bindings, node health

Amazon SQS

Queue depth, message age, DLQ stats

Azure Service Bus

Message throughput, DLQ monitoring

CI/CD & Version Control

GitHub Actions

Workflow runs, failures, deployment tracking

Jenkins

Build history, job failures, pipeline correlation

Azure Pipelines

Pipeline runs, logs, artifacts

GitHub

Commits, PRs, branches, deployment events

Observability & APM

Prometheus

Metrics, time-series, alert rules

Datadog

Infrastructure metrics, APM traces, logs, dashboards

New Relic

APM, error tracking, transaction metrics

Sentry

Error events, performance traces

Coralogix

Log aggregation, queries, correlations

Logging

Elasticsearch (ELK)

Log search, aggregations, cluster health

Coralogix

Centralized logging and analysis

Loki

Log aggregation, queries, alerts

Collaboration

Slack

Incident channels, thread analysis, bot interactions

Microsoft Teams

Coming soon

How Sherlocks.ai Works

Understanding the end-to-end flow from data ingestion to incident resolution

1
Slack Ingestion

Sherlocks monitor your Slack workspace for incident-related conversations, questions, and alerts. They learn from team discussions, incident channels, and post-mortems to build contextual understanding.

@sherlocks why is the API slow?

2
Telemetry Access for RCA via Watson

During an RCA, the Watson agent accesses telemetry from your infrastructure as needed to provide context and insights:

Metrics from Prometheus, Datadog, CloudWatch
Logs from ELK, Coralogix, or cloud logging
Traces from APM tools
Infrastructure state from Kubernetes, cloud APIs
Database and queue health metrics
CI/CD pipeline events

3
Awareness Graph Building

Sherlocks construct a living map of your system by correlating:

Service dependencies and communication patterns
Normal vs. abnormal behavior baselines
Historical incident patterns
Deployment and code change timelines
Team knowledge from Slack conversations

This graph continuously evolves as your system changes and new incidents occur.

4
AI Reasoning Engine

When an investigation is triggered (manually, by alert, or proactively), Sherlocks:

Identify relevant signals from the Awareness Graph
Correlate metrics, logs, and traces across time and services
Generate hypotheses about potential root causes
Test hypotheses against available data
Rank causes by likelihood and impact
Consider historical similar incidents

The LLM reasoning happens in your chosen environment (cloud or self-hosted).

5
RCA Generator

Sherlocks generate a comprehensive Root Cause Analysis including:

Primary root cause with confidence level
Contributing factors
Timeline of events leading to the issue
Affected services and blast radius
Recommended remediation steps
Links to relevant logs, metrics, and commits

6
Slack Response

Sherlocks deliver findings directly in Slack with:

Clear, actionable summary
Interactive elements for deeper investigation
Links to dashboards and relevant resources
Suggested next steps
Option to ask follow-up questions

Example Response:

"The API slowdown started at 14:23 UTC, 2 minutes after deployment v2.3.4. Root cause: N+1 query in UserService.getProfile() introduced in commit abc123. This is causing 50x more database queries. Recommend: rollback to v2.3.3 or apply hotfix to add eager loading."

Deployment Models

Flexible deployment options to meet your security and compliance requirements

SaaS Model
Recommended for most teams

The Sherlocks platform runs in our cloud, while the Watson agent runs inside your VPC/infrastructure.

Architecture:

Watson agent deployed via Helm in your Kubernetes cluster
Agent has read-only access to your infrastructure
Telemetry metadata sent to Sherlocks cloud (encrypted in transit)
AI reasoning happens in Sherlocks cloud
Results delivered via Slack

Security: No raw application data leaves your environment. Only metadata and metrics are transmitted.

Cloud-Native SaaS
Quickest to get started

The entire Sherlocks platform, including Investigators and the Watson Data Agent, is deployed and managed within Sherlocks.ai's secure cloud infrastructure. This model connects directly to your cloud accounts and other monitoring sources, eliminating the need for any agent installation or additional infrastructure on your end.

Architecture:

Sherlocks Investigators and Watson Data Agent deployed in Sherlocks.ai cloud
Connects directly to your cloud accounts via secure API authentication
No agents, Kubernetes pods, or VMs required in your infrastructure
All monitoring sources integrated directly from Sherlocks cloud
Managed updates and scaling by Sherlocks.ai

Benefits: Get started instantly with no agent installation or infrastructure management. Leverage Sherlocks.ai's secure, SOC2 Type 2 certified cloud environment for all your monitoring needs.

Fully In-VPC Sherlocks
Maximum control

Deploy the entire Sherlocks platform within your own infrastructure with no external dependencies.

Architecture:

Complete Sherlocks stack runs in your VPC
No data leaves your environment
Self-managed updates and scaling
Requires self-hosted LLM or private cloud LLM

Best for: Highly regulated industries (finance, healthcare) or air-gapped environments.

In-VPC LLM (Azure OpenAI)
Enterprise AI security

Use Azure OpenAI Service within your own Azure tenant for enterprise-grade AI with data residency guarantees.

Benefits:

LLM runs in your Azure subscription
Microsoft's enterprise data protection guarantees
No training on your data
Compliance with SOC 2, HIPAA, GDPR
Data residency in your chosen Azure region

Azure OpenAI Service: Learn more about Azure OpenAI Service

Installation & Integration Steps

Step-by-step guide to get Sherlocks running in your environment. The installation steps are covered in this guide.

Prerequisites

Kubernetes cluster (v1.20+) with Helm 3 installed
Admin access to grant IAM roles for cloud providers
Slack workspace admin access

Outcome

Prevented database outage

Permissions Model

Detailed breakdown of required permissions and security guarantees

Security Guarantee

All permissions are read-only. Sherlocks cannot modify your infrastructure, databases, or application data. We collect only metadata and metrics—never raw application data, PII, or secrets.

19 of 19 systems

System	Required Permissions	Purpose	Business Value
MySQL Database	SELECT on INFORMATION_SCHEMA, SHOW commands for metadata	Monitor query performance, replication status, connection pools, and deadlocks	Identify slow queries and database bottlenecks affecting user experience
MongoDB Database	Read-only access to admin and local databases, serverStatus command	Track operations, index statistics, replication state, and connection metrics	Detect performance degradation and replication lag before it impacts users
Redis Database	INFO command, read-only key access for metadata	Monitor memory usage, key counts, replication status, and eviction metrics	Prevent cache misses and memory issues that slow down applications
Elasticsearch Database	Read-only cluster and index stats API access	Track index health, search performance, and cluster resource usage	Ensure search functionality remains responsive during high traffic
Cassandra Database	Read-only access to system tables and nodetool metrics	Monitor cluster health, compaction status, and read/write latencies	Maintain database availability and prevent query timeouts
Kafka Queue	Read-only consumer group and topic metadata access	Track consumer lag, partition health, broker metrics, and throughput	Prevent message backlog that causes delayed processing and user complaints
RabbitMQ Queue	Read-only access to management API for queue and exchange stats	Monitor queue depths, message rates, and connection health	Detect message processing bottlenecks before they cause system failures
Amazon SQS Queue	CloudWatch metrics read access, SQS GetQueueAttributes	Track queue depth, message age, and throughput metrics	Identify and resolve message processing delays proactively
Azure Service Bus (Queues/Topics) Queue	Read-only access to queue/topic metrics and message counts	Monitor active message counts, dead letter queues, and processing rates	Prevent message accumulation that leads to service degradation
Kubernetes Orchestration	Read-only access to pods, services, deployments, events, and nodes	Map service dependencies, track resource usage, and monitor pod health	Enable rapid incident response by understanding service topology and health
Clouds (AWS, GCP, Azure) Cloud	Read-only access to CloudWatch/Stackdriver/Monitor metrics and resource metadata	Collect infrastructure metrics, resource utilization, and service health data	Correlate application issues with infrastructure problems for faster root cause analysis
Prometheus Observability	Read-only query API access to metrics and time-series data	Query metrics for baseline establishment and anomaly detection	Leverage existing metrics infrastructure for comprehensive system visibility
Datadog Observability	Read-only API access to metrics, events, and dashboards	Query metrics and correlate with application performance data	Unify observability data for faster incident investigation
New Relic / Sentry (APM & Error Tracking) Observability	Read-only API access to application performance metrics and error traces	Correlate infrastructure issues with application errors and performance degradation	Connect infrastructure problems to user-facing issues for complete incident understanding
Coralogix / ELK Observability	Read-only access to log aggregation and search APIs	Query logs during incidents to understand system behavior and errors	Speed up root cause analysis by correlating metrics with log events
GitHub (Code Repository) CI/CD	Read-only access to repository metadata, commit history, and code	Analyze code changes that may have caused incidents and understand service dependencies	Link incidents to code changes for faster resolution and prevention
Jenkins (CI Server) CI/CD	Read-only access to build history and job metadata	Correlate deployments and builds with incident timelines	Identify if recent deployments caused issues, enabling quick rollback decisions
GitHub Actions (CI/CD) CI/CD	Read-only access to workflow runs and job metadata	Track deployment history and correlate with incident occurrences	Understand deployment impact on system stability
Azure Pipelines (Azure DevOps) CI/CD	Read-only access to pipeline runs and release metadata	Monitor deployment frequency and correlate releases with incidents	Enable data-driven deployment decisions and faster incident resolution

What Sherlocks Does With This Data

Builds the Awareness Graph mapping service dependencies
Establishes baselines for normal system behavior
Correlates signals during incident investigations
Identifies anomalies and potential issues
Generates root cause analyses
Learns from incidents to improve future investigations

Data Exfiltration Protection

Watson agent runs inside your VPC with no ability to access application data from databases or queues. Only aggregated metrics and metadata are transmitted to the Sherlocks platform (or kept entirely in your environment with self-hosted deployment).

Data Security & Privacy

How Sherlocks protect your data and maintain security

Isolation Model

Watson agent runs entirely within your VPC or infrastructure, ensuring your data never leaves your control:

Agent deployed via Helm in your Kubernetes cluster
Direct access to your infrastructure using your network
No VPN or external access required
Telemetry processed locally before transmission

Read-Only IAM Roles

Watson operates with strictly read-only permissions across all integrations:

Cannot modify infrastructure, databases, or queues
Cannot execute commands or deploy changes
Cannot access secrets or credentials
Permissions follow principle of least privilege

Guarantee: Even if compromised, Watson cannot modify your systems or exfiltrate application data.

No Raw Data Access

Watson collects metadata and metrics, not application data:

What We Collect

Database connection counts
Query execution times
Replication lag metrics
Queue depth and message age
Error rates and types

What We Don't Collect

Table data or records
Message contents
Customer PII
API keys or secrets
Source code

TLS & Encryption

All data in transit encrypted with TLS 1.3
Data at rest encrypted using AES-256
Separate encryption keys per customer
Key rotation policies enforced

Optional Private LLM

For maximum data privacy, use a private LLM deployment:

Azure OpenAI: LLM runs in your Azure tenant with Microsoft's data protection guarantees
AWS Bedrock: Fully managed in your AWS account
Self-Hosted: Run open-source models in your infrastructure

With private LLM, your telemetry data never leaves your cloud environment.

Retention Policy

Telemetry metadata retained for 90 days by default (configurable)
Incident analyses retained for 1 year
Awareness Graph continuously updated, old patterns pruned
Data deletion requests honored within 30 days

Self-Hosting Options

For complete control, deploy Sherlocks entirely in your infrastructure:

All components run in your VPC
No external dependencies
You control all data retention and deletion
Suitable for air-gapped environments

Bring Your Own Cloud (BYOC) LLM

Use your own LLM API keys and accounts:

Sherlocks use your Azure OpenAI or Anthropic account
LLM costs billed directly to you
Full visibility into LLM usage and costs
Compliance with your existing LLM vendor agreements

Custom Instructions

Teach Sherlocks about your team's processes and preferences

Global Instructions

Set team-wide guidelines that apply to all investigations:

Example Global Instructions:

Always check the #deployments channel for recent changes
Our peak traffic hours are 9 AM - 5 PM EST
Database queries over 1s are considered slow
Escalate to @oncall-sre for production issues
We prefer rollback over hotfix for critical issues

Per-Service Overrides

Provide service-specific context and thresholds:

Payment Service:

Normal latency: 50-100ms
Depends on: wallet-service, stripe-api
Known issue: Stripe rate limits at 100 req/s
Owner: @payments-team

User Service:

Cache hit rate should be \u003e 90%
Redis failover takes 30s (expected)
Owner: @backend-team

Assistant Levels

Choose the depth of analysis for different situations:

Intern

Quick Triage

Fast, surface-level analysis. Good for initial assessment or low-priority issues.

Senior

Detailed Investigation

Comprehensive RCA with correlation across multiple signals. Default mode.

Architect

Strategic Analysis

Deep analysis with architectural recommendations and long-term solutions.

@sherlocks investigate this as an architect

Safety Boundaries

Define what Sherlocks should never do:

Never suggest deleting production data
Never recommend scaling down during business hours
Always require approval before suggesting rollbacks
Don't investigate PII-related logs

Company SRE Cultural Preferences

Encode your team's debugging philosophy:

Example Cultural Instructions:

"We value blameless post-mortems"
"Always consider blast radius before suggesting changes"
"Prefer gradual rollouts over big-bang deploys"
"Document everything in Confluence"
"We communicate outages in #customer-updates within 5 minutes"

Escalation Rules

Define when and how to escalate:

Page @oncall-sre for SEV-1 incidents (customer-facing outage)
Notify @backend-lead for database issues
Alert @security-team for authentication failures \u003e 10/min
Create Jira ticket for non-urgent issues

Example Templates

"How we debug production issues"

1. Check recent deployments
2. Review error rates in Datadog
3. Examine logs for stack traces
4. Check database connection pools
5. Verify external API health
6. Consider rollback if deployed \u003c 1 hour ago

"How we talk to customers on outages"

- Be transparent about impact
- Provide ETAs only if confident
- Update every 30 minutes
- Never blame third parties publicly

"Our rollback policies"

- Rollback immediately for SEV-1
- Rollback within 15 min for SEV-2
- Hotfix acceptable for SEV-3+
- Always notify #engineering before rollback

Sherlocks in Slack

How to interact with Sherlocks through Slack

Query Formats

Ask Sherlocks questions using natural language:

@sherlocks why is the API slow?

General investigation request

@sherlocks what caused the deployment failure?

Specific incident investigation

@sherlocks show me the health of the payment service

Service health check

@sherlocks what changed in the last hour?

Change tracking

@sherlocks explain this error: [paste stack trace]

Error analysis

Slack Shortcuts

Use Slack shortcuts for quick actions:

/investigate - Start a new investigation
/sherlock-status - Check system health
/sherlock-recent - View recent incidents
/sherlock-help - Get help and examples

Screen Share & Voice Debugging

During Slack huddles or calls, Sherlocks can participate in real-time debugging sessions:

Share your screen showing dashboards or logs
Ask Sherlocks questions verbally
Sherlocks respond in the thread with analysis
Collaborative investigation with your team

Coming Soon: Voice responses and interactive screen analysis

Incident Channel Automation

When you create an incident channel (e.g., #incident-2024-01-15), Sherlocks automatically:

Join the channel
Begin investigating based on channel name or initial messages
Post preliminary findings within minutes
Monitor the conversation for context
Update analysis as new information emerges
Generate RCA summary at incident resolution

Auto-Generated RCA Summaries

At the end of an incident, Sherlocks automatically generate a comprehensive RCA including:

Timeline of events
Root cause with supporting evidence
Services and users impacted
Remediation steps taken
Recommendations to prevent recurrence
Links to relevant metrics, logs, and commits

This RCA is posted in the incident channel and stored in the Awareness Graph for future reference.

Interactive Elements

Sherlocks responses include interactive buttons and menus:

Dig Deeper: Request more detailed analysis
Show Logs: View relevant log entries
View Metrics: Open related dashboards
Similar Incidents: See past similar issues
Mark Resolved: Close the investigation
Escalate: Page on-call engineer

Troubleshooting

Common issues and how to resolve them

Agent not visible in Slack

Possible Causes

Sherlocks app not installed in workspace
App not invited to channel
Permissions not granted

Solutions

Reinstall Sherlocks app from Slack App Directory
Invite @sherlocks to the channel with /invite @sherlocks
Check workspace admin approved the app

Missing permissions errors

Possible Causes

IAM roles not configured correctly
Credentials expired
Insufficient permissions granted

Solutions

Review permissions table and ensure all required permissions are granted
Rotate credentials if expired
Check Watson logs for specific permission errors
Use sherlock-status command to see integration health

Telemetry not flowing

Possible Causes

Watson agent not running
Network connectivity issues
Integration credentials incorrect

Solutions

Check Watson pod status: kubectl get pods -n sherlocks
View Watson logs: kubectl logs -n sherlocks -l app=watson
Verify network policies allow outbound connections
Test integration credentials manually

Slack not responding

Possible Causes

Sherlocks platform outage
Rate limiting
Investigation queue backlog

Solutions

Check status.sherlocks.ai for platform status
Wait a few minutes and retry
Use /sherlock-status to check queue depth
Contact support if issue persists

Stuck investigations

Possible Causes

LLM provider timeout
Insufficient data to analyze
Complex query requiring more time

Solutions

Wait up to 5 minutes for complex investigations
Rephrase query to be more specific
Check LLM provider status
Cancel and restart investigation

LLM timeouts or restrictions

Possible Causes

LLM API rate limits hit
LLM provider outage
Content filtering triggered

Solutions

Configure fallback LLM provider
Increase rate limits with your LLM provider
Review content filtering policies
Switch to self-hosted LLM

Getting Help

If you're still experiencing issues:

Email: support@sherlocks.ai
Slack: Join our community Slack workspace
Documentation: docs.sherlocks.ai
Status Page: status.sherlocks.ai

Include Watson logs and investigation IDs when contacting support for faster resolution.

Frequently Asked Questions

Common questions about Sherlocks capabilities and deployment

Glossary

Key terms and concepts in the Sherlocks platform

Awareness Graph

A living knowledge base that maps your entire system's architecture, dependencies, behavior patterns, and incident history. It continuously evolves as your system changes.

Investigator

The AI reasoning engine that analyzes telemetry, correlates signals, and generates root cause analyses.

Watson

The data collection agent that runs in your infrastructure with read-only access to gather telemetry and metadata.

Signals

Data points used in investigations: metrics, logs, traces, events, and Slack conversations.

RCA (Root Cause Analysis)

A comprehensive analysis identifying the primary cause of an incident, contributing factors, timeline, and recommended remediation.

Hypothesis

A potential explanation for an incident that Sherlocks generates and tests against available data.

Forensics

The process of analyzing historical data to understand what happened during an incident.

Incident Memory

Sherlocks' record of past incidents, their causes, and resolutions, used to identify patterns and accelerate future investigations.

Playbooks

Predefined investigation patterns and remediation steps for common incident types.

Blast Radius

The scope of services and users affected by an incident.

MTTR (Mean Time To Resolution)

The average time it takes to resolve an incident from detection to fix.

Toil

Repetitive, manual operational work that doesn't provide lasting value (e.g., manually correlating logs during incidents).

Custom Instructions

Team-specific guidance you provide to Sherlocks about how to investigate issues, communicate findings, and prioritize concerns.

Assistant Levels

Different modes of Sherlocks operation: Intern (basic analysis), Senior (detailed investigation), Architect (strategic recommendations).

Sherlocks.ai Documentation

Introduction

What is Sherlocks.ai?

Why Sherlocks.ai Exists

High-Level Value Proposition

Reduce MTTR by 70%

Improve Customer Satisfaction (CSAT)

Institutional Memory

Slack-Native

Architecture Summary

Core Concepts

Watson Agent

Awareness Graph

Investigations

Signals

Knowledge Consolidation

LLM Execution Model

Supported Integrations

Cloud Providers

AWS

Google Cloud Platform

Microsoft Azure

Kubernetes

Kubernetes

Helm

Datastores

MySQL

PostgreSQL

MongoDB

Redis

Cassandra

Message Queues

Apache Kafka

RabbitMQ

Amazon SQS

Azure Service Bus

CI/CD & Version Control

GitHub Actions

Jenkins

Azure Pipelines

GitHub

Observability & APM

Prometheus

Datadog

New Relic

Sentry

Coralogix

Logging

Elasticsearch (ELK)

Coralogix

Loki

Collaboration

Slack

Microsoft Teams

How Sherlocks.ai Works

1Slack Ingestion

2Telemetry Access for RCA via Watson

3Awareness Graph Building

4AI Reasoning Engine

5RCA Generator

6Slack Response

Deployment Models

SaaS ModelRecommended for most teams

Architecture:

Cloud-Native SaaSQuickest to get started

Architecture:

Fully In-VPC SherlocksMaximum control

Architecture:

In-VPC LLM (Azure OpenAI)Enterprise AI security

Benefits:

Installation & Integration Steps

Prerequisites

Step 1Connect Slack

Step 2Preparation

Step 3Install Watson Agent via Helm

Step 4Connect Observability Tools

Step 5Connect Databases and Queues

Step 6Connect GitHub/Jenkins

Step 7Test Connection

Step 8Verify Initial Ingestion

1
Slack Ingestion

2
Telemetry Access for RCA via Watson

3
Awareness Graph Building

4
AI Reasoning Engine

5
RCA Generator

6
Slack Response

SaaS Model
Recommended for most teams

Cloud-Native SaaS
Quickest to get started

Fully In-VPC Sherlocks
Maximum control

In-VPC LLM (Azure OpenAI)
Enterprise AI security