Alert on Causes, Not Symptoms: The Fastest Way to Reduce MTTR
Your Pingdom alert fires at 3 AM. The site is down.
You fumble for your laptop, log into Slack, and begin the investigation. Check the deployment history. Pull up Grafana. Scroll through logs. SSH into the server. Query the database. Ten minutes later, you discover the problem: a deployment scaled your replicas to zero, and Kubernetes dutifully reported zero endpoints for your service.
The truth is, your infrastructure was aware of the issue. It just didn't communicate it clearly.
This is the key difference between symptom-based and cause-based alerting. It can mean the difference between spending 10 minutes or just 10 seconds on an investigation.
The Symptom-Cause Gap
Most alerting systems work backward. They detect the problem's effect and leave you to discover the cause.
Symptom-Based Alerting
"Site unreachable" (Pingdom)
Engineer wakes up, logs in
Dashboards, logs, recent changes
Investigate multiple possibilities
Apply fix
Cause-Based Alerting
"payment-service: replicas=0, endpoints=0"
Engineer wakes up, logs in
The alert IS the context
You already know
Apply fix
Time Saved Per Incident
10-35 minutes
When downtime costs $14,056 per minute on average - rising to $23,750 for large enterprises according to EMA Research's 2024 report - those minutes aren't academic. They're existential.
Why We Default to Symptom Alerts
Symptom-based alerts dominate because they're easy to set up. You don't need deep system knowledge to configure a Pingdom check. You don't need to understand Kubernetes internals to set a latency threshold.
This is a feature when you're getting started. It becomes a liability at scale.
The problem compounds as more symptom alerts are added. You might set up a Pingdom check, then add alerts for latency, error rates, CPU, and memory. Each alert responds to a different sign of the same underlying problem.
Average alerts per day teams field
Source: Vectra 2023 State of Threat Detection
Alerts are ignored due to false positives and alert fatigue
Source: Vectra 2023
The Cause-Alert Catalog
Common symptom alerts and their cause-based alternatives
Site Down
Pingdom/synthetic check fails
deployment.replicas == 0 && service.endpoints == 0Eliminates 15-20 minutes of investigation per incident
High Latency
P99 latency > 500ms
db.connection_pool.active / db.connection_pool.max > 0.9Points directly to connection pool exhaustion
API Errors
5xx error rate > 1%
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0Identifies memory issues immediately
Request Timeouts
API timeout rate > 0.5%
kafka_consumer_lag > 10000 or sqs_queue_depth > thresholdLeading indicator before user-visible impact
Health Check Failures
Health check failure count > 3
(cert_expiry_timestamp - now()) / 86400 < 7Alert 7 days before expiry, not after outage
Slow Page Loads
Page load time > 3s
cdn_cache_miss_rate > 50%Points directly to CDN configuration
Service Unavailable
Service returning 503s
upstream_service_health == unhealthy or circuit_breaker_state == openIdentifies actual failure point, not downstream effects
The Math of Investigation Time
Research from the University of California, Irvine (Dr. Gloria Mark) shows it takes an average of 23 minutes and 15 seconds to fully regain focus after an interruption.
incidents/month
symptom-based/month
cause-based/month
saved/year @ $150/hr
The real savings are even larger:
Context switching costs: Each incident pulls engineers out of flow state
After-hours impact: 3 AM investigations are disproportionately expensive
Cascade prevention: Faster identification prevents incidents from escalating
Knowledge capture: Cause-based alerts encode institutional knowledge
Implementing Cause-Based Alerts
The shift from symptom to cause alerting requires three things
Instrument the Causes
You can't alert on what you don't measure.
- Resource utilization (connections, memory, CPU, file descriptors)
- Queue depths and consumer lag
- Certificate expiry dates
- Dependency health states
- Kubernetes pod states and termination reasons
- Recent deployment status
Map Symptoms to Causes
For each symptom alert, ask: "What are the top 3 causes when this fires?"
- Start with your noisiest alerts
- Identify alerts that fire frequently
- Focus on alerts requiring significant investigation
- These offer the highest ROI for conversion
Layer Your Alerting
Cause-based alerts shouldn't replace symptom alerts entirely.
- Layer 1: Cause alerts (immediate, actionable)
- Layer 2: Symptom alerts (backstop, catch-all)
- Symptom alerts become your safety net
- If symptoms fire without causes, you've found a gap
The Cost of Inaction
Alert fatigue is more than just annoying; it can be dangerous. If your alerting system sends too many false alarms, you might miss the real issues.Alert fatigue isn't just annoying—it's dangerous. When your alerting system cries wolf too often, the real wolves slip through.
Drop in alert acceptance for each repeated reminder
Source: Atlassian
Error rate increase from 5-second interruptions
Source: APA
Of enterprises estimate hourly downtime costs at $1M+
Source: ITIC 2024
The answer is not to have fewer alerts, but to have better ones. Good alerts should tell you exactly what is wrong, not just that there is a problem.
Conclusion
The purpose of alerting is not just to notify you, but to help you resolve issues quickly.
When you move from symptom-based alerts ("The site is slow") to cause-based alerts ("The connection pool is exhausted"), you remove the most time-consuming part of incident response: the investigation.
If your alerts show you exactly what is wrong, your mean time to resolution (MTTR) can drop from minutes or hours to just seconds. This is more than just an operational improvement; it changes how your team works with your infrastructure.
"If your alerting system only tells you that something is broken - but not why or what to do next - it's not an alerting system. It's just a very loud graph."
References
- EMA Research. (2024). "IT Outages: 2024 Costs and Containment." Enterprise Management Associates.
- Vectra. (2023). "State of Threat Detection." Security research report.
- International Data Corporation. (2021). Research on alert investigation rates across enterprise organizations.
- Mark, G. et al. "The Cost of Interrupted Work: More Speed and Stress." University of California, Irvine.
- ITIC. (2024). "Hourly Cost of Downtime Survey." Information Technology Intelligence Consulting.
- Atlassian. "Understanding and Fighting Alert Fatigue." Incident Management documentation.
- incident.io. (2024-2025). Research on coordination overhead and MTTR in incident response.
- American Psychological Association. Research on interruption impact on cognitive work performance.
This post is part of Sherlocks.ai's ongoing research into AI-powered site reliability engineering. We are developing AI SRE teammates that understand causes, not just symptoms, so they can cut investigation time from minutes to seconds.
Learn More About Sherlocks.ai