Back to Best Practices

Alert on Causes, Not Symptoms: The Fastest Way to Reduce MTTR

AlertingSREDevOpsIncident ManagementObservability

Your Pingdom alert fires at 3 AM. The site is down.

You fumble for your laptop, log into Slack, and begin the investigation. Check the deployment history. Pull up Grafana. Scroll through logs. SSH into the server. Query the database. Ten minutes later, you discover the problem: a deployment scaled your replicas to zero, and Kubernetes dutifully reported zero endpoints for your service.

The truth is, your infrastructure was aware of the issue. It just didn't communicate it clearly.

This is the key difference between symptom-based and cause-based alerting. It can mean the difference between spending 10 minutes or just 10 seconds on an investigation.

The Symptom-Cause Gap

Most alerting systems work backward. They detect the problem's effect and leave you to discover the cause.

Symptom-Based Alerting

Alert fires0 min

"Site unreachable" (Pingdom)

Acknowledgment~5 min

Engineer wakes up, logs in

Context gathering5-15 min

Dashboards, logs, recent changes

Root cause identification5-20 min

Investigate multiple possibilities

Remediation2-5 min

Apply fix

Total Time:17-45 min

Cause-Based Alerting

Alert fires0 min

"payment-service: replicas=0, endpoints=0"

Acknowledgment~5 min

Engineer wakes up, logs in

Context gathering0 min

The alert IS the context

Root cause identification0 min

You already know

Remediation2-5 min

Apply fix

Total Time:7-10 min

Time Saved Per Incident

10-35 minutes

When downtime costs $14,056 per minute on average - rising to $23,750 for large enterprises according to EMA Research's 2024 report - those minutes aren't academic. They're existential.

Why We Default to Symptom Alerts

Symptom-based alerts dominate because they're easy to set up. You don't need deep system knowledge to configure a Pingdom check. You don't need to understand Kubernetes internals to set a latency threshold.

This is a feature when you're getting started. It becomes a liability at scale.

The problem compounds as more symptom alerts are added. You might set up a Pingdom check, then add alerts for latency, error rates, CPU, and memory. Each alert responds to a different sign of the same underlying problem.

4,484

Average alerts per day teams field

Source: Vectra 2023 State of Threat Detection

67%

Alerts are ignored due to false positives and alert fatigue

Source: Vectra 2023

The Cause-Alert Catalog

Common symptom alerts and their cause-based alternatives

Site Down

Symptom Alert

Pingdom/synthetic check fails

Cause Alert
deployment.replicas == 0 && service.endpoints == 0

Eliminates 15-20 minutes of investigation per incident

High Latency

Symptom Alert

P99 latency > 500ms

Cause Alert
db.connection_pool.active / db.connection_pool.max > 0.9

Points directly to connection pool exhaustion

API Errors

Symptom Alert

5xx error rate > 1%

Cause Alert
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0

Identifies memory issues immediately

Request Timeouts

Symptom Alert

API timeout rate > 0.5%

Cause Alert
kafka_consumer_lag > 10000 or sqs_queue_depth > threshold

Leading indicator before user-visible impact

Health Check Failures

Symptom Alert

Health check failure count > 3

Cause Alert
(cert_expiry_timestamp - now()) / 86400 < 7

Alert 7 days before expiry, not after outage

Slow Page Loads

Symptom Alert

Page load time > 3s

Cause Alert
cdn_cache_miss_rate > 50%

Points directly to CDN configuration

Service Unavailable

Symptom Alert

Service returning 503s

Cause Alert
upstream_service_health == unhealthy or circuit_breaker_state == open

Identifies actual failure point, not downstream effects

The Math of Investigation Time

Research from the University of California, Irvine (Dr. Gloria Mark) shows it takes an average of 23 minutes and 15 seconds to fully regain focus after an interruption.

15

incidents/month

5 hrs

symptom-based/month

1.25 hrs

cause-based/month

$6,750

saved/year @ $150/hr

The real savings are even larger:

Context switching costs: Each incident pulls engineers out of flow state

After-hours impact: 3 AM investigations are disproportionately expensive

Cascade prevention: Faster identification prevents incidents from escalating

Knowledge capture: Cause-based alerts encode institutional knowledge

Implementing Cause-Based Alerts

The shift from symptom to cause alerting requires three things

1

Instrument the Causes

You can't alert on what you don't measure.

  • Resource utilization (connections, memory, CPU, file descriptors)
  • Queue depths and consumer lag
  • Certificate expiry dates
  • Dependency health states
  • Kubernetes pod states and termination reasons
  • Recent deployment status
2

Map Symptoms to Causes

For each symptom alert, ask: "What are the top 3 causes when this fires?"

  • Start with your noisiest alerts
  • Identify alerts that fire frequently
  • Focus on alerts requiring significant investigation
  • These offer the highest ROI for conversion
3

Layer Your Alerting

Cause-based alerts shouldn't replace symptom alerts entirely.

  • Layer 1: Cause alerts (immediate, actionable)
  • Layer 2: Symptom alerts (backstop, catch-all)
  • Symptom alerts become your safety net
  • If symptoms fire without causes, you've found a gap

The Cost of Inaction

Alert fatigue is more than just annoying; it can be dangerous. If your alerting system sends too many false alarms, you might miss the real issues.Alert fatigue isn't just annoying—it's dangerous. When your alerting system cries wolf too often, the real wolves slip through.

30%

Drop in alert acceptance for each repeated reminder

Source: Atlassian

3x

Error rate increase from 5-second interruptions

Source: APA

41%

Of enterprises estimate hourly downtime costs at $1M+

Source: ITIC 2024

The answer is not to have fewer alerts, but to have better ones. Good alerts should tell you exactly what is wrong, not just that there is a problem.

Conclusion

The purpose of alerting is not just to notify you, but to help you resolve issues quickly.

When you move from symptom-based alerts ("The site is slow") to cause-based alerts ("The connection pool is exhausted"), you remove the most time-consuming part of incident response: the investigation.

If your alerts show you exactly what is wrong, your mean time to resolution (MTTR) can drop from minutes or hours to just seconds. This is more than just an operational improvement; it changes how your team works with your infrastructure.

"If your alerting system only tells you that something is broken - but not why or what to do next - it's not an alerting system. It's just a very loud graph."

References

  1. EMA Research. (2024). "IT Outages: 2024 Costs and Containment." Enterprise Management Associates.
  2. Vectra. (2023). "State of Threat Detection." Security research report.
  3. International Data Corporation. (2021). Research on alert investigation rates across enterprise organizations.
  4. Mark, G. et al. "The Cost of Interrupted Work: More Speed and Stress." University of California, Irvine.
  5. ITIC. (2024). "Hourly Cost of Downtime Survey." Information Technology Intelligence Consulting.
  6. Atlassian. "Understanding and Fighting Alert Fatigue." Incident Management documentation.
  7. incident.io. (2024-2025). Research on coordination overhead and MTTR in incident response.
  8. American Psychological Association. Research on interruption impact on cognitive work performance.

This post is part of Sherlocks.ai's ongoing research into AI-powered site reliability engineering. We are developing AI SRE teammates that understand causes, not just symptoms, so they can cut investigation time from minutes to seconds.

Learn More About Sherlocks.ai