TL;DR

Modern observability stacks are built on four pillars: metrics, logs, traces, and events. Most SRE teams have invested heavily in the first three. The fourth is events: deploys, config changes, feature flag flips, and infrastructure actions. This signal is typically scattered across Slack channels, CI/CD logs, and ticket comments. It is not queryable, not correlated to alerts, and not on the on-call engineer's primary screen.

That gap is where most modern incidents live. Almost every P1 has the same first question: “what changed in the last hour?” Events are the only pillar that answers it directly. According to LogicMonitor's SRE Report 2026, median toil still accounts for 34% of SRE time despite growing AI adoption, with the majority of that toil spent reconstructing change context that should have been captured as structured telemetry in the first place.

The Four Pillars of Telemetry is the complete framework for incident investigation. Building the fourth pillar is the highest-leverage observability investment most teams are not making.

Why do most SRE teams have observability gaps even with mature tooling?

Most SRE teams will tell you they have observability. Then a P1 fires at 3 AM, and the on-call engineer ends up grepping Kubernetes logs in one tab, staring at a metrics dashboard in another, opening a trace in Jaeger because the dashboard did not help, and finally messaging the deploy team because none of those three told them what changed thirty minutes before the spike.

That is not observability. That is three telemetry sources stitched together by a tired human under pressure.

The problem is not that teams have bad tools. It is that the standard observability conversation stops at three pillars: metrics, logs, and traces. The fourth pillar, events, exists in every engineering organisation. It is just not treated as telemetry. And it is the pillar most often correlated with incident causation.

As Grafana's 2026 Observability Survey of more than 1,300 practitioners found, teams are spending more on observability than ever, yet incidents still take 20 to 40 minutes to diagnose. The data is there. The causality is not. That distinction matters, and it is what this article addresses. For a deeper look at why this gap persists, see Observability in 2026: More Data, Fewer Answers.

What are the four pillars of telemetry?

The Four Pillars of Telemetry is a framework developed by Sherlocks.ai to describe the complete set of signals required for effective incident investigation. Most observability frameworks stop at three pillars. The fourth is what closes the gap between visibility and understanding.

The Four Pillars of Telemetry, developed by Sherlocks.ai based on analysis of incident investigation patterns across production environments. Freely usable under CC BY-NC 4.0 with attribution.

Pillar 1📊

Metrics

Numeric time series: counters, gauges, histograms

Key question

Is something wrong? How bad is it?

Datadog, Prometheus, Grafana

Pillar 2📄

Logs

Per-event textual records

Key question

What did the system do, in order, for this request?

Splunk, Elastic, Loki, CloudWatch

Pillar 3🔗

Traces

Causal records of a request across services

Key question

Where did the latency come from?

Jaeger, Honeycomb, Datadog APM

Pillar 4⚡

Events

Deploys, config pushes, flag flips, infra actions

Key question

What changed in the last hour?

CI/CD logs, Slack, GitHub Actions, mostly unstructured

Four pillars, four questions; you need all of them to close the gap between symptom and cause

The first three pillars tell you the system is broken and where the symptoms are. The fourth tells you why. That asymmetry is the core of the modern MTTR problem.

Pillar 1

Metrics

Metrics are the default unit of “is something wrong.” Numeric time series, counters, gauges, histograms. They drive alerting in most organisations and are the first signal an on-call engineer sees.

What they answer well

Is the rate of X up or down? Is latency at the 99th percentile breaching SLO? Is queue depth growing?

What they do not answer

Why. Metrics are aggregations. By the time you see the spike, the per-request context that caused it has been collapsed into a number.

Where teams over-invest

Building more dashboards. There is a point at which more dashboards reduce the speed of triage rather than improve it. The on-call engineer does not need another panel. They need a hypothesis.

Where teams under-invest

Cardinality discipline. High-cardinality labels like user_id or request_id blow up storage costs, and most time-series databases will drop them silently. By the time you need that label during an incident, it is gone.

Pillar 2

Logs

Logs are the narrative of what happened. Per-event textual records that capture system behaviour at a granular level.

What they answer well

What did the system actually do, in order, for this specific request or operation?

What they do not answer

Aggregate behaviour. “How often does this happen across the fleet?” requires counting, which logs are poor at. They are individual records, not summaries.

Where teams over-invest

Verbosity. Structured logs at five megabytes per request are not observability; they are a data exhaust pipe. Index costs compound fast, and the signal-to-noise ratio drops with it.

Where teams under-invest

Correlation IDs. Logs without a request_id or trace_id are noise. Logs with one are the spine of incident response. This is one of the highest-leverage, lowest-cost investments a team can make in observability hygiene.

Pillar 3

Traces

Traces capture the causal path of a request as it moves across services. A directed graph of spans, each representing a unit of work in a specific service.

What they answer well

Where did the latency come from? Which downstream service is the bottleneck? What was the call shape for this specific request?

What they do not answer

Anything you did not instrument. Tracing is opt-in by definition. Coverage gaps are invisible until an incident exposes them at the worst possible moment.

Where teams over-invest

Sampling configuration. Most teams spend significant time tuning head-versus-tail sampling. The practical answer is: sample tail, accept the storage cost, and you will thank yourself during the next complex incident.

Where teams under-invest

Cross-team trace context propagation. If your team sets a trace header and the downstream team strips it at their gateway, your trace is half a story. OpenTelemetry's propagation spec exists precisely to standardise this, but adoption requires agreement across teams, not just tooling.

Pillar 4

Events

Events are discrete, non-rate state changes. Deploys. Feature flag flips. Config pushes. Database schema migrations. Infrastructure autoscaling actions. Vendor status changes.

What they answer well

What changed. Almost every incident has the same first question: “what was different in the last hour for this service?” Events are the only pillar that answers that question directly, without requiring an engineer to manually reconstruct change history across CI/CD logs, Slack channels, and deployment dashboards.

What they do not answer

Per-request behaviour. Events are coarse by design. They tell you something changed; the other pillars tell you how the system responded.

Where teams under-invest

Treating events as first-class telemetry. Walk into a typical SRE organisation and the event stream looks like this: deploy notifications in a Slack channel, GitHub Actions logs behind an auth wall, Terraform Cloud notifications in a separate tool, PagerDuty incident streams not connected to anything, vendor status page updates that someone has to manually check. None of it is structured. None of it is queryable. None of it is correlated to the alert that fired.

The deploy that broke production is in your CI/CD logs. It is also almost certainly not correlated to your alert. That gap is where MTTR lives.

Why do most SRE teams have three pillars but not four?

The pattern is consistent across organisations of every size:

Pillar 1Mature

Metrics

Datadog or Prometheus plus Grafana. Mature, well-staffed, heavily invested.

Pillar 2Mature

Logs

Splunk, Elastic, Loki, or a cloud-native variant. Mature.

Pillar 3Maturing

Traces

Datadog APM, Honeycomb, or an OpenTelemetry-based stack. Maturing.

Pillar 4Missing

Events

Slack messages from a deploys channel, GitHub Actions logs, Terraform Cloud notifications, PagerDuty incident streams, vendor status pages. Not stitched together. Not queryable. Not on the on-call engineer's primary screen during an incident.

The fourth pillar exists in every organisation. It just is not treated as telemetry.

There are two reasons for this. First, events come from many different systems: CI/CD, feature flag management, infrastructure tooling, and external vendors. No single team owns all of them. Second, there is no natural home for events in the standard observability stack. Metrics have Prometheus. Logs have Elastic. Traces have Jaeger. Events have Slack, which is not a telemetry system.

The result is that the one signal most directly correlated with incident causation is the least structured, least queryable, and least accessible during active triage.

What does a well-built fourth pillar actually look like?

A mature events pillar has four characteristics.

1A single event bus

Deploys, feature flag changes, config pushes, infrastructure changes, and vendor advisories all land in one place with structured payloads. The source does not matter. The schema does.

2Structured metadata on every event

Every event carries a timestamp, a service identifier, an actor (who or what triggered it), and a reference such as a commit SHA, flag name, or ticket number. Free-text descriptions are not enough.

Here is what a structured event should look like:

{
  "timestamp": "2026-05-14T10:32:21Z",
  "service": "payment-api",
  "actor": "deploy-system",
  "action": "deploy",
  "version": "v2.3.1",
  "commit": "a7f3e9b",
  "environment": "production",
  "triggered_by": "github-actions",
  "ticket": "PROD-1423"
}

Every event in your bus should look roughly like this. If it does not have a service identifier and an actor, it is a notification, not telemetry.

3Events queryable alongside metrics, logs, and traces

You should be able to overlay “deploys to service X” on the latency graph for service X without switching tabs. If your events live in a separate tool from your metrics, they will not be consulted during the first ten minutes of an incident, which is when they matter most.

4Events included in alert payloads

Alerts should surface the most recent N events for the affected service as part of the notification, not as a link to click later. The most expensive minutes in any incident are between page-fired and first hypothesis. If those minutes can go from “open four tabs and remember the context” to “read the alert and know what changed,” MTTR drops before any new tooling is added.

How to start building your fourth pillar this week

You do not need a new vendor. You need a pattern.

Identify all sources of change events in your organisation: deploys, feature flags, config changes, infrastructure actions, vendor status updates.

1 day

Pick one source. Deploys are the highest leverage. Send every deploy event to a structured destination: a BigQuery table, a ClickHouse database, or even a dedicated Slack channel with a webhook that parses JSON.

2-3 days

Add three required fields to every event: timestamp, service, actor. Without these, it is noise.

1 day

Surface recent events in your alert payloads. If your alerting system can pull from your event store, do it. If not, append the last five events to the alert description manually.

1-2 days

Repeat for feature flags, then config changes, then infrastructure. Within a quarter, you will have a functioning fourth pillar.

Ongoing

The goal is not perfection. The goal is to move from “events are scattered” to “events are structured and queryable.” Start small. One event type, one destination, one required field at a time.

How should your team prioritise telemetry investment?

Different teams are missing different pillars. The right investment depends on where investigation time is actually being lost.

By gap type

If your team is saying this…	The gap is probably here	Start with this
“We never have enough data when an incident fires”	Metrics or logs	Audit instrumentation gaps, add structured logging with correlation IDs
“We can see something is wrong but cannot tell where”	Traces	Add distributed tracing with proper context propagation across service boundaries
“We know there is a problem but cannot figure out what changed”	Events	Build a structured event bus; start with deploys and feature flag changes
“We have all the data but it takes forever to connect it”	Intelligence layer	Evaluate an AI SRE platform that correlates across all four pillars automatically

By team size

Team size	Which pillars to prioritise	Why
Under 50 engineers	Metrics, logs, basic events	Focus on logs with correlation IDs and a basic deploy event stream. A full event bus is not justified yet, but correlating deploys to alerts is.
50 to 200 engineers	Metrics, logs, traces, structured events	Distributed tracing and a structured event bus become high value. This is the stage where incidents start involving multiple teams and manual context reconstruction becomes the bottleneck.
200+ engineers	All four pillars and an AI correlation layer	At this scale, the cost of missing any pillar during an incident is measured in engineer-hours per week. An AI layer that reasons across all four pillars starts to show clear ROI.

How does an AI SRE platform use the four pillars differently?

An AI SRE platform that consumes only metrics and logs is a slightly faster grep. The platforms that work in production consume all four pillars, and they reason about them in a specific order.

Step 1⚡

Events first

“What changed in the last hour?”

Answers the majority of P1s before looking at anything else. Events are causally upstream of symptoms.

↓

Step 2📊

Metrics for scoping

“How bad is it?”

Confirms blast radius and when degradation started relative to the change.

↓

Step 3🔗

Traces for attribution

“Where is it failing?”

Identifies the specific service boundary where the failure is occurring. Walked structurally, not summarised.

↓

Step 4📄

Logs for verification

“Does the hypothesis hold?”

Scoped per-incident via correlation ID. Used to confirm or rule out a hypothesis, not fed in bulk.

Step 1⚡

Events first

“What changed in the last hour?”

Answers the majority of P1s before looking at anything else. Events are causally upstream of symptoms.

Step 2📊

Metrics for scoping

“How bad is it?”

Confirms blast radius and when degradation started relative to the change.

Step 3🔗

Traces for attribution

“Where is it failing?”

Identifies the specific service boundary where the failure is occurring. Walked structurally, not summarised.

Step 4📄

Logs for verification

“Does the hypothesis hold?”

Scoped per-incident via correlation ID. Used to confirm or rule out a hypothesis, not fed in bulk.

AI investigation order: causal first, symptomatic last

An agent that starts with logs ends up hallucinating because logs are unstructured and lossy at scale. An agent that starts with events finishes faster because it is working causally rather than symptomatically. This ordering maps to how experienced on-call engineers actually triage when they know what they are doing.

For a comparison of AI SRE platforms and how they handle signal correlation across all four pillars, see Top AI SRE Tools in 2026. For a foundational explanation of what AI SRE is and what it addresses, see What Is AI SRE in 2026.

Key takeaways

•The standard observability conversation stops at three pillars. The fourth, events, is the one most directly correlated with incident causation and the one most teams have not built as structured telemetry.
•The Four Pillars of Telemetry: Metrics (is something wrong?), Logs (what did the system do?), Traces (where did latency come from?), Events (what changed?). You need all four to close the gap between seeing a symptom and understanding a cause.
•The highest-leverage observability investment most teams are not making is building a structured, queryable event bus and surfacing recent events in alert payloads. This does not require a new vendor. It requires treating change data as first-class telemetry.
•If your team is consistently spending the first 20 to 30 minutes of an incident reconstructing what changed, the missing pillar is events, not better dashboards.
•An AI SRE platform that reasons in the correct order (events first, then metrics, then traces, then logs) finishes faster because it works causally rather than symptomatically.

Frequently Asked Questions

What are the four pillars of telemetry in SRE?

The four pillars are metrics, logs, traces, and events. Metrics capture numeric system health data. Logs record per-event system behaviour. Traces follow requests across service boundaries. Events capture discrete state changes like deploys, config pushes, and feature flag flips.

Why do most teams only have three telemetry pillars?

Events come from many different systems: CI/CD pipelines, feature flag tools, infrastructure platforms, and external vendors. No single team owns all of them. There is also no natural home for events in the standard observability stack the way metrics have Prometheus or logs have Elastic. As a result, events end up in Slack channels and CI/CD logs rather than a queryable telemetry system.

What is the most common cause of slow MTTR in SRE teams?

The investigation phase, specifically the time spent reconstructing what changed before a root cause is identified. According to DORA's State of DevOps research, incidents have increased as delivery velocity accelerated, but investigation tooling has not kept pace. A structured events pillar directly addresses this bottleneck.

What should a structured event in observability look like?

Every event should carry a timestamp, a service identifier, an actor (human or system that triggered it), and a reference such as a commit SHA, flag name, or ticket number. See the JSON example in this article for a concrete schema. If an event does not have a service identifier and an actor, it is a notification, not telemetry.

How does an AI SRE platform use telemetry differently from a human engineer?

An AI SRE platform reasons across all four pillars in a specific order: events first to identify what changed, metrics to scope the blast radius, traces to attribute the failure to a specific service boundary, and logs to verify the hypothesis. Human engineers follow a similar pattern when experienced, but AI platforms do it consistently and in seconds rather than minutes.

Do I need all four pillars before adopting an AI SRE tool?

Not necessarily. Agentic platforms like Sherlocks.ai can work with the pillars you have and surface gaps in coverage over time. That said, the more complete your telemetry, the faster and more accurate the investigation. Starting with a structured event stream alongside your existing metrics, logs, and traces gives an AI SRE platform the most to work with.

Never Miss What's Breaking in Prod

Breaking Prod is a weekly newsletter for SRE and DevOps engineers.

Subscribe on LinkedIn →

The Four Pillars of Telemetry: Metrics, Logs, Traces, and Events

Why do most SRE teams have observability gaps even with mature tooling?

What are the four pillars of telemetry?

Metrics

Logs

Traces

Events

Why do most SRE teams have three pillars but not four?

What does a well-built fourth pillar actually look like?

How to start building your fourth pillar this week

How should your team prioritise telemetry investment?

By gap type

By team size

How does an AI SRE platform use the four pillars differently?

Key takeaways

Frequently Asked Questions

Related Reading

Observability Trends in 2026

Best Incident Response Platforms for DevOps (2026)

What Is AI SRE in 2026?

Alert on Cause, Not Symptom

Never Miss What's Breaking in Prod