Your team has 47 dashboards, 312 alert rules, and a Slack channel where every notification is immediately ignored. Someone gets paged at 3am for a CPU spike that resolved itself before they opened their laptop. The postmortem says "improve monitoring." Nobody knows what that means. This is alert fatigue, and it is the default state of most observability setups. More dashboards did not make you more observable. They made you numb.

The problem is not tooling. Prometheus, Grafana, Datadog, whatever you run, these tools are capable enough. The problem is orientation. Most teams instrument everything and understand nothing. They monitor infrastructure metrics because that is what the default exporters provide, not because those metrics answer real questions about whether the system is working for users.

SLO-driven observability inverts this. Instead of starting with what is easy to measure, you start with what matters to users. You define the contract, measure against it, budget the gap, and alert only when the budget is burning too fast. Everything else is context for debugging, not signal for action.

SLIs: Measuring What Users Experience

A Service Level Indicator is a metric that tells you whether users are having a good experience. Not whether your servers are busy. Not whether your pods are healthy. Whether the thing the user asked for actually worked, fast enough, correctly enough.

Most teams get this wrong by measuring from the wrong vantage point. Server-side CPU utilization tells you about infrastructure cost. It tells you nothing about whether the checkout page loaded in under two seconds. SLIs must be measured as close to the user as possible, ideally from the load balancer, the edge, or the client.

The Four SLI Categories That Matter

  • Availability -- the proportion of requests that succeed. Measured as successful responses divided by total responses. A 500 error counts. A timeout counts. A dropped connection counts.
  • Latency -- the proportion of requests served faster than a threshold. Not average latency, which hides tail performance. Use percentiles: "99% of requests complete in under 300ms."
  • Throughput -- the rate of valid requests processed. Useful for batch systems, data pipelines, and message queues where latency is less relevant than processing rate.
  • Correctness -- the proportion of responses that return the right data. Often overlooked. A 200 response with stale data from a broken cache is not a successful request from the user's perspective.
Start with two SLIs, not twelve

For most request-serving systems, availability and latency cover 90% of what users care about. Add throughput or correctness only when the service type demands it. Overloading on SLIs creates the same noise problem you are trying to solve.

The key discipline is specificity. "API latency" is not an SLI. "Proportion of /checkout POST requests that complete in under 500ms, measured at the load balancer" is an SLI. The more precise the definition, the more useful the measurement, and the harder it is to game or misinterpret.

SLOs: Setting the Contract

An SLO is a target for your SLI over a rolling time window. It is the answer to "how reliable does this service need to be?" Not how reliable it can be. Not how reliable you wish it were. How reliable it needs to be for users to trust it and for the business to function.

This distinction matters because 100% is never the target. Every additional nine costs exponentially more in engineering effort, operational complexity, and deployment velocity. The right SLO is the lowest target your users will tolerate.

What the Nines Actually Mean

Teams throw around "five nines" without understanding what they are committing to. Here is what each level means in practical downtime over a 30-day window:

  • 99% (two nines) -- 7 hours 12 minutes of allowed downtime per month. Acceptable for internal tools and non-critical batch systems.
  • 99.9% (three nines) -- 43 minutes 12 seconds per month. The practical floor for user-facing production services.
  • 99.95% -- 21 minutes 36 seconds per month. Requires automated rollbacks and solid incident response.
  • 99.99% (four nines) -- 4 minutes 19 seconds per month. Requires redundancy across failure domains, zero-downtime deployments, and on-call engineering with sub-minute response.

"If you have not done the math on what your SLO costs to maintain, you do not have an SLO. You have a wish."

Internal SLOs vs External SLAs

An SLA is a contractual commitment with financial or legal consequences. An SLO is an internal engineering target. Your SLOs should always be tighter than your SLAs. If your SLA promises 99.9% and your internal SLO is also 99.9%, you have zero margin. When you breach, you breach the contract and the customer relationship simultaneously.

Set internal SLOs at least one step above the external commitment. If the SLA says 99.9%, target 99.95% internally. The gap between SLO and SLA is your safety margin and your error budget for deployment velocity.

Error Budgets: Making Trade-offs Explicit

An error budget is the inverse of your SLO, expressed as allowed unreliability. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% of total requests in that window. For a service handling 10 million requests per day, that is 10,000 failed requests per day, or roughly 300,000 over the month.

This number is not a failure threshold. It is a spending account. Every deployment that causes errors, every maintenance window, every dependency outage, they all spend from the same budget. The budget makes the cost of unreliability visible and the trade-off between velocity and stability explicit.

How Error Budget Policy Works

The power of error budgets is in the policy, not the number. A well-designed error budget policy creates automatic decision rules:

  • Budget healthy (more than 50% remaining) -- ship freely. Deploy daily. Run experiments. The system has headroom.
  • Budget cautious (20% to 50% remaining) -- review deployments more carefully. Prioritize reliability work alongside features. No risky experiments.
  • Budget critical (under 20% remaining) -- freeze non-essential changes. Focus engineering effort on reliability improvements. All deploys require explicit approval.
  • Budget exhausted (0% remaining) -- full feature freeze until the budget recovers. Only reliability fixes and rollbacks ship. This is not punishment. It is math.

"Error budgets turn 'we should invest in reliability' from an opinion into a policy. When the budget is gone, the argument is over."

The cultural shift matters as much as the mechanism. Without error budgets, reliability is always less urgent than the next feature. With error budgets, reliability has a measurable, defensible claim on engineering time. Product managers can see the trade-off: ship this risky feature now and potentially burn the budget, or invest in retry logic first and ship safely next week.

Alert Design: Fewer Alerts, Better Signals

Most alerting setups are built backwards. Someone provisions a new service, copies the default alert rules from the wiki, adjusts a few thresholds, and moves on. Six months later there are 40 alert rules per service, half of them fire weekly, and the on-call engineer has learned to sleep through notifications.

SLO-driven alerting replaces this with a single question: is the error budget burning too fast? If the answer is no, do not page anyone. If the answer is yes, page the owner.

Symptom-Based vs Cause-Based Alerts

Cause-based alerts fire on infrastructure conditions: CPU above 80%, disk above 90%, pod restarts above 3. These alerts are noisy because causes do not always produce symptoms. CPU can spike to 95% during a batch job and users notice nothing. A pod can restart cleanly with zero impact on availability.

Symptom-based alerts fire on user-visible degradation: error rate exceeding the SLO burn rate, latency exceeding the SLI threshold for a sustained window. These alerts are meaningful because they directly measure the contract you care about.

Multi-Window Burn Rate Alerting

The most effective SLO alerting model uses burn rate across two windows. Burn rate is how fast you are consuming your error budget relative to the window. A burn rate of 1 means you will exactly exhaust the budget at the end of the window. A burn rate of 10 means you will exhaust it in one-tenth of the time.

A multi-window alert fires only when both a short window (5 to 10 minutes) and a long window (1 to 6 hours) show elevated burn rates. This eliminates two failure modes: brief spikes that self-resolve (caught by the long window filter) and slow degradation that the short window would miss (caught by the long window trigger).

  • Page-level alert -- burn rate above 14x over 5 minutes AND above 14x over 1 hour. This means total budget exhaustion in under 2 hours at current rate. Wake someone up.
  • Ticket-level alert -- burn rate above 3x over 30 minutes AND above 3x over 6 hours. The budget is draining faster than planned but not imminently. File a ticket, investigate during business hours.
  • Low-priority alert -- burn rate above 1x over 6 hours AND above 1x over 3 days. Slow bleed. Review in the next SLO meeting.
The on-call litmus test

If an alert fires and the on-call engineer's first action is to check whether it matters, the alert is wrong. Every page-level alert should be unambiguous: the error budget is burning at a rate that will breach the SLO within hours. No interpretation needed.

Alert Routing and Ownership

An alert without a clear owner is an alert that gets ignored. Routing is not a technical configuration problem. It is an organizational design problem. The question is not "which PagerDuty schedule does this go to?" The question is "which team owns this service's reliability?"

Routing Principles

  • Service ownership equals alert ownership. The team that builds and deploys the service owns its SLOs and gets paged when the budget burns. No exceptions, no shared on-call pools for unowned services.
  • Escalation paths must be explicit and tested. If the primary responder does not acknowledge within 10 minutes, the alert escalates. If the secondary does not acknowledge within another 10, it goes to the engineering manager. Write this down. Test it quarterly.
  • Cross-service alerts route to the symptom owner, not the cause owner. If Service A depends on Service B and Service A's SLO is burning, the alert goes to Team A first. They triage, identify the dependency, and escalate to Team B with context. This prevents Team B from being paged for problems that Team A can mitigate with a fallback or circuit breaker.
  • Infrastructure alerts route to the platform team only when they affect platform SLOs. Node CPU is not a page. "Cluster-wide scheduling latency exceeding platform SLO" is a page.

The hardest part is the organizational commitment. Every service in production must have an owning team. Every owning team must have an on-call rotation. Services without owners do not get SLOs, and services without SLOs do not get reliable. This is where observability becomes an organizational problem, not a tooling problem.

Runbooks: Turning Alerts Into Actions

Every alert that can page a human must link to a runbook. Not a wiki page that was last updated two years ago. A living document that tells the responder exactly what to do when this specific alert fires at 3am when they are half awake and stressed.

Runbook Structure

A useful runbook answers four questions in order:

  1. What triggered this alert? -- One sentence. "The checkout service error budget burn rate exceeded 14x over the last hour, indicating SLO breach within 90 minutes at current rate."
  2. What should you check first? -- Three to five specific checks with exact commands or dashboard links. Not "check the logs." Instead: "Open this Grafana panel. Look at error rate by endpoint. If /checkout/submit shows elevated 500s, check the payment gateway dependency panel."
  3. What should you do? -- Decision tree with concrete actions. "If the payment gateway is returning 503s, enable the circuit breaker by setting PAYMENT_FALLBACK=true in the checkout service config map and rolling restart. If the error is in checkout logic, roll back to the last known good deployment using the runbook deploy command."
  4. When should you escalate? -- Clear criteria. "Escalate to the payments team if the gateway is down for more than 15 minutes. Escalate to the engineering manager if the SLO breach is confirmed and customer-facing impact exceeds 30 minutes."

Runbooks are not documentation. They are operational tooling. Treat them like code: version them, review them, update them after every incident where the runbook was insufficient. If a responder had to improvise, the runbook failed and needs a patch.

SLO
Target
Error
Budget
Burn Rate
Alert
Runbook
Resolution
Feedback loop: refine SLOs and runbooks
Fig. 1, SLO-driven observability flow, from target definition through error budget, alerting, runbook execution, and resolution, with a continuous feedback loop.

Dashboard Hygiene

Dashboards multiply like bacteria in a warm environment. Someone creates one during an incident. Someone else clones it and adds panels. A vendor integration ships with 15 pre-built dashboards. Within a year, your Grafana instance has 200 dashboards, 30 of which have been viewed in the last month, 5 of which are actually useful during incidents.

The fix is a tiered model with strict scoping. Three tiers. No more.

The Three-Dashboard Model

  1. Executive dashboard (SLO status). One dashboard per product area. Shows each service's SLO compliance, error budget remaining, and burn rate trend. This is what leadership looks at in weekly reviews. Green means healthy, yellow means budget is cautious, red means budget is critical. No infrastructure metrics. No deployment markers. Just the contract and how you are tracking against it.
  2. Service dashboard (golden signals). One dashboard per service. Shows the four golden signals: latency (p50, p95, p99), traffic (requests per second), error rate (5xx / total), and saturation (CPU, memory, queue depth). This is what the owning team monitors daily and what on-call checks first when an alert fires. Include deployment markers to correlate changes with signal shifts.
  3. Debug dashboard (detailed metrics). Deep-dive panels used only during active incidents or post-incident investigation. Per-endpoint breakdown, dependency latency, connection pool states, garbage collection pauses, individual pod metrics. These dashboards can be messy and granular because they are not for routine use. They exist to answer "why" after the service dashboard shows "what."

Everything else gets archived or deleted. If a dashboard was not opened in the last 30 days and is not one of the three tiers, it is clutter. Delete it. If someone needs it again, they can recreate it from the metrics that still exist. The cost of dashboard sprawl is not disk space. It is attention fragmentation during incidents when seconds matter.

Operational Ownership

SLOs are worthless without ownership. A target written in a document that nobody reviews is just aspirational text. For SLO-driven observability to work, SLOs must be embedded in how teams operate week to week.

Embedding SLOs Into Team Cadence

  • Weekly SLO review (15 minutes, added to existing standup or sync). Review error budget status for each owned service. Did anything notable consume budget this week? Are there upcoming deployments that carry risk? Does the SLO itself still reflect user expectations?
  • Sprint planning considers budget state. If the error budget is healthy, prioritize feature work. If the budget is cautious or critical, pull reliability items into the sprint. This is not optional. The budget policy dictates the balance.
  • Quarterly SLO recalibration. SLOs are not permanent. User expectations change. Infrastructure changes. Review whether each SLO is too tight (causing unnecessary freezes), too loose (allowing degraded experience), or just right. Adjust the target and update the alerting thresholds.
  • Incident postmortems reference SLO impact. Every postmortem should quantify the error budget consumed by the incident. "This outage consumed 40% of our monthly error budget" is more concrete than "this outage lasted 23 minutes." It connects incidents to the reliability contract and justifies follow-up investment.

The organizational prerequisite is clear: every production service has an owning team, every owning team has defined SLOs, and every SLO has a review cadence. Services that fall outside this structure are unobserved by definition, regardless of how many metrics they emit. Metrics without ownership are just data. Metrics with ownership, targets, and review cadence are observability.

FAQ, SLO-Driven Observability

What is the difference between an SLI, an SLO, and an SLA?

An SLI (Service Level Indicator) is a metric that measures user-facing behavior, such as request latency or error rate. An SLO (Service Level Objective) is the target you set for that metric, for example 99.9% of requests complete in under 300ms. An SLA (Service Level Agreement) is a contractual commitment with consequences if the SLO is breached. SLIs feed SLOs, and SLOs underpin SLAs.

How do error budgets help teams ship faster?

An error budget is the allowed amount of unreliability derived from your SLO. If your SLO is 99.9% availability, your error budget is 0.1% of total requests per window. When the budget is healthy, teams have explicit permission to ship aggressively. When it is nearly spent, teams slow down and focus on reliability. This replaces subjective risk debates with data-driven decisions.

What is multi-window burn rate alerting?

Multi-window burn rate alerting compares error budget consumption across a short window (for example 5 minutes) and a longer window (for example 1 hour). An alert fires only when both windows show elevated burn rates. This eliminates false positives from brief spikes while still catching sustained degradation before the full budget is exhausted.

How many dashboards should a team maintain?

A practical model uses three tiers: an executive dashboard showing SLO status and error budget remaining, a service dashboard showing the four golden signals (latency, traffic, errors, saturation) per service, and a debug dashboard with detailed system metrics used only during active incidents. Most teams have too many dashboards with overlapping data. Consolidate ruthlessly and delete the rest.

Next step: if your team is drowning in alerts and wants to build observability that actually works, start with a 30-minute discovery call to map your current SLI coverage and define your first SLOs.