Stop Alerting on CPU: What to Monitor Instead

Your on-call engineer gets paged at 2 AM for high CPU on a batch processing node. They investigate for 20 minutes, find nothing wrong, and go back to sleep. An hour later, another page: disk space on a log aggregator. Another false alarm. By the time the third alert fires — this one signaling elevated API error rates — they’ve learned to assume it’s another false positive. They check Slack, see nothing, and go back to sleep.

In the morning, they discover the payment service was down for 45 minutes. Customers couldn’t complete purchases. The alert was real; the engineer had just been trained to ignore it.

This is alert fatigue in action. When everything alerts, nothing alerts. The fix isn’t better discipline — it’s better alert design.

The Symptom vs Cause Distinction

The most common alerting mistake is alerting on causes rather than symptoms. High CPU, low disk space, elevated connection counts — these are causes. They might affect users, or they might not. Request latency, error rates, failed transactions — these are symptoms. They tell you users are actually experiencing problems right now.

Consider three classic cause-based alerts and their symptom-based alternatives:

  1. 1
    High CPU usage
    Fires when cpu_usage > 90%. But CPU can spike during legitimate load, garbage collection, or batch jobs without any user impact. The symptom-based alternative-elevated request latency-only fires when users are actually affected. Many different causes lead to latency; one symptom alert catches them all.
  2. 2
    Low disk space
    Fires when disk_free < 10%. But this triggers on expected growth, archival systems, and cold storage that's supposed to be full. The symptom-based alternative-write failures increasing-only fires when disk space is causing actual failures. Keep disk space as a ticket for business hours, not a 2 AM page.
  3. 3
    Database connections high
    Fires when connections exceed 80% of max. But connection pools often run hot without issues. The symptom-based alternative-query timeouts-only fires when connection exhaustion impacts queries. Connection monitoring becomes a dashboard metric, not an alert.

The pattern is consistent: causes are potential problems; symptoms are actual problems. This leads to a simple rule: page on symptoms, ticket on causes. text: ‘If you can’t write down how to diagnose and remediate an alert, you don’t understand it well enough to wake someone up for it. The runbook doesn’t need to be exhaustive-a summary of user impact, links to relevant dashboards, common causes ranked by likelihood, and escalation criteria are enough. The point is that a 2 AM responder shouldn’t have to reverse-engineer what the alert means.’, This creates a natural hierarchy for categorizing alerts:

The alerting pyramid.
The alerting pyramid. description

Top-to-bottom flowchart divided into three grouped tiers. The top tier, Page Alerts User Impact, contains Error rate greater than SLO burn rate, Latency greater than SLO target, and Availability below threshold. This tier points to User impact NOW with the label Wake someone up. The middle tier, Ticket Alerts Degradation, contains CPU sustained high, Disk filling toward capacity, and Connection pool trending hot. This tier points to Potential future impact with the label Fix during business hours. The bottom tier, Dashboards Awareness, contains All metrics visible, Trends and rates, and Capacity projections. This tier points to No action required with the label Context for investigation. The diagram shows a hierarchy where user-facing symptoms warrant urgent pages, potential causes become tickets, and the rest of the operational data stays on dashboards for context.

SLO-Based Burn Rate Alerting

Symptom-based alerting answers what to alert on. Burn rates answer when.

The problem with raw thresholds is they lack context. A 1% error rate sounds scary, but is it? If your SLO allows 0.1% errors over a month, that 1% rate means you’re burning through your error budget 10x faster than sustainable. You have about 3 days before you exhaust your monthly budget. Urgent, but not a 2 AM emergency.

newsletter.subscribe

$ Stay Updated

> Subscribe to our newsletter for the latest insights on web development and digital solutions.

$

You'll receive a confirmation email. Click the link to complete your subscription.

Burn rate measures how fast you’re consuming your error budget relative to a sustainable pace. A 1x burn rate means you’ll exactly exhaust your budget by month’s end — sustainable but leaves no margin. A 14.4x burn rate means you’ll exhaust your entire monthly budget in just 2 hours. That’s an emergency.

The math: if your monthly budget is 0.1% errors and you’re currently seeing 1.44% errors (14.4 × 0.1%), you’re burning 14.4x faster than sustainable. At that rate, your 30-day budget disappears in 30 days ÷ 14.4 ≈ 2 hours.

Burn rate to response mapping.

Here’s what a 14.4x burn rate alert looks like in Prometheus:

groups:
  - name: slo-burn-rate-alerts
    rules:
      - alert: APIAvailabilityFastBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "API error rate burning SLO budget rapidly"
          runbook_url: "https://runbooks.example.com/api-availability"
Prometheus burn rate alert.

One refinement makes burn rate alerts even more reliable: multi-window alerting. Single-window alerts have a problem — short windows catch spikes but also false-positive on brief blips; long windows miss fast-moving incidents. The solution is requiring both a short and long window to breach before alerting.

For example, a 14.4x burn rate alert might require both conditions to be true: the 5-minute error rate exceeds the threshold and the 1-hour error rate exceeds the threshold. If a 30-second traffic spike pushes errors to 2% but the hourly rate is still 0.05%, the alert doesn’t fire — the spike isn’t sustained. Conversely, if an incident resolved 20 minutes ago, the 1-hour window might still show elevated errors, but the 5-minute window is clean — no alert, because the problem is already over.

Sustaining Alert Quality

The symptom vs cause distinction and burn rate math are the technical foundation, but sustainable alerting requires organizational practices too.

Free PDF Guide

Alert Fatigue: Symptom-Based Alerting That Works

Designing alerts that wake people up for real problems and include runbooks for resolution.

What you'll get:

  • Symptom alert design checklist
  • Burn rate threshold calculator
  • Runbook template starter pack
  • Monthly alert hygiene rubric
PDF download

Free resource

Instant access

No credit card required.

Info callout:

The goal isn’t zero alerts — it’s ensuring every alert that fires represents a real problem that requires human intervention, and the human receiving it has everything they need to resolve it quickly. Start by auditing your current alerts: calculate your actionable rate over the past month. If it’s below 80%, you have work to do.

Alert fatigue isn’t a discipline problem you can train your way out of. It’s a design problem you can engineer your way out of. Page on symptoms, ticket on causes, and let burn rates determine urgency.

Share this article

Enjoyed the read? Share it with your network.

Other things I've written