SLO alerts

Overview

SLO alerts notify you when your service level objectives are at risk. Instead of alerting on raw metrics, SLO alerts track error budgets and burn rates to give you early warning when reliability is degrading, so you can respond before your users are affected.

SLO concepts

Service level indicator (SLI)

The SLI is the ratio of good events to total events, expressed as a percentage:

SLI = (good_events / total_events) * 100

For example, if your service handled 10,000 requests and 9,950 were successful, your SLI is 99.5%.

Target

The target is the desired SLI percentage for a given time window. For example, a target of 99.9% means you aim for no more than 0.1% of events to be failures.

Error budget

The error budget represents how much unreliability you can tolerate within the SLO window:

error_budget = (1 - target / 100) * window_minutes

For a 99.9% target over 30 days (43,200 minutes), the error budget is 43.2 minutes of allowed downtime.

Burn rate

The burn rate measures how fast you are consuming your error budget relative to the elapsed time in the window:

burn_rate = (error_budget_consumed_pct / 100) / elapsed_fraction

A burn rate of 1.0 means the budget is being consumed at exactly the expected rate. A burn rate above 1.0 means the budget is being consumed faster than sustainable, and the SLO will be breached before the window ends if the rate continues.

SLO window types

Rolling windows

Rolling windows look back a fixed duration from the current time:

Window Duration (minutes)
7 days 10,080
14 days 20,160
28 days 40,320
30 days 43,200
90 days 129,600

Calendar windows

Calendar windows align to calendar boundaries:

Window Duration (minutes)
Weekly 10,080
Monthly 43,200
Quarterly 129,600

SLO statuses

Each SLO is assigned one of the following statuses based on current performance:

Status Description
HEALTHY SLI meets the target
AT_RISK SLI is within the warning threshold or more than 75% of the error budget is consumed
BREACHED SLI is below the target or 100% of the error budget is consumed
NO_DATA Insufficient data to evaluate the SLO

Alert types

Burn rate alerts

Burn rate alerts fire when the rate of error budget consumption exceeds a threshold. These are useful for detecting fast-moving incidents that will exhaust your budget quickly.

The alert fires when:

burn_rate >= burn_rate_threshold

Budget consumed alerts

Budget consumed alerts fire when the total percentage of error budget consumed exceeds a threshold. These provide a direct measure of how much budget remains, regardless of the rate.

The alert fires when:

error_budget_consumed_pct >= consumed_percentage

Alert configuration

Schema fields

Field Type Description
name String Alert name
sloId ObjectId Reference to the SLO definition
condition String Alert type: burn_rate, budget_remaining, compliance
operator String Comparison operator: greater_than, less_than
threshold Number Threshold value (default: 5)
timeWindow String Evaluation window: 1h, 6h, 12h, 24h, 7d, 30d
severity String Alert severity: info, warning, critical
enabled Boolean Whether the alert is active
channels Array Notification channel IDs to receive alerts

Evaluation

SLO alerts are evaluated every minute via a scheduled cron job. SLOs are processed in batches of 10 to manage system load.

Examples

Fast burn rate alert

Detect rapid error budget consumption that would exhaust the budget within hours:

{
  "name": "API availability - fast burn",
  "sloId": "648a1b2c3d4e5f6a7b8c9d0e",
  "condition": "burn_rate",
  "operator": "greater_than",
  "threshold": 14,
  "timeWindow": "1h",
  "severity": "critical",
  "enabled": true,
  "channels": ["648a1b2c3d4e5f6a7b8c9d0f"]
}

This fires a critical alert when the burn rate exceeds 14x over a one-hour window, indicating the error budget will be fully consumed in roughly two hours if the trend continues.

Error budget exhaustion warning

Get an early warning when a significant portion of the error budget has been consumed:

{
  "name": "Checkout flow - budget warning",
  "sloId": "648a1b2c3d4e5f6a7b8c9d1a",
  "condition": "budget_remaining",
  "operator": "greater_than",
  "threshold": 75,
  "timeWindow": "30d",
  "severity": "warning",
  "enabled": true,
  "channels": ["648a1b2c3d4e5f6a7b8c9d1b"]
}

This fires a warning when more than 75% of the 30-day error budget has been consumed, giving the team time to investigate and course-correct before the SLO is breached.