Alerting best practices

This guide covers practical strategies for building an effective alerting setup that catches real problems without overwhelming your team.

Choosing incident preferences

Every alert policy has an incident preference that controls how violations are grouped into incidents. Pick the level that matches your operational needs.

By policy

One incident is created for the entire policy, no matter how many rules fire or how many entities are affected. Additional violations are added as issues to the existing incident.

Best for simple setups with a small number of rules. This is the least noisy option.

By rule

One incident per rule, regardless of how many entities violate that rule. If three hosts breach a CPU rule, they all appear as issues under a single incident.

This provides a good balance between signal and noise. Most teams start here.

By rule and target

One incident per rule per entity (application, host, or service). If three hosts breach a CPU rule, you get three separate incidents, each with its own notification lifecycle.

This is the most granular option. Best for multi-service environments where different teams own different hosts. Can be noisy if many entities share the same rules.

Setting effective thresholds

Start generous, tighten over time

Set initial thresholds wider than you think you need. After a week of baseline data, narrow them based on actual behavior. This avoids a flood of false positives on day one.

Use warning and critical together

  • Warning -- An early signal that something is trending in the wrong direction. Route to Slack or email.
  • Critical -- Requires immediate action. Route to PagerDuty, SMS, or on-call channels.

Using both levels gives your team time to respond before problems escalate.

Duration matters

The duration setting controls how long a metric must stay above the threshold before an incident opens.

  • Longer durations (15-30 minutes) filter out transient spikes and reduce false positives. Use these for metrics that naturally fluctuate, like CPU or memory.
  • Shorter durations (5 minutes) catch real issues faster. Use these for metrics where speed matters, like error rate or host availability.

Choosing a time function

  • All -- Every data point in the duration window must violate the threshold. Use this for sustained issues. It avoids false positives from momentary spikes.
  • Any -- A single data point in the window is enough to trigger. Use this for critical metrics where even one breach matters, such as security events or complete service outages.

When to use baseline vs static thresholds

Static thresholds

Use static thresholds when you know the expected range of a metric:

  • CPU utilization should stay below 80%
  • Error rate should stay below 5%
  • Disk usage should stay below 85%

Static thresholds are straightforward and predictable. They work well for infrastructure metrics with well-understood limits.

Baseline thresholds

Use baseline thresholds when normal behavior varies by time of day or day of week:

  • Web traffic that peaks during business hours and drops overnight
  • Batch processing that runs heavier on weekends
  • Seasonal load patterns around holidays or events

Baseline alerting learns the pattern and triggers only when metrics deviate significantly from the expected range.

Note: Baseline thresholds are available for Browser and APM metrics only.

Notification channel strategy

Map severity to channels

Route notifications by severity to avoid desensitizing your team to critical alerts:

Severity Recommended channels
Critical PagerDuty, SMS, phone call
Warning Slack, Microsoft Teams
Info Email, webhook

Set up at least one channel per policy

A policy without a notification channel will create incidents but nobody will know about them. Always attach at least one channel.

Test before you rely on them

Use the test notification feature after creating or modifying a channel. This sends a sample message so you can verify that the integration works, the right people receive it, and the message format looks correct.

Rate limits

There is a limit of 200 incident notifications per day per account. If you are hitting this limit, it is a strong signal that your alert rules need tuning -- either raise thresholds, increase durations, or consolidate noisy rules.

Avoiding alert fatigue

Alert fatigue happens when teams receive so many notifications that they start ignoring all of them, including the important ones.

Focus on user-impacting metrics first

Do not alert on every metric you collect. Start with metrics that directly affect your users or your business, then expand only when you have capacity to act on new alerts.

These five alerts cover the most common failure modes and give you a strong foundation:

  1. APM: Web response time > 2s for 5 min -- Catches slow responses before users complain.
  2. APM: Error rate > 5% for 5 min -- Catches error spikes from bad deploys or upstream failures.
  3. Infra: Host not reporting for 5 min -- Catches hosts that have gone offline.
  4. Infra: CPU > 90% for 15 min -- Catches sustained CPU saturation (longer duration filters out deploy spikes).
  5. Browser: Page load time > 5s for 10 min -- Catches front-end performance regressions.

Suppress known noise with maintenance windows

Use maintenance windows during deployments to prevent expected metric disruptions from triggering alerts. See the section below for details.

Review alerts monthly

Set aside time each month to review your alerting setup:

  • Disable rules that fire frequently but rarely lead to action.
  • Adjust thresholds on rules that produce too many false positives.
  • Remove rules for decommissioned services.
  • Add rules for new services that launched without coverage.

Using maintenance windows

Maintenance windows suppress notifications for selected projects during a defined time period. Incidents are still created but notifications are held until the window closes.

One-time windows

Create a one-time window before a planned deployment or infrastructure change. Set the start time to just before the change begins and the end time to when you expect the system to stabilize.

Recurring windows

Use recurring windows for predictable maintenance periods, such as:

  • Sunday 2:00-4:00 AM for database backups
  • First Saturday of the month for OS patching
  • Nightly 3:00-3:30 AM for log rotation

Keep windows short

The shorter the window, the smaller your monitoring blind spot. A 30-minute window for a deployment is better than a 4-hour window "just in case." If the deployment runs long, you can extend the window.

Runbook URLs

Every alert rule supports an optional runbook URL field. Use it to link to a document that helps the responder diagnose and fix the problem.

What to include in a runbook

  • Diagnosis steps -- What to check first (dashboards, logs, recent deploys).
  • Common causes -- The most frequent reasons this alert fires and how to confirm each one.
  • Remediation procedures -- Step-by-step instructions to resolve the issue.
  • Escalation contacts -- Who to contact if the on-call responder cannot resolve it alone.

Attaching runbooks to rules reduces mean time to resolution, especially for less experienced responders or during off-hours on-call rotations.