Overview

Infrastructure alerts let you monitor the health and performance of your servers, processes, network interfaces, and filesystems. Rules are evaluated on a cron schedule against data stored in ClickHouse. Each rule targets individual hosts, so alerts fire independently per hostname.

Available rule types

Rule Type Description
System Metric System-level metrics such as CPU usage, memory usage, disk I/O, and load average
Process Metric Per-process metrics including CPU usage, memory usage, and thread count
Network Metric Network interface metrics such as bytes in/out, packets, and errors
Storage Metric Filesystem and disk metrics including usage percentage, free space, and inode usage
Host Not Reporting Detects when a host stops sending data for a configured duration
Checks Failing Detects when infrastructure health checks fail
Process Running Detects when a specific process stops running on a host

Available metrics

System metrics

Metric Summary Function Unit
Load Average 1 Minute (infra.system.load.1) average
Load Average 5 Minutes (infra.system.load.5) average
Load Average 15 Minutes (infra.system.load.15) average
CPU Used Percentage (infra.system.cpu.total) average %
CPU User Percentage (infra.system.cpu.user) average %
CPU Idle Percentage (infra.system.cpu.idle) average %
CPU IOwait Percentage (infra.system.cpu.iowait) average %
CPU Steal Percentage (infra.system.cpu.steal) average %
Memory Actual Used Percentage (infra.system.memory.actualUsed.pct) average %
Memory Actual Used (infra.system.memory.actualUsed.bytes) average GB
Memory Actual Free (infra.system.memory.actualFree.bytes) average GB
Memory Swap Used Percentage (infra.system.memory.swapUsed.pct) average %
Memory Swap Used (infra.system.memory.swapUsed) average GB
Memory Swap Free (infra.system.memory.swapFree) average GB

Storage metrics

Metric Summary Function Unit
Disk Used Percentage (infra.system.fs.used.pct) average %
Inodes Used Percentage (infra.system.fs.usedInodesPercentage) average %

Process metrics

Metric Summary Function Unit
Process Fd Open Count (infra.system.process.fdOpen) average count
Process CPU Used Percentage (infra.system.process.cpuTotal.pct) average %
Process Memory Virtual (infra.system.process.memoryVirtual) average GB
Process Memory Shared (infra.system.process.memoryShared) average MB
Process Memory RSS (infra.system.process.memoryRss) average GB
Process Memory RSS Percentage (infra.system.process.memoryRss.pct) average %

Network metrics

Metric Summary Function Unit
Network In Bytes (infra.system.network.in.bytes) average MB
Network In Packets (infra.system.network.in.packets) average Packets
Network In Errors (infra.system.network.in.errors) average Packets
Network In Dropped (infra.system.network.in.dropped) average Packets
Network Out Bytes (infra.system.network.out.bytes) average MB
Network Out Packets (infra.system.network.out.packets) average Packets
Network Out Errors (infra.system.network.out.errors) average Packets
Network Out Dropped (infra.system.network.out.dropped) average Packets

Infrastructure checks

Metric Summary Function Unit
Checks Health (infra.system.checks.up) max

Host not reporting

Metric Summary Function Unit
Host Not Reporting (infra.host.not.reporting) avg

Enter thresholds in the unit shown. For memory metrics, enter the value in GB or MB as shown. For CPU and disk metrics, enter the percentage value directly (e.g., 80 for 80%).

Targets

Infrastructure alerts target resources by hostname. Each alert rule fires independently per host, so a single rule can produce separate incidents for different servers.

Filters

Rules support filtering by hostname and by metric-specific dimensions through the filterConditions field.

Available filter operators

Operator Description
is Exact match
is not Excludes exact match
contains Substring match
not contains Excludes substring match
less than Numeric comparison
greater than Numeric comparison

Evaluation logic

Cron-based evaluation

Alert rules are evaluated on a recurring schedule. At each evaluation cycle, the system queries ClickHouse for metric data within the configured duration window.

Duration

The evaluation window can be set to one of the following durations:

  • 5 minutes
  • 10 minutes
  • 15 minutes
  • 30 minutes
  • 60 minutes

Time functions

Function Behavior
any Alert fires if the threshold is breached at any point during the evaluation window
all Alert fires only if the threshold is breached for the entire evaluation window

Threshold operators

Operator Description
above Fires when the metric value exceeds the threshold
below Fires when the metric value drops below the threshold
equal Fires when the metric value equals the threshold

Special rule: host not reporting

The Host Not Reporting rule type uses a different evaluation approach. It checks if the most recent event time for a host is older than the start of the evaluation window. If lastEventTime < startTime, the host is considered stale and the alert fires.

This means the alert triggers when a host has not reported any data for longer than the configured duration.

Special rule: checks failing

The Checks Failing rule type monitors infrastructure health checks. Data is grouped by hostname, check name, and IP address. Within each 1-minute bucket, the system evaluates the check status. If a check reports as down, it is considered failing.

Examples

Example 1: CPU usage alert

Alert when any host's CPU usage exceeds 90% for 5 minutes.

Setting Value
Rule type System Metric
Metric CPU Used Percentage
Operator above
Critical threshold 90 %
Warning threshold 80 %
Duration 5 minutes
Time function all

This rule evaluates CPU usage across all hosts. If a host sustains CPU usage above 90% for the full 5-minute window, a critical incident is opened. A warning is raised at 80%.

Example 2: Host not reporting

Alert when a host stops sending data for more than 10 minutes.

Setting Value
Rule type Host Not Reporting
Duration 10 minutes

At each evaluation cycle, the system checks the last event time for every known host. If a host's last event is more than 10 minutes ago, an incident is created for that host.

Example 3: Disk space alert

Alert when filesystem usage exceeds 85% on any host.

Setting Value
Rule type Storage Metric
Metric Disk Used Percentage
Operator above
Critical threshold 85 %
Warning threshold 75 %
Duration 15 minutes
Time function any
Filter hostname contains "prod"

This rule monitors disk usage on hosts with "prod" in their hostname. If any evaluation point within the 15-minute window shows usage above 85%, a critical incident is opened. The any time function means even a brief spike triggers the alert.