Infrastructure Alerts

Overview

Infrastructure alerts let you monitor the health and performance of your servers, processes, network interfaces, and filesystems. Rules are evaluated on a cron schedule against data stored in ClickHouse. Each rule targets individual hosts, so alerts fire independently per hostname.

Available rule types

Rule Type	Description
System Metric	System-level metrics such as CPU usage, memory usage, disk I/O, and load average
Process Metric	Per-process metrics including CPU usage, memory usage, and thread count
Network Metric	Network interface metrics such as bytes in/out, packets, and errors
Storage Metric	Filesystem and disk metrics including usage percentage, free space, and inode usage
Host Not Reporting	Detects when a host stops sending data for a configured duration
Checks Failing	Detects when infrastructure health checks fail
Process Running	Detects when a specific process stops running on a host

Available metrics

System metrics

Metric	Summary Function	Unit
Load Average 1 Minute (`infra.system.load.1`)	average	—
Load Average 5 Minutes (`infra.system.load.5`)	average	—
Load Average 15 Minutes (`infra.system.load.15`)	average	—
CPU Used Percentage (`infra.system.cpu.total`)	average	%
CPU User Percentage (`infra.system.cpu.user`)	average	%
CPU Idle Percentage (`infra.system.cpu.idle`)	average	%
CPU IOwait Percentage (`infra.system.cpu.iowait`)	average	%
CPU Steal Percentage (`infra.system.cpu.steal`)	average	%
Memory Actual Used Percentage (`infra.system.memory.actualUsed.pct`)	average	%
Memory Actual Used (`infra.system.memory.actualUsed.bytes`)	average	GB
Memory Actual Free (`infra.system.memory.actualFree.bytes`)	average	GB
Memory Swap Used Percentage (`infra.system.memory.swapUsed.pct`)	average	%
Memory Swap Used (`infra.system.memory.swapUsed`)	average	GB
Memory Swap Free (`infra.system.memory.swapFree`)	average	GB

Storage metrics

Metric	Summary Function	Unit
Disk Used Percentage (`infra.system.fs.used.pct`)	average	%
Inodes Used Percentage (`infra.system.fs.usedInodesPercentage`)	average	%

Process metrics

Metric	Summary Function	Unit
Process Fd Open Count (`infra.system.process.fdOpen`)	average	count
Process CPU Used Percentage (`infra.system.process.cpuTotal.pct`)	average	%
Process Memory Virtual (`infra.system.process.memoryVirtual`)	average	GB
Process Memory Shared (`infra.system.process.memoryShared`)	average	MB
Process Memory RSS (`infra.system.process.memoryRss`)	average	GB
Process Memory RSS Percentage (`infra.system.process.memoryRss.pct`)	average	%

Network metrics

Metric	Summary Function	Unit
Network In Bytes (`infra.system.network.in.bytes`)	average	MB
Network In Packets (`infra.system.network.in.packets`)	average	Packets
Network In Errors (`infra.system.network.in.errors`)	average	Packets
Network In Dropped (`infra.system.network.in.dropped`)	average	Packets
Network Out Bytes (`infra.system.network.out.bytes`)	average	MB
Network Out Packets (`infra.system.network.out.packets`)	average	Packets
Network Out Errors (`infra.system.network.out.errors`)	average	Packets
Network Out Dropped (`infra.system.network.out.dropped`)	average	Packets

Infrastructure checks

Metric	Summary Function	Unit
Checks Health (`infra.system.checks.up`)	max	—

Host not reporting

Metric	Summary Function	Unit
Host Not Reporting (`infra.host.not.reporting`)	avg	—

Enter thresholds in the unit shown. For memory metrics, enter the value in GB or MB as shown. For CPU and disk metrics, enter the percentage value directly (e.g., 80 for 80%).

Targets

Infrastructure alerts target resources by hostname. Each alert rule fires independently per host, so a single rule can produce separate incidents for different servers.

Filters

Rules support filtering by hostname and by metric-specific dimensions through the filterConditions field.

Available filter operators

Operator	Description
`is`	Exact match
`is not`	Excludes exact match
`contains`	Substring match
`not contains`	Excludes substring match
`less than`	Numeric comparison
`greater than`	Numeric comparison

Evaluation logic

Cron-based evaluation

Alert rules are evaluated on a recurring schedule. At each evaluation cycle, the system queries ClickHouse for metric data within the configured duration window.

Duration

The evaluation window can be set to one of the following durations:

5 minutes
10 minutes
15 minutes
30 minutes
60 minutes

Time functions

Function	Behavior
`any`	Alert fires if the threshold is breached at any point during the evaluation window
`all`	Alert fires only if the threshold is breached for the entire evaluation window

Threshold operators

Operator	Description
`above`	Fires when the metric value exceeds the threshold
`below`	Fires when the metric value drops below the threshold
`equal`	Fires when the metric value equals the threshold

Special rule: host not reporting

The Host Not Reporting rule type uses a different evaluation approach. It checks if the most recent event time for a host is older than the start of the evaluation window. If lastEventTime < startTime, the host is considered stale and the alert fires.

This means the alert triggers when a host has not reported any data for longer than the configured duration.

Special rule: checks failing

The Checks Failing rule type monitors infrastructure health checks. Data is grouped by hostname, check name, and IP address. Within each 1-minute bucket, the system evaluates the check status. If a check reports as down, it is considered failing.

Examples

Example 1: CPU usage alert

Alert when any host's CPU usage exceeds 90% for 5 minutes.

Setting	Value
Rule type	System Metric
Metric	CPU Used Percentage
Operator	above
Critical threshold	90 %
Warning threshold	80 %
Duration	5 minutes
Time function	all

This rule evaluates CPU usage across all hosts. If a host sustains CPU usage above 90% for the full 5-minute window, a critical incident is opened. A warning is raised at 80%.

Example 2: Host not reporting

Alert when a host stops sending data for more than 10 minutes.

Setting	Value
Rule type	Host Not Reporting
Duration	10 minutes

At each evaluation cycle, the system checks the last event time for every known host. If a host's last event is more than 10 minutes ago, an incident is created for that host.

Example 3: Disk space alert

Alert when filesystem usage exceeds 85% on any host.

Setting	Value
Rule type	Storage Metric
Metric	Disk Used Percentage
Operator	above
Critical threshold	85 %
Warning threshold	75 %
Duration	15 minutes
Time function	any
Filter	hostname contains "prod"

This rule monitors disk usage on hosts with "prod" in their hostname. If any evaluation point within the 15-minute window shows usage above 85%, a critical incident is opened. The any time function means even a brief spike triggers the alert.