Overview
Infrastructure alerts let you monitor the health and performance of your servers, processes, network interfaces, and filesystems. Rules are evaluated on a cron schedule against data stored in ClickHouse. Each rule targets individual hosts, so alerts fire independently per hostname.
Available rule types
| Rule Type | Description |
|---|---|
| System Metric | System-level metrics such as CPU usage, memory usage, disk I/O, and load average |
| Process Metric | Per-process metrics including CPU usage, memory usage, and thread count |
| Network Metric | Network interface metrics such as bytes in/out, packets, and errors |
| Storage Metric | Filesystem and disk metrics including usage percentage, free space, and inode usage |
| Host Not Reporting | Detects when a host stops sending data for a configured duration |
| Checks Failing | Detects when infrastructure health checks fail |
| Process Running | Detects when a specific process stops running on a host |
Available metrics
System metrics
| Metric | Summary Function | Unit |
|---|---|---|
Load Average 1 Minute (infra.system.load.1) |
average | — |
Load Average 5 Minutes (infra.system.load.5) |
average | — |
Load Average 15 Minutes (infra.system.load.15) |
average | — |
CPU Used Percentage (infra.system.cpu.total) |
average | % |
CPU User Percentage (infra.system.cpu.user) |
average | % |
CPU Idle Percentage (infra.system.cpu.idle) |
average | % |
CPU IOwait Percentage (infra.system.cpu.iowait) |
average | % |
CPU Steal Percentage (infra.system.cpu.steal) |
average | % |
Memory Actual Used Percentage (infra.system.memory.actualUsed.pct) |
average | % |
Memory Actual Used (infra.system.memory.actualUsed.bytes) |
average | GB |
Memory Actual Free (infra.system.memory.actualFree.bytes) |
average | GB |
Memory Swap Used Percentage (infra.system.memory.swapUsed.pct) |
average | % |
Memory Swap Used (infra.system.memory.swapUsed) |
average | GB |
Memory Swap Free (infra.system.memory.swapFree) |
average | GB |
Storage metrics
| Metric | Summary Function | Unit |
|---|---|---|
Disk Used Percentage (infra.system.fs.used.pct) |
average | % |
Inodes Used Percentage (infra.system.fs.usedInodesPercentage) |
average | % |
Process metrics
| Metric | Summary Function | Unit |
|---|---|---|
Process Fd Open Count (infra.system.process.fdOpen) |
average | count |
Process CPU Used Percentage (infra.system.process.cpuTotal.pct) |
average | % |
Process Memory Virtual (infra.system.process.memoryVirtual) |
average | GB |
Process Memory Shared (infra.system.process.memoryShared) |
average | MB |
Process Memory RSS (infra.system.process.memoryRss) |
average | GB |
Process Memory RSS Percentage (infra.system.process.memoryRss.pct) |
average | % |
Network metrics
| Metric | Summary Function | Unit |
|---|---|---|
Network In Bytes (infra.system.network.in.bytes) |
average | MB |
Network In Packets (infra.system.network.in.packets) |
average | Packets |
Network In Errors (infra.system.network.in.errors) |
average | Packets |
Network In Dropped (infra.system.network.in.dropped) |
average | Packets |
Network Out Bytes (infra.system.network.out.bytes) |
average | MB |
Network Out Packets (infra.system.network.out.packets) |
average | Packets |
Network Out Errors (infra.system.network.out.errors) |
average | Packets |
Network Out Dropped (infra.system.network.out.dropped) |
average | Packets |
Infrastructure checks
| Metric | Summary Function | Unit |
|---|---|---|
Checks Health (infra.system.checks.up) |
max | — |
Host not reporting
| Metric | Summary Function | Unit |
|---|---|---|
Host Not Reporting (infra.host.not.reporting) |
avg | — |
Enter thresholds in the unit shown. For memory metrics, enter the value in GB or MB as shown. For CPU and disk metrics, enter the percentage value directly (e.g.,
80for 80%).
Targets
Infrastructure alerts target resources by hostname. Each alert rule fires independently per host, so a single rule can produce separate incidents for different servers.
Filters
Rules support filtering by hostname and by metric-specific dimensions through the filterConditions field.
Available filter operators
| Operator | Description |
|---|---|
is |
Exact match |
is not |
Excludes exact match |
contains |
Substring match |
not contains |
Excludes substring match |
less than |
Numeric comparison |
greater than |
Numeric comparison |
Evaluation logic
Cron-based evaluation
Alert rules are evaluated on a recurring schedule. At each evaluation cycle, the system queries ClickHouse for metric data within the configured duration window.
Duration
The evaluation window can be set to one of the following durations:
- 5 minutes
- 10 minutes
- 15 minutes
- 30 minutes
- 60 minutes
Time functions
| Function | Behavior |
|---|---|
any |
Alert fires if the threshold is breached at any point during the evaluation window |
all |
Alert fires only if the threshold is breached for the entire evaluation window |
Threshold operators
| Operator | Description |
|---|---|
above |
Fires when the metric value exceeds the threshold |
below |
Fires when the metric value drops below the threshold |
equal |
Fires when the metric value equals the threshold |
Special rule: host not reporting
The Host Not Reporting rule type uses a different evaluation approach. It checks if the most recent event time for a host is older than the start of the evaluation window. If lastEventTime < startTime, the host is considered stale and the alert fires.
This means the alert triggers when a host has not reported any data for longer than the configured duration.
Special rule: checks failing
The Checks Failing rule type monitors infrastructure health checks. Data is grouped by hostname, check name, and IP address. Within each 1-minute bucket, the system evaluates the check status. If a check reports as down, it is considered failing.
Examples
Example 1: CPU usage alert
Alert when any host's CPU usage exceeds 90% for 5 minutes.
| Setting | Value |
|---|---|
| Rule type | System Metric |
| Metric | CPU Used Percentage |
| Operator | above |
| Critical threshold | 90 % |
| Warning threshold | 80 % |
| Duration | 5 minutes |
| Time function | all |
This rule evaluates CPU usage across all hosts. If a host sustains CPU usage above 90% for the full 5-minute window, a critical incident is opened. A warning is raised at 80%.
Example 2: Host not reporting
Alert when a host stops sending data for more than 10 minutes.
| Setting | Value |
|---|---|
| Rule type | Host Not Reporting |
| Duration | 10 minutes |
At each evaluation cycle, the system checks the last event time for every known host. If a host's last event is more than 10 minutes ago, an incident is created for that host.
Example 3: Disk space alert
Alert when filesystem usage exceeds 85% on any host.
| Setting | Value |
|---|---|
| Rule type | Storage Metric |
| Metric | Disk Used Percentage |
| Operator | above |
| Critical threshold | 85 % |
| Warning threshold | 75 % |
| Duration | 15 minutes |
| Time function | any |
| Filter | hostname contains "prod" |
This rule monitors disk usage on hosts with "prod" in their hostname. If any evaluation point within the 15-minute window shows usage above 85%, a critical incident is opened. The any time function means even a brief spike triggers the alert.
+1-415-800-4104