Atatus Kubernetes alerts monitor the health and performance of your Kubernetes clusters. You can alert on CPU and memory usage for pods, containers, and nodes, track replica availability for deployments and statefulsets, detect pod restart loops, and monitor storage capacity.
Available metrics
Pod metrics
| Metric | Summary Function | Unit |
|---|---|---|
| CPU Usage | average | millicore |
| Memory Usage | average | MB |
| Network Received | average | MB |
| Network Transmitted | average | MB |
| Pod Restart Count | average | — |
Container metrics
| Metric | Summary Function | Unit |
|---|---|---|
| CPU Usage | average | millicore |
| Memory Usage | average | MB |
| Container Restart Count | average | — |
CronJob metrics
| Metric | Summary Function | Unit |
|---|---|---|
| Active Count | average | — |
DaemonSet metrics
| Metric | Summary Function | Unit |
|---|---|---|
| Replicas Available | average | replicas |
| Replicas Desired | average | replicas |
Deployment metrics
| Metric | Summary Function | Unit |
|---|---|---|
| Replicas Available | average | replicas |
| Replicas Desired | average | replicas |
Job metrics
| Metric | Summary Function | Unit |
|---|---|---|
| Pods Succeeded | average | pods |
| Pods Active | average | — |
| Pods Failed | average | pods |
Node metrics
| Metric | Summary Function | Unit |
|---|---|---|
| CPU Usage | average | millicore |
| Memory Usage | average | GB |
Storage metrics
| Metric | Summary Function | Unit |
|---|---|---|
| PersistentVolume Capacity | average | MB |
| PersistentVolumeClaim Request Storage | average | GB |
ReplicaSet metrics
| Metric | Summary Function | Unit |
|---|---|---|
| Replicas Available | average | replicas |
| Replicas Desired | average | replicas |
StatefulSet metrics
| Metric | Summary Function | Unit |
|---|---|---|
| Replicas Observed | average | replicas |
| Replicas Desired | average | replicas |
| Replicas Ready | average | replicas |
Enter CPU thresholds in millicores (1 core = 1000 millicores). Enter memory and network thresholds in MB or GB as shown. The alert engine converts units automatically.
How queries are executed
Kubernetes alert queries evaluate metrics per resource name, not at the cluster level. Each distinct pod, container, node, or workload is evaluated independently, and each violating resource creates its own incident.
Pod and container queries
- A status filter subquery runs first to exclude completed pods (
statusPhase != 'succeeded') or empty container names. Only active resources are evaluated. - The main query computes the metric value (e.g.,
avg(cpuUsageNanocores)) per 1-minute bucket, grouped by resource name. - The outer query counts how many buckets violated the threshold.
- Filters like
cluster,namespace,node, anddeploymentcan narrow the scope via the kubeFilters field.
Restart count queries work differently: they compute the delta of statusRestarts (max - min) within the time window from the kubernetes.state_container table, rather than using an absolute count. This captures restarts that occurred during the evaluation period.
Workload queries (DaemonSet, Deployment, ReplicaSet, StatefulSet, CronJob, Job)
Workload queries evaluate metrics per workload name. For example, a deployment replicas alert computes avg(replicasAvailable) per 1-minute bucket for each deployment.
This is especially useful for detecting replica mismatches: set an alert when Replicas Available drops below a threshold to detect under-provisioned deployments.
Node queries
Node queries evaluate metrics per node name from the kubernetes.node table. Memory usage uses the memoryWorkingsetBytes column (working set memory, not total allocated memory), which reflects actual memory pressure.
Storage queries
PersistentVolume and PersistentVolumeClaim queries evaluate per volume name from their respective tables.
Common query pattern
All Kubernetes queries follow this structure:
SELECT name,
countIf(value {operator} {threshold}) as violationCount,
count() AS totalCount
FROM (
SELECT name, avg({column}) AS value
FROM kubernetes.{resource_table}
WHERE {time_and_account_filters}
AND name IN ({status_filter})
GROUP BY {time_bucket}, name
)
GROUP BY name
The inner query computes the metric value per 1-minute bucket per resource. The outer query counts how many buckets violated the threshold. An alert triggers based on the time function (all = every bucket breached, any = at least one bucket breached).
Targets
Kubernetes alert rules target resources by name. Each violating resource creates its own incident.
Filters
Use the kubeFilters field to narrow the scope of evaluation:
- cluster — restrict to a specific cluster
- namespace — restrict to a specific namespace
- node / nodeName — restrict to a specific node
- deployment, daemonset, job, cronjob, replicaset — restrict to a specific workload
How alert evaluation works
Operators
| Operator | Triggers when |
|---|---|
above |
metric value > threshold |
below |
metric value < threshold |
equal |
metric value = threshold |
Evaluation windows
Available durations: 5, 10, 15, 30, or 60 minutes.
Time functions
| Function | Behavior |
|---|---|
all |
Triggers only if every 1-minute bucket in the window breaches the threshold. |
any |
Triggers if at least one 1-minute bucket breaches the threshold. |
Severity
Configure Warning and Critical thresholds independently.
Example configurations
Alert: Detect deployment replica mismatch — available replicas drops below 3.
| Setting | Value |
|---|---|
| Metric | Replicas Available (Deployment) |
| Operator | below |
| Critical threshold | 3 replicas |
| Duration | 5 minutes |
| Time function | all |
| Filter: namespace | production |
Alert: Pod memory usage exceeds 1024 MB.
| Setting | Value |
|---|---|
| Metric | Memory Usage (Pod) |
| Operator | above |
| Warning threshold | 800 MB |
| Critical threshold | 1024 MB |
| Duration | 10 minutes |
| Time function | all |
| Filter: namespace | production |
Alert: Node CPU exceeds 4000 millicores (4 cores).
| Setting | Value |
|---|---|
| Metric | CPU Usage (Node) |
| Operator | above |
| Critical threshold | 4000 millicore |
| Duration | 15 minutes |
| Time function | all |
+1-415-800-4104