Understanding Alert Inhibition
Alertmanager's inhibition mechanism is designed to prevent notification storms by silencing redundant alerts when a root cause is identified. For instance, if an entire server cluster becomes unreachable, an administrator typically needs only one high-level notification indicating the cluster failure. Without inhibition, they would be bombarded with individual alerts for every application, middleware service, and dependency that failed as a result of the network outage.
Configuration Structure
Inhibition rules are defined under the inhibit_rules section in the Alertmanager configuration. The logic relies on matching labels between a "source" alert (the root cause) and a "target" alert (the symptom).
inhibit_rules:
- source_match: # Criteria for the root cause alert
severity: 'critical'
target_match: # Criteria for the alert to be silenced
alertname: 'HighLatency'
equal: ['cluster', 'region'] # Labels that must match between source and target
When an existing alert matches the target_match criteria, and a new alert arrives that matches the source_match criteria, inhibition activates. Crucial, the labels listed in the equal array must have identical values in both alerts for the rule to apply.
Simulation Scenario
To demonstrate this, we will simulate a monitoring stack for a specific host. Let's assume we have a server named worker-node-01 (IP: 192.168.10.20) running an Nginx service. Our monitoring objectives include:
- Detecting if the server itself is down via Node Exporter.
- Checking if TCP port 80 (Nginx) is accessible.
- Verifying that the HTTP endpoint
http://192.168.10.20/healthreturns a 200 status code.
If the server crashes, we want to suppresss the TCP and HTTP checks, as their failure is merely a consequence of the server being offline.
Deploying Blackbox Exporter
First, we deploy the Blackbox Exporter to perform the network probing.
apiVersion: v1
kind: Service
metadata:
name: blackbox-prober
namespace: observability
labels:
app: blackbox-prober
spec:
selector:
app: blackbox-prober
type: ClusterIP
ports:
- name: metrics
port: 9115
protocol: TCP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: blackbox-prober
namespace: observability
labels:
app: blackbox-prober
spec:
replicas: 1
selector:
matchLabels:
app: blackbox-prober
template:
metadata:
labels:
app: blackbox-prober
spec:
containers:
- image: prom/blackbox-exporter:latest
name: blackbox-prober
ports:
- containerPort: 9115
name: metrics
resources:
limits:
cpu: "500m"
memory: "1Gi"
requests:
cpu: "100m"
memory: "256Mi"
restartPolicy: Always
Configuring Prometheus Scraping
We configure Prometheus to use the Blackbox Exporter for HTTP and TCP module probing.
scrape_configs:
- job_name: 'blackbox_http_probe'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://192.168.10.20/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-prober:9115
- job_name: 'blackbox_tcp_probe'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- 192.168.10.20:80
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-prober:9115
Defining Alert Rules
We define three alert rules with a clear hierarchy of severity. The host status is the most critical, followed by the service port, and finally the HTTP endpoint status.
groups:
- name: infrastructure_alerts
rules:
- alert: ServerHardwareDown
expr: up{job="node_exporter"} == 0
for: 2m
labels:
severity: critical
scope: infrastructure
annotations:
summary: "Server {{ $labels.instance }} is unreachable"
description: "Node exporter has been down for more than 2 minutes."
- alert: NginxServiceUnreachable
expr: probe_success{instance="192.168.10.20:80"} == 0
for: 1m
labels:
severity: warning
scope: application
project: web-tier
annotations:
summary: "Nginx port unreachable on {{ $labels.instance }}"
- alert: WebPageStatusCodeInvalid
expr: probe_http_status_code{instance="http://192.168.10.20/health"} != 200
for: 1m
labels:
severity: info
scope: application
project: web-tier
annotations:
summary: "Health check failed for {{ $labels.instance }}"
Configuring Inhibition Rules
To prevent the alert storm, we configure inhibition rules in Alertmanager based on the labels defined above.
- Host Down Inhibition: If
ServerHardwareDownfires, silence alerts for the same instance that belong to theweb-tierproject. - Service Dependency Inhibition: If the Nginx TCP port is unreachable, silence the HTTP status code check for the same instance, as the TCP failure makes the HTTP check irrelevant.
inhibit_rules:
# Rule 1: If server is down, silence application alerts
- source_match:
alertname: 'ServerHardwareDown'
severity: 'critical'
target_match_re:
project: 'web-tier'
equal: ['instance']
# Rule 2: If Nginx port is down, silence the HTTP page check
- source_match:
alertname: 'NginxServiceUnreachable'
target_match:
alertname: 'WebPageStatusCodeInvalid'
equal: ['instance']
Verification and Testing
Test Case 1: Service Failure
Stop the Nginx service on the node. Prometheus triggers two alerts: NginxServiceUnreachable and WebPageStatusCodeInvalid. However, in Alertmanager, only NginxServiceUnreachable is visible. The WebPageStatusCodeInvalid alert is successfully inhibited by the second rule because the TCP probe failure logically supersedes the HTTP content check.
Test Case 2: Total Outage
Power off the worker-node-01 server. Prometheus triggers all three alerts. In Alertmanager, only the ServerHardwareDown alert appears. The NginxServiceUnreachable and WebPageStatusCodeInvalid alerts are both suppressed by the first inhibition rule. Consequently, the notification system sends only a single critical email regarding the server outage.