Prometheus is my go-to tool for monitoring these days. At the core of Prometheus is a time-series database that can be queried with a powerful language for everything – this includes not only graphing but also alerting. Alerts generated with Prometheus are usually sent to Alertmanager to deliver via various media like email or Slack message.
That’s all nice and dandy but when I started to use it I was struggling because there are no built-in alerts coming with Prometheus. Looking on the Internet, though, I’ve found the following alert examples:
From my point of view, the lack of ready-to-use examples is a major pain for anyone who is starting to use Prometheus. Fortunately, the community is aware of that and working on various proposals:
All of this seems great but we are not there yet, so here is my humble attempt to add more examples to the sources above. I hope it will help you get started with Prometheus and Alertmanager.
Before you start setting up alerts you must have metrics in Prometheus time-series database. There are various exporters for Prometheus that exposes various metrics but I will show you examples for the following:
All of the exporters are very easy to setup except JMX because the latter should be run as Java agent within Kafka/Zookeeper JVM. Refer to my previous post on setting up jmx-exporter.
After setting up all the needed exporters and collecting the metrics for some time we can start crafting out alerts.
My philosophy for alerting is pretty simple – alert only when something is really broken, include maximum info and deliver via multiple media.
You describe the alerts in alert.rules
file (usually in /etc/prometheus
) on
Prometheus server, not Alertmanager, because the latter is responsible for
formatting and delivering alerts.
The format of alert.rules is YAML and it goes like this:
groups:
- name: Hardware alerts
rules:
- alert: Node down
expr: up{job="node_exporter"} == 0
for: 3m
labels:
severity: warning
annotations:
title: Node {{ $labels.instance }} is down
description: Failed to scrape {{ $labels.job }} on {{ $labels.instance }} for more than 3 minutes. Node seems down.
You have a top-level groups
key that contains a list of groups. I usually
create group for each exporter, so I have Hardware alerts for node_exporter,
Redis alerts for redis_exporter and so on.
Also, all of my alerts have 2 annotations – title and description that will be used by Alertmanager.
Let’s start with a simple one – alert when the server is down.
- alert: Node down
expr: up{job="node_exporter"} == 0
for: 3m
labels:
severity: warning
annotations:
title: Node {{ $labels.instance }} is down
description: Failed to scrape {{ $labels.job }} on {{ $labels.instance }} for more than 3 minutes. Node seems down.
The essence of this alert is expression which states up{job="node_exporter"} == 0
. I’ve seen a lot of examples that just use up == 0
but it’s strange because
every exporter that is being scraped by Prometheus has this metric, so you’ll be
alerted on a completely unwanted thing like restart of postgres_exporter which
is not the same as Postgres itself. So I set job label to node_exporter to
explicitly scrape for node health.
Another key part in this alert is the for: 3m
which tells Prometheus to send
alert only when expression holds true for 3 minutes. This is intended to avoid
false positives when some scrapes were failed because of network hiccups. It
basically add robustness to your alerts.
Some people use blackbox_exporter with ICMP probe for this.
Next is the Linux md raid alert
- alert: MDRAID degraded
expr: (node_md_disks - node_md_disks_active) != 0
for: 1m
labels:
severity: warning
annotations:
title: MDRAID on node {{ $labels.instance }} is in degrade mode
description: Degraded RAID array {{ $labels.device }} on {{ $labels.instance }}: {{ $value }} disks failed
In this one I check the diff between the total count of the disks and count of
the active disks and use diff value {{ $value }}
in description.
You can also access metric labels via $labels
variable to put useful info into
your alerts.
The next one is for bonding status:
- alert: Bond degraded
expr: (node_bonding_active - node_bonding_slaves) != 0
for: 1m
labels:
severity: warning
annotations:
title: Bond is degraded on {{ $labels.instance }}
description: Bond {{ $labels.master }} is degraded on {{ $labels.instance }}
This one is similar to mdraid one.
And the final one for hardware alerts is free space:
- alert: Low free space
expr: (node_filesystem_free{mountpoint !~ "/mnt.*"} / node_filesystem_size{mountpoint !~ "/mnt.*"} * 100) < 15
for: 1m
labels:
severity: warning
annotations:
title: Low free space on {{ $labels.instance }}
description: On {{ $labels.instance }} device {{ $labels.device }} mounted on {{ $labels.mountpoint }} has low free space of {{ $value }}%
To calculate free space I’m calculating it as a percentage and check if it’s
less than 15%. In the expression above I’m also excluding all mountpoints with
/mnt
because it’s usualy external to the node like remote storage which may be
close to full, e.g. for backups.
The final note here is labels
where I set severity: warning
. Inspired by Google
SRE book I have decided to use only 2 severity levels for alerting – warning
and page
. warning
alerts should go to the ticketing system and you should
react to these alerts during normal working days. page
alerts are emergencies
and can wake up on-call engineer – this type of alerts should be crafted
carefully to avoid burnout. Alerts routing based on levels is managed by
Alertmanager.
These are pretty simple – we have a warning
alert on redis cluster instance
availability and page
alert when the whole cluster is broken:
- alert: Redis instance is down
expr: redis_up == 0
for: 1m
labels:
severity: warning
annotations:
title: Redis instance is down
description: Redis is down at {{ $labels.instance }} for 1 minute.
- alert: Redis cluster is down
expr: min(redis_cluster_state) == 0
labels:
severity: page
annotations:
title: Redis cluster is down
description: Redis cluster is down.
These metrics are reported by redis_exporter. I deploy it on all instances of
Redis cluster – that’s why there is a min
function applied on
redis_cluster_state
.
I have a single Redis cluster but if you have multiple you should include that into alert description – possibly via labels.
For Kafka we check for availability of brokers and health of the cluster.
- alert: KafkaDown
expr: up{instance=~"kafka-.+", job="jmx-exporter"} == 0
for: 3m
labels:
severity: warning
annotations:
title: Kafka broker is down
description: Kafka broker is down on {{ $labels.instance }}. Could not scrape jmx-exporter for 3m.
To check whether Kafka is down we check up
metric from jmx-exporter. This is
the sane way of checking is Kafka process alive because jmx-exporter runs as
java agent inside Kafka process. We also filter by instance name because
jmx-expoter is run for both Kafka and Zookeeper.
- alert: KafkaNoController
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) < 1
for: 3m
labels:
severity: warning
annotations:
title: Kafka cluster has no controller
description: Kafka controller count < 1, cluster is probably broken.
This one checks for the active controller. The controller is responsible for
managing the states of partitions and replicas and for performing administrative
tasks like reassigning partitions. Every broker reports
kafka_controller_kafkacontroller_activecontrollercount
metric but only current
controller will report 1 – that’s why we use sum
.
If you use Kafka as an event bus or for any other real time processing you may
choose severity page
for this one. In my case, I use it as a queue and if it’s
broken client requests are not affected. That’s why I have severity warning
here.
- alert: KafkaOfflinePartitions
expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
for: 3m
labels:
severity: warning
annotations:
title: Kafka cluster has offline partitions
description: "{{ $value }} partitions in Kafka went offline (have no leader), cluster is probably broken.
In this one we check for offline partitions. These partitions have no leader and
thus can’t accept or deliver messages. We check for offline partitions on all
nodes – that’s why we have sum
in alert expression.
Again, if you use Kafka for some real-time processing you may choose to assign
page
severity for these alerts.
- alert: KafkaUnderreplicatedPartitions
expr: sum(kafka_cluster_partition_underreplicated) > 10
for: 3m
labels:
severity: warning
annotations:
title: Kafka cluster has underreplicated partitions
description: "{{ $value }} partitions in Kafka are under replicated
Finally, we check for under replicated partitions. This may happen when some Kafka node failed and partition has no place to replicate. This is not preventing Kafka to serve from this partition – producers and consumers will continue to work but the data in this partition is at risk.
Zookeeper alerts are similar to Kafka – we check for instance availability and cluster health.
- alert: Zookeeper is down
expr: up{instance=~"zookeeper-.+", job="jmx-exporter"} == 0
for: 3m
labels:
severity: warning
annotations:
title: Zookeeper instance is down
description: Zookeeper is down on {{ $labels.instance }}. Could not scrape jmx-exporter for 3 minutes>
Just like with Kafka we check for Zookeeper instance availability from up
metric of jmx-exporter because it runs inside Zookepeer process.
- alert: Zookeeper is slow
expr: max_over_time(zookeeper_MaxRequestLatency[1m]) > 10000
for: 3m
labels:
severity: warning
annotations:
title: Zookeeper high latency
description: Zookeeper latency is {{ $value }}ms (aggregated over 1m) on {{ $labels.instance }}.
You should really care about Zookeeper performance in terms of latency because if it’s slow dependent systems will fall miserably – leader election will fail, replication will fail and all other sorts of bad things will happen.
Zookeeper latency is reported via zookeeper_MaxRequestLatency
metric but it’s
gauge so you can’t apply increase
or rate
function on it. That’s why we use
max_over_time
looking in 1m intervals.
The alert is checking whether max latency is more than 10 seconds (10000ms). This may seem extreme but we saw it in production.
- alert: Zookeeper ensemble is broken
expr: sum(up{job="jmx-exporter", instance=~"zookeeper-.+"}) < 2
for: 1m
labels:
severity: page
annotations:
title: Zookeeper ensemble is broken
description: Zookeeper ensemble is broken, it has {{ $value }} nodes in it.
Finally, there is an alert for Zookeeper ensemble status where we sum up
metric values for jmx-exporter. Remember that it runs inside Zookeeper JVM so
essentially we check whether Zookeeper instances are up and compare it to the
majority of our cluster (2 in case of 3-nodes cluster).
Similar to Zookeeper and any other cluster system we check for Consul availability and cluster health.
There are 2 metrics sources for Consul – 1) The official consul_exporter and 2) the Consul itself via telemetry configuration.
consul_exporter provides most of the metrics for monitoring health of nodes and services registered in Consul. And Consul itself exposes internal metrics like client RPC RPS rate and other runtime metrics.
To check whether Consul agent is healthy we use consul_agent_node_status
metric with label status="critical"
:
- alert: Consul agent is not healthy
expr: consul_health_node_status{instance=~"consul-.+", status="critical"} == 1
for: 1m
labels:
severity: warning
annotations:
title: Consul agent is down
description: Consul agent is not healthy on {{ $labels.node }}.
Next, we check for cluster degrade via consul_raft_peers
. This metric reports
how many server nodes are in the cluster. The trick is to apply min
function
to it so we can detect network partitions where one instance thinks that it has
2 raft peers and the other has 1.
- alert: Consul cluster is degraded
expr: min(consul_raft_peers) < 3
for: 1m
labels:
severity: page
annotations:
title: Consul cluster is degraded
description: Consul cluster has {{ $value }} servers alive. This may lead to cluster break.
Finally, we check for autopilot status. Autopilot is a feature in Consul when the leader is constantly checking stability of other servers. This is internal metric and it’s reported from Consul itself.
- alert: Consul cluster is not healthy
expr: consul_autopilot_healthy == 0
for: 1m
labels:
severity: page
annotations:
title: Consul cluster is not healthy
description: Consul autopilot thinks that cluster is not healthy.
I hope you’ll find this useful and these sample alerts will help you jump start your Prometheus journey.
There are a lot of useful metrics you can use for alerts and there is no magic here – research what metrics you have, think how it may help to track the stability of your system, rinse and repeat.
That’s it, till the next time!