Prometheus alerts examples

October 29, 2019

Prometheus is my go-to tool for monitoring these days. At the core of Prometheus is a time-series database that can be queried with a powerful language for everything – this includes not only graphing but also alerting. Alerts generated with Prometheus are usually sent to Alertmanager to deliver via various media like email or Slack message.

That’s all nice and dandy but when I started to use it I was struggling because there are no built-in alerts coming with Prometheus. Looking on the Internet, though, I’ve found the following alert examples:

From my point of view, the lack of ready-to-use examples is a major pain for anyone who is starting to use Prometheus. Fortunately, the community is aware of that and working on various proposals:

All of this seems great but we are not there yet, so here is my humble attempt to add more examples to the sources above. I hope it will help you get started with Prometheus and Alertmanager.

Prerequisites

Before you start setting up alerts you must have metrics in Prometheus time-series database. There are various exporters for Prometheus that exposes various metrics but I will show you examples for the following:

  • node_exporter for hardware alerts
  • redis_exporter for Redis cluster alerts
  • jmx-exporter for Kafka and Zookeeper alerts
  • consul_exporter for alerting on Consul metrics

All of the exporters are very easy to setup except JMX because the latter should be run as Java agent within Kafka/Zookeeper JVM. Refer to my previous post on setting up jmx-exporter.

After setting up all the needed exporters and collecting the metrics for some time we can start crafting out alerts.

Alerts

My philosophy for alerting is pretty simple – alert only when something is really broken, include maximum info and deliver via multiple media.

You describe the alerts in alert.rules file (usually in /etc/prometheus) on Prometheus server, not Alertmanager, because the latter is responsible for formatting and delivering alerts.

The format of alert.rules is YAML and it goes like this:

groups:
- name: Hardware alerts
  rules:
  - alert: Node down
    expr: up{job="node_exporter"} == 0
    for: 3m
    labels:
      severity: warning
    annotations:
      title: Node {{ $labels.instance }} is down
      description: Failed to scrape {{ $labels.job }} on {{ $labels.instance }} for more than 3 minutes. Node seems down.

You have a top-level groups key that contains a list of groups. I usually create group for each exporter, so I have Hardware alerts for node_exporter, Redis alerts for redis_exporter and so on.

Also, all of my alerts have 2 annotations – title and description that will be used by Alertmanager.

Hardware alerts with node_exporter

Let’s start with a simple one – alert when the server is down.

- alert: Node down
  expr: up{job="node_exporter"} == 0
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Node {{ $labels.instance }} is down
    description: Failed to scrape {{ $labels.job }} on {{ $labels.instance }} for more than 3 minutes. Node seems down.

The essence of this alert is expression which states up{job="node_exporter"} == 0. I’ve seen a lot of examples that just use up == 0 but it’s strange because every exporter that is being scraped by Prometheus has this metric, so you’ll be alerted on a completely unwanted thing like restart of postgres_exporter which is not the same as Postgres itself. So I set job label to node_exporter to explicitly scrape for node health.

Another key part in this alert is the for: 3m which tells Prometheus to send alert only when expression holds true for 3 minutes. This is intended to avoid false positives when some scrapes were failed because of network hiccups. It basically add robustness to your alerts.

Some people use blackbox_exporter with ICMP probe for this.

Next is the Linux md raid alert

- alert: MDRAID degraded
  expr: (node_md_disks - node_md_disks_active) != 0
  for: 1m
  labels:
    severity: warning
  annotations:
    title: MDRAID on node {{ $labels.instance }} is in degrade mode
    description: Degraded RAID array {{ $labels.device }} on {{ $labels.instance }}: {{ $value }} disks failed

In this one I check the diff between the total count of the disks and count of the active disks and use diff value {{ $value }} in description.

You can also access metric labels via $labels variable to put useful info into your alerts.

The next one is for bonding status:

- alert: Bond degraded
  expr: (node_bonding_active - node_bonding_slaves) != 0
  for: 1m
  labels:
    severity: warning
  annotations:
    title: Bond is degraded on {{ $labels.instance }}
    description: Bond {{ $labels.master }} is degraded on {{ $labels.instance }}

This one is similar to mdraid one.

And the final one for hardware alerts is free space:

- alert: Low free space
  expr: (node_filesystem_free{mountpoint !~ "/mnt.*"} / node_filesystem_size{mountpoint !~ "/mnt.*"} * 100) < 15
  for: 1m
  labels:
    severity: warning
  annotations:
    title: Low free space on {{ $labels.instance }}
    description: On {{ $labels.instance }} device {{ $labels.device }} mounted on {{ $labels.mountpoint }} has low free space of {{ $value }}%

To calculate free space I’m calculating it as a percentage and check if it’s less than 15%. In the expression above I’m also excluding all mountpoints with /mnt because it’s usualy external to the node like remote storage which may be close to full, e.g. for backups.

The final note here is labels where I set severity: warning. Inspired by Google SRE book I have decided to use only 2 severity levels for alerting – warning and page. warning alerts should go to the ticketing system and you should react to these alerts during normal working days. page alerts are emergencies and can wake up on-call engineer – this type of alerts should be crafted carefully to avoid burnout. Alerts routing based on levels is managed by Alertmanager.

Redis alerts

These are pretty simple – we have a warning alert on redis cluster instance availability and page alert when the whole cluster is broken:

- alert: Redis instance is down
  expr: redis_up == 0
  for: 1m
  labels:
    severity: warning
  annotations:
    title: Redis instance is down
    description: Redis is down at {{ $labels.instance }} for 1 minute.

- alert: Redis cluster is down
  expr: min(redis_cluster_state) == 0
  labels:
    severity: page
  annotations:
    title: Redis cluster is down
    description: Redis cluster is down.

These metrics are reported by redis_exporter. I deploy it on all instances of Redis cluster – that’s why there is a min function applied on redis_cluster_state.

I have a single Redis cluster but if you have multiple you should include that into alert description – possibly via labels.

Kafka alerts

For Kafka we check for availability of brokers and health of the cluster.

- alert: KafkaDown
  expr: up{instance=~"kafka-.+", job="jmx-exporter"} == 0
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Kafka broker is down
    description: Kafka broker is down on {{ $labels.instance }}. Could not scrape jmx-exporter for 3m.

To check whether Kafka is down we check up metric from jmx-exporter. This is the sane way of checking is Kafka process alive because jmx-exporter runs as java agent inside Kafka process. We also filter by instance name because jmx-expoter is run for both Kafka and Zookeeper.

- alert: KafkaNoController
  expr: sum(kafka_controller_kafkacontroller_activecontrollercount) < 1
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Kafka cluster has no controller
    description: Kafka controller count < 1, cluster is probably broken.

This one checks for the active controller. The controller is responsible for managing the states of partitions and replicas and for performing administrative tasks like reassigning partitions. Every broker reports kafka_controller_kafkacontroller_activecontrollercount metric but only current controller will report 1 – that’s why we use sum.

If you use Kafka as an event bus or for any other real time processing you may choose severity page for this one. In my case, I use it as a queue and if it’s broken client requests are not affected. That’s why I have severity warning here.

- alert: KafkaOfflinePartitions
  expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Kafka cluster has offline partitions
    description: "{{ $value }} partitions in Kafka went offline (have no leader), cluster is probably broken.

In this one we check for offline partitions. These partitions have no leader and thus can’t accept or deliver messages. We check for offline partitions on all nodes – that’s why we have sum in alert expression.

Again, if you use Kafka for some real-time processing you may choose to assign page severity for these alerts.

- alert: KafkaUnderreplicatedPartitions
  expr: sum(kafka_cluster_partition_underreplicated) > 10
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Kafka cluster has underreplicated partitions
    description: "{{ $value }} partitions in Kafka are under replicated

Finally, we check for under replicated partitions. This may happen when some Kafka node failed and partition has no place to replicate. This is not preventing Kafka to serve from this partition – producers and consumers will continue to work but the data in this partition is at risk.

Zookeeper alerts

Zookeeper alerts are similar to Kafka – we check for instance availability and cluster health.

- alert: Zookeeper is down
  expr: up{instance=~"zookeeper-.+", job="jmx-exporter"} == 0
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Zookeeper instance is down
    description: Zookeeper is down on {{ $labels.instance }}. Could not scrape jmx-exporter for 3 minutes>

Just like with Kafka we check for Zookeeper instance availability from up metric of jmx-exporter because it runs inside Zookepeer process.

- alert: Zookeeper is slow
  expr: max_over_time(zookeeper_MaxRequestLatency[1m]) > 10000
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Zookeeper high latency
    description: Zookeeper latency is {{ $value }}ms (aggregated over 1m) on {{ $labels.instance }}.

You should really care about Zookeeper performance in terms of latency because if it’s slow dependent systems will fall miserably – leader election will fail, replication will fail and all other sorts of bad things will happen.

Zookeeper latency is reported via zookeeper_MaxRequestLatency metric but it’s gauge so you can’t apply increase or rate function on it. That’s why we use max_over_time looking in 1m intervals.

The alert is checking whether max latency is more than 10 seconds (10000ms). This may seem extreme but we saw it in production.

- alert: Zookeeper ensemble is broken
  expr: sum(up{job="jmx-exporter", instance=~"zookeeper-.+"}) < 2
  for: 1m
  labels:
    severity: page
  annotations:
    title: Zookeeper ensemble is broken
    description: Zookeeper ensemble is broken, it has {{ $value }} nodes in it.

Finally, there is an alert for Zookeeper ensemble status where we sum up metric values for jmx-exporter. Remember that it runs inside Zookeeper JVM so essentially we check whether Zookeeper instances are up and compare it to the majority of our cluster (2 in case of 3-nodes cluster).

Consul alerts

Similar to Zookeeper and any other cluster system we check for Consul availability and cluster health.

There are 2 metrics sources for Consul – 1) The official consul_exporter and 2) the Consul itself via telemetry configuration.

consul_exporter provides most of the metrics for monitoring health of nodes and services registered in Consul. And Consul itself exposes internal metrics like client RPC RPS rate and other runtime metrics.

To check whether Consul agent is healthy we use consul_agent_node_status metric with label status="critical":

- alert: Consul agent is not healthy
  expr: consul_health_node_status{instance=~"consul-.+", status="critical"} == 1
  for: 1m
  labels:
    severity: warning
  annotations:
    title: Consul agent is down
    description: Consul agent is not healthy on {{ $labels.node }}.

Next, we check for cluster degrade via consul_raft_peers. This metric reports how many server nodes are in the cluster. The trick is to apply min function to it so we can detect network partitions where one instance thinks that it has 2 raft peers and the other has 1.

- alert: Consul cluster is degraded
  expr: min(consul_raft_peers) < 3
  for: 1m
  labels:
    severity: page
  annotations:
    title: Consul cluster is degraded
    description: Consul cluster has {{ $value }} servers alive. This may lead to cluster break.

Finally, we check for autopilot status. Autopilot is a feature in Consul when the leader is constantly checking stability of other servers. This is internal metric and it’s reported from Consul itself.

- alert: Consul cluster is not healthy
  expr: consul_autopilot_healthy == 0
  for: 1m
  labels:
    severity: page
  annotations:
    title: Consul cluster is not healthy
    description: Consul autopilot thinks that cluster is not healthy.

Conclusion

I hope you’ll find this useful and these sample alerts will help you jump start your Prometheus journey.

There are a lot of useful metrics you can use for alerts and there is no magic here – research what metrics you have, think how it may help to track the stability of your system, rinse and repeat.

That’s it, till the next time!