Redis high availability

March 28, 2018

Recently, at the place where I work, we started to use Redis for session-like objects storage. Despite that these objects are small and short-lived, without them our service would stop working, so a question about Redis high availability arose. Turns out, for Redis there is no ready-made solution – there are multiple options with different tradeoffs and information sometimes is a bit scarce and distributed across documentation and blog posts, hence I’m writing this in shy hope of helping another poor soul like myself to solve such problem. I’m by no means a Redis guru but I wanted to share my experience anyway because, after all, it’s my personal blog.

I’m going to describe high availability in terms of node failure and not persistence.

Redis high availability options

Standalone Redis, which is a good old redis-server you launch after installation, is easy to setup and use, but it’s not resilient to the failure of a node it’s running on. It doesn’t matter whether you use RDB or AOF as long as a node is unavailable you are in a trouble.

Over the years, Redis community came up with a few high availability options – most of them are built in Redis itself, though there are some others that are 3rd party tools. Let’s dive into it.

Simple Redis replication

Redis has a replication support since, like, forever and it works great – just put the slaveof <addr> <port> in your config file and the instance will start receiving the stream of the data from the master.

You can configure multiple slaves for the master, you can configure slave for a slave, you can enable slave-only persistence, you can make replication synchronous (it’s async by default) – the list of what you can do with Redis seems like bounded only by your imagination. Just read the docs for replication – it’s really great.

Pros:

Quick and simple to setup
Could be automated via configuration management tools
Continue to work as long as a single master instance is available - it can survive failures of all of the slave instances.

Cons:

Writes must go to the master
Slaves may serve reads but because replication is asynchronous you may get stale reads
It doesn’t shard data, so master and slaves will have unbalanced utilization
In case of master failure, you have to elect the new master manually

The last thing is, IMHO, a major downside and that’s where the Redis Sentinel helps.

Redis replication with Sentinel

Nobody wants to wake up in the middle of the night, just to issue the SLAVEOF NO ONE to elect new master – it’s pretty silly and should be automated, right? Right. That’s why Redis Sentinel exists.

Redis Sentinel is the tool that monitors Redis masters and slaves and automatically elects the new master from one of the slaves. It’s a really critical task so you’re better off making Sentinel highly available itself. Luckily, it has a built-in clustering which makes it a distributed system.

Sentinel is a quorum system, meaning that to agree on the new master there should be a majority of Sentinel nodes alive. This has a huge implication on how to deploy Sentinel. There are basically 2 options here – colocate with Redis server or deploy on a separate cluster. Colocating with Redis server makes sense because Sentinel is a very lightweight process, so why pay for additional nodes? But in this case, we lose our resilience because if you colocate Redis server and Sentinel on, say, 3 nodes, you can only lose 1 node because Sentinel needs 2 nodes to elect the new Redis server master. Without Sentinel, we could lose 2 slave nodes. So maybe you should think about a dedicated Sentinel cluster. If you’re on the cloud you could deploy it on some sort of nano instances but maybe it’s not your case. Tradeoffs, tradeoffs, I know.

Besides dealing with maintaining one more distributed system, with Sentinel, you should change the way your clients work with Redis because now your master node can move. For this case, your application should first go to Sentinel, ask it about current master and only then work with it. You can build a clever hack with HAProxy here – instead of going to Sentinel you can put a HAProxy in front of Redis servers to detect the new master with the help of TCP checks. See example at HAProxy blog

Nevertheless, Sentinel colocated with Redis servers is a really common solution for Redis high availability, for example, Gitlab recommends it in its admin guide.

Pros:

Automatically selects new master in case of its failure. Yay!
Easy to setup, (seems) easy to maintain.

Cons:

Yet another distributed system to maintain
May require a dedicated cluster if not colocated with Redis server
Still doesn’t shard data, so master will be overutilized in comparison to slaves

Redis cluster

All of the solutions above seems IMHO half-assed because they add more things and these things are not obvious at least at first sight. I don’t know any other system that solves availability problem by adding yet another cluster that must be available itself. It’s just annoying.

So with recent versions of Redis came the Cluster – a builtin feature that adds sharding, replication and high availability to the known and loved Redis. Within a cluster, you have multiple master instances that serve a subset of the keyspace. Clients may send requests to any of the master instances which will redirect to the correct instance for the given key. Master instances may have as many replicas as they want, and these replicas will be promoted to master automatically even without a quorum. Note, though, that master instances quorum is required for the whole cluster work, but a quorum is not required for the shard working including the new master election.

Each instance in the Redis cluster (master or slave) should be deployed on a dedicated node but you can configure cross replication where each node will contain multiple instances. There are sharp corners here, though, that I’ll illustrate in the next post, so stay tuned!

Pros:

Shards data across multiple nodes
Has replication support
Has builtin failover of the master

Cons:

Not every library supports it
May not be as robust (yet) as standalone Redis or Sentinel
Tooling is wack, building and maintaining (replacing a node) cluster is a manual process
Introduces an extra network hop in case we missed the shard.

Twemproxy

Twemproxy is a special proxy for in-memory databases – namely, memcached and Redis – that was built by Twitter. It adds sharding with consistent hashing, so resharding is not that painful, and also maintains persistent connections and enables requests/response pipelining.

I haven’t tried it because in the era of Redis cluster it doesn’t seem relevant to me anymore, so I couldn’t tell pros and cons, but YMMV.

Redis Enterprise

After the initial post, quite a few people reached out to me telling that they have great success with Redis Enterprise from Redis Labs. Check out this one from Reddit. The point is that if you have a really high workload and your data is more critical and you can afford it then you should consider their solution.

You may also check their guide on Redis High Availability – it is also well written and illustrated.

Conclusion

Choosing the right solution for Redis high availability is full of tradeoffs. Nobody knows your situation better than you, so get to know how Redis works – there is no magic here – in the end, you’ll have to maintain the solution. In my case, we have chosen a Redis cluster with cross replication after lots of testing and writing a doc with instructions on how to deal with failures.

That’s all for now, stay tuned for the dedicated Redis cluster post!