Redis experience

January 18, 2020

Intro

Redis is an indispensable tool for many software engineering problems because it provides great primitives, it’s fast and solid. Most of the time it’s used as some sort of cache. But if you stretch it to other use cases its behavior may surprise you.

Recently we’ve tried to use it as persistent storage for a large dataset. We’ve got a lot of problems, fixed many and gained a lot of experience that I wanted to share. So here is my experience report.

Disclaimer – all of these problems arose from our use case and not because Redis is somewhat flawed. Like any piece of software it requires understanding and research before deployed in any decent production environment.

Our use case

We have a data collecting pipeline with the following requirements:

  • We need aggregated counters to calculate various metrics during data collect
  • There are more than 800 million keys where 97% of the keys hold a couple of integers
  • We need to make it available because our data pipeline is always working
  • We need to cleanup outdated entries because our dataset is changing every day
  • We want it to be persistent because loading that amount of data takes a lot of time and we really don’t like to stop our data pipeline

Cluster

Given our requirements we started to use Redis cluster from the start. We chose it over single master/replica because we couldn’t fit our 800M+ keys on a single instance and because Redis cluster provides high availability kinda out of the box (you still need to create the cluster with redis-trib.rb or redis-cli --cluster create). Also, such beefy nodes are very hard to manage – loading of the dataset would take about an hour, the snapshot would take a long time - generally, I prefer to use many small nodes with small datasets on each instead of a few huge nodes.

So, I’ve setup Redis cluster and this time I did it without cross replication because I’ve used Google Cloud instances and because cross replication is very tedious to configure and painful to maintain.

Now, it’s time to load the data.

Loading data

The naive way of loading data by sending millions of SET commands is very inefficient because you’ll spend most of the time waiting for command RTT. Instead, you should use pipelining or even generate a file with Redis protocol for mass insert.

I have experience with pipelining and would recommend this way because it allows you to control the process and anyway it’s much more convenient than generating text files.

With pipelining I saw more than 300K RPS on insert (SET/HSET/SADD) so it’s very performant. But it has one crucial point regarding the Redis cluster mode – multi-key commands must hit the same node. That’s understandable because all commands in a pipeline are seen as one and to generate the response you don’t need to gather data from other nodes (potentially failing) but instead do everything in a single process context.

Nevertheless, it’s possible to use pipelining with Redis cluster – you just have to use hash tags. Hash tags are a substring in curly braces that Redis will use for calculating the hash slot and consequently determine the cluster node. It looks like this:

SET {shard}:key

{shard} is a hash tag.

All operations in a pipeline must have the same hash tag to succeed. But the problem here is that all keys with the same hash tag will be on the same node in the same hash slot. This will lead to uneven data distribution and imbalanced memory consumption on Redis cluster nodes. In our use case data partitions were very different in sizes and after the data loading we got a 3x discrepancy in memory consumption between some nodes. This is a problem because you’ll have different utilization of cluster nodes and you don’t know how to size your cluster now.

It’s possible to rebalance your cluster by moving hash slots between nodes – it’s described in the cluster tutorial. I’ve tried the process described in CLUSTER SETSLOT doc. But I would recommend against this because it’s a manual process, error-prone, you will forget about it the next you need to setup the cluster and essentially it’s a dirty fix.

Going forward

So we started to use Redis cluster, load the data with pipelining and use hash tags to make pipelining work.

Memory consumption

Let’s talk about memory consumption because Redis is an in-memory database, meaning that your dataset is bound by the amount of memory the Redis server node. But you can’t only count the size of your data for capacity planning, you have to remember that storing any Redis key is not free. The main hash table (used for SET) and all Redis datatypes like sets and lists have overhead.

We can see that overhead with a MEMORY USAGE command.

127.0.0.1:6379> mget 0 1000 100000
1) "76876987"
2) "76184956"
3) "74602210"
127.0.0.1:6379> MEMORY USAGE 0
(integer) 43
127.0.0.1:6379> MEMORY USAGE 1000
(integer) 46
127.0.0.1:6379> MEMORY USAGE 100000
(integer) 48

127.0.0.1:6379> DEBUG OBJECT 0
Value at:0x7f21c8ab95e0 refcount:1 encoding:int serializedlength:5 lru:16680050 lru_seconds_idle:103

Serialized length of the value is 5 while real memory usage is 43, so a single simple key storing nothing but single integer value has overhead of almost 40 bytes.

This overhead is needed not only for making hash table work but also for various features that Redis provides to you like efficient memory encoding and LRU keys eviction.

Expires

If you want to store keys with expiration (i.e. TTL) prepare for a 50% increase in memory consumption.

Let’s conduct a simple experiment – load 1 million keys without TTL and then compare memory usage with 1 million keys with TTL.

Here is the initial state with empty redis.

$ redis-cli
127.0.0.1:6379> dbsize
(integer) 0
127.0.0.1:6379> INFO memory
# Memory
used_memory:853328
used_memory_human:833.33K
used_memory_rss:5955584
used_memory_rss_human:5.68M
used_memory_peak:853328
used_memory_peak_human:833.33K
used_memory_peak_perc:100.01%
used_memory_overhead:841102
used_memory_startup:791408
used_memory_dataset:12226
used_memory_dataset_perc:19.74%
...

Load 1 million keys each containing a single random integer:

$ python3 loader.py 
$ redis-cli
127.0.0.1:6379> dbsize
(integer) 1000000
127.0.0.1:6379> info memory
# Memory
used_memory:57240464
used_memory_human:54.59M
used_memory_rss:62619648
used_memory_rss_human:59.72M
used_memory_peak:57240464
used_memory_peak_human:54.59M
used_memory_peak_perc:100.00%
used_memory_overhead:49229710
used_memory_startup:791408
used_memory_dataset:8010754
used_memory_dataset_perc:14.19%
...

Memory usage is 59.72M.

Now let’s load 1 million keys with expire:

$ python3 loader.py --expire
$ redis-cli
127.0.0.1:6379> dbsize
(integer) 1000000
127.0.0.1:6379> info memory
# Memory
used_memory:89628800
used_memory_human:85.48M
used_memory_rss:95326208
used_memory_rss_human:90.91M
used_memory_peak:89628800
used_memory_peak_human:85.48M
used_memory_peak_perc:100.00%
used_memory_overhead:81618318
used_memory_startup:791408
used_memory_dataset:8010482
used_memory_dataset_perc:9.02%
...

Memory consumption grew 52% to 90.91M.

Redis expires gives a lot of additional overhead because, as far as I can tell, they are stored as separate keys in the internal hash table (db->expires).

/* Set an expire to the specified key. If the expire is set in the context
 * of an user calling a command 'c' is the client, otherwise 'c' is set
 * to NULL. The 'when' parameter is the absolute unix time in milliseconds
 * after which the key will no longer be considered valid. */
void setExpire(client *c, redisDb *db, robj *key, long long when) {
    dictEntry *kde, *de;

    /* Reuse the sds from the main dict in the expire dict */
    kde = dictFind(db->dict,key->ptr);
    serverAssertWithInfo(NULL,key,kde != NULL);
    de = dictAddOrFind(db->expires,dictGetKey(kde));
    dictSetSignedIntegerVal(de,when);

    int writable_slave = server.masterhost && server.repl_slave_ro == 0;
    if (c && writable_slave && !(c->flags & CLIENT_MASTER))
        rememberSlaveKeyWithExpire(db,key);
}

By the way, this is the entire function. Redis code is very readable once you get used to the camel case in C.

Our memory consumption

Once we started to load the data in our Redis cluster the memory consumption was too damn high! With our imbalanced cluster we started to use n1-highmem-16 nodes to be able to fit our largest shard which are quite expensive.

So we needed to reduce our memory consumption. And the only way to do this without (almost) any modification to the data is to use Redis hashes.

Hash

One of the nicest tricks to reduce memory consumption is to store values in small Redis hashes instead of the main hash table. This will work because of ziplist optimization in Redis.

In short, with this optimization Redis stores hash values in arrays of configurable size. You avoid hash table overhead but give up lookup speed which is amortized over time because of the small size of the array.

Folks at Instagram used it and we also tried it and shaved off a considerable amount of memory.

But remember that you can’t just shove your values in hash and call it done. To trigger ziplist optimization you need to bucket the hash table to the size of ziplist. Also, with hashes you lose some features, the most important is expires - you can’t set expire on the hash element, only on the key in the main table.

Going forward

So we started to store our dataset in Redis hashes to reduce memory consumption and use smaller instance types for our imbalanced cluster.

Persistence

Finally, we wanted to use persistence because our dataset was important – we cannot lose it because it would lead to the data pipeline downtime and, while we can regenerate all of the data, it takes a lot of time to load.

The key lesson here is that if you want to use persistence in Redis with a lot of data – you have a problem.

It all boils down to the, again, memory consumption that is quickly growing during snapshotting. But first, let’s quickly recall how persistence works.

There are 2 persistence options in Redis – RDB snapshots and AOF log. With RDB snapshots Redis periodically makes snapshot of the in-memory data by forking the main process and writing data in a child process. It works because of Copy-on-Write feature in modern operating systems where parent and child processes can share the memory without doubling the data unless memory is not modified in the parent process. When memory gets written in the parent process the operating system will make a copy for the child so it will see the old version – that’s why it’s called Copy-on-Write.

When RDB snapshotting is performed it should be free in terms of memory consumption because of the CoW but it’s more subtle. If there is new data writing happening during snapshotting then memory consumption will grow on the size of that new data because Copy-on-Write will trigger the creation of new memory pages. The longer your snapshot process the more likely it will hit you. And the more data you’re writing during this process the more your memory consumption will grow.

With the default configuration snapshot will be taken every 10000 changes which in our case means constantly during data upload. We were uploading data in huge batches so our memory consumption grew almost twice and eventually Redis was OOM killed.

So we tried to use AOF instead of RDB. But when AOF log is rewritten it uses the same Copy-on-Write trick as RDB snapshots so we get OOM killed again.

There are a few possible fixes for this. First, you can simply disable persistence if it fits your case. For example, if you can lose or quickly recover your data.

You can also have 2x memory to accommodate extra writes during snapshotting.

And you can also control snapshotting by issuing Manual BGSAVE or REWRITEAOF. But this won’t help you when replica is syncing from the master. This is the most surprising thing I saw with Redis – when replica was crashed and restarted it will need to sync with master. Syncing with master is performed by triggering RDB snapshot and sending it over the network. So even if persistence is completely disabled Redis may trigger RDB snapshotting for replica sync with all the consequences like increased memory consumption and risk of being killed by OOM. And as far as I know, you cannot disable it.

In our case we settled on the manual BGSAVE via cron once a day when the data most likely won’t be uploaded.

Conclusion

At the end of this journey we had a Redis cluster for our simple aggregated data. We loaded data via Redis pipelined commands so we used hash tags. To reduce memory consumption we used Redis hashes. And for persistence we have a cron job that will trigger BGSAVE in idle time.

This is my third post on Redis – I’ve also written on high availability options and cross-replicated cluster.

Doing our use case taught me a lot about Redis – how it works, where it’s good or not and I get a much better understanding of it which is the most important thing for software engineers.

As always if you have any comments or suggestions feel free to send me an email. That’s it for now, subscribe via RSS/Atom feed to stay tuned for the next post. Till the next time!