versione stampabile

Redis Presharding

Friday, 25 February 11

Redis cluster is currently under development, and we hope it will be able to solve the problem of partitioning data among different Redis instances in a transparent, fast, and fault tolerant way. The cluster project is currently not ready: even if we have a pretty clear design, and many networking level code for gossip and failure detection, it will take some more month to be released, and more time to be released into a stable release.

While our work continue, people shard regardless of Redis cluster, using their own algorithms and systems. This post describes a strategy for partitioning your data among N instances that is very simple but that works great. We'll call this strategy "Redis Presharding".

Redis is lightweight

Redis is a very small program, currently consisting of just 30k lines of code. The only dependency is the libc. We use our libraries for everything, with just the minimal code and functionality needed to get the work done. This provides us with a very neat advantage: our memory footprint for a spare instance is minimal. Every Redis instance running consumes little more than one megabyte of RAM. This means that having 32, 64, or 128 instances of Redis all running against the same Linux box does not pose any problem.

This is very important for our use case. Let's see why.

Simple partitioning algorithms are cool because they are, well, simple. You get the key, hash it, and get K bits of the hash (or if you prefer perform a modulo operation). So you can map every given key to N different Redis nodes.

Node = Hash(key) MOD N

The limit with this kind of partitioning is that once you need to add or remove nodes from the system to adjust the capacity, it is a real mess. Hashing is easy, rehashing is hard, and involves moving keys form instances to other instances while the system is running. If you tried to design a system like that, you know what I'm talking about. Redis cluster will be able to do things like that, and if you check the design you'll discover is more complex than Hash MOD N :)

But wait, maybe we can mount a poor man's cluster. Since the Redis instances are so light to run, what about if we start considering we'll need a lot of capacity? So we start, from day zero, 128 different Redis Instances, using just two not too powerful virtual machines on EC2 (this is just an example, you can do your math to understand how much you want to grow before of changing design).

Of course your 128 instances will use a small amount of memory each. It is important that you use this design with objects that are more or less all of the same size, and that your hash functions does not have trivial biases (in other words, use SHA1). If your application contains things like long lists or alike, you can use a specialized Redis instance for this data structures.

Handling many instances

Handling 128 instances is not like handling a single one, so it is a good idea to write scripts to start all the instances, to stop all the instances, to collect all the .rdb or AOF files to create a single tar.gz with everything that you can use as a backup. It's a good idea to also have a script that takes this tar.gz as input and restore all the dumps into the different Redis instances.

Also some monitoring will not hurt at all.

Basically this set of tools could be a nice open source project, if done well, simply, and the spirit of Redis, and possibly coded in Ruby ;) Ok let's avoid language flame wars...

The bottom line is: be prepared to handle hundred of instances.

But running many instances also have some neat advantage. For instance, do you want to rewrite the AOF file? Do this one instance after the other, and the memory hint will be small. The same applies to saving with the normal .rdb persistence.

Also you are doing a very neat thing: you are working at scale from the start, even if you are still small. This means to be prepared to run a much larger site from day zero. Not a bad idea if you ask me.

Moving instances

Now the interesting part is, I need to scale. My 128 instances are using all the memory and resources in my small virtual machines. What to do? It is pretty easy, just fire a third virtual machine and move one third of your instances in this new machine.

You can do this without any kind of down time, using a common trick:

Start the spare Redis instances in the new virtual machine.
Set this instances are replicas for the old instances you want to move.
When the initial synchronization is done and all the slaves are working, change the configuration of your clients to use the new instances.
Elect the new instances as masters with SLAVEOF NO ONE.
Finally shut down all the old instances.
Upgrade your shell scripts configs with the new IP/PORT info for every instance.

This solution will ensure that the down time is zero as the slaves are able to accept writes, so once you change the configuration of your clients all the clients will start writing against the new instances. In a second the old instances will not get any new query at all, so the slaves can be elected to masters, and the old masters can be killed.

Hash tags

With this schema, and with Redis Cluster itself, commands taking multiple keys as arguments are harder to use since if the two keys will hash to two different instances the operation can not be performed.

Either do not use multi key operations but instead try to model this complex ops at application level, or use a technique called Hash Tags. Basically if a key contains the {} characters, instead of hashing the whole string to obtain the instance ID, you just hash the string inside {}.

So while the key "foo" will be hashed as SHA1("foo"), the key "bar{zap}" will be hashed just as SHA1("zap").

This way you can force keys to be stored in the same instance, so that if you want to perform intersections just against user data you can do it using hash tags that are different for every user, but the same for all the keys related to the same user. Sometimes this is enough, sometimes instead this is still not enough and you have application level help to model operations like intersections between sets in different instances, or to rethink your application logic at all.

Fault tolerance

The described solution can be made fault tolerant using Redis replication. Basically in our example instead of firing two virtual machines, you can instead fire four. Every instance is replicated in another instance in a different virtual machine (make sure they are in a different data center).

If something goes bad with a virtual machine it is possible to point the clients to the other virtual machine, changing all the occurrences of the IP address in the configuration table with another one.

Conclusion

A solution like Redis Cluster handling all this for you is obviously a better long term solution, but what was descried in this post is a simple way to work with what we have currently.

If a set of scripts and small programs are developed to help in this kind of setups, including monitoring programs, tools to start and stop many instances, tools to perform backups and to restore all the .rdb or AOF files, it can be much simpler to setup a system like the one described here.

108657 views^*

Posted at 16:16:37 | permalink | 5 comments | print