versione stampabile

Redis weekly update #4 - Fixes and Optimizations

Tuesday, 13 April 10

Welcome back to the Redis weekly update, we skipped one week as now that we are in feature freeze the news are a bit slower, even if... actually in this two weeks we managed to accumulate an interesting Changelog. While the development of new features stopped the 1th April, we are still not into a very static stage, and the following kind of changes are the new focus:

API changes to provide a more coherent behavior. Redis 2.0 can break a few corner cases at least, being it a major release, and it's time to fix some old bug.
Optimizations: we want to go faster, consuming less memory, before 2.0 will be released, at least when this is possible adding little code or changing modular code that is isolated from the system (for instance dict.c, the hash table implementation, is one of this components that is being modified in this stage).
Fixes, obviously.

Pattern matching Pub/Sub

Just before the feature freeze started I got a chance to add pattern matching to Pub/Sub. So now it's possible to listen to a channel named, for instance, "news.italy." (that incidentally will match a lot of sad stories about our "bizzarre" prime minister as usually). But my idea was that the ability to match binary safe channels, and to specifically match every possible key name, was a very important feature. For instance "" is a valid key, and with the new protocol every binary string is a valid key. So if I want to subscribe to just "*" verbatim, I should be able to do so. For this reasons two new commands were added to Redis, in order to subscribe and unsubscribe to patterns:

PSUBSCRIBE pattern1 pattern2 ... patternN
PUNSUBSCRIBE pattern1 pattern2 ... patternN

This commands are yet not documented, but the semantic is exactly the same as SUBSCRIBE/UNSUBSCRIBE (non pattern matching variants).

Hashes, new things

The feature freeze was entered with the promise to the users to add a few planned hash commands before 2.0 release, so thanks to Pieter Noordhuis we have now HMSET that is able to atomically set multiple fields of an hash with little latency and HINCRBY that is just INCRBY but against hash table fields. Other two commands are planned: HMGET and HSETNX, and we'll call the hashes feature complete, apart for a detail, that SORT is still not supporting BY/GET options targeting fields of hash tables, but that's in the pre-2.0 TODO list as well.

Also I and Pieter in one of our IRC design sessions managed to design a defragmentation strategy for hash tables that is both simple and works well. He promptly implemented our design in zipmap.c that is now simpler than before (because after considering many options we decided to either leave a bit of trailing space for future values or go ahead and defragment asap if the "hole" left is too big).

Non blocking Tcl client

Finally I found some time to write a non blocking Tcl-client (using the Tcl built-in event driven programming framework), so it will be possible to test for Pub/Sub stuff and BLPOP in test-redis.tcl.

VM hacking

Also some work was done in the VM subsystem in order to fix a race condition happening when multiple databases were used with keys having the same name swapped at the same time. Again this changes are a result of an hacking session together with Pieter Noordhuis (yes, he rocks).

Shared objects feature killed, but a better one added

Do you remember the sharedobjects stuff in redis.conf that was off by default? Well this implementation of object sharing was not working very well. Redis normally shares objects when that's possible, but sharedobjects tried to do a bit more than that, taking a pool with frequently used objects to share. Unfortunately it required a too big sharing pool size or a too strong bias towards a minority of common strings in the dataset in order to work well. So I removed that and instead added a much simpler but much better stuff: now strings that look like integers in the range 0-9999 are always shared! If you have many small counters, or in general many numbers in this range, this will be a huge win memory-wise. To enlarge the rage there is just to change a define. I'll try to push forward this kind of features when possible, compatibly with the fact that we are in freeze so this changes are only possible only when it's possible to be 99.999% sure the code is sane and will not lead to new bugs.

Consistent replies for commands returning multiple elements

The old behavior of commands returning more than one element (a multi bulk reply in Redis slang) was to return a nil element when called against non existing keys (translated as nil, NULL, Null, or whatever is the null object in a given language), but this is wrong as there is a rule in the way Redis commands behave:

If an operation targeting an aggregate data type (list,set,zset,hash) is performed against a non existing key, the behavior should be exactly the one obtained running the operation against an empty aggregate value of the same type. So for instance LLEN returns 0 if called against a non existing key, because we consider it holding an empty list.

Since Redis also removes keys belonging to aggregate data types if the result of an operation is an empty list, set, sorted set, or hash, then this behavior is completely consistent. You don't have to care about creating empty values before issuing operations. You don't have to care about the cleanup of empty values. So in order to be consistent with this behavior now operations like LRANGE mykey 0 -1 performed against non existing keys will return an empty list. All the operations are now ported to the new semantics, including ZRANGE, ZREVRANGE, and so forth.

Non blocking rehashing

Apparently Redis is being experimented in some very large environment! One of our users is performing tests against four 160 GB instances... and guess what? Rehashing when you have 200,000,000 keys can take 47 seconds... As you may already know the Redis key space is stored into a data structure called an "hash table". The hash table naive implementation (99% of the implementations are naive in this regard) will block when rehashing is needed, that is, the table has already too many elements compared to the number of buckets available, so a bigger table is created and all the keys moved from the old the new table.

This is a blocking operation, and if there are millions of keys this is going to be slow. How to fix this? There are many approaches:

Use a different data structure like a balanced tree or a skip list. But this will turn lookup into O(log(N)), and will make some operation like getting a random element harder to perform. Also this is going to use some more memory.
Use a mix between hash tables and trees, that is, a tree of hash tables. So you have many small hash tables and the rehash time is always little. But note that even with a table of just one million elements the rehashing takes several milliseconds, in some environment where Redis is used as a "mostly real-time" component this can be already too much.
Perform a non-blocking, incremental rehashing, using two hash tables. So you have one hash table (the old one) where you perform the lookups (and on misses you also try to lookup in the second hash table), but all the new keys are stored in the second table.

After talking with a few skilled guys at VMWare (a special thanks to Derek Collison for showing me how the incremental rehashing was a sound design) I started to like the latter approach more and more. Derek suggested migrating the hash table from the first to the second table incrementally just using a small amount of CPU time (for instance 1 millisecond) from time to time. I combined the idea with a trick of performing a single key migration at every hash table lookup (so you are guaranteed that if the old hash table contains N keys, after N operations the whole hash table is rehashed).

The only left problem was how to to pick a random element out of the two hash tables. Redis really need this operation for a number of reasons (for instance the RANDOMKEY command), and with a good distribution. In order to get a random element from an hash table, the only reasonable approach for a good distribution regardless of the quality of the hash function is sampling random buckets until a non empty bucket si found. If the hash table uses chaining, then, if the bucket has M elements, select a random number between 0 and M-1 and jump at that bucket.

How to run this algorithm when you have two tables? Well it's very simple, you have an hash table with N1 buckets and one with N2 buckets, just select a random number between 0 and (N1+N2)-1, and perform the sampling as if it were a unique table.

So now I'm working exactly at this problem, trying to bring a non blocking experience asap, stay tuned! More updates on twitter as usually.

24984 views^*

Posted at 07:24:28 | permalink | 4 comments | print