Redis weekly update #4 - Fixes and Optimizations

Tuesday, 13 April 10
Welcome back to the Redis weekly update, we skipped one week as now that we are in feature freeze the news are a bit slower, even if... actually in this two weeks we managed to accumulate an interesting Changelog. While the development of new features stopped the 1th April, we are still not into a very static stage, and the following kind of changes are the new focus:
  • API changes to provide a more coherent behavior. Redis 2.0 can break a few corner cases at least, being it a major release, and it's time to fix some old bug.
  • Optimizations: we want to go faster, consuming less memory, before 2.0 will be released, at least when this is possible adding little code or changing modular code that is isolated from the system (for instance dict.c, the hash table implementation, is one of this components that is being modified in this stage).
  • Fixes, obviously.

Pattern matching Pub/Sub

Just before the feature freeze started I got a chance to add pattern matching to Pub/Sub. So now it's possible to listen to a channel named, for instance, "news.italy." (that incidentally will match a lot of sad stories about our "bizzarre" prime minister as usually). But my idea was that the ability to match binary safe channels, and to specifically match every possible key name, was a very important feature. For instance "" is a valid key, and with the new protocol every binary string is a valid key. So if I want to subscribe to just "*" verbatim, I should be able to do so. For this reasons two new commands were added to Redis, in order to subscribe and unsubscribe to patterns:
  • PSUBSCRIBE pattern1 pattern2 ... patternN
  • PUNSUBSCRIBE pattern1 pattern2 ... patternN
This commands are yet not documented, but the semantic is exactly the same as SUBSCRIBE/UNSUBSCRIBE (non pattern matching variants).

Hashes, new things

The feature freeze was entered with the promise to the users to add a few planned hash commands before 2.0 release, so thanks to Pieter Noordhuis we have now HMSET that is able to atomically set multiple fields of an hash with little latency and HINCRBY that is just INCRBY but against hash table fields. Other two commands are planned: HMGET and HSETNX, and we'll call the hashes feature complete, apart for a detail, that SORT is still not supporting BY/GET options targeting fields of hash tables, but that's in the pre-2.0 TODO list as well.

Also I and Pieter in one of our IRC design sessions managed to design a defragmentation strategy for hash tables that is both simple and works well. He promptly implemented our design in zipmap.c that is now simpler than before (because after considering many options we decided to either leave a bit of trailing space for future values or go ahead and defragment asap if the "hole" left is too big).

Non blocking Tcl client

Finally I found some time to write a non blocking Tcl-client (using the Tcl built-in event driven programming framework), so it will be possible to test for Pub/Sub stuff and BLPOP in test-redis.tcl.

VM hacking

Also some work was done in the VM subsystem in order to fix a race condition happening when multiple databases were used with keys having the same name swapped at the same time. Again this changes are a result of an hacking session together with Pieter Noordhuis (yes, he rocks).

Shared objects feature killed, but a better one added

Do you remember the sharedobjects stuff in redis.conf that was off by default? Well this implementation of object sharing was not working very well. Redis normally shares objects when that's possible, but sharedobjects tried to do a bit more than that, taking a pool with frequently used objects to share. Unfortunately it required a too big sharing pool size or a too strong bias towards a minority of common strings in the dataset in order to work well. So I removed that and instead added a much simpler but much better stuff: now strings that look like integers in the range 0-9999 are always shared! If you have many small counters, or in general many numbers in this range, this will be a huge win memory-wise. To enlarge the rage there is just to change a define. I'll try to push forward this kind of features when possible, compatibly with the fact that we are in freeze so this changes are only possible only when it's possible to be 99.999% sure the code is sane and will not lead to new bugs.

Consistent replies for commands returning multiple elements

The old behavior of commands returning more than one element (a multi bulk reply in Redis slang) was to return a nil element when called against non existing keys (translated as nil, NULL, Null, or whatever is the null object in a given language), but this is wrong as there is a rule in the way Redis commands behave:
If an operation targeting an aggregate data type (list,set,zset,hash) is performed against a non existing key, the behavior should be exactly the one obtained running the operation against an empty aggregate value of the same type. So for instance LLEN returns 0 if called against a non existing key, because we consider it holding an empty list.
Since Redis also removes keys belonging to aggregate data types if the result of an operation is an empty list, set, sorted set, or hash, then this behavior is completely consistent. You don't have to care about creating empty values before issuing operations. You don't have to care about the cleanup of empty values. So in order to be consistent with this behavior now operations like LRANGE mykey 0 -1 performed against non existing keys will return an empty list. All the operations are now ported to the new semantics, including ZRANGE, ZREVRANGE, and so forth.

Non blocking rehashing

Apparently Redis is being experimented in some very large environment! One of our users is performing tests against four 160 GB instances... and guess what? Rehashing when you have 200,000,000 keys can take 47 seconds... As you may already know the Redis key space is stored into a data structure called an "hash table". The hash table naive implementation (99% of the implementations are naive in this regard) will block when rehashing is needed, that is, the table has already too many elements compared to the number of buckets available, so a bigger table is created and all the keys moved from the old the new table.

This is a blocking operation, and if there are millions of keys this is going to be slow. How to fix this? There are many approaches:
  • Use a different data structure like a balanced tree or a skip list. But this will turn lookup into O(log(N)), and will make some operation like getting a random element harder to perform. Also this is going to use some more memory.
  • Use a mix between hash tables and trees, that is, a tree of hash tables. So you have many small hash tables and the rehash time is always little. But note that even with a table of just one million elements the rehashing takes several milliseconds, in some environment where Redis is used as a "mostly real-time" component this can be already too much.
  • Perform a non-blocking, incremental rehashing, using two hash tables. So you have one hash table (the old one) where you perform the lookups (and on misses you also try to lookup in the second hash table), but all the new keys are stored in the second table.

After talking with a few skilled guys at VMWare (a special thanks to Derek Collison for showing me how the incremental rehashing was a sound design) I started to like the latter approach more and more. Derek suggested migrating the hash table from the first to the second table incrementally just using a small amount of CPU time (for instance 1 millisecond) from time to time. I combined the idea with a trick of performing a single key migration at every hash table lookup (so you are guaranteed that if the old hash table contains N keys, after N operations the whole hash table is rehashed).

The only left problem was how to to pick a random element out of the two hash tables. Redis really need this operation for a number of reasons (for instance the RANDOMKEY command), and with a good distribution. In order to get a random element from an hash table, the only reasonable approach for a good distribution regardless of the quality of the hash function is sampling random buckets until a non empty bucket si found. If the hash table uses chaining, then, if the bucket has M elements, select a random number between 0 and M-1 and jump at that bucket.

How to run this algorithm when you have two tables? Well it's very simple, you have an hash table with N1 buckets and one with N2 buckets, just select a random number between 0 and (N1+N2)-1, and perform the sampling as if it were a unique table.

So now I'm working exactly at this problem, trying to bring a non blocking experience asap, stay tuned! More updates on twitter as usually.
Posted at 07:24:28 | permalink | 4 comments | print
Do you like this article?
Subscribe to the RSS feed of this blog or use the newsletter service in order to receive a notification every time there is something of new to read here.

Note: you'll not see this box again if you are a usual reader.


13 Apr 10, 22:00:15
Have you considered using linear hashing?
It's nearly 4am here so please excuse my quoting from wikipedia:

Linear hashing allows for the expansion of the hash table one slot at a time.
The frequent single slot expansion can very effectively control the length of
the collision chain. The cost of hash table expansion is spread out across
each hash table insertion operation, as opposed to be incurred all at once.
Therefore linear hashing is well suited for interactive applications.

In a sense, it pushes the blocking problem to the underlying dynamic array,
but an hybrid between linear hashing and the approach you describe seems
possible (I'm exceedingly tired and probably not thinking clearly, though).
Also, in platforms with memory overcommit, you could just allocate a huge
array and let the OS assign physical memory to your process as the hash table

As for hash tables vs. e.g. BSTs: I can see your hash table implementation
uses around 4.33 words (for average load factor 0.75) per entry _plus_ the
malloc overhead (often 8 bytes IIRC from Doug Lea's description of his
malloc), that is 6.33 words on average on 32-bit platforms. If you use for
instance a RB tree, you can get it down to 5 words plus malloc overhead,
or even 4 if you encode the color creatively (pointer tagging). With an AVL
it'd be 5 words. BSTs are a bit of a PITA if you don't have a GC at hand, but
there must be zillions of readily usable, robust C implementations around
anyway. Which reminds me that Judy arrays could give you excellent memory
efficiency, performance comparable to hash tables, and, since they are similar
to 256-ary tries, they might have better worst-case bounds on insert (they
might be O(log N), some quick/lazy googling didn't return any complexity bounds). ENEEDSLEEP

antirez writes:
14 Apr 10, 03:56:34
@Mauricio: thanks for the insight! That's what I think about the two proposed solutions:

- Linear hashing: it's *very* similar to what I'm doing, I would consider my schema a variant actually.
- Trees: we need get-random-element operation so every kind of tree-alike data structure needs to be augmented with additional info per node in order to provide this operation. Also as from time to time we call this operation many times, the O(log(N)) time complexity can't be enough.

I'm implementing my simple schema to check how it works, but I'm confident ;)
14 Apr 10, 15:11:30
I'd forgotten about RANDOMKEY; if you say O(log N) is not good enough, that
clearly precludes balanced trees.

The method you describe is indeed similar to linear hashing and sounds quite
easy to implement. It seems to me that a linear hashing scheme with
a dynamic array copied incrementally would need less memory and would exhibit
a better access pattern when the underlying array is being expanded, though.

When you use two tables and start to migrate keys (because the load factor is
nearly 1), you'll have 3N buckets (N for the orig + 2N for the new table) for
the time it takes for the migration, whose lower bound is given by the cache
misses. That'd be some 3 per element, assuming you're scanning the buckets and
"forwarding" as you move towards the end of the bucket array (getting the
bucket address is free since the array is examined sequentially, then 1 miss
to get the key pointer, one to address the key, finally another cache miss to
write in the destination table's bucket array, and I'm ignoring collisions in
the 1st order analysis).

That seems consistent with the figures you gave: 200 M elements rehashed in
47s amounts to 235 ns per element (70/80ns mem latency sounds reasonable on
the kind of hardware your user must have been using to hold 160 GB sets in
memory, actual computations take no time compared to the cache misses :). In
other words, you need memory to hold 3N buckets for a a period of time of no
less than N * 235ns (47s in the example you gave).

Now, as for linear hashing: once the initial table with N buckets is full, we
have to expand the underlying dynamic array. Normally, we'd just allocate the
new one (so we have N + 2N buckets now) and copy the old buckets over new new
table. However, in this case we aren't accessing the destination table
randomly, so we're limited by bandwidth, not latency. Migrating a bucket takes
the time needed to copy 1 pointer, instead of several cache misses. For
instance, if the memory bandwidth is 6GB (pretty conservative), that could be
about 0.7 ns, vs. over 200 ns in the previous case, or in other words around
0.13s to copy a table with 200 million buckets, compared to 47s.
Now, we do not have to do it all at once, and need not block for 0.13s: we can
copy incrementally (with some care to prevent collisions, for instance by
adding elements to the old table unless the corresponding buckets in the new
table have already been populated; we can trigger resizing a bit earlier to
avoid high collision rates while we're expanding the hash table).

I believe this all amounts to the following: since the average memory usage is
lower in the latter case, you can use a larger constant to resize the array,
thus lowering the load factor and decreasing the amount of collisions. If we
move the average load from 0.75 to 0.665 (i.e. array size increased by a
factor of 3 on resize, for similar average mem usage of 3N buckets), the expected
number of probes per successful lookup goes from 1.52 to 1.43 (1 + (exp(2*x) -
1 - 2 * x)/(8*x) + (x / 4) where x is the load factor) and GET is nearly 6%
faster :) Also, in the two-table scheme you have to lookup in both when you
get a miss in the first one (for newly added elements), which wouldn't happen
with linear hashing, causing an additional (and potentially larger) difference
in speed. It's going to be very dependent on the usage pattern: if you add
lots of new elements and then access them while you're copying to the 2nd
table, suddenly GET gets twice slower... Then there's also the issue with
deletes (have to operate on both tables) and... oh this is getting tricky :-)

This is becoming quite a Gedanken experiment :) writes:
28 Mar 11, 07:53:50
The Virtual machine is difficult to hack if the code is well optimized. I fact,the security must be optimized in the correct way for a maximum of effect against hackers.
comments closed