Redis weekly update #5 - RC1 is near

Monday, 17 May 10
Welcome back to the Redis Weekly update :) This one actually took one month instead of one week, but this is not bad news after all, as the reason for the delay is that I was busy working at Redis in order to speedup the release of 2.0.0 RC1.

RC1 is very near now, and we have a date: it will be released 21th May 2010. The work will continue after RC1, but will focus on testing (including providing a better test suite) and bug fixing.

This weekly update could be huge if I would report one month of work, so I'll picka few interesting stuff from the change log, trying to explain the details, instead of presenting a long list of things without any detail. I'll start from one that I think will make many users happy (for sure it made me happy), that is, loading and saving times.

RDB files: ridiculous speed improvements

What's ridiculous about the speed improvement is not just the huge speedup factor, but that this was not done before. In the latest weeks I started doing something I never did in the past: checking the performance of selected pieces of code.

I was completely aware that Redis was (and is!) absolutely not optimized, but coding is a matter of priorities. Since Redis was already pretty fast, completeness of the API, stability, Virtual Memory, were the real priorities. With the 2.0.0 release we are approaching a status where Redis feels more or less feature complete from the point of view of functionality exported to the user.

Well, I want more control on the string type, including ability to set substrings and bit manipulation commands to do things like a bloom filter redis backed using the same memory that C code would require, I also want a MULTI/EXEC with CAS support, but apart from a few minor things and a good clustering solution, it's time to slow down a lot the process of adding new things, in order to focus the energies in making what we got much better: faster, less memory intensive, more reliable, better documented.

I'm dreaming of a 2.2.0 release that is a short cycle of development only focused on this aspects, adding virtually no important new features from the point of view of the command set and data types.

Condensed story of an optimization

Back to the rdb speedup, everything started on the Google Group, and later on IRC, when Andy McCurdy (aka the author of the Python Redis lib) reported too slow loading and saving times of the RDB dataset with VM switched on. After some time you start knowing people, even on IRC, and Andy is the kind of guy that is not going to notify something without a very good reason and test data at hand. So I decided this was about time to investigate the issue.

One thing that was especially odd was that while the VM was turned on, the saving and loading process seemed mostly CPU-bound, and not a problem with I/O speed. Basically some profiling showed that Andy was experiencing slow saving and loading time because with VM he was having the first experiences with very big datasets. But the problem seemed much more general.

Thanks to the test dataset provided by Andy and a few test datasets (real datasets from production environments) I had already at hand, I started to perform profiling and optimization of different parts of the code, and the results are worth of some attention I guess:
1.2.6 loading time: 21 seconds
1.2.6 saving time: 7.5 seconds

1.3.12 loading time: 3 seconds 1.3.12 saving time: 1.2 seconds
The times shown are about the same dataset, already cached in the OS buffers, so in real world conditions where the I/O is also involved the speedup will not be always so huge, but still very noticeable.

Basically it's worth upgrading just for this. I did not checked the numbers against the AOF file: optimization of AOF will be 2.2 business, but I bet it is going much faster already since many of the changes leading to the RDB files speedup are pretty general.

But there is a story inside the story, so let's look inside the code to check how a specific optimization was done, and why it took me two days to write 10 lines of code.

Programming casts

Sometimes programmers form a cast, and this is surely what's happening in the IRC channel "#C", as I noticed when I visited the channel in order to talk of other kinds of casts: C casting between double and long values.

But let's start from the begin. After some profiling fun I noticed that snprintf() (both glibc and the mac os x version that I guess is originally form BSD folks) is very slow when the task is translating a double number into a string. And this is how double numbers, used as scores for the sorted set type, are saved inside the .RDB file indeed: as strings, that is a totally arch-agnostic way to save a floating point number.

Still, very often in actual datasets, the used scores are numbers without decimal part: for instance unix times, or counters obtained using ZINCRBY. So I started thinking about that: if I can check if a double has not decimal part when represented in radix 10, and the number is inside a range that can be represented by a long long value, I could print it using the much faster ll2string() (long long to string) function instead of snprintf.

So I implemented the trick getting a huge performance gain in datasets containing many sorted sets elements, and joined #C on freenode to check with some guy if this was going to be good or not. They told me that I was stupid, as casting was undefined in the case of overflow. And undefined, they claimed, is not just that you'll get a bad value after the casting: the system may also crash, as, well, "undefined means undefined, everything can happen", "So, stupid, try writing sane code", this were the words of one guy particularly frustrated living inside that channel. What followed was a flame war that resulted in a ban of my username from the channel, but overall I had fun and tried to do my best to make this guy very... excited ;)

My trick was more or less the following:
double val = ... some double value ...
if (val == (double)((long long)val)) {
   ... then, there is no decimal part ...
Now the C standard states the following (more or less, not verbatim): "casting a floating point number into an integer will truncate it removing the decimal part, and the result will be just the integral part, assuming the number is in the range that the integer type can receive. Otherwise if there is an overflow the result is undefined".

Fortunately in order to avoid undefined behavior (but I think that in practical terms they are not a problem at all)! we can use the defines in limits.h and float.h in order to know the range of both the double precision and the long long min and max value. So this is the final code:
#if (DBL_MANT_DIG >= 52) && (LLONG_MAX == 0x7fffffffffffffffLL)
        double min = -4503599627370495; /* (2^52)-1 */
        double max = 4503599627370496; /* -(2^52) */
        if (val > min && val < max && val == ((double)((long long)val)))
            ll2string((char*)buf+1,sizeof(buf),(long long)val);
Basically we test that the number is in a given safe range, and we set the range so that when a number have no decimal part it can be represented just using the mantis part of the double. So we are sure that the representation is always "precise" without any lost of information (if I've a double with a N bits mantis I'm supposed to be able to represent integers up to 2^N without problems).

If the number is in that range, we perform the double casting trick, that is always in a range were no overflow can occur (see the #if conditions for LLONG_MAX), and finally we print the number.

I'm sure that #C folks will still think this is totally broken (and I'm stupid): never mind, your data is safe.

New test suite

Thanks to the work of Pieter Noordhuis we have a new tests suite that is definitely headed in the right direction: now the test is able to run a dedicated redis server instead of messing with the server that is already running on port 6379. Also multiple servers can be started, with different configurations, so we are now able to test for replication, auth command, and many other things.

We'll continue to work on this in the next weeks before 2.0.0 will be released as stable in order to add support for valgrind (I'm working on this right now).

Also some day ago I added the DEBUG DIGEST command that is able to take the SHA1 of the whole dataset: if two servers hold the same data, will have the same digest. So now the test suite is using this to check consistency of RDB / AOF files, but the new command can also be used with the integration tests provided by the new test suite in order to check that the replication is consistent in an automatic way, while usually this test was performed by hand every month or so (!).

There is an interesting trick used in DEBUG DIGEST in order to make the implementation much simpler. Basically imagine you want to take the SHA1 digest of a Redis Set: the set is an unsorted aggregate of elements, so the same elements but in a different order will change the digest result, that is not what we want. The obvious solution to this problem is sorting the elements of the set before of taking the digest, but this also sucks as it's slow, requires more memory, and so forth.

But fortunately there is a very simple way to handle this problem, that is, instead of performing the SHA1(the_whole_set), I perform the "SHA1(first_element) xor SHA1(second_element) xor ...". As the bitwise xor operation is commutative, the result is always the same. This has another hidden advantage. If you have a set of strings like "foo", "bar", "biz" and you take the SHA1 just calling the function to update the SHA1 state against all the elements, the result will be the same if the elements are "fo" "obar" "biz". Not good, as different datasets will have the same SHA1. Our approach also fixes this problem, but there is much more inside the code, as instead for data types such lists where the order is instead significative we use different tricks.

So the general idea is, when you are dealing with unordered set of elements, think about commutative operators.

AOF check tool

Another Pieter contribution (one among a number of others, if you check the Changelog is an endless stream of fixes, improvements, and other stuff provided by him) is a tool to fix the AOF file when it gets corrupted (for instance, truncated because the server was out of file.

The tool is called redis-check-aof and is able to check the AOF for consistency and repair it truncating the last non complete command if needed. This is very useful as I already assisted users with this problem. When you end with a very big AOF file corrupted and you need to truncate it at the end in a hurry in order to restart the server asap, you'll discover that everything is against you, and that vim is simply not good for the task this time ;) This tool is going to save us in this bad circumstances.

MONITOR and redis-cli revamped

When you got problems with your servers, MONITOR is very handy in order to understand what's happening. But now that everything is using the new protocol internally, the MONITOR output was basically impossible to parse by eyes, and hard to parse by tools.

Also it lacked timestamps, and timing is very important in order to discover bugs. So I fixed this in one of my "on the pool" sessions. This sessions are very nice: I bring my son to the pool two times every week. There is to stay there 1 hour, and there is no internet. I leave home with in mind a single TODO list item for Redis, bringing with me my laptop. My son swim, and I code without distractions at all. So we are both happy :)

This is the new MONITOR format:
Escape character is '^]'.
+1274050302.735903 "monitor"
+1274050304.695527 "set" "foo" "bar"
+1274050356.424762 "set" "key with spaces" "value with\n strange \x01\x02 chars inside"
As you can see it's more or less CSV. Also the time resolution provided by gettimeofday() is simply amazing, and very useful in this context.

Again in the pool, but in another day, I did exactly the same for redis-cli:
% ./redis-cli
redis> set "this is my key" foo
redis> get "this is my key"
redis> append "this is my key" "\r\n"
(integer) 5
redis> get "this is my key"
As you can see now it's possible to use the repl-mode of redis-cli like if it was a little programming language shell. We got line editing, history, and ability to write and read quoted strings.

There are many more improvements: Pieter did a wonderful work with hashes refactoring and made the API complete. The VM was optimized, and a lot of testing was done against it, The AOF received a big speedup thanks to a cool idea provided by Derek Collison that is working very well. Pub/Sub got pattern matching support, and so forth. It's just impossible to enumerate all the stuff done in one month on a fast evolving project.

I'm very happy about how things are going, and I look at the future with a lot of interest :) Redis is getting better but there is still a lot of work we can do to make it more interesting and useful. And while the work on Redis continues (and also in client libs and other related projects), there are every day more users testing and adopting it.

Some month ago I wrote about me joining VMware, now I can confirm that my hopes turned into reality: thanks to VMware, the great contributions of many users, and the great code and ideas provided by Pieter Noordhuis, the Redis development entered into a new stage.
Posted at 08:18:51 | permalink | 7 comments | print