Redis new test engine

Monday, 11 July 11

If you used to follow the advice that Redis displays after you build it, that is, to run make test, then you know that our test was not the fastest in the world, actually taking several minutes to complete in a not so fast computer.

One reason the test used to be so slow is that we do a lot of Fuzz Testing. Almost all the bugs we discovered thanks to the test suite were discovered thanks to fuzz tests and very very rarely thanks to regression tests and unit tests. This is I guess pretty obvious since if you try to write correct code it is a bit unlikely you do trivial errors. More likely errors will arise from interactions you are not vulcan enough to spot while coding.

Fuzz tests are slow but another element of the Redis test slowness is that it has to deal with the networking stack, since the test is actually a specialized Redis client poking commands against a live instance. This means that every time we issue a command we pay the round trip time.

Whatever the reason was, a slow test is not good for developers that need to constantly run it after significant changes (and the slower it is, the more significant you wait for the change to be to run it), and of course for users that will be disappointed to wait several minutes after installation just since they decided to behave safely testing the build before deploying it. Also our Continuous Integration environment can perform less test runs per hour if the test is slow, and this will prevent or delay the discovery of hard to catch bugs that only happen from time to time.

Since before the 2.4 release candidate I want to get better testing, better CI, and better coverage, this was the right time to fix the test making it faster, and more valgrind friendly. The result is already merged in the unstable branch. The following are a few notes about what I did in order to make the test faster.

Going parallel

For a faster test to have serious impacts on how developers and users use it you don't need a 30% faster test, you need an order of magnitude faster test, so this was the kind of improvement I was looking for.

Using a faster client, or trying to optimize some specific tests would help a little bit, but not enough to reach the one order of magnitude speedup I was looking for, however there was a simple thing to do in order to dramatically speedup the test execution: turning the test into a parallel one.

The Redis test was already organized in separated units, such as "list", "aof", "replication" and so forth. Often the tests inside a single unit need to run in a sequential fashion because sometimes a test uses the data set created by the previous test and so forth, so turning every test into a separated unit was too complex and possibly not worth it. What is simpler to do is instead to run the different units (composed of tens of tests each) in parallel.

This was much simpler, as different units already used to start difference instances of Redis.

Server Client model

One of my goals was to reuse as much code as possible from the old test engine: I and Pieter Noordhuis work at the tests since two years, it is not work to throw away without good reasons. The previous engine was also perfectly able to run a single unit, that is part of what I wanted to accomplish. So what I did was to turn the old test into a test client.

This is the final design:

The test starts a test server, that is a process that will handle the execution of all the test units and report back to the user.
The test server starts a number of test clients. A test client is basically the old test suite, but with a networking interface. Every test client connects to the test server on startup via a TCP socket, and waits for commands.
At this point the test server starts assigning tasks to the different clients, like "run the list test".
Clients running test units will report back to the test server. The test server uses an event-driven design, so it can read the replies from all the clients with little work.
Every time a test client finished to execute a test, it sends a "done" message to the server, that will re-use the test client to run the next test unit, if any.
Eventually all the test units will be executed, and the test can exit with the appropriate exit code.

Currently I'm spawning 16 test clients to run a little more than 20 units. The new units are not exactly what they used to be, since too long running tests are now split into N sub units, in order to improve the parallelization.

Less fuzz, more speed

As I said at the begin of this article we have a lot of fuzz testing, however this tests sometimes run for 10000 iterations and are very slow to execute. Running fuzz tests into the CI makes sense as this helps discovering rare bugs that need a few very particular events to be triggered, however if something is broken it will be evident even after a lot less iterations.

So in order to further speed up the test execution I introduced a --accurate option that the user can pass to the test suite. When this option is given the test is ran (as it used to be) with a lot of iterations in the fuzz tests, by default instead we use less iterations. The continuous integration tests uses --accurate, and so we developers will do before a release (but anyway the test is running continuously inside the CI with --accurate so if something is broken we'll be informed soon).

This further reduced the execution time of a few of the more time consuming tests.

The actual speedup

How faster we can run the test now? A lot faster!

In the same (fast) box what used to take 2 minutes and 54 seconds now can run the same tests in just 13 seconds. This is a 13x speed improvement!.

Even running with --accurate is significantly faster than it used to be, taking a total of 48 seconds to execute.

This is the kind of speedup I was looking for since it is the kind of speedup that can completely change the interaction between testing and users. A 13 seconds test running hundreds of tests with a colorized output is fun to run, while a 3 minutes test is frustrating and many users will not run the tests or will be far from happy of doing it.

Implementation details

This system is implemented in Tcl, using the built in event-driven programming support that Tcl has since... decades? Far before this paradigm started to gain in popularity. You may remember the Why threads are a bad idea paper from Ousterhout (Tcl's father).

For instance in Tcl you could write the following event-driven time server even 10 years ago:

socket -server handle_client 1234
proc handle_client {fd host port} {
    fconfigure $fd -blocking 0
    puts $fd "Hello $host:$port! the current unix time is [clock seconds]"
    close $fd
}
vwait forever

If you telnet to port 127.0.0.1 the result is: Hello 127.0.0.1:65236! the current unix time is 1310419689 as expected. That will handle 30k clients per second in your macbook without issues.

Tcl event driven programming supports sending data in background automagically, timers, and so forth. Another feature I used is that in Tcl everything is a string so it was very trivial to exchange data between the test clients and the test server. I used the following function:

proc send_data_packet {fd status data} {
    set payload [list $status $data]
    puts $fd [string length $payload]
    puts -nonewline $fd $payload
    flush $fd
}

The receiver of the data packet can easily decoded the data as it is a Tcl list, and is represented as a string like any other data type in Tcl.

So for instance if a test worked as expected the test client sends a data packet to the server with status "ok" using the data filed to communicate the test name. Otherwise if there was some problem an "err" status is sent, along with the details of the error.

Instead if for some reason a runtime error happens in a client it gets trapped using Tcl exceptions and sent to the test server as an "exception" packet, that will halt the execution of everything.

Valgrind support

Another thing I improved was the valgrind support. When the test is running over valgrind a few time dependent tests may fail or produce slightly different outputs that result in false positives. I simply added a few sleeps where needed and everything is now fine. As a result now the Redis test is constantly running over valgrind (at least the unstable branch so far). This was pretty important as in the latest months we spot at least one bug that was detected by valgrind just running the existing test suite.

In short testing is going to be more important in the Redis world, as I'm more content shipping a bit less, but rock solid, that a bit more but with potential issues. The release of 2.4 will be probably delayed a bit in order to add more tests and to test by hand the 2.4 release better, but I think this is in the best interest of our users.

Also the 2.6 release will be based in the unstable release, so that we'll try to restrict the number of significantly different source trees we are managing.

I hope this overview of the test suite was interesting, please if you have questions feel free to ask.

51552 views^*

Posted at 21:39:51 | permalink | discuss | print

Commenti

Comments closed

antirez weblog