versione stampabile

An update on the Memcached/Redis benchmark

Thursday, 23 September 10

A few days ago I published a blog post that included a benchmark between Redis and memcached.

My results showed Redis to be considerably faster than memcached running a single instance of redis-benchmark (or the equivalent mc-benchmark) against an instance of Redis using a single core, and an instance of Memcached using four cores.

I was missing something in the test, Dormando published a counter benchmark where more than a single mc-benchmark was used against memcached, showing that only under this conditions memcached is able to saturate all the CPU cores, leading to much higher numbers.

The test performed by @dormando was missing an interesting benchmark, that is, given that Redis is single threaded, what happens if I run an instance of Redis per core? Also now that we have this new results, what is the reason why memcached was slower against a single instance of mc-benchmark? We have the proof that the benchmark is able to perform more operations per second even with a single instance, since Redis was showing higher numbers, so given that memcached appears to do a great job with multiple instances of the benchmark, what was happening?

Redis and memcached trying to saturate two cores

My first attempt will be to use memcached started with "-t 2" to use just two cores, and two instances of Redis, against two instances of the benchmark. I limited both memcached and Redis to two cores because my box is a quad core so I can't run four instances of the server and four of the benchmark making sure that every thread will have a core.

What are the results? I'll not show any graphs this time, but just what happens with 100 clients as the results are more or less consistent with different number of clients:

Memcached was serving 130k SETs per second and 150k GETs per second.
Redis was serving 200k SETs per second and 200k GETs per second.

Redis appears to be able to scale horizontally per core without issues both in SET and GET operations, since every core is running an isolated process. Memcached obviously can't scale so well because it is a single instance using multiple cores, this a tradeoff that will be discussed later.

As I started to be curious I launched memcached with different number of threads, and discovered something very interesting. When running with a single thread, memcached is able to perform much better against a single instance benchmark... so what happens if we run two memcached instances instead of a threaded one?

Memcached was serving 200k SETs per second and 200k GETs per second.

Exactly like Redis.

Threaded or not

Ok a few observations... the first is that with many kind of workloads you may better run memcached with "-t 1", even if you have more cores. Second, a process per thread is the most scalable solution apparently, and this should not be a big surprise at all.

So now the big question is: "is a multi threaded implementation worth it?".

This is a matter of design, tastes, and facts all mixed together. With a single instance using multiple threads you have a few advantages:

No sharding if you want to use a single server.
If your application performs a lot of GET operations with multiple keys per time, a single instance does not force you to take multiple connections and to send more requests in parallel, that is less straightforward.

There are also disadvantages:

Slower development speed to achieve the same features. Multi thread programming is hard.
It's harder to fix bugs. The only place of the Redis code base where we experienced hard to fix bugs was the Virtual Memory, that is threaded (because it is the only way to do it well).
Not as scalable.
If you are dealing with complex atomic operations like Redis does, it can become a nightmare.
Once Redis 2.2 will be stable we'll focus on Redis Cluster, that will mitigate the pain of running a cluster of instances. This is a non issue with memcached mostly as it's used for caching and client-side sharding is perfectly fine for this application.

We decided to go against a threaded approach for the following reasons:

Redis is much more complex than memcached, a threaded implementation can be very hard to develop at our current speed and with our current stability goals.
Anyway if you need to scale you eventually need to go over a single server, and soon. Any application with non trivial traffic is going to need many servers. And the guys that are more concerned with performances, that are the ones running big sites, have tens of servers at least.
Because of the data structures exported by Redis, in Redis land MGET is not a so much abused primitive, for instance you can use Hashes to store objects and retrieve all the fields with an HGETALL call. There is still a value in less instances for the same number of keys as you may want to retrieve in parallel unrelated keys (example: ten different users).
Anyway once you have more than a single server you need multiple connections to fully exploit the parallelism.

I really have zero doubts about this: in a future of cloud computing I want to consider every single core as a computer itself. It's hard to allow for a lot more complexity for something you can obtain with a slightly better client library (your client can take connections against all the instances and provide multi get for you, via multiplexing). Add to this that the CPU/memory overhead for every added instance is near to zero, for both memcached and Redis.

The End

I hope the combined efforts of my benchmark and the dormando one had the effect of shedding some light on the matter. With current entry level hardware the ceil appears to be 100,000 operations per core, and there is a tradeoff between threaded and non threaded implementations.

I think we need more tests like this in general (and less busy-loops...) but not to tell "I've it longer" but to show what are the real world performances of different systems and why, as with well designed systems you can be sure it's almost always a tradeoff and not some lame programming error. In the case there are large discrepancies and there is indeed some programming problem, it's possible to investigate and fix such problems.

84908 views^*

Posted at 05:24:41 | permalink | 7 comments | print