An update on the Memcached/Redis benchmark

Thursday, 23 September 10
A few days ago I published a blog post that included a benchmark between Redis and memcached.

My results showed Redis to be considerably faster than memcached running a single instance of redis-benchmark (or the equivalent mc-benchmark) against an instance of Redis using a single core, and an instance of Memcached using four cores.

I was missing something in the test, Dormando published a counter benchmark where more than a single mc-benchmark was used against memcached, showing that only under this conditions memcached is able to saturate all the CPU cores, leading to much higher numbers.

The test performed by @dormando was missing an interesting benchmark, that is, given that Redis is single threaded, what happens if I run an instance of Redis per core? Also now that we have this new results, what is the reason why memcached was slower against a single instance of mc-benchmark? We have the proof that the benchmark is able to perform more operations per second even with a single instance, since Redis was showing higher numbers, so given that memcached appears to do a great job with multiple instances of the benchmark, what was happening?

Redis and memcached trying to saturate two cores

My first attempt will be to use memcached started with "-t 2" to use just two cores, and two instances of Redis, against two instances of the benchmark. I limited both memcached and Redis to two cores because my box is a quad core so I can't run four instances of the server and four of the benchmark making sure that every thread will have a core.

What are the results? I'll not show any graphs this time, but just what happens with 100 clients as the results are more or less consistent with different number of clients:
  • Memcached was serving 130k SETs per second and 150k GETs per second.
  • Redis was serving 200k SETs per second and 200k GETs per second.
Redis appears to be able to scale horizontally per core without issues both in SET and GET operations, since every core is running an isolated process. Memcached obviously can't scale so well because it is a single instance using multiple cores, this a tradeoff that will be discussed later.

As I started to be curious I launched memcached with different number of threads, and discovered something very interesting. When running with a single thread, memcached is able to perform much better against a single instance benchmark... so what happens if we run two memcached instances instead of a threaded one?
  • Memcached was serving 200k SETs per second and 200k GETs per second.
Exactly like Redis.

Threaded or not

Ok a few observations... the first is that with many kind of workloads you may better run memcached with "-t 1", even if you have more cores. Second, a process per thread is the most scalable solution apparently, and this should not be a big surprise at all.

So now the big question is: "is a multi threaded implementation worth it?".

This is a matter of design, tastes, and facts all mixed together. With a single instance using multiple threads you have a few advantages:
  • No sharding if you want to use a single server.
  • If your application performs a lot of GET operations with multiple keys per time, a single instance does not force you to take multiple connections and to send more requests in parallel, that is less straightforward.
There are also disadvantages:
  • Slower development speed to achieve the same features. Multi thread programming is hard.
  • It's harder to fix bugs. The only place of the Redis code base where we experienced hard to fix bugs was the Virtual Memory, that is threaded (because it is the only way to do it well).
  • Not as scalable.
  • If you are dealing with complex atomic operations like Redis does, it can become a nightmare.
  • Once Redis 2.2 will be stable we'll focus on Redis Cluster, that will mitigate the pain of running a cluster of instances. This is a non issue with memcached mostly as it's used for caching and client-side sharding is perfectly fine for this application.

We decided to go against a threaded approach for the following reasons:
  • Redis is much more complex than memcached, a threaded implementation can be very hard to develop at our current speed and with our current stability goals.
  • Anyway if you need to scale you eventually need to go over a single server, and soon. Any application with non trivial traffic is going to need many servers. And the guys that are more concerned with performances, that are the ones running big sites, have tens of servers at least.
  • Because of the data structures exported by Redis, in Redis land MGET is not a so much abused primitive, for instance you can use Hashes to store objects and retrieve all the fields with an HGETALL call. There is still a value in less instances for the same number of keys as you may want to retrieve in parallel unrelated keys (example: ten different users).
  • Anyway once you have more than a single server you need multiple connections to fully exploit the parallelism.

I really have zero doubts about this: in a future of cloud computing I want to consider every single core as a computer itself. It's hard to allow for a lot more complexity for something you can obtain with a slightly better client library (your client can take connections against all the instances and provide multi get for you, via multiplexing). Add to this that the CPU/memory overhead for every added instance is near to zero, for both memcached and Redis.

The End

I hope the combined efforts of my benchmark and the dormando one had the effect of shedding some light on the matter. With current entry level hardware the ceil appears to be 100,000 operations per core, and there is a tradeoff between threaded and non threaded implementations.

I think we need more tests like this in general (and less busy-loops...) but not to tell "I've it longer" but to show what are the real world performances of different systems and why, as with well designed systems you can be sure it's almost always a tradeoff and not some lame programming error. In the case there are large discrepancies and there is indeed some programming problem, it's possible to investigate and fix such problems.
Posted at 05:24:41 | permalink | 7 comments | print
Do you like this article?
Subscribe to the RSS feed of this blog or use the newsletter service in order to receive a notification every time there is something of new to read here.

Note: you'll not see this box again if you are a usual reader.


anon writes:
23 Sep 10, 11:12:59
Can you point to a guide showing how to implement redis most-efficiently on a 16-core machine ?
Also, do you know of any similar benchmarks against other key-value storage facilities ?
Like Luxio, tokyo cabinet by itself or through kumofs, Tx, or even the new BlitzDB merged into Drizzle not long ago ?

Also, the redis own benchmark is quite simple. Can it be updated with more complex tasks, inserting different random values into different random places, mixing PUTs and GETs, varying inserted data size, and basically checking both the average case and the worst case?

Links to projects:
Tx (a bit off-topic, but as you showed text search, can compare with that):
Tokyo cabinet:
Louie writes:
23 Sep 10, 11:23:58
Great work. I appreciate the quick responses to community discussions and benchmarks. It really demonstrates your commitment to the project which in turn makes it easier for all of us to adopt Redis.
dean writes:
23 Sep 10, 11:32:41
Good stuff. Can't wait for Redis Cluster!

Thanks for your devotion to this cool project!
Jeo writes:
23 Sep 10, 14:36:06
All this is especially true given virtual machines.
Adam Malter writes:
25 Sep 10, 15:54:06
Hi - thanks for the great data.

We are a shop contemplating Redis for persistent k/v stores and this causes me a bit of concern at the app server level.

For example, we currently have 4 memcache boxes, probably on dual quad core instances. If we were to supplement this with a similar Redis cluster; 8 instances per machine times 4 current machines - so 24 connections per app server.

Now, our app servers are java and using the Spy memcache client, we found that best performance was achieved when we had 2 or 3 connections per server.

Would this mean that with our starting cluster you might advise to have 50-75 connections (which translate to threads on the java side)

This connection scaling worries me, even if the instance to app server ratio is 1:1. With a growing cluster, growing number of cores per box, well, things get exponential quickly...

I am very new to evaluating Redis and might have some stuff backwards or might be over-worrying on connection count. Could you speak to what configurations are recommended.
Kim writes:
26 Sep 10, 06:40:48
I have a question about synchronization. Are the two instances of redis synchronized, so that if you do a SET on one process, the other process is updated as well? If not, that is an argument for a threaded model where you ensure consistency.

Memory requirements are also another concern. If you want to use your 16-core CPU, you would need 16 times the memory with separate processes.
Vivek Munagala writes:
29 Jan 11, 03:36:23
Awesome post. It helped me understand many concepts of redis.

Kim, different instances of redis do not share the memory. So if you set a key in one instance, it will not be seen from the 2nd instance. And I dont understand why you need 16 times the memory if you have 16 instances. Obviously you need to distribute your keys among all the instances instead of replicating. As simple as a checksum of key can help you in distributing.
comments closed