What's wrong with 2006 programming?

What's wrong with 2006 programming? Tuesday, 05 October 10

Redis 2.0 introduced a new feature called Virtual Memory. The idea is that some applications using Redis may not access the whole dataset with the same frequency. In extreme cases only a little percentage of hot spot data is used often, while the rest is mostly idle and touched very rarely. For instance imagine a Redis instance holding User objects: the most active users will hit this subset of records continuously, while a large percentage of users will access the site a few times a month, and another large subset of users completely forgot about this web service at all.

Since Redis is memory backed the idea was to transfer rarely accessed data on disk, to reload swapped data when needed (that is when a client will try to access it). The actual implementation of Redis Virtual Memory is completely done in user space: we try to approximate an LRU algorithm, encode data that should be swapped, write it on disk, and reload if needed, decode, managing pages in the swap file, and so forth. It's a non trivial piece of code but it is working well.

Still almost every week I receive a mail, a blog message, a tweet, or I happen to read an article pointing me to this article written by the Varnish guy (edit: that is, the well known developer Poul-Henning Kamp). The article will tell you how silly is to implement your caching layer on top of the one already provided by the operating system. The idea is that you should just write things into an mmap()ed file or alike, and let the OS swap/load things for you.

If you know Redis you already know that we actually try hard to use the operating system smartness to do complex things in a simpler ways. For instance our persistence engine is completely based on fork() copy-on-write semantics of modern kernels, but for Redis Virtual Memory using the OS is not a good solution, and it's time to explain in details why it is not.

OS paging is blocking as hell

The first huge problem with this approach is how badly blocking it is. What happens is that when you try accessing a memory page that is swap on disk the CPU will raise an exception, asking the kernel to retrieve the page from the swap file and transfer it in a physical memory page. In the meantime the process is completely blocked.

What this means? That if we have two clients, C1 and C2, and...

C1 is trying to access a key that was stored into a page that the OS transfered on disk.
C2 is trying to access a key that is fully in memory. A recently used one.
C1 sends the query one millisecond before C2.
Because C1 will touch a page that is swapped on disk, the process will be halted, and will wait the disk I/O needed to bring the page back into memory.
In the meanwhile everything is stopped. Even if C2 was going to read something in memory it gets serialized and will be served after C1.

One very important goal in Redis VM (and I guess this should be a primary goal of every system with a low latency semantics) is to be able to serve keys that are in memory as fast as usually. Clients performing a query against a rarely used page will instead pay the latency penalty, without effects for other clients.

This is already a show stopper and just because of this it should not be worth continuing with the rest of the article, but well, while I'm at it it's a good exercise I guess.

The granularity is 4k pages

The kernel is able to swap/load 4k pages. For a page to be idle from the point of view of the kernel and its LRU algorithm, what is needed is that there are no memory accesses in the whole page for some time.

Redis is an in-memory data structures server, this means that our values are often things like lists, hash tables, balanced trees, and so forth. This data structures are created incrementally with commands, often in a long time. For instance a Redis list may be composed of 10k elements storing the timeline of a twitter user, accumulated in the course of six months. So every element of the list is a Redis object. Redis objects gets shared, cached, and so forth: there is no good locality in such a data structure obviously.

Multiply this for all the keys you have in memory and try visualizing it in your mind: These are a lot of small objects. What happens is simple to explain, every single page of 4k will have a mix of many different values. For a page to be swapped on disk by the OS it requires that all contained objects should belong to rarely used keys. In practical terms the OS will not be able to swap a single page at all even if just 10% of the dataset is used.

Oh but this is since you are lame! Store related objects nearby...

The whole Redis semantics of being single threaded, fast, and very versatile in the data structures provided, is up to the fact that we use the good and old data structures implemented with something that is able to provide good performances even with bad locality (compared to a disk) that is: memory.

Handling this data structures with very good locality is as hard as implementing well this data structures on disk. If we could do this, it would be a much better strategy to use the inverse design: store everything on disk and use the kernel disk cache to take the hot spot in memory. Persistence and VM solved in a single pass, a no brainer.

Actually in Redis 2.2 we try to "compact" our data in memory, and in this way we obtained huge space savings. Many datasets in Redis 2.2 takes just 20% of the space that was required in 2.0. This is five times more space efficient than before. But where is the trick? That we can do this only for small lists, sets, and hashes, where O(N) algorithms are as fast as O(1) algorithms because of cache locality.

I think I already showed my point, but there are more good reasons to implement paging at application level, especially in the case of Redis.

Optimal representation on disk and on memory are very different

Many data structures are designed to be able to provide specific time complexity performances. For instance an hash table provides an element lookup time of O(1) in the average case. In a similar way a balanced tree is designed so that it's possible to update a Redis sorted set score in O(log(N)).

For this to be possible, there is to waste memory because you have meta data of many kinds: pointers, allocations overheads, informations per every node for augmented data structures (like our skiplist implementation), and so forth. The representation of data is optimized for interacting with this data.

On the other side when values are swapped they are idle. For storage the best representation can be completely different. For instance an hash table holding name of fruits in memory can be represented on disk as a trivial comma separated string of values: "orange,apple,...".

The OS has zero knowledge of what's written in a page. Instead with application level paging we know what we are doing, and can serialize the data in the VM in the smarter way. This means from 5 to 10 times less disk I/O compared to the work performed by the kernel in the same conditions!

Aging algorithm can't be changed

And finally... what value to swap on disk? What value to take in memory?

Again, the kernel will use a simple LRU algorithm, where the granularity is the page. Redis can do much better, for instance LRU is not always the best algorithm when accessing data in a "circular" way, one record after the other and then again. Also the current Redis algorithm takes into account the size of a given value. If it's small it's not worth transferring if the age is exactly like another value that is bigger, and things like this. In Redis 2.2 we plan to provide different swapping algorithms so that people can pick what can work better for a given dataset.

I think the Varnish article is not bad at all, the real problem is that an article is not enough to provide a deep understanding of the specific implementation of a different system. I hope this article provided a counter-case for the Varnish approach that can be used when it is sensible to use it. And the other way around.

35 comments
home