What's wrong with 2006 programming?

Tuesday, 05 October 10
Redis 2.0 introduced a new feature called Virtual Memory. The idea is that some applications using Redis may not access the whole dataset with the same frequency. In extreme cases only a little percentage of hot spot data is used often, while the rest is mostly idle and touched very rarely. For instance imagine a Redis instance holding User objects: the most active users will hit this subset of records continuously, while a large percentage of users will access the site a few times a month, and another large subset of users completely forgot about this web service at all.

Since Redis is memory backed the idea was to transfer rarely accessed data on disk, to reload swapped data when needed (that is when a client will try to access it). The actual implementation of Redis Virtual Memory is completely done in user space: we try to approximate an LRU algorithm, encode data that should be swapped, write it on disk, and reload if needed, decode, managing pages in the swap file, and so forth. It's a non trivial piece of code but it is working well.

Still almost every week I receive a mail, a blog message, a tweet, or I happen to read an article pointing me to this article written by the Varnish guy (edit: that is, the well known developer Poul-Henning Kamp). The article will tell you how silly is to implement your caching layer on top of the one already provided by the operating system. The idea is that you should just write things into an mmap()ed file or alike, and let the OS swap/load things for you.

If you know Redis you already know that we actually try hard to use the operating system smartness to do complex things in a simpler ways. For instance our persistence engine is completely based on fork() copy-on-write semantics of modern kernels, but for Redis Virtual Memory using the OS is not a good solution, and it's time to explain in details why it is not.

OS paging is blocking as hell

The first huge problem with this approach is how badly blocking it is. What happens is that when you try accessing a memory page that is swap on disk the CPU will raise an exception, asking the kernel to retrieve the page from the swap file and transfer it in a physical memory page. In the meantime the process is completely blocked.

What this means? That if we have two clients, C1 and C2, and...
  • C1 is trying to access a key that was stored into a page that the OS transfered on disk.
  • C2 is trying to access a key that is fully in memory. A recently used one.
  • C1 sends the query one millisecond before C2.
  • Because C1 will touch a page that is swapped on disk, the process will be halted, and will wait the disk I/O needed to bring the page back into memory.
  • In the meanwhile everything is stopped. Even if C2 was going to read something in memory it gets serialized and will be served after C1.

One very important goal in Redis VM (and I guess this should be a primary goal of every system with a low latency semantics) is to be able to serve keys that are in memory as fast as usually. Clients performing a query against a rarely used page will instead pay the latency penalty, without effects for other clients.

This is already a show stopper and just because of this it should not be worth continuing with the rest of the article, but well, while I'm at it it's a good exercise I guess.

The granularity is 4k pages

The kernel is able to swap/load 4k pages. For a page to be idle from the point of view of the kernel and its LRU algorithm, what is needed is that there are no memory accesses in the whole page for some time.

Redis is an in-memory data structures server, this means that our values are often things like lists, hash tables, balanced trees, and so forth. This data structures are created incrementally with commands, often in a long time. For instance a Redis list may be composed of 10k elements storing the timeline of a twitter user, accumulated in the course of six months. So every element of the list is a Redis object. Redis objects gets shared, cached, and so forth: there is no good locality in such a data structure obviously.

Multiply this for all the keys you have in memory and try visualizing it in your mind: These are a lot of small objects. What happens is simple to explain, every single page of 4k will have a mix of many different values. For a page to be swapped on disk by the OS it requires that all contained objects should belong to rarely used keys. In practical terms the OS will not be able to swap a single page at all even if just 10% of the dataset is used.

Oh but this is since you are lame! Store related objects nearby...

The whole Redis semantics of being single threaded, fast, and very versatile in the data structures provided, is up to the fact that we use the good and old data structures implemented with something that is able to provide good performances even with bad locality (compared to a disk) that is: memory.

Handling this data structures with very good locality is as hard as implementing well this data structures on disk. If we could do this, it would be a much better strategy to use the inverse design: store everything on disk and use the kernel disk cache to take the hot spot in memory. Persistence and VM solved in a single pass, a no brainer.

Actually in Redis 2.2 we try to "compact" our data in memory, and in this way we obtained huge space savings. Many datasets in Redis 2.2 takes just 20% of the space that was required in 2.0. This is five times more space efficient than before. But where is the trick? That we can do this only for small lists, sets, and hashes, where O(N) algorithms are as fast as O(1) algorithms because of cache locality.

I think I already showed my point, but there are more good reasons to implement paging at application level, especially in the case of Redis.

Optimal representation on disk and on memory are very different

Many data structures are designed to be able to provide specific time complexity performances. For instance an hash table provides an element lookup time of O(1) in the average case. In a similar way a balanced tree is designed so that it's possible to update a Redis sorted set score in O(log(N)).

For this to be possible, there is to waste memory because you have meta data of many kinds: pointers, allocations overheads, informations per every node for augmented data structures (like our skiplist implementation), and so forth. The representation of data is optimized for interacting with this data.

On the other side when values are swapped they are idle. For storage the best representation can be completely different. For instance an hash table holding name of fruits in memory can be represented on disk as a trivial comma separated string of values: "orange,apple,...".

The OS has zero knowledge of what's written in a page. Instead with application level paging we know what we are doing, and can serialize the data in the VM in the smarter way. This means from 5 to 10 times less disk I/O compared to the work performed by the kernel in the same conditions!

Aging algorithm can't be changed

And finally... what value to swap on disk? What value to take in memory?

Again, the kernel will use a simple LRU algorithm, where the granularity is the page. Redis can do much better, for instance LRU is not always the best algorithm when accessing data in a "circular" way, one record after the other and then again. Also the current Redis algorithm takes into account the size of a given value. If it's small it's not worth transferring if the age is exactly like another value that is bigger, and things like this. In Redis 2.2 we plan to provide different swapping algorithms so that people can pick what can work better for a given dataset.

I think the Varnish article is not bad at all, the real problem is that an article is not enough to provide a deep understanding of the specific implementation of a different system. I hope this article provided a counter-case for the Varnish approach that can be used when it is sensible to use it. And the other way around.
Posted at 17:15:25 | permalink | 35 comments | print
Do you like this article?
Subscribe to the RSS feed of this blog or use the newsletter service in order to receive a notification every time there is something of new to read here.

Note: you'll not see this box again if you are a usual reader.


05 Oct 10, 17:44:00
By the way, the Varnish approach is not revolutionary. It seems that the author didn't know it at the time, but it is well known as "cache oblivious algorithms" in the literature. There is a lot of good and interesting reading on that topic!
Andraz Tori writes:
05 Oct 10, 18:19:45
The bottom line is... "Show me the data". :)

Basically there's a lot of interesting theory in here, there's a lot of interesting theory in Varnish guy's article.

Where are the benchmarks?

[my personal experience is that doing your own caching hurts you as you at the end need more memory, because underlying operating system caches things again and you basically take twice as much memory as you would need. So if you really need real time performance you will have to have everything in memory anyway, so why not simply use mmap]

Andraz Tori
Tim Jarratt writes:
05 Oct 10, 18:26:19
Just wanted to say that I appreciate all the hard work that's gone into redis 2.0 and beyond. I've been accessing redis from a highly asynchronous environment (node.js) and knowing that you try to avoid blocking operations as much as possible makes me very happy and confident that the non-blocking code I write will run as fast as possible, and scale.
FunnyGuy writes:
05 Oct 10, 19:12:09
Biggest observation from me is that regardless of whether the OS does the caching to disk or custom app code does it, disk I/O always winds up being a bottleneck........
Poul-Henning Kamp writes:
05 Oct 10, 19:19:24
Let me say first that I have not seen your code, so these comments should be read at a very general and broad level.

As an operating system programmer, I find it strange that you would try to replicate so much of the kernels code in userland. In particular in light of the fact that try as you may, you cannot avoid the kernel code acting also.

About the only circumstances where I would even think about that, would be a single threaded program that could not possibly be multithreaded or written to work sensibly with the kernels VM system in any sane way.

With the kind of hardware we have seen the last 20 years, I would think more than thrice before deciding squeeze high performance out of such a single-threaded program: You will be sequestered into smaller and smaller fractions of the silicon in the CPU socket as progress marches past.

Just this week I saw benchmarks from a 24core "commodity server". A single threaded program will at best get access to 8% of that machine, one core for userland and one for kernel.

But I'm sure you know all that, and have good and valid reasons for what you do.

The problems about residence you mention in your example should have programmatic solutions (madvice/mincore) but due to lack of usage, this API has never been fully developed to the point where you can prefault pages in a non-blocking fashion. At least not portably.

I consider this a deficiency in POSIX standards, not in the concept of VM.

I will fully agree with you that the kernel may not always do a stellar job at VM management. Linux' problems with overcommit are legendary by now, but you can work around them if you spend a little time.

You are not correct that the aging algorithm cannot be manipulated, but again, I will be the first to point out that the madvice() API is far from optimal for the purpose.

So to sum things up: I think you misconstrue POSIX standardization incompetence as Virtual Memory shortcomings and I think that prevents you from perceiving the full potential of VM.

@Baron: I think you are confusing two different papers, I suggest you actually follow the link before comnmenting on its contents :-)

@Andraz: I think both benchmarks and real-life behaviour has validated Varnish' design. I don't have the numbers at hand, but I belive the current synthetic speed record is north of 150 kreq/s on a single machine using less than 25% of its CPU.

Andraz Tori writes:
05 Oct 10, 19:52:38
@Poul-Henning Kamp
Yes, as I said, my experience (and our general experience at Zemanta) agrees with your approach.

I also find this article should mention what happens when kernel/VM caches your disk access and thus spends additional memory for the same data. Does Redis use O_DIRECT to mitigate this and does that approach work?
Torbjorn writes:
05 Oct 10, 19:55:59
@PHK Very interesting reply. I will go to varnish.org and locate some of the benchmarks.
05 Oct 10, 20:14:56
Poul-Henning, how come the blocking problem mentioned bei Salvatore doesn't affect Varnish? Or does it?
Poul-Henning Kamp writes:
05 Oct 10, 20:23:36

Oh it absolutely does affect Varnish.

But it generally (p>>.9) only blocks the single thread that faults the page, so only the single client that needs those bits is affected. All the other threads, often thousand of them, continue unaffected.

antirez writes:
05 Oct 10, 22:44:18
Thank you everybody for all this great comments! It's 1 a.m. here and I want to be less asleep before replying, I'll comment tomorrow morning. Thanks again.
05 Oct 10, 23:08:35
One solution to the problem of related data being scattered all over the memory is to compact items as they are accessed. Related items then end together and no single element keeps a OS page in use. For example, in the COLA data structure recently inserted items are kept together, so you could reinsert elements just read to keep them fresh.

Another avenue of research is data structures with the working set property, such as http://www.cs.au.dk/~gerth/papers/isaac10implicit.... ("A Cache-Oblivious Implicit Dictionary with the Working Set Property")
Sam Watkins writes:
06 Oct 10, 00:38:59
I think syncing to the network is a good idea. Supposing you want to ensure that some transaction is committed to disk, you have to wait for fsync. Fsync is much slower than a LAN ping or even internet pings within a single country, if you are well connected. (e.g. ping from my pi.nipl.net -> hu.nipl.net appears to average about 0.001 ms).

If you log transactions on several other servers, your reliability is much greater. Even if the main server suddenly explodes or catches fire, you still have all the committed transactions. RAID provides no such guarantees. It's extremely unusual for 3 servers in different datacenters to explode simultaneously - if it's a concern, get more servers.
startupgrrl writes:
06 Oct 10, 04:01:43
The *thread* is completely blocked, not the entire process. I stopped reading right there.
Vinay Sajip writes:
06 Oct 10, 07:26:05
@startupgrrl: I think that since Redis is single-threaded, it amounts to the same thing in this case.
antirez writes:
06 Oct 10, 08:37:14
@Poul-Henning: first of all thanks for replying to this blog post. I appreciate your article, what I'm criticizing is mainly the fact that it can't be generalized as applicable in all the cases.

Let's start with Redis being single threaded, the path we are taking is to build "Redis Cluster", this is where most of my coding efforts are spent currently. The idea is that anyway a computer is not enough and we need multiple instances in different hosts, so there is anyway to care about sharding data, fault tolerance, and so forth, so why not considering every core as a computer per se?

This means that Redis will run 48 instances in a 48 core CPU, and every instance will use 1/48 of the memory and resources. This way there is no contention at all, if not at hardware level (like memory bus and so forth) between the processes and this provides very good scaling properties. Testing it in the practice showed how four instances of Redis in a four core box is able to reach 4 times the request/second of a single instance.

This way not only we have zero contention between our processes but also the system is much simpler, as every synchronization problem is gone. As Redis manipulates shared data structures in complex ways compared to, for instance, a web server, avoiding the locks is a really big win. Running 48 instances of Redis, when will be common, can be made trivial from the point of view of the user supporting this in a specific way. main() will just open 48 unrelated threads that will run a whole instance each, without shared data, binding a different TCP port, and so forth. For now we are just doing this by hand but the upgrade path is straight.

About the VM, I understand how a better VM implementation and API may kill many of my concerns, but Redis needs to be pragmatic for now, targeting mainly unix systems in their current incarnations, especially Linux and *BSD.

I think that in the specific case of Redis even with a perfect POSIX API that will let us to control the VM in a very specific way, there are still problems, because of the fragmentation issue. A Redis list is composed of a linked list of many redis objects structures, that are cached instead of freed, and in general subject to complex manipulations over the time, insertions in the middle and so forth. We have even more fancy data types like sorted sets implemented using a skiplist and an hash table at the same time in order to guarantee specific time complexities. Every page is full of redis object structures belonging to a number of actual keys, it will be nearly impossible for the kernel to do a good job with the current hardware at least, and with the current allocation technology.

Maybe in the future things will change and a different programming style will lead to better results but for now I think the Redis design makes sense, like a threaded server serving big documents (compared to the page size) can do a stellar job using the kernel VM implementation, as Varnish is doing.
Nathan Kurz writes:
06 Oct 10, 10:26:21
Hi Salvatore --

It does sound like you have a real solid handle on the issues and the tradeoffs. You're right: if each object is scattered across hundreds of pages, there's no way that a OS is going to be able to handle caching. But this seems like an oddly self-made constraint. Wouldn't even a single-threaded server get much better memory utilization and throughput if each object was more compact?

It doesn't seem that hard to keep the objects localized. You could do as Bruno suggests above and reinsert the objects when modified, or you could treat each object as a pool allocated by the system and manage the suballocations yourself. Yes, and it has the side-benefit that you might be able to let the OS do the lifting at some distant point in the future.

In general, I think you might that efforts at in-memory compaction (optimizing for space) might continue to yield benefits much farther than you think. You can burn a lot of cycles moving bits around in the time it would take to hit main memory time and time again: http://duartes.org/gustavo/blog/post/what-your-com...
Andraz Tori writes:
06 Oct 10, 10:31:27

I have a practical question... How do you prevent the OS from caching the data that you read from files? Do you use O_DIRECT all the time and do absolutely all caching by yourself?
antirez writes:
06 Oct 10, 10:32:16
@Nathan: I think the article already addresses this part very well. We already do this, but this is possible only for small lists, sets and hashes. Check the files ziplist.c, zipmap.c and intset.c in the Redis master branch at github for more details.

It is not possible to implement complex data structures with good locality and with the same time complexity. Also in many applications keys and values are simply small, there is nothing to compact in a 200 bytes string :) and still a 4k page will hold many 200 bytes values.

And as I already stated in the article, if you can implement the Redis semantic with good memory locality and without a ton of more complexity, then you can do it writing directly on disk.

If there will be to rethink the design the right thing to do will be to try to get completely disk-backed and use the memory as a cache for the working set, not the other way around, as this approach also solves the persistence problem.
antirez writes:
06 Oct 10, 10:36:27
@Andraz: Redis memory pages about not swapped out values will never get swapped because as stated even with minimal traffic all the pages are touched.

About the swap file being cached by the OS: this is what *we want* actually. If we are using too much memory already and are near the limit the OS will not have memory to do so, but when there is some memory, it's a good think that the OS will cache our swap file, because the same data in the swap file is 5 to 10 times smaller. So this turns the Redis VM into an in-memory compression system for rarely used data.

And actually it is suggested to use a vm-max-memory setting that will let the system to perform some OS-level caching on the swap file... the performances will be better.
Andraz Tori writes:
06 Oct 10, 10:51:52

That means that the same data is stored in memory twice: once in uncompressed form by Redis, second time in compressed form by page cache.

That might be fine for Redis and it might be exactly what you want. However this exact waste of resources has hurt me twice in my history. The application could hold almost the whole working set in memory if that was the only thing memory was used for. But because some part of the dataset was cached by OS level cache, there was duplication of data and application had to swap way more than necessary. Using mmap (along with multithreading) fixed this in one situation, and in the other situation this was not really possible since it was too big code base to be turned around (and written in Java).

Interesting exchange of experiences nevertheless! :) Thank you for sharing your approach, it's interesting.
antirez writes:
06 Oct 10, 10:54:01
@Andraz: no sorry, the swap file only contains data that is no longer in memory. This swap file can be cached in memory as well if you have some available space for buffer, but this means, a copy on disk, a copy on RAM, that is ok IMHO if there is free ram.

Instead the memory contains data that is not stored on disk as the OS paging can't swap this memory since the pages are continuously touched, and anyway a mprotectall() call is cheap (but I assure you, not needed).
Poul-Henning Kamp writes:
06 Oct 10, 15:54:57
@Antirez: As I pressumed: You had thought about it :-)

I'm not sure I would go the "one process per core" route, but it is certainly feasible.

You preclude in-memory communication between your instances, all else being equal that is two orders of magnitude faster than having the kernel pass packets between processes.

You also incur some, possibly marginal, cost in the VM hardware, because you switch between 48 different address-spaces rather than stay in the same one.

But of course, at the same time, you gain robustness to single failuers, provided your network protocols are well designed etc. etc.

kenn writes:
06 Oct 10, 20:39:05
@Antirez: It seems that you have a solid idea about the topic. I learned a lot about the background of the decision that you made, and now I can understand Redis better. Thank you.

However, this conversation left some questions lingering in my head.

As a database admin (and as a programmer who uses a laptop for development where the free memory is always near-zero out of 4GB. Thanks Apple, for making everything 64bit), I need full control of memory usage.

For instance, with RDBMS, "server sizing" is easy as pie - just calculate "global_buffer + (per_connection_buffer * number_of_clients)" and you're done. If it's estimated as 10GB, it's in fact capped at 10GB, and I can safely predict that it will remain at 10GB after a year of operation. Peace of mind.

With Redis VM, we have vm-max-memory. But it doesn't really solve the problem, because the memory usage of keys is still unpredictable. I think the controllability of memory usage is a critical feature of database products, if you think Redis as one. Otherwise, sooner or later we all end up with OS swap.

Let me explain why - as I said in the comment on the previous post, on Linux, OS swap occurs not because you run out of memory, but because the OS tries to cache pages as much as possible, up until it takes up all the physical memory, then it starts to think the page cache is more important than the database process. Unless you do something like mlock(). vm.swappiness=0 doesn't completely stop that behavior, so it happens no matter how much memory you have, in the long run. Sounds silly but it's the reality. You say all the keys are kept in-memory with VM, but that's not true - OS will push the less frequently touched keys to disk, and trying to read them will block Redis as hell.

By the way , the "server sizing" is even possible with memcached, with "-m 64m" option - memcached abandoned malloc/free for the slab allocator, because malloc caused hefty fragmentation and at some point as the dataset grow the heap manager took more CPU than memcached itself. The slab allocator has its own downsides, but at least you might see there are occasions where we can't just say that memory is fast enough to do anything.

I don't think the latency of memory (and the overhead of memory management) should be underestimated, it's for real with a large dataset. CPU is idling because the bottleneck is RAM, which is a good thing. On a cache server of a significant dataset, if the CPU is saturated, it usually is a bad sign.

Here's another question - while I think being single-threaded is great, Redis is already threaded - with "vm-max-threads". If the blocking OS paging is the problem with mmap, couldn't you approach the same way? That is, with a read/write thread that handles blocking IO and let the main thread know when it's ready. I think it would make Redis resilient and robust under any conditions, but I'm not sure if that works with Redis and would like to know what you think.
kenn writes:
07 Oct 10, 00:03:59
@Poul-Henning: The idea behind Varnish sounds awesome, as much as Redis does in a different way.

One question - what kind of situation is it where the CPU usage gets near 100% at the userland? I thought using worker threads + job queue was excellent to maximize concurrency, not necessarily (but as a fortunate side-effect) to utilize more CPU cores. I'm curious in what kind of CPU arithmetic occurs where the RAM latency / throughput is the dominant factor.
Poul-Henning Kamp writes:
07 Oct 10, 10:07:21
We don't know: I have yet to see a CPU saturated Varnish server.

Typically a cache-hit is delivered with 7 system calls: A read, four timestamps a write and a setsockopt()

It's not like it is a hard job, you just have to avoid the threads stomping on each other.

kenn writes:
07 Oct 10, 17:10:43
@Poul-Henning: That's what I thought - thanks for clarification.

Oh, and I also agree with you that madvise() or posix_fadvise() are not good enough as Linus hoped. It may be because of "a totally broken interface (O_DIRECT) resulting in better interfaces not getting used" as Linux said, but it may be because "people disliked adding complexity for no guarantee." But going back to O_DIRECT or raw partitions DOES sound like we're in 80's. I don't know what's the 2006 way - do you?
Nathan Kurz writes:
10 Oct 10, 05:02:14
"It is not possible to implement complex data structures with good locality and with the same time complexity. Also in many applications keys and values are simply small, there is nothing to compact in a 200 bytes string :) and still a 4k page will hold many 200 bytes values. "

Well, that's sort of my question. What constitutes too much complexity? What's the proper trade off between memory and processor. I'd say that if your memory is full and your processor is at partial capacity, you have room for more algorithmic complexity.

If you can compress that 200 byte string down to 100 bytes, might this be a win? Uncertain, but interestingly it's a win to store integers compressed when building large in-memory inverted indexes for search: http://www.ir.uwaterloo.ca/book/addenda-06-index-c...

Similarly, you are right to worry about how many things items are 'hot' per 4K page, but I'm not sure that it's impossible to keep this ratio high. Might it be a win to rewrite each object in its entirety on every write just to keep it localized, with the reallocation algorithm rigged to cluster together the hot and cold objects?

You're obviously already thinking about these matters. I'll stop bugging you so you can keep writing great software!
antirez writes:
10 Oct 10, 10:16:36
@kenn: I'll try to reply to your questions.

1) For the memory concerns, there is simply to size things accounting for the number of keys currently. With Redis cluster as you can guess this is going to be much simpler as if you have solid monitoring systems (and you need them, see what happened to Four Square) when we are near to the memory limits all it's needed is to add a new server and wait for the auto-sharding to complete.

Anyway I hope that very little users will use VM, it is an interim solution. 2.2 uses a lot less memory than 2.0 for many kind of datasets, and when there is a strong bias between the working set and the data set size I suggest using Redis as an "intelligent" cache and an SQL DB for storage of bulk data.

For intelligent cache I mean, that thanks to the data structures exported to Redis it is possible to perform writes against both the cache and the DB to take everything in sync.

When the application is write-heavy RAM is the way to go anyway.

2) About your OS swapping concern, as already stated, never seen an instance even remotely used by clients that will get pages swapped. This is for the memory layout, and is a proof of how badly our system could work with the OS swap and a 4k page size.

But again, I think I'll add an mprotectall() call in Redis 2.2 final.

3) About the latency of memory when there is cache misses, data structures of arbitrary size of elements are completely designed for this environment, so it's better to think at cache hits just as a gift you can get from time to time. In the memory model assumed in a CS book when talking about a binary tree there is this implicit idea most of the time.

When it's possible it is cool to have more memory locality, but there is a tention between fragmentation, locality, and predictable and guaranteed average case time complexity. This is why it's so hard to implement even a simple linked list on disk with good locality that will work against every kind of insertion/deletion pattern.

I bet there is even some proof in some CS book that it is practically impossible to implement N linked lists or hash tables on disk, with random updates, while ensuring a good locality with all the access patterns. It's something that is more related to physics than computer science.
01 Nov 10, 17:16:14

Actually, I do agree with phk here. In addition to that I did have a look at redis concepts and issues, and I think that what is at stake here is that the data model is wrong: hash-tables are very bad cache-wise, and also suffer from all the drawbacks you cite which force you to write your own layer of VM.

But actually there are cache oblivious data structures that you could use, that have very decent lookup times (not unlike hash-tables at all), and allow clustering of coherent data in the same VM pages. I'm just assuming that keys starting with very common prefixes point to data that is often accessed together (which sounds reasonnable, and which redis users can use in their data models also).

I'm naming, HAT-Tries[0]. As a bonus, it's way more compact that h-tables for many data-sets. I've sent you a mail about that with more details to your @gmail address, I hope you've received it.

Oh and a huge advantage of using the OS VM that phk didn't mentionned is that when you use paging, if pointing to a page is enough then pointers are only 32bits even on 64bits hosts, because the page uses 12 bits (4k) and that with a simple translation table you can number the pages on 32bits and be able to address 34bit worth of data which yields a very decent 4To of RAM, which should be way enough for everybody (err I did hear something like that once about 640ků famous last words) while not penalizing 64bits archs at *ALL*.

All in all, I think that what phk said in this article is just proven true here again: using the same old datatypes from CS-101 is just wrong on modern hardware. He's a tad hard with pretending that "modern hardware" starts with a vax, but clearly for huge data-sets, plain old hash-tables (even very fancy ones) aren't cache-concious, you can't really use divide-and-conquer on them, do partial updates and so forth. With hat-tries, since it's a *tree* as soon as you work in separate sub-trees you can parallelize, divide-and-conquer, and more importantly, data is clustered and/hence cache-efficient.

[0] http://crpit.com/confpapers/CRPITV62Askitis.pdf
01 Nov 10, 17:19:25
Okay, errata: I meant 44bits worth of RAM, not 34 of course (but 4To is correct). Note that early amd64 only used 40bits worth of addresses, and nowadays use 48, so we're not even spoiling a lot of space anyway ;)
Navigatore Anonimo writes:
01 Nov 10, 19:06:59
@Pierre: I fear you missed an important point about our reasoning. We don't have just dictionaries, but also lists and sorted sets. Also all the operations associated should run with the same time complexity on disk, and still be cache obvious.

As a plus (but an absolutely required one) when the dataset fits completely in memory, that is, 95% of Redis use cases, the operations (for instance ZRANK and ZADD when updating) in this new cache obvious, on-disk, data structures, should run using the same CPU and in the same time (or less) than they currently run in Redis with the in-memory implementation.

With all that, then we can switch to disk, and I could be very happy about this.

01 Nov 10, 20:39:36
I haven't missed that. I'm not saying Hat-tries (or B-trie as described in [0]) are a silver bullet. I'm well aware of the fact that you have lists and so forth.

What I'm saying is that unless I'm completely mistaken, the key space in redis is a huge h-table. I'm saying that using a hat/b-trie for it will bring:
- size improvements over h-tables;
- no significant performance loss;
- data clustering.

What I'm hinting wrt disk-based storage is this. Instead of malloc()ing or anonymous mmaps, you instead allocate half gig (a number that I've found beeing nice enough in various similar code I've written) files on disk and you mmap them. You just need a page allocator (I recommend tlsf-based[1] allocators) that dispatches those memory pages.

If you're smart enough, which I don't doubt, you'll easily have hat/b-trie dispatch nodes fit exactly in a few pages (probably 1, 2 or 4 depending on your fanout). For data leaves, well there's a few cases: either the data is a scalar, then you will encode it inside the data leaf. Note that integer/doubles updates is cheap (a simple overwrite works), but string data aren't because you may have to memmove stuff (but a malloc based solution isn't any better as for small stuff realloc() will anyway memmove stuff too for most of the cases). If your data isn't a scalar but one of your complex dataset then here is how I would write that:
- hashes, well, it's just a new hat/b-trie;
- sets: again, a hat/b-trie where the key is associated with an empty data;
- zsets are more complicated in the sense that it's easy to implement as two hat/b-trie: one mapping "elements" values to their priority, and another mapping the priorities to corresponding values, but it's likely to be a wasteful implementation. So let's say that for this one, when the data is a zset, then you associate the pointer to your current zset implementation if it's less wasteful;
- lists, well, lists are easy to represent inside a list of pages with list nodes allocated continuously in the pages. You can even maintain a small header per page (or page cluster when the values are large) that says how many values are in this page/page cluster which makes a nice skip-list, you can also know which size is occupied, and when you do stuff like removes or similar, you merge adjacents pages like a b-tree does: you consider your left and right neighbour, and see if the three of them can be merged in two pages or not, which ensures less than a 33% waste. Of course this is a tad slower than linked lists, but you gain data clustering and well, LREM is slow already anyway (compared to LPOP or similar).

Once the zset issues are resolved then well, you just need to forget about your VM code because the kernel will just do what you want for free.

So no I've not overlooked anything I think :)

WRT your remarks wrt ZRANK and ZADD, my naive implementation using two hat/b-trie will have the same complexity (O(n) for n being the number of elements in the set) but very likely with a smaller O, for ZADD, well, I don't know your current implementation at all, so I can't tell, but it won't probably be very different.

I think what you miss is that what I say is that you can gain an efficient disk-based implementation relying on the kernel solely using:
- the right data structures;
- an allocator that splits a file-based mmap into pages served to those data structures.

Actually the nice thing is that what I propose is to use data structures that are more cache oblivious than yours (what phk heaps are to standard heaps), and that such data structures naturally use nodes that are very suited to be stored in pages. And as soon as you have pages, then you fit in the standard kernel-VM model for free. It's not really the other way around. Being able to be disk-backed here is just a neat-cheap-easy side effect, nothing more.

FWIW, I sadly can't use redis for my needs because I need predicates on columns (meaning that I need stuff like: please return me the list of keys which are associated with the value "v" which redis isn't really made for), so well, I was just giving some ideas to improve redis because I've liked its simplicity, and having written/contributed code for in-memory databases too, I kind of have some experience to share. I won't be vexed if you say I'm stupid (maybe just a little sad okay) :)

And of course I'm aware that it's kind of a major rework of redis as basically it throws away most of your data structures, hence implies rewriting most of the operations and so forth.

[0] http://goanna.cs.rmit.edu.au/~naskitis/Nikolas_Ask...
[1] http://rtportal.upv.es/rtmalloc/
01 Nov 10, 20:47:31
okay, again errata:

complexit for ZRANK will obviously be O(log(n)) not O(n)

And also, of course this is a huge overhaul, but you can do it incrementally: replace the root key-space dictionnary with a hat/b-trie, then use it for sets and hashes where it's simple to do so, and just keep pointers to your current implementations for the other data types (zsets and lists) until you've worked out sane, fast, paginated implementations for them :)

In my experience, a simple tlsf disk-based page allocator (and mine has nice properties à la munmap that allows you to make holes in your allocated pages) is 460slocs (702 lines with comments). Hat/b-trie are more tricky and will likely require 2kslocs (i've only very specialized ones to look at right now where keys are of fixed size, which allows huge simplifications in the code and those are under a ksloc). So I'm not talking about anything unrealistic at all :)
Marcus writes:
27 Apr 11, 21:38:01
@antirez I'm familiar with how your data structures and VM usage are implemented, but if you're not doing this already, could you not have a hybrid solution, which could work something like this:

- have a large (or multiple) mmap()ped file(s) for persistent storage
- dedicate some of the pages of this(these) file(s) to *active* data and some to *inactive* data
- monitor the usage of objects in the cache
- for objects that are not frequently used, move them from an 'active page' to an 'inactive page'

Then just leave it up to the OS to swap these inactive pages to disk, which is much more likely to happen if all the data on an 'inactive page' is very rarely used.

You would obviously need to keep track of which objects (or perhaps parts of objects) are being accessed, and you might need to move some files from the inactive pages back to active ones if they start being used regularly again, but you'd do that anyway if you're doing your own swapping implementation.
Marcus writes:
27 Apr 11, 22:02:32
Oops, meant to say not familiar.
comments closed