A few key problems in Redis persistence

Saturday, 02 October 10
Today I read this article about Redis with interest. It's a legitimate point of view on databases, you may agree or not, but the author has a solid idea about what he is looking for and about the database scene. I don't agree with everything, but there is something I especially like about this article: it actually identifies very well what I think are actually the strength and deficiencies of Redis.

In short: the strength is the data model, and the deficiency is the persistence.

Fancy and Persistent

We want two things that are hard to play well together:
  • Programmers want complex data types, the ability to use 60 years of computer science not just in the language, but in the database too. I think that even if Redis will be obsoleted in six months and I'll quit programming and do molecular cuisine, a point here is shown: plain key-value is not a cool world where programming inside. To have real data structures and atomic operations is a completely different level of abstraction.
  • We want good persistence.


Why it's so hard to make this things play well together? Well, first of all the direct approach does not work well, that is, implementing complex data types and atomic operations directly on disk. It's hard and slow at best.

We took the other direction, and for additional reasons, like, the guess that memory is the disk of tomorrow. So our data is in memory. The problem is that we need the same data to be at the same time on disk, with a point-in-time snapshot semantics. Our disk snapshot may or may not be realtime, but anyway a property should be guaranteed: The copy on disk should represent the exact dataset that was in memory at a given time. Without this even the basic consistency requirements are completely broken.

An hard to escape rule

So, how to take a snapshot on disk with this property, and at the same time doing it in a non blocking way? There is... more or less a single way, conceptually:
  • Iterate the keys, and write every key on disk
  • But since you are doing this in a non blocking way, that is, the server is also accepting queries you also need...
  • To track every time a key changed. That is, if we are snapshotting and there is a write against a given key that was not already transfered on disk, we need to make a copy of the old value (or remember the key was not existing at all), and use this old copy when we'll write this key on disk.


If you think about this five seconds you'll realize you need an amount of memory proportional to the changes to the dataset happening while the saving is in progress. That is, in the worst case, up to two times the memory needed for the dataset (if all the keys changed while saving, in the worst order).

Redis implements the above algorithm letting the kernel do the work for us, that is, via copy on write of pages. What we do is forking the master process and start a save in the child process. Every time the dataset receives writes, a few pages will change on memory, and this will get duplicated in the child process. At least this was trivial to implement :) and very bug free, that is an important thing in a database system.

The AOF way

If taking a live point in time snapshot of the database is hard, can't we instead take a journal of all the operations performed and re-play such a journal in order to rebuild the state?

Sure, this works well, it's pretty fast in practice too even if it is disk bound, but we have a few problems even with this system:
  • The AOF file size is proportional to the number of writes. It will get bigger and bigger.
  • The bigger the AOF file is, the longer the server will take to restart.
  • So we need a way to compact the AOF from time to time. Again in a non blocking way, while the server is running receiving read and write queries.


What Redis does in order to compact the AOF is rewriting it from scratch. This means to do something very similar to writing the point in time snapshot but just into a format that happens to be a valid sequence of Redis commands. So Redis does not read the old AOF to rebuild it. It instead reads what we have in memory to write a perfect (as small as possible) AOF from scratch. When the new AOF is in place, we do an atomic rename syscall swapping the old with the new.

This is done in the child process, again, it's basically exactly the same problem of dealing with the point in time snapshot, but with the additional problem (that is easy to fix) of accumulating the new queries while the AOF rewrite is in progress, so that before to swap the old with the new, we will also append all the new operations accumulated in the meantime.

Guess what: we have the same problems here. Up to 2 times the memory required while writing the AOF file. This seems like a waste. There aren't better ways? Many other systems are doing smarter things... we should try harder.

It's not so trivial

Ok let's start with an idea that is common to other systems with a plain key-value semantics. For instance Bitcask is using it, and with good reasons, it's a neat idea.

That is... instead of dealing with a single huge AOF file, we can segment it into small pieces, in different physical files: AOF.1, AOF.2, AOF.3 ... Every time the AOF is big enough we open a new file and continue.

This way we may compact the old AOF files that are no longer actively used in background, maybe even with a command line tool, so that we'll have just a single Redis process that will never fork. It will just write to the latest AOF file, the active one. While the command line tool will continue compacting the old AOF files into a single one. Cool right?

How to compact old files?

If we have a plain KV system we have just two write operations: SET and DEL.

In order to compact the old AOF files I can do the following:
  • Read all the AOF files entries in sequence. Every time I encounter a SET I write this entry into an on-disk index, like a B-TREE or alike. What I need to do is mapping every key with the file and offset of the last time I saw a SET for this key.
  • If I encounter a DEL, I'll just remove the key from the temporary index I'm taking.
  • At the end of this process I take my temporary index, and write a new AOF key by key using the offset stored in the index.


This works because SET and DEL and idempotent operations. What will win is the latest SET or DEL entry for a given key, all the old ones can be discarded! I can have an infinite number of operations against a key, but once I see "SET foo bar" I don't care about all the previous operations, the value of foo is bar.

Cool, but does not work

This does not work when you have complex operations against aggregate data types.

To start, in order to even parse the AOF with the command line tool, this tool need to be Redis-complete. All the operations should be implemented, for instance intersections between sets, otherwise what I'll do when I encounter a SUNIONSTORE operation?

Also, in Redis most operations are not idempotent. Think about LPUSH for instance, the simplest of our list write operations. The only way to turn list operations into idempotent operations is to turn all of them into an MKLIST <key> ... all the elements ... operation. Not viable at all.

Our command line tool at best will be able to exploit SET and DEL operations in order to reduce the size of the file, but it will just loose against an always updated sorted set.

Conclusions

To pay 2x of the memory in the worst case may not be so bad at this stage, because it is not trivial to find a really better solution with the Redis data model. Does this means we'll never improve this part of Redis? Absolutely not! We'll try hard to make it better, for instance possibly using a binary AOF format that is faster to write and to load, more compact. We may write a command line tool that is able to process the AOF multiple time only dealing with a subset of the keys at every pass in order to use less memory (but we have inter-key operations on aggregate data types, so this may only work well if such operations are not used). For sure we'll keep investigating.

Possibly new ideas will emerge. For instance currently I'm working at Redis Cluster. With the cluster it is possible to run many instances of Redis in the same computer in a transparent way. Every instance will save just his subset of data that can be much smaller than a single giant instance. Both snapshotting and AOF log rewrite will be simpler to perform.

Redis is young and there are open problems, but this is an interesting challenge I want to take ;)
48256 views*
Posted at 12:02:59 | permalink | 10 comments | print
Do you like this article?
Subscribe to the RSS feed of this blog or use the newsletter service in order to receive a notification every time there is something of new to read here.

Note: you'll not see this box again if you are a usual reader.

Comments

Matt Siegel writes:
02 Oct 10, 15:05:33
Hi Salvatore :)

Re-writing the AOF by breaking it into multiple segments sounds like a good idea.

Since the main problem is dependencies between keys, then perhaps pre-scanning the original AOF to build a dependency graph will help determine which keys can be separated.

There will be some degenerate cases that prevent good separation, but I suspect those users have bigger problems to worry about ;)
Matt Siegel writes:
02 Oct 10, 15:24:45
oops: to be clearer, my comment is similar to your Cluster idea, in that it divides the key space... not dividing the operations by order in time. enjoy your weekend :)
02 Oct 10, 17:42:02
I think it is important to differentiate between worst case performance and average case. In the average case you will need far less than 2x space for COW data, especially if you switch to managing your own COW process. If you are able to maintain a cursor into your data as it is snapshotted you can always know what data needs to be backed up before being dirtied. The window of data that needs to be backed up rapidly shrinks as the snapshot progresses. A basic 7.5k SATA drive can write 80 megabytes a second so snapshots can progress quickly if you saturate the disk. I imagine it gets more complicated once you bring Redis VM into play.

I think you should find out what it would take to generate a pathological workload before deciding additional improvements are necessary.

Also, COW data may be more space efficient (possibly more than the original data) because you can allocate memory in large blocks and then allocate linearly from those without the gaps that would be introduced by allocating memory in odd quantities off the heap.
kenn writes:
02 Oct 10, 17:45:53
Hi Salvatore,

Thanks for your link!
I totally agree that the awesomeness of Redis comes from keeping the code base simple and clean. The bug-free-ness as a result is a "feature" of Redis, I think. :)

Do you have any comments/ideas on the Virtual Memory part, to handle larger dataset? That part was the real concern in my post.

It would be really hard to do those things well, especially while keeping things simple, but I guess moving from malloc to mmap/msync wouldn't be so bad? However you wouldn't be able to simply use cp command for backup... atomic rename of file would become atomic page swap (shadow paging) and yeah any ways from there would complicate things a lot. Just for food for thought.
Guy Naor writes:
02 Oct 10, 19:20:13
This and the blog post you refer to, are probably one of the most level headed and smart discussions around persistence in databases. And it can help others understand the tradeoffs between persistence and performance. I also help it will help people understand that there isn't just one single tool that can solve all problems. As developers and architects what we need is a good understanding of the advantages and tradeoffs of the tools we are using.
Nathan Kurz writes:
03 Oct 10, 08:22:54
Hi Salvatore --

Great response to a good article. I appreciated your line about 'molecular cuisine' because it turns out that I have pretty much quit programming to do just that. But before that, I thought quite a bit about some similar problems. I think Kenn is right that you should spend a lot of time thinking about how you can use the OS to do the heavy lifting for you. Your reliance on COW for snapshots is brilliant --- I think that there are equally brilliant solutions with mmap() as he suggests.

It's possible you're ahead of us, and have already researched and eliminated this approach. But there are a couple things in your VM article that leave me wondering. You talk about saving memory by selectively swapping values to disk. How do you convince the OS not to cache the page you are trying to write, so you don't make your memory problems worse? It's possible you have a rock-solid grasp on this, and don't need the prompting, but I want to make sure you realize that this is at the crux of Kenn's argument.

Also, if it's true that your on-disk representation is 1/10th the size of your in-memory, I think you've got to be doing something wrong. Or viewed more positively, it looks like you still have room for a 10x improvement in how much data a Redis instance can handle! While you are justifiably proud of your low CPU usage, realize that one could also view that as an underutilized resource. Yes, memory is the new disk --- and hence you don't want to be pushing 10x the necessary data back and forth every time.

Anyway, once again, great response! But since I'm occupied with sorbet, I humbly encourage you to make sure you're not overlooking a way to work with the OS rather than against it. Yes, it would be tricky to get an mmap() solution snapshotting properly, but if you can make it over that hurdle everything else suddenly becomes a whole lot easier.
03 Oct 10, 11:22:15
Hi there!

Here at GenieDB, we faced the same design decision back in the early days, and as persistence was a bigger requirement for us, we designed for only idempotent operations. This has led us to a radically different architecture to Redis, with replication to a number of on-disk copies of the database. The biggest pain we feel is that we'd love to have FIFO queues, as they're very useful! Our indexing system is rich enough to give us the equivalent of hashes - but those queues would be lovely. We are jealous indeed of Redis' rich data structures!

However, don't forget that Redis is, after all, implementing its lovely data structures on top of RAM, which is just a bunch of fixed-sized words with idempotent write operations. The key thing being that Redis "transactionally" updates the multiple memory cells that comprise a data structure to perform atomic operations on them, to hide this fact.

So, with regards to the issues of persisting such a data model in real time, you could do it by logging your updates in terms of some simpler lower-level data model. Perhaps turn it all into hashes, and treat a list as a hash mapping an arbitrary index to a value (and handle appending to each end of the list by just inserting a hash entry with an index one lesser or one greater than the current min/max), treat sets as hashes mapping values to an empty value, etc (I've not gone through the entire Redis data structure/operation list to make sure this particular approach would work, but you get the idea). Then you could treat them as idempotent SET/DEL operations on hashes!

This could work for Redis, based on my somewhat shallow idea of how Redis works internally; sadly, it won't work for us at GenieDB, as we have multiple nodes working away independently so have nearly no scope for globally arguing as to what the "highest" or "lowest" index used in a FIFO is, and would still be presented with the problem of agreeing which client gets which entry from the head of the FIFO if everyone tries to pull from it at once :-( So our research is continuing, in a somewhat different direction...
kenn writes:
03 Oct 10, 21:04:38
Nathan,

Thanks for your clarification, I think you've added a great perspective to the heart of the discussion.

In particular, I totally overlooked that the "double cache" aspect of VM. Yes, that's a real problem indeed, the swap file could be duplicate on the page cache of the OS too, unless you use O_DIRECT or alike to bypass that. And unfortunately, the page cache is not so transparent as it sounds or should be. Linux is notorious in that, the page cache could cause user processes swap (no way!), even though the real memory are plenty available if the page cache give up some. And the worst part? vm.swappiness=0 doesn't really help.

For the record, I'm not suggesting to use O_DIRECT, quite contrary, I'm against it, not only because Linus said so (http://kerneltrap.org/node/7563), but because it entitles a bad citizenship to the process. So bad that you don't want to put anything else on that machine.

As to on-disk representation being 1/10th the size of in-memory, I think it is justifiable in that they have lots of additional metadata. For instance, if you query SCARD on a set, it doesn't scan the entire elements in the value but read from the metadata for the count instead, achieving absolute O(1) performance on it. Consider it as a smart, pre-calculated cache, make sense? I love it, because we have suffered from InnoDB causing a full table scan at SELECT COUNT(*) FROM TABLE, unlike MyISAM. I know this seemingly easy thing is not easy at all for a database with full MVCC support. I appreciate Redis has such nifty features on metadata.

That said, I imagine that if Redis went with mmap approach, those metadata could also be simply persisted, too, as having the same byte representation both on memory and on disk.
Vinay Sajip writes:
05 Oct 10, 14:16:41
This post and the article which prompted it made me think of what Poul-Henning Kamp, architect of Varnish, had to say about similar issues:

http://www.varnish-cache.org/trac/wiki/ArchitectNo...

Does Redis not already use the approach described therein? I thought it did.
valentine flower UAE writes:
12 Feb 11, 07:44:03
Very Informative And Interesting Blog to read very nice And Happy Valentine
comments closed