Comments for post A few key problems in Redis persistence
valentine flower UAE writes: Very Informative And Interesting Blog to read very nice And Happy Valentine
Vinay Sajip writes: This post and the article which prompted it made me think of what Poul-Henning Kamp, architect of Varnish, had to say about similar issues:
Does Redis not already use the approach described therein? I thought it did.
kenn writes: Nathan,
Thanks for your clarification, I think you've added a great perspective to the heart of the discussion.
In particular, I totally overlooked that the "double cache" aspect of VM. Yes, that's a real problem indeed, the swap file could be duplicate on the page cache of the OS too, unless you use O_DIRECT or alike to bypass that. And unfortunately, the page cache is not so transparent as it sounds or should be. Linux is notorious in that, the page cache could cause user processes swap (no way!), even though the real memory are plenty available if the page cache give up some. And the worst part? vm.swappiness=0 doesn't really help.
For the record, I'm not suggesting to use O_DIRECT, quite contrary, I'm against it, not only because Linus said so (http://kerneltrap.org/node/7563), but because it entitles a bad citizenship to the process. So bad that you don't want to put anything else on that machine.
As to on-disk representation being 1/10th the size of in-memory, I think it is justifiable in that they have lots of additional metadata. For instance, if you query SCARD on a set, it doesn't scan the entire elements in the value but read from the metadata for the count instead, achieving absolute O(1) performance on it. Consider it as a smart, pre-calculated cache, make sense? I love it, because we have suffered from InnoDB causing a full table scan at SELECT COUNT(*) FROM TABLE, unlike MyISAM. I know this seemingly easy thing is not easy at all for a database with full MVCC support. I appreciate Redis has such nifty features on metadata.
That said, I imagine that if Redis went with mmap approach, those metadata could also be simply persisted, too, as having the same byte representation both on memory and on disk.
Alaric Snell-Pym writes: Hi there!
Here at GenieDB, we faced the same design decision back in the early days, and as persistence was a bigger requirement for us, we designed for only idempotent operations. This has led us to a radically different architecture to Redis, with replication to a number of on-disk copies of the database. The biggest pain we feel is that we'd love to have FIFO queues, as they're very useful! Our indexing system is rich enough to give us the equivalent of hashes - but those queues would be lovely. We are jealous indeed of Redis' rich data structures!
However, don't forget that Redis is, after all, implementing its lovely data structures on top of RAM, which is just a bunch of fixed-sized words with idempotent write operations. The key thing being that Redis "transactionally" updates the multiple memory cells that comprise a data structure to perform atomic operations on them, to hide this fact.
So, with regards to the issues of persisting such a data model in real time, you could do it by logging your updates in terms of some simpler lower-level data model. Perhaps turn it all into hashes, and treat a list as a hash mapping an arbitrary index to a value (and handle appending to each end of the list by just inserting a hash entry with an index one lesser or one greater than the current min/max), treat sets as hashes mapping values to an empty value, etc (I've not gone through the entire Redis data structure/operation list to make sure this particular approach would work, but you get the idea). Then you could treat them as idempotent SET/DEL operations on hashes!
This could work for Redis, based on my somewhat shallow idea of how Redis works internally; sadly, it won't work for us at GenieDB, as we have multiple nodes working away independently so have nearly no scope for globally arguing as to what the "highest" or "lowest" index used in a FIFO is, and would still be presented with the problem of agreeing which client gets which entry from the head of the FIFO if everyone tries to pull from it at once :-( So our research is continuing, in a somewhat different direction...
Nathan Kurz writes: Hi Salvatore --
Great response to a good article. I appreciated your line about 'molecular cuisine' because it turns out that I have pretty much quit programming to do just that. But before that, I thought quite a bit about some similar problems. I think Kenn is right that you should spend a lot of time thinking about how you can use the OS to do the heavy lifting for you. Your reliance on COW for snapshots is brilliant --- I think that there are equally brilliant solutions with mmap() as he suggests.
It's possible you're ahead of us, and have already researched and eliminated this approach. But there are a couple things in your VM article that leave me wondering. You talk about saving memory by selectively swapping values to disk. How do you convince the OS not to cache the page you are trying to write, so you don't make your memory problems worse? It's possible you have a rock-solid grasp on this, and don't need the prompting, but I want to make sure you realize that this is at the crux of Kenn's argument.
Also, if it's true that your on-disk representation is 1/10th the size of your in-memory, I think you've got to be doing something wrong. Or viewed more positively, it looks like you still have room for a 10x improvement in how much data a Redis instance can handle! While you are justifiably proud of your low CPU usage, realize that one could also view that as an underutilized resource. Yes, memory is the new disk --- and hence you don't want to be pushing 10x the necessary data back and forth every time.
Anyway, once again, great response! But since I'm occupied with sorbet, I humbly encourage you to make sure you're not overlooking a way to work with the OS rather than against it. Yes, it would be tricky to get an mmap() solution snapshotting properly, but if you can make it over that hurdle everything else suddenly becomes a whole lot easier.
Guy Naor writes: This and the blog post you refer to, are probably one of the most level headed and smart discussions around persistence in databases. And it can help others understand the tradeoffs between persistence and performance. I also help it will help people understand that there isn't just one single tool that can solve all problems. As developers and architects what we need is a good understanding of the advantages and tradeoffs of the tools we are using.
kenn writes: Hi Salvatore,
Thanks for your link!
I totally agree that the awesomeness of Redis comes from keeping the code base simple and clean. The bug-free-ness as a result is a "feature" of Redis, I think. :)
Do you have any comments/ideas on the Virtual Memory part, to handle larger dataset? That part was the real concern in my post.
It would be really hard to do those things well, especially while keeping things simple, but I guess moving from malloc to mmap/msync wouldn't be so bad? However you wouldn't be able to simply use cp command for backup... atomic rename of file would become atomic page swap (shadow paging) and yeah any ways from there would complicate things a lot. Just for food for thought.
Ariel Weisberg writes: I think it is important to differentiate between worst case performance and average case. In the average case you will need far less than 2x space for COW data, especially if you switch to managing your own COW process. If you are able to maintain a cursor into your data as it is snapshotted you can always know what data needs to be backed up before being dirtied. The window of data that needs to be backed up rapidly shrinks as the snapshot progresses. A basic 7.5k SATA drive can write 80 megabytes a second so snapshots can progress quickly if you saturate the disk. I imagine it gets more complicated once you bring Redis VM into play.
I think you should find out what it would take to generate a pathological workload before deciding additional improvements are necessary.
Also, COW data may be more space efficient (possibly more than the original data) because you can allocate memory in large blocks and then allocate linearly from those without the gaps that would be introduced by allocating memory in odd quantities off the heap.
Matt Siegel writes: oops: to be clearer, my comment is similar to your Cluster idea, in that it divides the key space... not dividing the operations by order in time. enjoy your weekend :)
Matt Siegel writes: Hi Salvatore :)
Re-writing the AOF by breaking it into multiple segments sounds like a good idea.
Since the main problem is dependencies between keys, then perhaps pre-scanning the original AOF to build a dependency graph will help determine which keys can be separated.
There will be some degenerate cases that prevent good separation, but I suspect those users have bigger problems to worry about ;)