Today I read this article about Redis
with interest. It's a legitimate point of view on databases, you may agree or not, but the author has a solid idea about what he is looking for and about the database scene. I don't agree with everything, but there is something I especially like about this article: it actually identifies very well what I think are actually
the strength and deficiencies of Redis.
In short: the strength is the data model, and the deficiency is the persistence.
Fancy and Persistent
We want two things that are hard to play well together:
- Programmers want complex data types, the ability to use 60 years of computer science not just in the language, but in the database too. I think that even if Redis will be obsoleted in six months and I'll quit programming and do molecular cuisine, a point here is shown: plain key-value is not a cool world where programming inside. To have real data structures and atomic operations is a completely different level of abstraction.
- We want good persistence.
Why it's so hard to make this things play well together? Well, first of all the direct approach does not work well, that is, implementing complex data types and atomic operations directly on disk. It's hard and slow at best.
We took the other direction, and for additional reasons, like, the guess that memory is the disk of tomorrow. So our data is in memory. The problem is that we need the same data to be at the same time on disk, with a point-in-time snapshot semantics. Our disk snapshot may or may not be realtime, but anyway a property should be guaranteed: The copy on disk should represent the exact dataset that was in memory at a given time
. Without this even the basic consistency requirements are completely broken.
An hard to escape rule
So, how to take a snapshot on disk with this property, and at the same time doing it in a non blocking way? There is... more or less a single way, conceptually:
- Iterate the keys, and write every key on disk
- But since you are doing this in a non blocking way, that is, the server is also accepting queries you also need...
- To track every time a key changed. That is, if we are snapshotting and there is a write against a given key that was not already transfered on disk, we need to make a copy of the old value (or remember the key was not existing at all), and use this old copy when we'll write this key on disk.
If you think about this five seconds you'll realize you need an amount of memory proportional to the changes to the dataset happening while the saving is in progress. That is, in the worst case, up to two times the memory needed for the dataset (if all the keys changed while saving, in the worst order).
Redis implements the above algorithm letting the kernel do the work for us, that is, via copy on write of pages. What we do is forking the master process and start a save in the child process. Every time the dataset receives writes, a few pages will change on memory, and this will get duplicated in the child process. At least this was trivial to implement :) and very bug free, that is an important thing in a database system.
The AOF way
If taking a live point in time snapshot of the database is hard, can't we instead take a journal of all the operations performed and re-play such a journal in order to rebuild the state?
Sure, this works well, it's pretty fast in practice too even if it is disk bound, but we have a few problems even with this system:
- The AOF file size is proportional to the number of writes. It will get bigger and bigger.
- The bigger the AOF file is, the longer the server will take to restart.
- So we need a way to compact the AOF from time to time. Again in a non blocking way, while the server is running receiving read and write queries.
What Redis does in order to compact the AOF is rewriting it from scratch. This means to do something very similar to writing the point in time snapshot
but just into a format that happens to be a valid sequence of Redis commands. So Redis does not read the old AOF to rebuild it. It instead reads what we have in memory to write a perfect (as small as possible) AOF from scratch. When the new AOF is in place, we do an atomic rename syscall swapping the old with the new.
This is done in the child process, again, it's basically exactly the same problem of dealing with the point in time snapshot, but with the additional problem (that is easy to fix) of accumulating the new queries while the AOF rewrite is in progress, so that before to swap the old with the new, we will also append all the new operations accumulated in the meantime.
Guess what: we have the same problems here. Up to 2 times the memory required while writing the AOF file. This seems like a waste. There aren't better ways? Many other systems are doing smarter things... we should try harder.
It's not so trivial
Ok let's start with an idea that is common to other systems with a plain key-value semantics. For instance Bitcask
is using it, and with good reasons, it's a neat idea.
That is... instead of dealing with a single huge AOF file, we can segment it into small pieces, in different physical files: AOF.1, AOF.2, AOF.3 ... Every time the AOF is big enough we open a new file and continue.
This way we may compact the old AOF files that are no longer actively used in background, maybe even with a command line tool, so that we'll have just a single Redis process that will never fork. It will just write to the latest AOF file, the active one. While the command line tool will continue compacting the old AOF files into a single one. Cool right?
How to compact old files?
If we have a plain KV system we have just two write operations: SET and DEL.
In order to compact the old AOF files I can do the following:
- Read all the AOF files entries in sequence. Every time I encounter a SET I write this entry into an on-disk index, like a B-TREE or alike. What I need to do is mapping every key with the file and offset of the last time I saw a SET for this key.
- If I encounter a DEL, I'll just remove the key from the temporary index I'm taking.
- At the end of this process I take my temporary index, and write a new AOF key by key using the offset stored in the index.
This works because SET and DEL and idempotent operations. What will win is the latest SET or DEL entry for a given key, all the old ones can be discarded! I can have an infinite number of operations against a key, but once I see "SET foo bar" I don't care about all the previous operations, the value of foo
Cool, but does not work
This does not work when you have complex operations against aggregate data types.
To start, in order to even parse the AOF with the command line tool, this tool need to be Redis-complete. All the operations should be implemented, for instance intersections between sets, otherwise what I'll do when I encounter a SUNIONSTORE operation?
Also, in Redis most operations are not idempotent. Think about LPUSH for instance, the simplest of our list write operations. The only way to turn list operations into idempotent operations is to turn all of them into an MKLIST <key> ... all the elements ...
operation. Not viable at all.
Our command line tool at best will be able to exploit SET and DEL operations in order to reduce the size of the file, but it will just loose against an always updated sorted set.
To pay 2x of the memory in the worst case may not be so bad at this stage, because it is not trivial to find a really better solution with the Redis data model. Does this means we'll never improve this part of Redis? Absolutely not! We'll try hard to make it better, for instance possibly using a binary AOF format that is faster to write and to load, more compact. We may write a command line tool that is able to process the AOF multiple time only dealing with a subset of the keys at every pass in order to use less memory (but we have inter-key operations on aggregate data types, so this may only work well if such operations are not used). For sure we'll keep investigating.
Possibly new ideas will emerge. For instance currently I'm working at Redis Cluster. With the cluster it is possible to run many instances of Redis in the same computer in a transparent way. Every instance will save just his subset of data that can be much smaller than a single giant instance. Both snapshotting and AOF log rewrite will be simpler to perform.
Redis is young and there are open problems, but this is an interesting challenge I want to take ;)