there is something between append only and reusing blocks

Monday, 07 February 11
One of the most interesting properties of an append only btree is that it's impossible to corrupt. There are other interesting properties as well like concurrent accesses are trivial, since whatever is the root and all the nodes your reader is going to access, they are all valid, just they may be an old representation of the btree.

But there is a price to pay that I'm not happy with, that is, the append only btree requires a compaction process, otherwise it will get bigger and bigger. If your use case is mostly-reads this may be ok, but if you plan to use a lot of writes there is some math to do.

Think at this: if you have many writes that are near to the max disk I/O that your server is able to deliver, your append only btree file size is going to get big of course, and you need compaction. But you are already near to the I/O limit, what happens once you start writing a new file to rewrite the btree? The additional I/O can affect badly the performance of the btree. Can you slow down writes in order to reduce the impact? Unfortunately not, otherwise the rewrite will never be fast enough to compact the file.

I'm all for performance characteristics that are as predictable as possible, so I don't like this design. It is for sure a sensible one for many applications, but I can live better with the idea of a single file that is never rewritten (if not for crash recovery or alike) and that is able to reuse the no longer used blocks.

But guess what? such a btree is much simpler to corrupt, especially if you don't use fsync() in order to implement write barriers. Disks and operating systems can reorder writes, so even if you do something like: If there is not an fsync() between the two, it is well possible that writes are reordered, so that the link is created before the node is written on disk. If a crash happens between this two operations you end with a corrupted btree. But are the append only strategy and update in place (and reuse blocks via free lists) so incompatible? Possibly not.

Experimenting with my btree implementation I discovered what is probably the discovery of hot water, that is, likely common practice, but the kind of common practice that you'll never find in an algorithms book. That is, what about if my free list entries have a timeout so that they can't be used before 60 seconds?

This way the btree implementation can always rewrite the modified nodes instead of updating them in place. Even if there is a read currently in progress, we are sure that even if our writes will reuse nodes, no recently freed node will be used, so unless the read will take 60 seconds to complete, everything will go well.

Unfortunately we still need a fixed offset for our root node pointer, so an fsync is still needed on every write in order to make sure that reordering can't corrupt our btree on crash.

With my implementation I'm not still at this stage as I'm starting with the most basic things that can work well, but that's the idea for the next releases.
36296 views*
Posted at 06:02:28 | permalink | 7 comments | print