Diskstore and a new b-tree implementation

Tuesday, 18 January 11

A few weeks ago I presented a new development direction for Redis, called diskstore. The original announcement is in this Redis Group post.

Diskstore is working very well in our initial tests, and we are working to improve it. Diskstore is basically a smart caching layer on top of Redis backed by an on disk plain key value store. The current implementation of the disk store is file system based: every key is represented on disk by a single file. This setup is interesting in many ways as it is pretty hard to corrupt, and very transparent (you can see the individual files, rsync them into another server, and even joining two databases just merging all the files), but hardly the fastest.

So the idea is to provide two disk backends built-in into Redis for different needs. One is the current file system based implementation, and one is an ongoing effort, based on a b-tree. If you follow me on twitter you probably noticed that many of my recent tweets are about a b-tree implementation: it is the one that is going to join the current backend into diskstore.

So diskstore users can select via redis.conf if they want to use the diskstore-fs or diskstore-btree backend, at least initially. It is possible that in the long run one solution will show to be superior in almost every way to the other and we may deprecate it. But current indications and experiences with other NoSQL databases are suggesting that there is some space for a plain filesystem based backend, especially if you want to use Redis to store very very large values.

Unfortunately I was not able to find a BSD implementation of a btree that satisfied my requirements and that was self-contained enough, so I'm writing one from scratch. The good news is that the new implementation is written as a stand alone library and completely decoupled from Redis itself, so it will be possible to use it for other projects as well.

I'll take you posted on twitter about progresses. I hope to have something working in a few weeks at max.

42655 views^*

Posted at 07:32:04 | permalink | 12 comments | print

Do you like this article?
Subscribe to the RSS feed of this blog or use the newsletter service in order to receive a notification every time there is something of new to read here.

Note: you'll not see this box again if you are a usual reader.

Comments

Jay Freeman (saurik) writes:

18 Jan 11, 08:47:24

Why a B-tree? To be clear what I mean: as opposed to LSM-trees (like BigTable and Cassandra), B-tries (note the "i" in there), whatever else.

startupgrrl writes:

18 Jan 11, 09:53:24

You might as well implement 2-COLA (a cache oblivious streaming B-Tree) if you're going to go that route. B-Trees alone are archaic.

antirez writes:

18 Jan 11, 09:57:55

@saurik: good question. Basically during this implementation I'm also trying to learn on disk data structures. This is what I'm doing so far, in a sort of incremental implement/understand way:

I'm starting with a pretty general on disk allocator that is able to allocate and reclaim, without any compaction stage (using on disk free lists) space in power of two sizes.

On top of this as a first attempt I'm creating a tree with a number of nodes between 4 and 255, with the same basic properties of btrees but implemented in a different way as values are always stored outside the node, keys are fixed length (first 16 byte of SHA1) as I need to support keys of any length even if I can't access them ordered.

At this stage this is just experimenting... then I'll try to apply modifications to enhance performances and corruption resistance.

I'll check the data structures you mention to see what advantages they have, and I hope I'll find something that can better solve my problems, that are: not the fastest but decently fast disk KV store, good resistance to corruption, ability to update N keys transactionally so that either all the new values or neither will get in, no need for compaction, no need for parallel accesses.

tobi writes:

18 Jan 11, 11:19:03

why not use berkleydb?

Anonymous writes:

18 Jan 11, 11:22:57

You mentioned that your fs-backed implementation allows rsyncing the files to another server. If I use the B-tree backend instead, how should I keep the data backed up in a consistent state?

Benjamin Sergeant writes:

18 Jan 11, 11:47:03

For the on diskstore-fs, I think there's a log(N) access cost for accessing a file in a folder with N entries, so it's a good idea to nest the files (dont know what's the good value to choose for max_n though).

eric writes:

18 Jan 11, 12:52:43

please hurry..... my rewrite log is like 40 gigs and I can't get it to rewrite in a normal amount of time.

Tim writes:

18 Jan 11, 15:58:32

both diskstore-fs and diskstore-btree are certainly useful, but i think the most important feature would be a generic diskstore interface (pipes? sockets? http? zeromq?), so that any backend store can be used.

there are a lot of exciting options with different tradeoffs: berkeleydb, riak, cassandra, tokyo cabinet, membase, mogilefs ....

Matt writes:

19 Jan 11, 06:27:20

Any code, yet?

rch writes:

20 Jan 11, 14:25:53

disco:ddfs?

[url=http://www.test.ro]test[/url] writes:

28 Mar 11, 07:36:58

This is just a test !

http://www.test.ro writes:

28 Mar 11, 07:37:15

Another test

comments closed

antirez weblog

Diskstore and a new b-tree implementation

Comments

PROGRAMMING AND WEB

HOT ARTICLES

NEWSLETTER