Comments for post Redis Virtual Memory: the story and the code

angel writes: MIU MIU Harlequin Large <strong><a href="http://www.e-sell-bags.com">Replica Handbags</a></strong> Leather Tote, which is Christmas, he quotes a few friends and wedding handbags party.Choose are very reasonable and Harlequin important.Colorful design also shows women elegant.The bags handbag <strong><a href="http://www.e-sell-bags.com">Fake Handbags</a></strong> Miu Miu Harlequin good.MIU only major Tote features lamb leather coats leather patchwork Harlequin Miu Miu plate (inside), braided two handgrips, adjustable berm shoulder strap with hooks of gold-tone, gold tone hardware, double top zip closure, interior pockets dark suit and silky purple dress model food. derived up.Only game designer Miu Miu <strong><a href="http://www.e-sell-bags.com">Fake Designer Handbags</a></strong> modeling can only girl in final and lush and meticulous fashion.
wholesale korean fashion writes: <a href="http://www.koreanjapanclothing.com/ShowClass32.aspx" target="_new"> Korean fashion </a>
angel writes: [b][url=http://www.top-leather-bags.com/louis-vuitton-lv-luggages-c-1_20.html]Fake Louis Vuitton Luggages[/url][/b] [b][url=http://www.top-leather-bags.com/gucci-c-24.html]Fake Gucci Handbags[/url][/b] [b][url=http://www.top-leather-bags.com/gucci-c-24.html]Replica Gucci Handbags[/url][/b] [b][url=http://www.top-leather-bags.com/gucci-c-24.html]Sell Replica Gucci Handbags[/url][/b]
[url=http://www.baidu.com]richie[/url] writes: <a href="http://www.fengzhiguo.51.com">Richie</a> is a great poet in China, you can come here to get familiar to him. [url=http://www.fengzhiguo.51.com]Richie[/url] [url]http://www.fengzhiguo.51.com[url]
[url=http://www.baidu.com]richie[/url] writes: <a href="http://www.fengzhiguo.51.com">Richie</a> is a great poet in China, you can come here to get familiar to him.[url=http://www.fengzhiguo.51.com]Richie[/url][url]http://www.fengzhiguo.51.com[url]
<a href="http://cheaptoryburchshoes.com">tory burchshoes</a> writes: never stopped innovating. Tory Burch Flats are full of cute unique design, beautiful construction, strong sense of fashion and excellent feeling for comfort! Tory Burch Shoes hot on sale! Tory Burch Sale with high quality low price, free shipping
tory burch shoes writes: never stopped innovating. Tory Burch Flats are full of cute unique design, beautiful construction, strong sense of fashion and excellent feeling for comfort! Tory Burch Shoes hot on sale! Tory Burch Sale with high quality low price, free shipping
Icy Tower writes: Good info, a cautionery tale for any readers I'm sure. Thanks
Danny Chin writes: I do agree with Rafal98. For a user base of 500m having 100k object (let say a facebook clone) will require 50,000 Giga Keys which will transform into USD300m for RAM alone or USD billions of dollars if we includes additional availability,scalability and machine hardware for the RAM facility. It is probably not only dollars but wheather it is cost effective to do so because not all objects are hotspot. eg. Probably not many people will want to read your long outdated posts. I guess, beside the values, swapping out the key or expire the key and later load it back is inevitable for this scenario. Then this will go back to memcached+datastore strategy. But Redis is already has a persisted datastore, it can be made into used rather than having duplicate datastore for "Key Expiring/Swapping/Loading" operation. Anyway, no offense, just some idea.
rafal98 writes: At the moment our product use a commercial DB: "RDM" from Birdstep (formerly Raima). It's a kind of graph DB with SQL layer if you need it. Index is only B+tree, no hash :( Redis appear to be very cool too
euhost writes: @rafal98: what other nosql DB do you use for your workload? thanks for sharing
rafal98 writes: Redis appears to be amazing with lot of high level features. Anyway, concerning this topic on VM, I think that have keys mandatory in RAM is a real limitation. I usualy use some others nosql DB to store billions small records. Each records is about 20 to 40 bytes, but in your demonstration 1 billion keys means 160 GB. It's very expensive to have a computer with 160 GB ... (and my key is 16 bytes so it's probably more ram). Anyway, congratulation for this wonderfull free DB. Regards
Nicolas writes: Great article, the why information is very important to understand how things get implemented in the end. Keep up the good work.
NoDoubt writes: Josh-- No one cares about this nonsense. In 1978 when we upgraded from a PDP-10 to a 20, it was great to get Segmented Virtual Memory. But, all the Paged Virtual Memory systems were better and that is all there is now. When there is one choice, the words no longer matter.
josh writes: um, whatever dude. your blog is an exposition, and it failed to explain that paging is a common component of virtual memory systems, but virtual memory is not paging. just go fix your crap and carry on. cheers. :-)
antirez writes: @josh: this blog entry is for people who care about the implementation of Redis, but a big part of the audience has no idea at all about OS theory, CPUs MMU and so forth. Btw if we want to be very specific, either term is not 100% correct, as in some way it's virtual memory, and in some way is paging. It's an high level implementation of both. For instance, Redis takes an in-memory page table, that is similar to VM (but Redis PT only contains information about used/free pages). Instead the translation between "virtual" addresses and physical is implemented using "VM pointer" objects translating a pointer in memory to a pointer in the disk, so it's like if this "virtual" space is mapped to a physical page of memory that is somebody else (on disk). This looks pretty similar to VM to me.
josh writes: oh yeah, you're proly right. i mean, the content of this blog covers memchached, threads, key-value pairs and all that - but whoa, paging vs. vm? that's WAY too complicated to explain. so yeah, you're right - it's probably better to completely misuse a well defined technical term so that we make sure there is no "misunderstandings". good call navigatore.
Navigatore Anonimo writes: @josh: as long as you know we feel pretty relaxed, as the bulwark of knowledge is in your hards. While I agree that the Redis VM is actually an implementation of "paging" , one of the main goals of virtual memory is to allow paging (even if there are many other goals). I bet that calling this feature of Redis "paging" would result in misunderstandings compared to "Virtual Memory".
josh writes: um...dude, i don't think you know what virtual memory is. then again, most people don't. http://youdontknowwhatvirtualmemoryis.blogspot.com/
clara writes: Great work, Antirez. Thanks! I look forward to play with REDIS in the next few days. I have been waiting for a key,value store that supports set operation for quite some time! I have one confusion in my mind... The second reason you gave for using your own VM scheme is that objects take less space in disk when using your scheme due to the fact that Redis does a very good job at encoding them. But when swapping to disk you are swapping pages. Are you doing encoding when swapping to disk??? I don't fully understand the reasoning here.
antirez writes: ETA for 2.0: probably mid June, but maybe mostly stable pre-releases will be available in mid May.
Chris Marisic writes: Any ETA on the release of 2.0?
antirez writes: Joe: mimi: Yes with 2.0 it is possible to use hashes to group different fields. This is 4 times more memory efficient or even more. Even when a user does not need hashes, still they can be used to setup a much more memory efficient setup. For instance insetad to set "key:192" and "key194" one can set "key:19" as hash with fields "2" and "4", and so forth. I bet that once 2.0 is released we'll have many libs doing this automagically for you.
mimi holliday writes: I also want to know about what @Joe has asked. Can anyone answer this.
Joe writes: Is there a way to get the 160 bytes per key down? My actual keys are on average 15 bytes and the value is only 1 byte. I have over 100 million records but using 16GB just for the keys is a killer. currently using multiple cdb files.
antirez writes: @Ashwin: a very big part of the article is about why this can't be used for Redis ;)
Ashwin Jayaprakash writes: I may have misunderstood your explanation (very likely) but I thought I saw some similarities between your "VM" implementation and the problems explained here - http://varnish-cache.org/wiki/ArchitectNotes#Sowhatswrongwith1975programming
Dhruv writes: @antirez: Nice, resque looks good :)
Demis Bellot writes: Awesome work again antirez! - You are quickly eliminating any reasons for not using Redis. It is already the most versatile key-value store around and the best technology I've seen yet to come out of the NO-SQL movement. Keep up the good work!
antirez writes: @Dhruv: Resque! :) http://github.com/blog/542-introducing-resque Also check this: http://github.com/gleicon/restmq It is one of the top use cases for Redis.
Dhruv writes: Actually, I am looking to use redis for a persistence store for a message queue. Do you know if there is anyone using it in that way?
antirez writes: @chh: Redis already does it for integers. With strings it's also trivial but for now I'm not doing it as most of the strings are small enough to be impossible to compress with standard algorithms (so I developed one btw, called smaz, you can find more in my github page). When writing on disk we can compress much more for a reason: we lose the structure of data. A list or a set all become just a simple string consisting of elements with a separator. So there in memory it's not possible to compress the same way as you need the right connections and organization of data in order to ensure the expected time complexity. @Dhruv: yes I see how there are good and bad things about the two different approaches, I just think that everything considered the user-land VM is the best compromise, at least for Redis. But for other kind of applications the best can be exactly the reverse, to build a vertical allocator or even simpler just using the OS paging, and I think this is the first approach to try. When there are no alternatives there is to implement an user-land VM that is indeed more complex but has its big advantages.
Dhruv writes: @antirez: I think that compression is a big win with your approach. However, if you switch to the mmap() approach, you have just to ensure that all memory comes from the mmapped region. No re-coding required. Yes, reads will block, and this might be a very bad thing as far as performance is concerned -- but only when the memory runs out. Linked Lists: Starting from the head (or tail considering a circular linked list), the boundary elements will be cached. Ordered Sets: I am assuming ordered sets are implemented as some sort of a binary tree. In this case, the nodes near the root will be in main memory. Hash Sets: The most accessed nodes will be in memory, and the corresponding bucket pointers as well. Yes, resize will touch all the bucket pointers, but not the data itself. Pros: Easy to code (this is an understatement!!), partial data structures present in memory, hence memory is optimally used (modulo fragmented values in data sturectures), easy on the CPU since no compression. Hence, you can use sendfile() to stream the data over the socket. Cons: Blocking on every I/O, no compression possible, will take up more space on disk, and hence more I/O time.
chh writes: How about compressing data in-memory?
antirez writes: @Dhruv: yes, I'm not supporting partial retrieving of objects for a few reasons I'll expose. It was a design decision because I may even implement it for Lists, and maybe the mmapped solution would more or less work with Lists *assuming* there is a way to solve the locality poblem, but what about sets, sorted sets, and hashes? This are currently implemented with an in-memory hash table, a few key lookups will scan many memory pages (also think at resizing of hash tables). But here are my arguments: Well let me trow in some number: a 10k elements list where every element is "a" is, serialized on disk, 20k, and in memory 1MB. To load a whole list from disk to memory requires very little I/O, and the whole VM concept is that you are trying to have the working set in mem and cache misses, especially of large values, should be rare. So my point is that for many aggregate objects composed even of 10k elements to read 20k from disk or a small string value is going to be more or less the same speed, and it is performed on an I/O thread. My second point is that OS VM *is blocking*! All clients are going to wait while Redis is swapping. And swapping 10 to 50 times the amonut of data, as you can see from my serialized/live 10k elements linked list. Finally, if you change the OS, you'll see different behaviors. This is going to be completely out of the control. Will the OS aging algorithm will be as good as our that has all the domain specific info?
Dhruv writes: @antirez: Yes, the compression is a big plus in the manual approach. As far as the locality is concerned, I agree with you. However, you can't now have a partial object in memory with manual VM. A value is either completely in memory or completely on disk. If a (size 100) linked list's head if accessed many more times, then the whole Linked List will be in memory if I understand correctly, or am I mistaken?
antirez writes: @matt b: thanks, fixing
antirez writes: @Dhruv: many Redis objects are composed of sub-objects and change continuously. The allocator should be object-aware, I should be able to tell, I need a new redisObject structure *near* this one. Also I've object sharing at many levels (even when shareobjects is disabled). The Redis VM is able to check all this, because it's user space. It is able to transfer objects on disk specially encoded to save a lot of space (10x is common!). And is a self-contained subsystem.
antirez writes: @Paul: sorry I was not clear in my previous comment: what I mean is that Redis objects are made in stages. For instance Redis lists are composed of different LPUSH commands. You don't know beforehand how elements there will be in a list (and this changes over time), but you want that elements from the same list are allocated in the same page: so you need to pre-allocate *per object*. How much? At least a small multiple of 4096. Even if you pre-allocate 3 pages per object you still have just a single guaranteed "whole" page as you don't have alignment guarantees. It's worst than this. What about fragmentation? For instance in many applications there are Sets composed of many elements changing over time: you end with many pre-allocated blocks you can't free because there is a singe element inside. It's not viable at all in my opinion.
Dhruv writes: Why don't you use the OS VM and mmaped files as memory instead of implementing your own VM? For keys, you can have sticky memory maps (which will never be swapped). Create 2 custom allocators, one for the keys and one for the values, which allocate memory from the appropriate mmapped regions?
mark lucovsky writes: very nice work!
matt b writes: Hey antirez, the first link for redis in your article is pointing at http://code.google.com/ instead of http://code.google.com/p/redis.
Paul Betts writes: @Antirez You don't necessarily have to completely preallocate everything, but you make sure to call malloc on large chunks, then use your own smart logic to hand out the pointers (essentially writing your own Redis-aware malloc, which is smart enough to allocate objects close to each other). You could build off of jemalloc (http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf), which Firefox uses instead of the standard allocator to limit memory fragmentation. I'm not saying to ditch the VM approach though, they most likely would complement each other, and hopefully this might help you to make Redis even faster and more awesome than it already is!
antirez writes: @Paul: yes in theory, but the problem with your solution is that you loose a number of big advantages when there is no need of swapping, mainly: object sharing, and reuse of pre-allocated objects (object pools) instead of calling malloc again and again. Not only, many times this is just not viable: think at Redis lists, at every push I've to allocate another object. If I've a per-list pool preallocated how many memory I need to store million of lists? And so forth.
Paul Betts writes: Your VM implementation might be good, but wouldn't a better solution to your problem #1 be just to improve the locality of your data by preallocating chunks of memory and managing it yourself, instead of just calling malloc and letting the heap manager decide where to put it? This would also help with #2, because your pages would be less likely to be paged out...
Justin writes: Sounds awesome and good work! I'm a relatively novice programmer and I enjoy reading the Redis codebase. What a great mentor you would be! :-)
Sunil Arora writes: Good work antirez! I can smell the possibilities with you unleashing the real potential of Redis Store. This weekend is all booked for Redis-VM. Thanks for the wonderful work once again!
Ericson Smith writes: Got goose bumps at the potential, just reading this! Good work antirez.
antirez writes: Thank you guys, I'll try to do my best.
Michelangelo Altamore writes: Thank you antirez, I am amazed at your mastery.
Jake McGraw writes: Keep up the awesome work, this is utterly amazing.
Guo Du writes: Redis is becoming one stop shop for storage: in memory cache, key/value store, queue, large data set. Good work and thanks!
Bernard writes: Thanks. I put a project on hold last year because I couldn't find a suitable data server that was both scalable and whose interface I liked. I think Redis is the one for me.
aprilchild writes: awesome! keep going:)
daniele writes: awesome! Keep on!
home