Backporting into Redis 2.4 and other news

Monday, 18 April 11
I think I should write more about Redis development... lately I was so focused on writing the code and the Redis Book that finding the time to blog about Redis was really hard, but I'll try to improve in the next weeks. However today I want to provide some fresh news to Redis users: to have some insight into the near future of a project can be very interesting for developers planning to start a new project with Redis.

Currently we have three development branches of development: 2.2, 2.4, Redis Cluster (unstable branch).

2.2 is a bugfix only development line, so we'll continue to ship 2.2.x versions only to fix bugs.

2.4 is our new branch, it is just a few days old. Our old development model with a stable branch and an unstable branch did not worked well, we needed something in the middle. There is simply a lot of stuff that can be back ported from the unstable branch.

The unstable branch where Redis Cluster development is happening, will take time to reach stability as the cluster is a big project (our idea is to release a first stable version of Redis Cluster later this summer). 2.4 is a way to put something better than 2.2 in the hands of our users ASAP.

We hope to ship 2.4 in an estimated time frame of 6 weeks. It will include the following changes compared to 2.2:

  • Memory optimized sorted sets. This means that small sorted sets will take little memory like small hashes, lists, and sets composed of integers are doing already.
  • Variadic versions [LR]PUSH, SADD, ZADD, ... so you can, for instance, push multiple values inside a list with a single command. I measured the difference with a few benchmarks and the difference is really dramatic compared to pipelining of many LPUSH commands.
  • Big improvements in .rdb persistence. Now specially encoded types are saved directly as they are. Just to give you an example, if you have a dataset composed of lists with an average of 100 elements you can expect 50x faster .rdb persistence.


All the above stuff is already inside redis unstable of course, but with 2.4 it will be readily available to all the users in short time. The current 2.4 branch only includes the first two changes, I'm working on merging the latest.

How to play with Redis Cluster

We have also some news about Redis Cluster. You can test with your hands what we have already. The following is an howto about testing Redis Cluster. Note: Redis Cluster is not complete, it is currently an alpha with a lot of missing features, and it is not stable. Here the goal is just to provide a preview.

To play with Redis Cluster fire three instances with the following configuration:
port 6379
cluster-enabled yes
cluster-config-file nodes-1.conf


Use port 6379 for the first instance, 6380 and 6381 for the other ports. Also make sure to use a different cluster-config-file name, nodes-1.conf, nodes-2.conf, nodes-3.conf. The cluster config file is not something you should change by hand, is a file where a cluster node saves the current configuration to reload the state at restart.

Now that you have three instances running you can start performing some command:
redis> connect 127.0.0.1 6379
redis> cluster info
cluster_state:fail
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:1
redis> cluster nodes
0c2f029a52bec8d17c43b137c74205fade1b1921 :0 myself - 0 0 disconnected
As you can see this node only knows about a single node, that is, itself. You can see this from the "myself" flag in the cluster nodes output. The cluster info output instead shows how out of the 4096 hash slots in which the key space is divided, nothing is assigned. This is why this node will not be happy to reply to queries:
redis> get foo
(error) ERR The cluster is down. Check with CLUSTER INFO for more information
So the first thing to do is to join the cluster, that is, make nodes aware that there are other nodes around, as this is a completely new cluster.

As a first step we join the instance running at 6379 with the instance running at 6380:
redis> connect 127.0.0.1 6379
redis> cluster meet 127.0.0.1 6380
OK
redis> cluster nodes
0c2f029a52bec8d17c43b137c74205fade1b1921 :0 myself - 0 0 disconnected
96fad8c3b4df5f86ac4abe6205a253c640c751ef 127.0.0.1:6380 master - 1303136527 1303136527 connected
As you can see now 6379 knows about 6380, and this is true for 6380 as well of couse as the new nodes did an handshake:
redis> connect 127.0.0.1 6380
redis> cluster nodes
0c2f029a52bec8d17c43b137c74205fade1b1921 127.0.0.1:6379 master - 1303136590 1303136590 connected
96fad8c3b4df5f86ac4abe6205a253c640c751ef :0 myself - 0 0 disconnected


I can already see in your face the "WTF this fields mean" expression... so every line of 'info nodes' is composed of the following fields, from left to right:
node_id
latest_know_ip_address_and_port
role_in_cluster
node_id_of_master_if_it_is_a_slave
last_ping_sent_time
last_pong_received_time
link_status
Every node has an ID that will be used for all the live of the node. All this info are saved in the nodes.conf file. The format of this file is exactly the same as the cluster nodes output as I was lazy to invent something new but this turned to be an advantage actually (less code, more descriptive info nodes).

Now Redis Cluster nodes are like bored old ladies, they gossip a lot about other nodes. But the good thing is that at least cluster nodes are very well informed, and only report informations they are pretty sure about ;)

Every node every second sends a PING packet to some random node, actually this node is not selected at random, but among nodes that are believed to be OK but with the oldest pong_received field in the node structure, so we tend to ping nodes that we don't chat with since more time.

In every PING packet, and in the PONG reply, there is a gossip section where we inform the other node about informations about other nodes. Also when a node pings or pongs another node, there are a lot of detailed information about the node sending the packet.

For a node to be marked as failing we need to both detect that it did not replied to our pings from some time, AND also we need to receive that another node has troubles wit this node, thanks to the gossip section. When this happens the node marks this other node as failing, and sends a "mark-as-failed" message to all the other known nodes.

Let's test gossip in practice. Know we have 6379 joined with 6380. What happens if we join 6380 with 6381 is that also 6379 and 6381 will meet. But Redis Nodes are like good families girls, they only trust and meet with other nodes either already trusted (in their nodes table) or trusted by their friends. The only way to make a Redis Node talking with another node that is not already in the known nodes list, nor in the know nodes of another trusted node is via the CLUSTER MEET command.

redis> connect 127.0.0.1 6381
redis> cluster meet 127.0.0.1 6380
OK
redis> connect 127.0.0.1 6379
redis> cluster nodes
8f1e863160f2627108451d0a0155127e8b1b4597 127.0.0.1:6381 noflags - 1303137505 1303137505 connected
0c2f029a52bec8d17c43b137c74205fade1b1921 :0 myself - 0 1303137500 disconnected
96fad8c3b4df5f86ac4abe6205a253c640c751ef 127.0.0.1:6380 master - 1303137505 1303137505 connected


Now all the three nodes are connected and aware of their friends... however the nodes are still not able to reply to queries as hash slots are not assigned at all. To assign hash slots we need to send "CLUSTER ADDSLOTS" commands. We assign part of the 4096 slots to all the nodes, so that all the slots will be covered:
$ echo '(0..1000).each{|x| puts "CLUSTER ADDSLOTS "+x.to_s}' | ruby | redis-cli -p 6379 > /dev/null
$ echo '(1001..2500).each{|x| puts "CLUSTER ADDSLOTS "+x.to_s}' | ruby | redis-cli -p 6380 > /dev/null
$ echo '(2501..4095).each{|x| puts "CLUSTER ADDSLOTS "+x.to_s}' | ruby | redis-cli -p 6381 > /dev/null
(note: actually CLUSTER ADDSLOTS can accept any number of hash slots as parameters, but redis-cli does not work well with huge command lines, so we send a command for every hash slot).

Ok now we should have a much more interesting cluster. Let's try to ask some node about how things are going:
redis> connect 127.0.0.1 6379
redis> cluster nodes
8f1e863160f2627108451d0a0155127e8b1b4597 127.0.0.1:6381 master - 1303138327 1303138327 connected 2501-4095
0c2f029a52bec8d17c43b137c74205fade1b1921 :0 myself - 0 1303138326 disconnected 0-1000
96fad8c3b4df5f86ac4abe6205a253c640c751ef 127.0.0.1:6380 master - 1303138326 1303138326 connected 1001-2500
redis> cluster info
cluster_state:ok
cluster_slots_assigned:4096
cluster_slots_ok:4096
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:3
Yes! Now our cluster state is OK. As you can see near every line of cluster nodes output there is the is the list of assigned slots. This informations all propagated thanks to the gossip section of PING/PONG packets. We are ready to try some actual query:
redis> get foo
(error) MOVED 3990 127.0.0.1:6381
redis> get bar
(nil)
Now nodes accept our requests finally. The first request was about hash slot 3990 as the key 'foo' will hash to that hash slot. So we got routed to the right node. A good client will remember this and will directly hit the right node the next time.

Ok, that's all for now. I hope that while I can't show a full solution for now this journey in the status of Redis Cluster was more interesting than just reading my tweets about "I'm working at cluster".

Also note that to operate on a cluster you'll actually never do this kind of stuff by hand. The redis-trib program will do all this for you, but my thought was that it is a lot less instructive to just type 'redis-trib create ...'. I wanted to show a bit more of the inner workings.
48419 views*
Posted at 10:55:43 | permalink | 6 comments | print
Do you like this article?
Subscribe to the RSS feed of this blog or use the newsletter service in order to receive a notification every time there is something of new to read here.

Note: you'll not see this box again if you are a usual reader.

Comments

Matthew Frazier writes:
18 Apr 11, 11:49:52
How is progress on Diskstore coming? Will that be in 2.4? I know there are a lot of people (myself included) who are interested in it.
antirez writes:
18 Apr 11, 11:51:42
@Matthew: not in 2.4, probably not even in 3.0 (redis cluster stable release number) as we consider cluster more high priority. Basically diskstore is just an experimental project, it will hit a stable release only if/when we think it rocks. I'm a bit skeptical about mixing Redis and disk as primary storage (not just for persistence) but we'll keep trying new solutions.
ariso writes:
19 Apr 11, 14:51:07
diskstore is really useful for embeding/desktop application. Could you please consider it more high priority?
Dean writes:
19 Apr 11, 16:04:26
I am very excited to start testing with Cluster. Thank you for your time working on it, and for posting an update on your progress! Since Hiredis is the "official" client library for Redis, are there plans to evolve it from being a 'naïve' client to a 'full-featured' client as far as Cluster support is concerned? (Referencing previous Cluster terminology where a naïve client requires two round trips for a lookup (using the MOVED response to find the key) and a full-featured client will maintain (and update) a map of keys to hash slots.)
Willp writes:
19 Apr 11, 17:08:08
Salvatore, thank you! You are doing terrific work! Redis should have been born decades ago.
Willp writes:
19 Apr 11, 17:26:31
One small improvement in assigning hash buckets would be to round-robin the hash buckets across the nodes in your cluster. Using the sample hash bucket division of hash slots will end up with a pretty uneven distribution of keys.

The distribution is not very uniform for values of crc16() % 4096 on strings that only differ by one or two ascii values. Strings that are generated sequentially will tend to have similar values. If instead the hash slots are initially assigned in round-robin style (give hashslot mod TotalNodes == NodeNumber to node numbered as NodeNumber), then there will be better distribution of keys across nodes. If your keys end in the same string and differ more on the leftmost characters, then the distribution is a lot better. Even better distribution would of course be a random shuffle, though harder to manage/recover. I know, it's still early days for Redis Cluster, and I am hoping for the best!

Python code to demonstrate:

>>> from crc16pure import * # from https://github.com/gennady/pycrc16/blob/master/python2x/crc16/crc16pure.py
>>> for x in range(10): print 'crc16(%s): %d' % ('abc:%d' % x, (crc16xmodem('abc:' + str(x)) % 4096))
...
crc16(abc:0): 2305
crc16(abc:1): 2336
crc16(abc:2): 2371
crc16(abc:3): 2402
crc16(abc:4): 2437
crc16(abc:5): 2468
crc16(abc:6): 2503
crc16(abc:7): 2534
crc16(abc:8): 2057
crc16(abc:9): 2088
>>>

(All but two of these consecutive strings would hash to slots assigned to the middle node on port 6380. CRC16 is cheap but doesn't perturb the bits much on sequential input.)

Round robining the slots (mod 3) in this case would give better (but not great) distribution:
>>> for x in range(10): print 'crc16("%s"): %d and mod 3 is: %d' % ( ('abc:%d' % x), (crc16xmodem('abc:' + str(x)) % 4096), (crc16xmodem('abc:' + str(x)) % 4096) % 3 )
...
crc16("abc:0"): 2305 and mod 3 is: 1
crc16("abc:1"): 2336 and mod 3 is: 2
crc16("abc:2"): 2371 and mod 3 is: 1
crc16("abc:3"): 2402 and mod 3 is: 2
crc16("abc:4"): 2437 and mod 3 is: 1
crc16("abc:5"): 2468 and mod 3 is: 2
crc16("abc:6"): 2503 and mod 3 is: 1
crc16("abc:7"): 2534 and mod 3 is: 2
crc16("abc:8"): 2057 and mod 3 is: 2
crc16("abc:9"): 2088 and mod 3 is: 0

So, 1 slot went to node 0, 4 slots to node 1, and 5 slots to node 2. Better to have (1,4,5) than (0, 8, 2).
comments closed