Redis reliable queues with Lua scripting
Saturday, 17 March 12
Redis 2.6 support for Lua scripting opens a lot of possibilities, basically because you can do atomically a lot of things that before required to pay a big performance hit. Now the price is much cheaper, so why don't abuse our power?
Even more important is that, before scripting, in order to turn non atomic primitives into atomic primitives you required help of the Redis WATCH command, that is a check and set style primitive. Being it an optimistic locking when there is high contention, like in the example of a queue with multiple workers (with many clients accessing a single key with WATCH), performances may be pretty bad.
In this blog post I want to show a pattern based on the scripting capability that can be used to implement reliable queues.
In our system there are only two players: Producers and Consumers, and we only push IDs, it's up to the consumer to agree with the producer about what this IDs really mean, similarly to Michel Martens Ost library.
An item is in processing state if a client is already processing it but has not yet finished.
Everything is based on the idea that tasks are never removed from the list, unless they were actually processed. But instead of using a service list to put there tasks that are in the processing state, we use a single list for everything.
From the point of view of the producer, if there is the new object ID 123 that needs to be processed by a worker (consumer), only an operation is performed:
LPUSH queue 123
So we add the item on the top of the list. Items on the top will be processed the last by workers, so this queue is First In Last Out.
The interesting part is what the consumer does to access an item in the queue.
It runs a Lua script that does the following:
- Get the element on the tail of the list, for instance 45.
- Put the same element on the head of the list, but followed by a trailing asterisk to signal that the item is currently being processed, followed by the unix time (passed by the client to the scripting engine). So in the end we get "45" from the tail, and we put "45*<unixtime>" to the head.
- Return the element to the client (45 in this case).
If the element currently on the tail was already followed by an asterisk and unix time the script does not add an additional asterisk and unix time, it is simply moved on the head, and returned to the client, including the asterisk and the timestamp.
So the client calling this script will either receive 45 (or any other ID actaully), or an ID followed by an unix timestamp like 45*1332014784.
What the consumer does with the returned value
If the item is in processing state but is still young enough (no timeout) it is discarded and the script is called again to fetch the next ID.
Otherwise if the item timed out the consumer will check if the item was actually processed or not by the original client, in an application-specific way, and will remove it from the queue if needed, otherwise the client will call another script that atomically remove the old item and add a new one with the new timestamp. And of course it will start processing it.
When an item was processed successfully it gets removed from the queue using LREM.
The advantage of this system, that may actually be modeled in many different ways, is that you have a rotating list full of jobs to process or currently being processed. There is no way for a job to be lost. Also clients will receive jobs that are still being processed every time a full run of the list is performed, so this jobs will be activated again if needed, but will still remain in the list forever as long as no one is able to complete them.
If tasks take a lot of time to complete using LREM to delete the task may not be optimal. We may use an additional key with a Redis set where we store all the completed tasks, that the lua script will remove every time an item in the processing state is encountered and is also in the Set.
Another good use of an additional Set is to mark the items currently processed or waiting to be processed if we don't want to put the same ID multiple times (rarely useful).
Blocking VS polling
This system requires some form of polling from the point of view of the consumer. In order to avoid that a consumer will rotate the list as fast as possible without actually fetching interesting things. To avoid this problem is possible to use a sentinel to signal the end of the list (like a special task ID -1) so that clients will pause a bit when this element is encountered. Another solution is to simply sleep a bit if after N calls to the script no processable element was found.
Another alternative is to use a second list just to notify that new tasks are available, using blocking pop. and push.
An alternative implementation is to use a list and a sorted set: the list contains new elements to process, while the sorted set elements that are in the processing state, scored by unix time. Basically there are endless alternatives, the main point is that now with scripting we can fetch an element while adding it somewhere else, with even additional information (the unix time) without issues, so many new patterns are possible in the messaging area of Redis usage.