A missing feature in most dynamic languages: take a random key from an hash table.

Tuesday, 10 February 09

Dynamic languages changed programming in many ways. One if this ways is the raise of the associative arrays as mainstream data structure. Associative arrays, usually implemented via hash tables, and also known as hashes or dictionaries, are so convenient that it is hard to find any non trivial program written in Ruby or Python not using this data structure because of the great advantages and flexibility they can bring to programming.

In my opinion the only reason hashes are not used even in every non trivial C program is that ANSI-C does not export an implementation of this data structure. It is hard to understand why there was no effort to bring things such dynamic strings, lists and hash tables as default C components. But this is another story... ;)

The missing feature

Given that most of the times hashes are implemented using hash tables let's focus on this data structure. What's cool about hash tables is that in the average case operations like INSERT, EXISTS?, DELETE are O(1). You can count how many occurrences of every word there are in a large text in an efficient way for example, or implement a fast caching system.

Of course dynamic programming languages export all these great features of hash tables in the hash object implementation. However under decent assumptions hash tables are able to support another useful operation in O(1), that is, GET A RANDOM ELEMENT.

Why is it useful to get a random element?

Sometimes there are problems where you need too much memory or time. We can return to our old example of counting words occurrences in a text, this is simple since there are a number of words finite and small, even if the text is pretty large still the memory we use is proportional to the number of words in the language the text is written in.

What about if you want to extract the most common 10 bytes sequences of a huge binary file? In theory the problem is the same as word counting, but the possible sequences of 10 bytes are... 2^10*8 that is 1208925819614629174706176 different elements. Our program running against a very large file will use all the memory we have in little time.

Imperfect solutions

Still it is possible that we don't really need a prefect solution, but one that is good enough... we can trade precision for memory, for example our program can be modified so that when we reach 100000 elements in the hash table we remove a random element from the table... this will not bring a perfect solution and our "top 100 sequences" may change from a run to another run of the program, but it is much bettern than nothing. We can make our algorithm even better. Instead of removing a random element we get three random elements and remove the sequence with the lower count, this will make less probable that we lost some frequent sequence from our hash table.

Is 'get a random key' O(1)?

So this operation is useful to perform against an hash table, even if most dynamic languages don't export this feature in the interface of the hashes. But is it really an O(1) operation? It is only if the following is true: the ratio between the hash table size and the number of used buckets should always be in a given range (also we need a decent distribution of elements in the table, but this is an assumption of hash tables anyway).

This is always true with most hash tables implementations if you just add elements. The table starts small, once it gets too populated of elements the table is resized so that for example it is three times larger than the number of elements it contains, and so on. It is trivial to see that this way the number-of-buckets/number-of-elements will always stay under a given range.

But... hash tables support even the DELETE operation. This may affect our operation in a bad way. For example if I fill the hash table with 10000 elements and remove all the elements but one, the ratio between the table size and element will be big, and the 'get a random element' operation will be a O(N) operation where N is max number of elements that a given hash table hold in its life. Still note that with this usage pattern even the memory used by the hash table is going to not be proportional to the number of elements so in the real world many hash tables implementations will resize the table once the elements are too few compared to the size of the table.

The algorithm

In pseudocode the algorithm to get a random element is as simple as this:

(key) get_random_element(table)
    if table.size == 0:
        return nil
    while true:
        index = random(0,table.size)
        if table.buckets[index] != nil
            return table.buckets[index].key
end

It's useful, it's simple, it's O(1) in a lot of implementations, I hope to see this in Ruby and other dynamic languages ASAP!

Vote or comment this article on reddit!

27806 views^*

Posted at 15:37:24 | permalink | 13 comments | print

Commenti

web development writes:

10 Feb 09, 15:40:24

I would use the rand() function in mysql to get a random row from the table.

web development writes:

10 Feb 09, 15:47:10

@web development: I think you missed the point, but from the URL you put in the comment I think it's just spam ;) Sorry if I removed the URL.

Dave Hinton writes:

10 Feb 09, 16:27:41

If I was using this feature, I would expect each key to be equally likely to be returned. But this is not the case with your code.

e.g. If the hash has 3 buckets and 2 keys, "a" and "b", with "a" stored in bucket 0 and "b" stored in bucket 1, then "a" will be twice as likely to be returned as "b".

David Teller writes:

10 Feb 09, 16:42:34

Mmmmhhh.... with OCaml Batteries Included, if you have a hashtable called t, getting a random key would be

Random.choice (keys t)

or, if you want to get a random ordering of keys,

Random.shuffle (keys t)

Of course, this is O(n) rather than O(1). But the random choice is fair among keys (assuming that OCaml's PRNG is fair, of course).

Lorenzo Bolognini writes:

10 Feb 09, 16:43:43

Missing but easy enough in Python

import random

adict = dict(a=1,b=2,c=3)
keys = adict.keys()
item = keys.pop(random.randrange(0, len(keys)))
print adict[item]

Amit Patel writes:

10 Feb 09, 17:14:12

I agree with Dave Hinton that I'd expect random key to be fair, and the O(1) solution presented isn't. Perhaps it can be given a different name.

Lorenzo Bolognini and David Teller give O(N) solutions but both are fair.

Although I agree this operation may be useful in some situations, I didn't quite follow the example in the blog post. If you want to get the 10 most common sequences, and you pick a random item to drop, you might end up dropping one of the common ones. Instead of dropping them one at a time, drop them all at once. When the hash table size reaches M, copy the K top elements to a new hash table (this takes M*O(K) time).

I think the general point is not about *dynamic* languages but about implementations being exposed. Non-dynamic high level languages like OCaml, Haskell, Java, etc., also hide the implementations and make it hard to do what you want.

Inheritance is often the way to add operations but you might also look at the way C++'s STL is structured. For example with the priority queue, all the data is exposed, so you can perform additional operations on it. I've used this successfully in an application where I needed a priority queue but also needed a search/reprioritize operation. Had I been using a standard closed implementation that hides all the underlying data, I wouldn't be able to add the operation I needed, and thus I wouldn't be able to reuse the existing priority queue code.

antirez writes:

10 Feb 09, 17:14:15

@Dave, @Amit: I changed the algorithm in the pseudo code, now every key will be returned with the same probability.

noah healy writes:

10 Feb 09, 21:40:40

What about perl's each function?

jimrandomh writes:

10 Feb 09, 23:25:35

If you need a data structure with unusual extra operations added, you probably need to implement that data structure yourself. This should not be surprising. I don't think I've ever seen a hash table implementation which included get-random-element, nor have I ever had occasion to want one, so this qualifies as an unusual operation. So why are dynamic languages singled out? C++'s std::hash_map doesn't have this operation either.

wildwood writes:

11 Feb 09, 01:27:08

Is your code assuming at most one key per bucket? This doesn't seem to work if you have more keys than buckets in your hash.

Ricardo Bánffy writes:

11 Feb 09, 01:44:51

Lorenzo,

There is an easier way:

import random
d = {'a':1, 'b':2, 'c':3}
k = random.choice(d.keys())
print d.pop(k)

Paddy3118 writes:

28 Feb 09, 10:37:15

In Python you can iterate over the keys of a dict in hash order using .iterkeys() and .iteritems .

Why not pop the last three items off the hash into a list, examine then discard one item, then add the other two back into the dict. That way you don't have to collect all the keys to the dict, but you do rely on the hash order of the dict keys rather than a random number generator.

- Paddy

Willp writes:

17 Apr 11, 16:43:22

@Ricardo @Lorenzo
The python keys() method is O(N) not O(1), unfortunately.

@Paddy3118 python hash order is extremely non-random for certain keys, i.e. small integer values from 0 to 6 come out of a dict completely sorted, so it's not reliable.

Try:
d = dict()
for x in range(6): d[x]=1
print d.keys()
# same result here:
print list(d.iterkeys())

The output is: [0, 1, 2, 3, 4, 5]

This was an intentional design decision for making super-efficient python dictionaries when they are small (up to 8 elements I believe), becuase dicts promise O(1) access, but make no promises about key ordering.

To get an O(1) random key out of a dict would require the dict to provide access to elements by their hash bucket position. You could do this by subclassing and maintaining a list[] of keys (updated for all mutating operations) and use that extra list to randomly choose from.

Having this as a native method in Redis is pretty cool.

Comments closed

antirez weblog