Tuesday, 10 February 09

Dynamic languages changed programming in many ways. One if this ways is the raise of the associative arrays as mainstream data structure. Associative arrays, usually implemented via hash tables, and also known as hashes or dictionaries, are so convenient that it is hard to find any non trivial program written in Ruby or Python not using this data structure because of the great advantages and flexibility they can bring to programming.

In my opinion the only reason hashes are not used even in every non trivial C program is that ANSI-C does not export an implementation of this data structure. It is hard to understand why there was no effort to bring things such dynamic strings, lists and hash tables as default C components. But this is another story... ;)### The missing feature

Given that most of the times hashes are implemented using hash tables let's focus on this data structure. What's cool about hash tables is that in the average case operations like INSERT, EXISTS?, DELETE are O(1). You can count how many occurrences of every word there are in a large text in an efficient way for example, or implement a fast caching system.

Of course dynamic programming languages export all these great features of hash tables in the hash object implementation. However under decent assumptions hash tables are able to support another useful operation in O(1), that is, GET A RANDOM ELEMENT.### Why is it useful to get a random element?

Sometimes there are problems where you need too much memory or time. We can return to our old example of counting words occurrences in a text, this is simple since there are a number of words finite *and* small, even if the text is pretty large still the memory we use is proportional to the number of words in the language the text is written in.

What about if you want to extract the most common 10 bytes sequences of a huge binary file? In theory the problem is the same as word counting, but the possible sequences of 10 bytes are... 2^10*8 that is 1208925819614629174706176 different elements. Our program running against a very large file will use all the memory we have in little time.### Imperfect solutions

Still it is possible that we don't really need a prefect solution, but one that is good enough... we can trade precision for memory, for example our program can be modified so that when we reach 100000 elements in the hash table we remove a random element from the table... this will not bring a perfect solution and our "top 100 sequences" may change from a run to another run of the program, but it is much bettern than nothing. We can make our algorithm even better. Instead of removing a random element we get three random elements and remove the sequence with the lower count, this will make less probable that we lost some frequent sequence from our hash table.
### Is 'get a random key' O(1)?

So this operation is useful to perform against an hash table, even if most dynamic languages don't export this feature in the interface of the hashes. But is it really an O(1) operation? It is only if the following is true: the ratio between the hash table size and the number of used buckets should always be in a given range (also we need a decent distribution of elements in the table, but this is an assumption of hash tables anyway).

This is always true with most hash tables implementations if you just add elements. The table starts small, once it gets too populated of elements the table is resized so that for example it is three times larger than the number of elements it contains, and so on. It is trivial to see that this way the*number-of-buckets*/*number-of-elements* will always stay under a given range.

But... hash tables support even the DELETE operation. This may affect our operation in a bad way. For example if I fill the hash table with 10000 elements and remove all the elements but one, the ratio between the table size and element will be big, and the 'get a random element' operation will be a O(N) operation where N is max number of elements that a given hash table hold in its life. Still note that with this usage pattern even the memory used by the hash table is going to not be proportional to the number of elements*so in the real world many hash tables implementations will resize the table once the elements are too few compared to the size of the table*.
### The algorithm

In pseudocode the algorithm to get a random element is as simple as this:

Vote or comment this article on reddit!

In my opinion the only reason hashes are not used even in every non trivial C program is that ANSI-C does not export an implementation of this data structure. It is hard to understand why there was no effort to bring things such dynamic strings, lists and hash tables as default C components. But this is another story... ;)

Of course dynamic programming languages export all these great features of hash tables in the hash object implementation. However under decent assumptions hash tables are able to support another useful operation in O(1), that is, GET A RANDOM ELEMENT.

What about if you want to extract the most common 10 bytes sequences of a huge binary file? In theory the problem is the same as word counting, but the possible sequences of 10 bytes are... 2^10*8 that is 1208925819614629174706176 different elements. Our program running against a very large file will use all the memory we have in little time.

This is always true with most hash tables implementations if you just add elements. The table starts small, once it gets too populated of elements the table is resized so that for example it is three times larger than the number of elements it contains, and so on. It is trivial to see that this way the

But... hash tables support even the DELETE operation. This may affect our operation in a bad way. For example if I fill the hash table with 10000 elements and remove all the elements but one, the ratio between the table size and element will be big, and the 'get a random element' operation will be a O(N) operation where N is max number of elements that a given hash table hold in its life. Still note that with this usage pattern even the memory used by the hash table is going to not be proportional to the number of elements

(key) get_random_element(table) if table.size == 0: return nil while true: index = random(0,table.size) if table.buckets[index] != nil return table.buckets[index].key endIt's useful, it's simple, it's O(1) in a lot of implementations, I hope to see this in Ruby and other dynamic languages ASAP!

Vote or comment this article on reddit!