I did some math, and I think this is the probability of two random SHA1 strings having the hamming dinstance of 'D':
This is why. For example imagine we want to compute the probability of an hamming distance of 2. I want the first bit to be not equal, that occurs with a probability of .5, the second bit to be not equal too, again .5, the others to be equal, that is .5 each. This are independent events, so we need to multiply them together, basically for any given pattern of equal/not equal we want there is a probability of .5^160 for a specific arrangement of equal/not-equal of the bits.
But... with an hamming distance of 2 it is required that 2 bits are not equal and 158 are equal, in any kind of order. How many ways there are to arrange 160 items, two black and 158 white? 160!/(158!*2!), so I actually have to multiply the probability of a single arrangement for the number of possible arrangements.
Ok, what are the practical implications of this? That with an optimized C program you can compute around 3.25 million sha1 hashes per second per core (this is what I got with my implementation). So if you have a server farm with 250 computers that happen to be quad cores, you need the following amount of hours to get a given hamming distance:
6 hours for an h.d. of 33
26 hours for an h.d. of 32
108 hours for an h.d. of 31
it looks like that university clusters can bring us a solution with an HD of 32 or 31 with a bit of luck.
People having just ten quadcore boxes can expect to reach an HD of 35 in 12 hours.
I could love to know if my math is correct, I'm not particularly strong on probability theory... but it looks correct to me.