I was asked this interesting question: suppose there are possible MACs (message authentication codes) and that each MAC is hashed to a bit hash. Suppose we have a hash obtained from a MAC, and want to find a MAC with the same hash.

The object of this post is to give a short proof that the average number of MACs we must hash is between and . This contrasts with the hashes that would on average be needed if there was a unique hash for each MAC.

**Update.** The ‘asymptotically correct’ answer is : see end of this post.

Generalize by having MACs, each of which is hashed to one of possible values. We model an idealised hash using balls and urns: take urns, representing the possible values of the hash, and suppose that each ball is thrown into an urn chosen uniformly at random. The probability that after throws, urn contains exactly balls is

For large we have . If is small compared to then and

.

It follows that the probability of seeing exactly balls in urn is approximately . (This is the Poisson distribution with parameter .) Hence, if is small compared to , then an excellent approximation for the probability that a given urn contains exactly balls is .

Now, doing subtly the wrong thing, suppose that urn is non-empty, and we draw balls at random until we find a ball in urn , throwing out balls that turn out to be in different urns. If urn has balls then on average balls must be drawn. (This is easily proved by induction, using that if is the expected number of drawings with balls then and, conditioning on the first ball drawn, .) Conditioned on being non-empty, the probability that urn has balls is . Hence the expected number of balls drawn is

If each urn has exactly one ball, then on average drawings are needed. So we find a ball in urn about faster than in the uniform case.

However, this does not capture the original problem. A hash from a MAC chosen uniformly at random is disproportionately likely to be the common hash of several different MACs. Correspondingly, in the ball-and-urns model, the probability of choosing an urn should be proportional to the number of balls in it. Choosing a non-empty urn at random gives too much weight to the urns with only one ball. So is only an upper bound on the amount of work required.

For a lower bound, suppose that the original ball is always chosen from the urn containing the most balls. Lemma 6.20 in these notes on hashing shows that the probability that there is a urn containing at least balls is at most . This probability is small for (provided is sufficiently large). So an asymptotic lower bound for the amount of work required is .

For the original problem we have and provided . So with probability at least , no urn contains more than balls. The expected numbers of balls drawn is therefore at least

**Update.** I posted the question to MathOverflow, and it got an unexpectedly elegant answer from Douglas Zare. When we choose a ball number uniformly at random, it is times more likely to be in an urn containing balls. In the Poisson model, this event has probability . So if is the number of balls in the urn of the chosen ball, not counting the chosen ball, then

. Hence is distributed as a Poisson random variable with mean , and the expected number of balls drawn is , which is shown to be in Zare’s post.

Since , the hash collisions mean an attacker saves about bits of work.

Thank you! This really clears it up for me. I feel a bit sorry I can’t give you lots of reputation points.