## How much do hash collisions help an attacker?

I was asked this interesting question: suppose there are $2^{256}$ possible MACs (message authentication codes) and that each MAC is hashed to a $256$ bit hash. Suppose we have a hash obtained from a MAC, and want to find a MAC with the same hash.

The object of this post is to give a short proof that the average number of MACs we must hash is between $0.0163 \times 2^{256} \approx 2^{250}$ and $0.418 \times 2^{256}$. This contrasts with the $0.5 \times 2^{256}$ hashes that would on average be needed if there was a unique hash for each MAC.

Update. The ‘asymptotically correct’ answer is $2^{256} / \mathrm{e} = 0.368 \times 2^{256}$: see end of this post.

Generalize by having $N$ MACs, each of which is hashed to one of $N$ possible values. We model an idealised hash using balls and urns: take $N$ urns, representing the possible values of the hash, and suppose that each ball is thrown into an urn chosen uniformly at random. The probability that after $N$ throws, urn $1$ contains exactly $r$ balls is

$\binom{N}{r} \bigl( 1 - \frac{1}{N} \bigr)^{N-r} \frac{1}{N^r}.$

For large $N$ we have $(1-1/N)^N \approx \mathrm{e}^{-1}$. If $r$ is small compared to $N$ then $(1-1/N)^{-r} \approx 1$ and

$\binom{N}{r} \frac{1}{N^r} = \frac{N(N-1) \ldots (N-r+1)}{N^r} \frac{1}{r!} \approx \frac{1}{r!}$.

It follows that the probability of seeing exactly $r$ balls in urn $1$ is approximately $\mathrm{e}^{-1}/r!$. (This is the Poisson distribution with parameter $1$.) Hence, if $r$ is small compared to $N$, then an excellent approximation for the probability that a given urn contains exactly $r$ balls is $\mathrm{e}^{-1}/r!$.

Now, doing subtly the wrong thing, suppose that urn $1$ is non-empty, and we draw balls at random until we find a ball in urn $1$, throwing out balls that turn out to be in different urns. If urn $1$ has $r$ balls then on average $(N+1)/(r+1)$ balls must be drawn. (This is easily proved by induction, using that if $e_{N}$ is the expected number of drawings with $N$ balls then $e_1 = 1$ and, conditioning on the first ball drawn, $e_N = r/N + (1-r/N)(1 + e_{N-1}) = 1 + (1-r/N) e_{N-1}$.) Conditioned on being non-empty, the probability that urn $1$ has $r$ balls is $1/(\mathrm{e}-1) r!$. Hence the expected number of balls drawn is

\begin{aligned} \sum_{r=1}^\infty \frac{1}{\mathrm{e}-1} \frac{1}{r!} \frac{N+1}{r+1} &= \frac{N+1}{\mathrm{e}-1} \sum_{s=2}^\infty \frac{1}{s!} \\ &= \frac{N+1}{\mathrm{e}-1}\bigl( \mathrm{e} - 1 - 1 \bigr) \\ &= (N+1)\frac{\mathrm{e}-2}{\mathrm{e}-1} \\ &\approx 0.418N. \end{aligned}

If each urn has exactly one ball, then on average $(N+1)/2 \approx 0.5N$ drawings are needed. So we find a ball in urn $1$ about $80\%$ faster than in the uniform case.

However, this does not capture the original problem. A hash from a MAC chosen uniformly at random is disproportionately likely to be the common hash of several different MACs. Correspondingly, in the ball-and-urns model, the probability of choosing an urn should be proportional to the number of balls in it. Choosing a non-empty urn at random gives too much weight to the urns with only one ball. So $0.418N$ is only an upper bound on the amount of work required.

For a lower bound, suppose that the original ball is always chosen from the urn containing the most balls. Lemma 6.20 in these notes on hashing shows that the probability that there is a urn containing at least $t$ balls is at most $N\mathrm{e}^t/t^t$. This probability is small for $t = \log N / \log \log N$ (provided $N$ is sufficiently large). So an asymptotic lower bound for the amount of work required is $N\log \log N/\log N$.

For the original problem we have $N = 2^{256}$ and $2^{256}\mathrm{e}^t/t^t < 1/1000$ provided $t \ge 60$. So with probability at least $999/1000$, no urn contains more than $60$ balls. The expected numbers of balls drawn is therefore at least

$\frac{999}{1000} \times \frac{2^{256} + 1}{61} \ge 0.0163 \times 2^{256}.$

Update. I posted the question to MathOverflow, and it got an unexpectedly elegant answer from Douglas Zare. When we choose a ball number uniformly at random, it is $k$ times more likely to be in an urn containing $k$ balls. In the Poisson model, this event has probability $k/\mathrm{e}/k! = 1/\mathrm{e}(k-1)!$. So if $X$ is the number of balls in the urn of the chosen ball, not counting the chosen ball, then
$\mathbb{P}[X=j] = 1/\mathrm{e}k!$. Hence $X$ is distributed as a Poisson random variable with mean $1$, and the expected number of balls drawn is $1/\mathbb{E}[X+1+1]$, which is shown to be $N/\mathrm{e}$ in Zare’s post.

Since $\log_2(\mathrm{e}) = 1/\log 2 = 1.443$, the hash collisions mean an attacker saves about $1.443$ bits of work.