## Variance of an estimate for the second moment

Suppose there are $u$ urns, numbered from $1$ up to $u$ containing in total $b$ balls, numbered from $1$ up to $b$. Let $p_i$ be the proportion of balls in $i$. Let $s$ be the probability that if we pick two numbers in $\{1,\ldots,b\}$ uniformly at random, then the balls with these numbers lie in the same urn. Clearly $s = \sum_{i=1}^u p_i^2$.

The honest way to estimate $s$ follows the definition by repeatedly picking pairs $(b,b')$ of balls, and counting the number of ‘hits’ where $b$ and $b'$ are in the same urn. For example, suppose that each urn contains equally many balls, so $p_i = 1/u$ for each $i$. Then our estimate for $s = 1/u$ after $N$ samples is $H/N$, where the number of hits $H$ has distribution $\mathrm{Bin}(N,1/u)$. As expected, $\mathbb{E}[H/N] = 1/u = s$.

An alternative is to choose $N$ ball numbers uniformly at random, and for each ball, note down its urn number. For each $i \in \{1,\ldots, u\}$, let $B_i$ be the random variable counting the number of balls lying in urn $i$. We then estimate $s$ by the statistic

$S = \sum_{i=1}^u \bigl( \frac{B_i}{N} \bigr)^2$.

By the well known formula

$\mathrm{Var} X = \mathbb{E}[(X -\mathbb{E}X)^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2,$

we get

\begin{aligned} \mathbb{E}[S] &= \frac{1}{N^2} \mathbb{E}\bigl[ \sum_{i=1}^u (B_i-\mathbb{E}[B_i])^2\bigr] + \sum_{i=1}^u \mathbb{E}[B_i]^2) \\ &= \frac{1}{N^2} \sum_{i=1}^u \mathbb{E}[(B_i - \mathbb{E}[B_i])^2] + s\\ &= \frac{1}{N^2} \sum_{i=1}^u \mathrm{Var} B_i + s. \end{aligned}

Thus $S$ is a biased estimator for $s$. Since $B_i$ has distribution $\mathrm{Bin}(N, p_i)$, we have $\mathrm{Var}(B_i) = Np_i(1-p_i)$ and so

$\mathbb{E}[S] = p + \frac{1}{N^2} \sum_{i=1}^U Np_i(1-p_i) = s + \frac{1}{N} - \frac{s}{N}.$

Since $s$ is small compared to $1$, a good correction for the systematic error can be made by subtracting $1/N$. All the same, it might well seem sampling by pairs is the simplest and best approach. But this ignores the variability in our estimates.

Again consider the case where $p_i = 1/u$ for each $i$. Then, as already seen, $H$ has distribution $\mathrm{Bin}(N,1/u)$ and so $\mathrm{Var}[H/N] = \mathrm{Var}[H]/N^2 = N(1/u)(1-1/u)/N^2 \approx 1/Nu$. The standard deviation in $H/N$ is therefore $1/\sqrt{Nu}$. In particular, to improve the accuracy of our estimate by a factor $f$, we must increase the number of samples by $f^2$.

For $S$, we extend the ‘orthogonality trick’ used earlier to express $\mathbb{E}[S]$ in terms of the variances of the $B_i$, to get

\begin{aligned} S &= \sum_{i=1}^u \frac{B_i^2}{N^2} \\ &= \sum_{i=1}^u \Bigl( \bigl( \frac{B_i}{N} - \frac{1}{u} \bigr)^2 + \frac{2B_i}{Nu} - \frac{1}{u^2} \Bigr) \\ &= \sum_{i=1}^u \bigl( \frac{B_i}{N} - \frac{1}{u} \bigr)^2 + \frac{2}{u} - \frac{1}{u} \\ &= \sum_{i=1}^u \bigl( \frac{B_i}{N} - \frac{1}{u} \bigr)^2 - \frac{1}{u} \end{aligned}
where the penultimate line uses $\sum_{i=1}^u B_i = N$.

By the Central Limit Theorem, $\sqrt{uN}\bigl( \frac{B_i}{N} - \frac{1}{u} \bigr)$ is approximately normally distributed with mean $0$ and variance $uN \mathrm{Var}[B_i/N] \approx 1$. While the $B_i$ are not independent, it still seems reasonable to approximate

$\hat{S} = uN \sum_{i=1}^u \bigl( \frac{B_i}{N} - \frac{1}{u} \bigr)^2$

as the sum of the squares of $u$ standard normal variables, i.e. as the $\chi^2$ distribution with $u$ degrees of freedom. This distribution has variance $2u$, so we have $\mathrm{Var}[\hat{S}] \approx 2u$. Undoing the scaling we get

$\mathrm{Var}[S] \approx \frac{2u}{u^2N^2} = \frac{2}{uN^2}$.

The standard deviation in $S$ is then $\frac{1}{N} \sqrt{\frac{2}{u}}$. Therefore increasing number of samples by a factor of $f$ reduces the standard deviation by the same factor. Moreover, we use fewer samples: $N$ rather than $2N$.

### Simulations

Data from simulation suggest these approximations work well in practice. For instance, with $u = 100$ urns and $N = 1000$ samples, the mean of $H/N$ over $1000$ repetitions of the sampling experiment was $0.01014$, with standard deviation $3.1354 \times 10^{-3}$; the mean of $S/N$ was $0.01099 \approx 1/u + 1/N$, with standard deviation $1.4101 \times 10^{-4}$. Increasing the number of samples to $10000$ reduced the standard deviations for $H/N$ and $S/N$ to $1.0114 \times 10^{-3}$ and $1.4176 \times 10^{-5}$, respectively, agreeing with the prediction above. For more skewed distributions of balls into urns the standard deviation of $S/N$ is still far less than for $H/N$, but the improvement on increasing the number of samples is less dramatic.