Variance of an estimate for the second moment

Suppose there are u urns, numbered from 1 up to u containing in total b balls, numbered from 1 up to b. Let p_i be the proportion of balls in i. Let s be the probability that if we pick two numbers in \{1,\ldots,b\} uniformly at random, then the balls with these numbers lie in the same urn. Clearly s = \sum_{i=1}^u p_i^2.

The honest way to estimate s follows the definition by repeatedly picking pairs (b,b') of balls, and counting the number of ‘hits’ where b and b' are in the same urn. For example, suppose that each urn contains equally many balls, so p_i = 1/u for each i. Then our estimate for s = 1/u after N samples is H/N, where the number of hits H has distribution \mathrm{Bin}(N,1/u). As expected, \mathbb{E}[H/N] = 1/u = s.

An alternative is to choose N ball numbers uniformly at random, and for each ball, note down its urn number. For each i \in \{1,\ldots, u\}, let B_i be the random variable counting the number of balls lying in urn i. We then estimate s by the statistic

S = \sum_{i=1}^u \bigl( \frac{B_i}{N} \bigr)^2.

By the well known formula

\mathrm{Var} X = \mathbb{E}[(X -\mathbb{E}X)^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2,

we get

\begin{aligned} \mathbb{E}[S] &= \frac{1}{N^2} \mathbb{E}\bigl[ \sum_{i=1}^u (B_i-\mathbb{E}[B_i])^2\bigr] + \sum_{i=1}^u \mathbb{E}[B_i]^2) \\ &= \frac{1}{N^2} \sum_{i=1}^u \mathbb{E}[(B_i - \mathbb{E}[B_i])^2]  + s\\ &= \frac{1}{N^2} \sum_{i=1}^u \mathrm{Var} B_i + s. \end{aligned}

Thus S is a biased estimator for s. Since B_i has distribution \mathrm{Bin}(N, p_i), we have \mathrm{Var}(B_i) = Np_i(1-p_i) and so

\mathbb{E}[S] = p + \frac{1}{N^2} \sum_{i=1}^U Np_i(1-p_i) = s + \frac{1}{N} - \frac{s}{N}.

Since s is small compared to 1, a good correction for the systematic error can be made by subtracting 1/N. All the same, it might well seem sampling by pairs is the simplest and best approach. But this ignores the variability in our estimates.

Again consider the case where p_i = 1/u for each i. Then, as already seen, H has distribution \mathrm{Bin}(N,1/u) and so \mathrm{Var}[H/N] = \mathrm{Var}[H]/N^2 = N(1/u)(1-1/u)/N^2 \approx 1/Nu. The standard deviation in H/N is therefore 1/\sqrt{Nu}. In particular, to improve the accuracy of our estimate by a factor f, we must increase the number of samples by f^2.

For S, we extend the ‘orthogonality trick’ used earlier to express \mathbb{E}[S] in terms of the variances of the B_i, to get

\begin{aligned} S &= \sum_{i=1}^u \frac{B_i^2}{N^2} \\ &= \sum_{i=1}^u \Bigl( \bigl( \frac{B_i}{N} - \frac{1}{u} \bigr)^2 + \frac{2B_i}{Nu} - \frac{1}{u^2} \Bigr) \\ &= \sum_{i=1}^u \bigl( \frac{B_i}{N} - \frac{1}{u} \bigr)^2 + \frac{2}{u} - \frac{1}{u} \\ &= \sum_{i=1}^u \bigl( \frac{B_i}{N} - \frac{1}{u} \bigr)^2 - \frac{1}{u} \end{aligned}
where the penultimate line uses \sum_{i=1}^u B_i = N.

By the Central Limit Theorem, \sqrt{uN}\bigl( \frac{B_i}{N} - \frac{1}{u} \bigr) is approximately normally distributed with mean 0 and variance uN \mathrm{Var}[B_i/N] \approx 1. While the B_i are not independent, it still seems reasonable to approximate

\hat{S} = uN \sum_{i=1}^u \bigl(  \frac{B_i}{N} - \frac{1}{u} \bigr)^2

as the sum of the squares of u standard normal variables, i.e. as the \chi^2 distribution with u degrees of freedom. This distribution has variance 2u, so we have \mathrm{Var}[\hat{S}] \approx 2u. Undoing the scaling we get

\mathrm{Var}[S] \approx \frac{2u}{u^2N^2} = \frac{2}{uN^2} .

The standard deviation in S is then \frac{1}{N} \sqrt{\frac{2}{u}}. Therefore increasing number of samples by a factor of f reduces the standard deviation by the same factor. Moreover, we use fewer samples: N rather than 2N.


Data from simulation suggest these approximations work well in practice. For instance, with u = 100 urns and N = 1000 samples, the mean of H/N over 1000 repetitions of the sampling experiment was 0.01014, with standard deviation 3.1354 \times 10^{-3}; the mean of S/N was 0.01099 \approx 1/u + 1/N, with standard deviation 1.4101 \times 10^{-4}. Increasing the number of samples to 10000 reduced the standard deviations for H/N and S/N to 1.0114 \times 10^{-3} and 1.4176 \times 10^{-5}, respectively, agreeing with the prediction above. For more skewed distributions of balls into urns the standard deviation of S/N is still far less than for H/N, but the improvement on increasing the number of samples is less dramatic.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: