Suppose there are urns, numbered from up to containing in total balls, numbered from up to . Let be the proportion of balls in . Let be the probability that if we pick two numbers in uniformly at random, then the balls with these numbers lie in the same urn. Clearly .
The honest way to estimate follows the definition by repeatedly picking pairs of balls, and counting the number of ‘hits’ where and are in the same urn. For example, suppose that each urn contains equally many balls, so for each . Then our estimate for after samples is , where the number of hits has distribution . As expected, .
An alternative is to choose ball numbers uniformly at random, and for each ball, note down its urn number. For each , let be the random variable counting the number of balls lying in urn . We then estimate by the statistic
By the well known formula
Thus is a biased estimator for . Since has distribution , we have and so
Since is small compared to , a good correction for the systematic error can be made by subtracting . All the same, it might well seem sampling by pairs is the simplest and best approach. But this ignores the variability in our estimates.
Again consider the case where for each . Then, as already seen, has distribution and so . The standard deviation in is therefore . In particular, to improve the accuracy of our estimate by a factor , we must increase the number of samples by .
For , we extend the ‘orthogonality trick’ used earlier to express in terms of the variances of the , to get
where the penultimate line uses .
By the Central Limit Theorem, is approximately normally distributed with mean and variance . While the are not independent, it still seems reasonable to approximate
as the sum of the squares of standard normal variables, i.e. as the distribution with degrees of freedom. This distribution has variance , so we have . Undoing the scaling we get
The standard deviation in is then . Therefore increasing number of samples by a factor of reduces the standard deviation by the same factor. Moreover, we use fewer samples: rather than .
Data from simulation suggest these approximations work well in practice. For instance, with urns and samples, the mean of over repetitions of the sampling experiment was , with standard deviation ; the mean of was , with standard deviation . Increasing the number of samples to reduced the standard deviations for and to and , respectively, agreeing with the prediction above. For more skewed distributions of balls into urns the standard deviation of is still far less than for , but the improvement on increasing the number of samples is less dramatic.