This post is an extended version of a talk I last gave a few years ago on an extended Summer visit to Bristol; four weeks into lockdown seems like a good time to write it up for a more varied audience. The overall thesis is that it’s hard to have a good intuition for really high-dimensional spaces, and that the reason is that, asked to picture such a space, most of us come far closer to something like
then the more accurate
. This is reflected in the rule of thumb that the size of a tower of exponents is determined by the number at the top:
is tiny compared to
and
should seem only a tiny bit smaller than
when both are compared with
.
Some details in the proofs below are left to the highly optional exercise sheet that accompanies this post. Sources are acknowledged at the end.
As a warm up for my talk, I invited the audience to order the following cardinals:

Of course they are already correctly (and strictly) ordered by size. From the perspective of ‘effective computation’, I claim that
and
are ‘tiny’,
and
are ‘huge’ and
and
sit somewhere in the middle. To give some hint of the most surprising part of this, interpret
as the Cantor set
, so
is the set of subsets of
. Computationally definable subsets of
are then computable predicates on
(i.e. functions
), and since
is compact, it is computable whether two such predicates are equal. In contrast, there is no algorithm that, given two predicates on
, will run for a finite length of time and then return the correct answer.
Euclidean space and unit balls
Let
be the solid
-dimensional unit ball in Euclidean space and let
be the
-dimensional surface of
.
House prices in Sphereland
Imagine a hypothetical ‘Sphereland’ whose inhabitants live uniformly distributed on the surface
. For instance,
is the one-point compactification of Flatland. With a god’s eye view you are at the origin, and survey the world. At what values of the final coordinate are most of the inhabitants to be found?
For example, the diagram below shows the case
with two regions of equal height shaded.

It is a remarkable fact, going all the way back to Archimedes, that the surface areas of the red and blue regions are equal: the reduction in cross-section as we go upwards is exactly compensated by the shallower slope. For a calculus proof, observe that in the usual Cartesian coordinate system, we can parametrize the part of the surface with
and
by
. Choose
. Then since
is orthogonal to the gradient of the sphere at
, to first order, if we increase the height
from
to
then we must decrease the second coordinate
from
to
. This is shown in the diagram below.

Hence the norm squared of the marked line segment tangent to the surface is

As in Archimedes’ argument, the area of the region
between the lines of latitude of height between
and
is (to first order in
) the product of
and the circumference of the line of latitude at height
. It is therefore

which is independent of
. As a small check, integrating over
from
to
we get
for the surface area of the sphere; as expected this is also the surface area of the enclosing cylinder of radius
and height
.
This cancellation in the displayed formula above is special to the case
. For instance, when
, by the argument just given, the arclength at height
is proportional to
. Therefore in the one-point compactification of Lineland (a place one might feel is already insular enough), from the god’s perspective (considering slices with varying
), most of the inhabitants are near the north and south poles (
), and almost no-one lives on the equation (
).
More generally, the surface area of
is given by integrating the product of the arclength
and the surface area of the ‘latitude cross-section’ at height
. The latter is the
-dimensional sphere of radius
. By dimensional analysis, its surface area is
for some constant
. Hence the density of Sphereland inhabitants at height
is proportional to
. (A complete proof of this using only elementary calculus is given in Exercise 1 of the problem sheet.) In particular, we see that when
is large, almost all the density is at the equator
. From the god’s eye perspective, in high-dimensional Sphereland, everyone lives on the equator, or a tiny distance from it. To emphasise this point, here are the probability density functions for
using colours red then blue then black.

As a further example, the scatter points below show
random samples from
, taking (left)
coordinates and (right)
coordinates. Note that almost no points have a coordinate more than
(in absolute value). Moreover, since the expected value of each
is
, we find most of the mass at
.

In particular, if a random inhabitant of large
-Spaceland is at
then it is almost certain that most of the
are very small.
One feature of this seems deeply unintuitive to me. There is, after all, nothing intrinsic about the
-coordinate. Indeed, the god can pick any hyperplane in
through the origin, and get a similar conclusion.
Concentration of measure
One could make the claim about the expected sizes of the coordinates more precise by continuing with differential geometry, but Henry Cohn’s answer to this Mathoverflow question on concentration of measure gives an elegant alternative approach. Let
be independent identically distributed normal random variables each with mean
and variance
. Then
normalized to have length
is uniformly distributed on
. (A nice way to see this is to use that a linear combination of normal random variables is normally distributed to show invariance under the orthogonal group
.) Moreover,
, and, using that
, we get
![\displaystyle \mathrm{Var}[X_1^2 + \cdots + X_n^2] = n \frac{3}{n^2} + n(n-1) \frac{1}{n^2} - 1 = \frac{2}{n}. \displaystyle \mathrm{Var}[X_1^2 + \cdots + X_n^2] = n \frac{3}{n^2} + n(n-1) \frac{1}{n^2} - 1 = \frac{2}{n}.](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cmathrm%7BVar%7D%5BX_1%5E2+%2B+%5Ccdots+%2B+X_n%5E2%5D+%3D+n+%5Cfrac%7B3%7D%7Bn%5E2%7D+%2B+n%28n-1%29+%5Cfrac%7B1%7D%7Bn%5E2%7D+-+1+%3D+%5Cfrac%7B2%7D%7Bn%7D.&bg=ffffff&fg=333333&s=0&c=20201002)
Hence, by Chebychev’s Inequality that if
is a random variable then
![\displaystyle \mathbb{P}\bigl[|Z-\mathbb{E}Z| \ge c\bigr] \le \frac{\mathrm{Var} Z}{c^2}, \displaystyle \mathbb{P}\bigl[|Z-\mathbb{E}Z| \ge c\bigr] \le \frac{\mathrm{Var} Z}{c^2},](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cmathbb%7BP%7D%5Cbigl%5B%7CZ-%5Cmathbb%7BE%7DZ%7C+%5Cge+c%5Cbigr%5D+%5Cle+%5Cfrac%7B%5Cmathrm%7BVar%7D+Z%7D%7Bc%5E2%7D%2C&bg=ffffff&fg=333333&s=0&c=20201002)
the probability that
is at most
which tends to
as
. Therefore we can, with small error, neglect the normalization and regard
as a uniformly chosen point on
. By Markov’s inequality, that if
is a non-negative random variable then
, the probability that
(or equivalently,
) is at most
, and the probability that
(or equivalently,
with
) is at most
. Since

as
, we see that with high probability, a random Sphereland inhabitant has all its coordinates in
. I think this makes the counter-intuitive conclusion of the previous subsection even starker.
Volume of the unit ball
The ‘length of tangent line times area of cross-section’ argument says that if
is the surface area of
then
A quick trip to Mathematica to evaluate the integral shows that

It follows easily by induction that
. Since
and
, the volume
of
is

In particular, from
for
we get
. Hence
(is there a quick way to see this?), and the proportion of the cube
occupied by the ball
is

Thus the proportion tends to
, very quickly. I find this somewhat surprising, since my mental picture of a sphere is as a bulging convex object that should somehow `fill out’ an enclosing cube. Again my intuition is hopelessly wrong.
Coding Theory
We now turn to discrete spaces. Our friends Alice and Bob must communicate using a noisy channel that can send the bits
and
. They agree (in advance) a binary code
of length
, and a bijection between codewords and messages. When Bob receives a binary word in
he decodes it as the message in bijection with the nearest codeword in
; if there are several such codewords, then he chooses one at random, and fears the worst. Here ‘nearest’ of course means with respect to Hamming distance.
Shannon’s probabilistic model
In this model, each bit flips independently with a fixed crossover probability
. If
then reliable communication is impossible and if
then we can immediately reduce to the case
by flipping all bits in a received word. We therefore assume that
. In this case, Shannon's Noisy Coding Theorem states that the capacity of the binary symmetric channel is
, where
is the binary entropy function. That is, given any
, provided
is sufficiently large, there is a binary code
and size
such that when
is used with nearest neighbour decoding to communicate on the binary symmetric channel, the probability of a decoding error is
(uniformly for every codeword). We outline a proof later.
For example, the theoretical maximum 4G data rate is
bits per second. Since
, provided
is sufficiently large, even if one in four bits gets randomly flipped by the network, Shannon’s Noisy Coding Theorem promises that Alice and Bob can communicate reliably using a code of size
. In other words, Alice and Bob can communicate reliably at a rate of up to
million bits per second.
Hamming’s adversarial model
In Hamming’s model the received word differs from the sent word in at most
positions, where these positions are chosen by an adversary to be as inconvenient as possible. Nearest neighbour decoding always succeeds if and only if the minimum distance of the code is at least
.
In the usual binary symmetric channel with crossover probability
, when
is large, the chance that the number of flips in the normal binary symmetric channel is more than
is negligible. Therefore a binary code with minimum distance
can be used to communicate reliably on an adversarial binary symmetric channel with crossover probability
, in which the number of flipped bits is always
, and these bits are chosen adversarially.
So how big can a code with minimum distance
be? Let
be such a code and for each
, let

Observe that each
is a binary code of length
and minimum distance at least
. By the Plotkin bound,
for all
. (Equality is attained by the Hadamard codes
.) Since each
has size at most
, we find that

The relative rate
of a binary code
of length
is defined to be
. By the bound above, if
is
-error correcting (and so its minimum distance satisfies
) we have

In particular, if
then
: the code consists of a vanishing small proportion of
. We conclude that if the crossover probability in the channel is
or more, fast communication in the Hamming model is impossible.
An apparent paradox
We have seen that Shannon’s Noisy Coding Theorem promises reliable communication at a rate beyond the Plotkin bound. I hope it is clear that there is no paradox: instead we have demonstrated that communication with random errors is far easy than communication with adversarial errors.
In my view, most textbooks on coding theory do not give enough emphasis to this distinction. They share this feature with the students in my old coding theory course: for many years I set an optional question asking them to resolve the apparent clash between Shannon’s Noisy Coding Theorem and the Plotkin Bound; despite several strong students attempting it, only one ever got close to the explanation above. In fact in most years, the modal answer was ‘mathematics is inconsistent’. Of course as a fully paid-up member of the mathematical establishment, I marked this as incorrect.
The structure of large-dimensional vector spaces
I now want to argue that the sharp difference between the Shannon and Hamming models illuminates the structure of
for large
is large.
When lecturing in the Hamming model, one often draws pictures such as the one below, in which, like a homing missile, a sent codeword inexorably heads towards another codeword (two errors, red arrows), rather than heading in a random direction (two errors, blue arrows).

While accurate for adversarial errors, Shannon’s Noisy Coding Theorem tells us that for random errors, this picture is completely inaccurate. Instead, if
is a large code with minimum distance
and
is sent over the binary symmetric channel with crossover probability
, and
is received, while
is at distance about
from
, it is not appreciably closer to any other codeword
. I claim that this situation is possible because
is ‘large’, in a sense not captured by the two-dimensional diagram above.
Shannon’s Noisy Coding Theorem for the binary symmetric channel
To make the previous paragraph more precise we outline a proof of Shannon’s Noisy Coding Theorem for the binary symmetric channel. The proof, which goes back to Shannon, is a beautiful application of the probabilistic method, made long before this was the standard term for such proofs.
We shall simplify things by replacing the binary symmetric channel with its ‘toy’ version, in which whenever a word
is sent, exactly
bits are chosen uniformly at random to flip. (So we are assuming
.) By the Law of Large Numbers, this is a good approximation to binomially distributed errors, and it is routine using standard estimates (Chebychev’s Inequality is enough) to modify the proof below so it works for the original channel.
Proof. Fix
and let
, where
will be chosen later. Choose
codewords
uniformly at random from
. Let
be the probability (in the probability space of the toy binary symmetric channel) that when
is sent, the received word is decoded incorrectly by nearest neighbour decoding. In the toy binary symmetric channel, the received word
is distance
from
, so an upper bound for
is the probability
that when
is sent, there is a codeword
with
within distance
of the received word. Note that
are all themselves random variables, defined in the probability space of the random choice of code.
Now
is in turn bounded above by the expected number (over the random choice of code) of codewords
with
within distance
of the received word. Since these codewords were chosen independently of
and uniformly from
, it doesn't matter what the received word is: the expected number of such codewords is simply

where
is the volume (i.e. number of words) in the Hamming ball of radius
about
. We can model words in this ball as the output of a random source that emits the bits
and
with probabilities
and
, respectively. The entropy of this source is
, so we expect to be able to compress its
-bit outputs to words of length
. Correspondingly,

(I find the argument by source coding is good motivation for the inequality, but there is of course a simpler proof using basic probability theory: see the problem sheet.) Hence

Since
, the probability
of decoding error when
is sent becomes exponentially small as
becomes large. In particular, the mean probability
is smaller than
, provided
is sufficiently large. A Markov’s Inequality argument now shows that by throwing away at most half the codewords, we can assume that the probability of decoding error is less than
for all
remaining codewords. 
Varying the alphabet
Increasing the size of the alphabet does not change the situation in an important way. In fact, if
is replaced with
for
large then the Singleton Bound, that
becomes effective, giving another constraint that also apparently contradicts Shannon’s Noisy Coding Theorem. There is however one interesting difference: in
, every binary word has a unique antipodal word, obtained by flipping all its bits, whereas in
there are
words at distance
from any given word. This is the best qualitative sense I know in which
is smaller than
.
Cryptography and computation
Readers interested in cryptography probably recognised
above as the key length of the block cipher DES. This cipher is no longer in common use because a determined adversary knowing (as is usually assumed) some plaintext/cipher pairs can easily try all
possible keys and so discover the key. Even back in 2008, an FPGA-based special purpose device costing £10000 could test
DES keys every second, giving
days for an exhaustive search.
Modern block ciphers such as AES typically support keys of length
and
. In Applied Cryptography, Schneier estimates that a Dyson Sphere capturing all the Sun’s energy for 32 years would provide enough power to perform
basic operations, strongly suggesting that
bits should be enough for anyone. The truly paranoid (or readers of Douglas Adams) should note that in Computational capacity of the universe, Seth Lloyd, Phys. Rev. Lett., (2002) 88 237901-3, the author estimates that even if the universe is one vast computer, then it can have performed at most
calculations. Thus in a computational sense,
is tiny, and
and
are effectively infinite.
The final number above was
. The Chinese supercomputer Sunway TaihuLight runs at 93 Petaflops, that is
operations per second. A modern Intel chip has AES encryption as a primitive instruction, and can encrypt at 1.76 cycles per byte for a 256-bit key, encrypting 1KB at a time. If being very conservative, we assume the supercomputer can test a key by encrypting 16 bytes, then it can test
keys every second, requiring
days to exhaust over
keys. Therefore
is in the tricky middle ground, between the easily computable and the almost certainly impossible.
My surprising claim that
also sits somewhere in the middle comes from my reading of a wonderful blog post by Martín Escardó (on Andrej Bauer’s blog). To conclude, I will give an introduction to his seemingly-impossible Haskell programs.
As a warm-up, consider this Haskell function which computes the Fibonacci numbers
, extending the definition in the natural way to all integers
.
fib n | n == 0 = 0
| n == 1 = 1
| n >= 2 = fib (n-1) + fib (n-2)
| n <= -1 = fib (n+2) - fib (n+1)
Except from the minor notation change that the conditions appear before rather than after the equals sign, this Haskell code is a verbatim mathematical definition. The code below defines two predicates
and
on the integers such that
is true if and only if
and
is true if and only if
mod
.
p n = fib n * fib n - fib (n+1) * fib (n-1) == (-1)^(n+1)
q n = fib n `mod` 3 == 0
The first equation is Cassini’s Identity, so
is true for all
;
is true if and only if
is a multiple of
. We check this in Haskell using its very helpful ‘list-comprehension’ syntax; again this is very close to the analogous mathematical notation for sets.
*Cantor> [p n | n <- [-5..5]]
[True,True,True,True,True,True,True,True,True,True,True]
*Cantor> [q n | n <- [-5..5]]
[False,True,False,False,False,True,False,False,False,True,False]
Therefore if we define
to be true (for all
) and
to be true if and only if
, we have
and
.
An important feature of Haskell is that it is strongly typed. We haven’t seen this yet because Haskell is also type-inferring in a very powerful way, that makes explicit type signatures usually unnecessary. Simplifying slightly, the type of the predicates above is Integer -> Bool
, and the type of fib
is Integer -> Integer
. (Integer
is a built in Haskell type that supports arbitrary sized integers.) The family of predicates
defined by
is defined by
r m n = n `mod` m == 0
Here r
has type Integer -> Integer -> Bool
. It is helpful to think of this in terms of the currying isomorphism
, ubiquitous in Haskell code. We now ask: is there a Haskell function
equal :: (Integer -> Bool) -> (Integer -> Bool) -> Bool
taking as its input two predicates on the integers and returning True
if and only if they are equal? The examples of
above show that such a function would, at a stroke, make most mathematicians unemployed. Fortunately for us, Turing’s solution to the Entscheidungsproblem tells us that no such function can exist.
Escardó’s post concerns predicates defined not on the integers
, but instead on the
-adic integers
. We think of the
-adics as infinite bitstreams. For example, the Haskell definitions of the bitstream
representing
and the bitstream
representing

are:
data Bit = Zero | One
type Cantor = [Bit]
zero = Zero : zero
one = One : zero
bs = One : Zero : bs :: Cantor
(I’m deliberately simplifying by using the Haskell list type []
: this is not quite right, since lists can be, and usually are, finite.) Because of its lazy evaluation — nothing is evaluated in Haskell unless it is provably necessary for the computation to proceed — Haskell is ideal for manipulating such bitstreams. For instance, while evaluating bs
at the Haskell prompt will print an infinite stream on the console, Haskell has no problem performing any computation that depends on only finitely many values of a bitstream. As proof, here is
:
*Cantor> take 10 (bs + tail bs)
[One,One,One,One,One,One,One,One,One,One]
*Cantor> take 10 (bs + tail bs + one)
[Zero,Zero,Zero,Zero,Zero,Zero,Zero,Zero,Zero,Zero]
(Of course one has to tell Haskell how to define addition on the Cantor
type: see Cantor.hs for this and everything else in this section.) As an example of a family of predicates on
, consider
twoPowerDivisibleC :: Int -> Cantor -> Bool
twoPowerDivisibleC p bs = take p zero == take p bs
Thus twoPowerDivisibleC p bs
holds if and only if bs
represents an element of
. For example, the odd
-adic integers are precisely those for which twoPowerDivisibleC 1 bs
is false:
*Cantor> [twoPowerDivisibleC 1 (fromInteger n) | n <- [0..10]]
[True,False,True,False,True,False,True,False,True,False,True]
In the rooted binary tree representation of
shown below, the truth-set of this predicate is exactly the left-hand subtree. The infinite path to
is shown by thick lines.

Here is a somewhat similar looking definition
nonZeroC :: Cantor -> Bool
nonZeroC (One : _) = True
nonZeroC (Zero : bs) = nonZeroC bs
While a correct (and correctly typed) Haskell definition, this does not define a predicate on
because the evaluation of nonZeroC zero
never terminates. In fact, it is surprisingly difficult to define a predicate (i.e. a total function with boolean values) on the Cantor
type. The beautiful reason behind this is that the truth-set of any such predicate is open in the
-adic topology. Since this topology has as a basis of open sets the cosets of the subgroups
, all predicates look something like one of the twoPowerDivisibleC
predicates above.
This remark maybe makes the main result in Escardó’s blog post somewhat less amazing, but it is still very striking: it is possible to define a Haskell function
equalC :: (Cantor -> Bool) -> (Cantor -> Bool) -> Bool
which given two predicates on
(as represented in Haskell using the Cantor
type) returns True
if and only if they are equal (as mathematical functions on
) and false if and only if they are unequal. Escardó’s ingenious definition of equalC
needs only a few lines of Haskell code: it may look obvious when read on the screen, but I found it a real challenge to duplicate unseen, even after having read his post. I encourage you to read it: it is fascinating to see how the compactness of
as a topological space corresponds to a ‘uniformity’ property of Haskell predicates.
Sources
The results on volumes of spheres and balls are cobbled together from several MathOverflow questions and answers, in particular Joseph O’Rourke’s question asking for an intuitive explanation of the concentration of measure phenomenon, and S. Carnahan’s answer to a question on the volume of the
-dimensional unit ball. The coding theory results go back to Shannon, and can be found in many textbooks, for example Van Lint’s Introduction to coding theory. The use of the ‘toy’ binary symmetric channel is my innovation, used when I lectured our Channels course in 2019–20 to reduce the technicalities in the proof of Shannon’s Noisy Coding Theorem. The diagrams were drawn using TikZ; for the spheres I used these macros due to Tomasz M. Trzeciak. The material about computability is either very basic or comes from a blog post by Martín Escardó (on Andrej Bauer’s blog) giving an introduction to his paper Infinite sets that admit fast exhaustive search, published in the proceedings of Logic in Computer Science 2007.