This post is an extended version of a talk I last gave a few years ago on an extended Summer visit to Bristol; four weeks into lockdown seems like a good time to write it up for a more varied audience. The overall thesis is that it’s hard to have a good intuition for really high-dimensional spaces, and that the reason is that, asked to picture such a space, most of us come far closer to something like then the more accurate . This is reflected in the rule of thumb that the size of a tower of exponents is determined by the number at the top: is tiny compared to and should seem only a tiny bit smaller than
when both are compared with .
Some details in the proofs below are left to the highly optional exercise sheet that accompanies this post. Sources are acknowledged at the end.
As a warm up for my talk, I invited the audience to order the following cardinals:
Of course they are already correctly (and strictly) ordered by size. From the perspective of ‘effective computation’, I claim that and are ‘tiny’, and are ‘huge’ and and sit somewhere in the middle. To give some hint of the most surprising part of this, interpret as the Cantor set , so is the set of subsets of . Computationally definable subsets of are then computable predicates on (i.e. functions ), and since is compact, it is computable whether two such predicates are equal. In contrast, there is no algorithm that, given two predicates on , will run for a finite length of time and then return the correct answer.
Euclidean space and unit balls
Let be the solid -dimensional unit ball in Euclidean space and let be the -dimensional surface of .
House prices in Sphereland
Imagine a hypothetical ‘Sphereland’ whose inhabitants live uniformly distributed on the surface . For instance, is the one-point compactification of Flatland. With a god’s eye view you are at the origin, and survey the world. At what values of the final coordinate are most of the inhabitants to be found?
For example, the diagram below shows the case with two regions of equal height shaded.
It is a remarkable fact, going all the way back to Archimedes, that the surface areas of the red and blue regions are equal: the reduction in cross-section as we go upwards is exactly compensated by the shallower slope. For a calculus proof, observe that in the usual Cartesian coordinate system, we can parametrize the part of the surface with and by . Choose . Then since is orthogonal to the gradient of the sphere at , to first order, if we increase the height from to then we must decrease the second coordinate from to . This is shown in the diagram below.
Hence the norm squared of the marked line segment tangent to the surface is
As in Archimedes’ argument, the area of the region between the lines of latitude of height between and is (to first order in ) the product of and the circumference of the line of latitude at height . It is therefore
which is independent of . As a small check, integrating over from to we get for the surface area of the sphere; as expected this is also the surface area of the enclosing cylinder of radius and height .
This cancellation in the displayed formula above is special to the case . For instance, when , by the argument just given, the arclength at height is proportional to . Therefore in the one-point compactification of Lineland (a place one might feel is already insular enough), from the god’s perspective (considering slices with varying ), most of the inhabitants are near the north and south poles (), and almost no-one lives on the equation ().
More generally, the surface area of is given by integrating the product of the arclength and the surface area of the ‘latitude cross-section’ at height . The latter is the -dimensional sphere of radius . By dimensional analysis, its surface area is for some constant . Hence the density of Sphereland inhabitants at height is proportional to . (A complete proof of this using only elementary calculus is given in Exercise 1 of the problem sheet.) In particular, we see that when is large, almost all the density is at the equator . From the god’s eye perspective, in high-dimensional Sphereland, everyone lives on the equator, or a tiny distance from it. To emphasise this point, here are the probability density functions for using colours red then blue then black.
As a further example, the scatter points below show random samples from , taking (left) coordinates and (right) coordinates. Note that almost no points have a coordinate more than (in absolute value). Moreover, since the expected value of each is , we find most of the mass at .
In particular, if a random inhabitant of large -Spaceland is at then it is almost certain that most of the are very small.
One feature of this seems deeply unintuitive to me. There is, after all, nothing intrinsic about the -coordinate. Indeed, the god can pick any hyperplane in through the origin, and get a similar conclusion.
Concentration of measure
One could make the claim about the expected sizes of the coordinates more precise by continuing with differential geometry, but Henry Cohn’s answer to this Mathoverflow question on concentration of measure gives an elegant alternative approach. Let be independent identically distributed normal random variables each with mean and variance . Then normalized to have length is uniformly distributed on . (A nice way to see this is to use that a linear combination of normal random variables is normally distributed to show invariance under the orthogonal group .) Moreover, , and, using that , we get
Hence, by Chebychev’s Inequality that if is a random variable then
the probability that is at most which tends to as . Therefore we can, with small error, neglect the normalization and regard as a uniformly chosen point on . By Markov’s inequality, that if is a non-negative random variable then , the probability that (or equivalently, ) is at most , and the probability that (or equivalently, with ) is at most . Since
as , we see that with high probability, a random Sphereland inhabitant has all its coordinates in . I think this makes the counter-intuitive conclusion of the previous subsection even starker.
Volume of the unit ball
The ‘length of tangent line times area of cross-section’ argument says that if is the surface area of then
A quick trip to Mathematica to evaluate the integral shows that
It follows easily by induction that . Since and , the volume of is
In particular, from for we get . Hence (is there a quick way to see this?), and the proportion of the cube occupied by the ball is
Thus the proportion tends to , very quickly. I find this somewhat surprising, since my mental picture of a sphere is as a bulging convex object that should somehow `fill out’ an enclosing cube. Again my intuition is hopelessly wrong.
We now turn to discrete spaces. Our friends Alice and Bob must communicate using a noisy channel that can send the bits and . They agree (in advance) a binary code of length , and a bijection between codewords and messages. When Bob receives a binary word in he decodes it as the message in bijection with the nearest codeword in ; if there are several such codewords, then he chooses one at random, and fears the worst. Here ‘nearest’ of course means with respect to Hamming distance.
Shannon’s probabilistic model
In this model, each bit flips independently with a fixed crossover probability . If then reliable communication is impossible and if then we can immediately reduce to the case by flipping all bits in a received word. We therefore assume that . In this case, Shannon's Noisy Coding Theorem states that the capacity of the binary symmetric channel is , where is the binary entropy function. That is, given any , provided is sufficiently large, there is a binary code and size such that when is used with nearest neighbour decoding to communicate on the binary symmetric channel, the probability of a decoding error is (uniformly for every codeword). We outline a proof later.
For example, the theoretical maximum 4G data rate is bits per second. Since , provided is sufficiently large, even if one in four bits gets randomly flipped by the network, Shannon’s Noisy Coding Theorem promises that Alice and Bob can communicate reliably using a code of size . In other words, Alice and Bob can communicate reliably at a rate of up to million bits per second.
Hamming’s adversarial model
In Hamming’s model the received word differs from the sent word in at most positions, where these positions are chosen by an adversary to be as inconvenient as possible. Nearest neighbour decoding always succeeds if and only if the minimum distance of the code is at least .
In the usual binary symmetric channel with crossover probability , when is large, the chance that the number of flips in the normal binary symmetric channel is more than is negligible. Therefore a binary code with minimum distance can be used to communicate reliably on an adversarial binary symmetric channel with crossover probability , in which the number of flipped bits is always , and these bits are chosen adversarially.
So how big can a code with minimum distance be? Let be such a code and for each , let
Observe that each is a binary code of length and minimum distance at least . By the Plotkin bound, for all . (Equality is attained by the Hadamard codes .) Since each has size at most , we find that
The relative rate of a binary code of length is defined to be . By the bound above, if is -error correcting (and so its minimum distance satisfies ) we have
In particular, if then : the code consists of a vanishing small proportion of . We conclude that if the crossover probability in the channel is or more, fast communication in the Hamming model is impossible.
An apparent paradox
We have seen that Shannon’s Noisy Coding Theorem promises reliable communication at a rate beyond the Plotkin bound. I hope it is clear that there is no paradox: instead we have demonstrated that communication with random errors is far easy than communication with adversarial errors.
In my view, most textbooks on coding theory do not give enough emphasis to this distinction. They share this feature with the students in my old coding theory course: for many years I set an optional question asking them to resolve the apparent clash between Shannon’s Noisy Coding Theorem and the Plotkin Bound; despite several strong students attempting it, only one ever got close to the explanation above. In fact in most years, the modal answer was ‘mathematics is inconsistent’. Of course as a fully paid-up member of the mathematical establishment, I marked this as incorrect.
The structure of large-dimensional vector spaces
I now want to argue that the sharp difference between the Shannon and Hamming models illuminates the structure of for large is large.
When lecturing in the Hamming model, one often draws pictures such as the one below, in which, like a homing missile, a sent codeword inexorably heads towards another codeword (two errors, red arrows), rather than heading in a random direction (two errors, blue arrows).
While accurate for adversarial errors, Shannon’s Noisy Coding Theorem tells us that for random errors, this picture is completely inaccurate. Instead, if is a large code with minimum distance and is sent over the binary symmetric channel with crossover probability , and is received, while is at distance about from , it is not appreciably closer to any other codeword . I claim that this situation is possible because is ‘large’, in a sense not captured by the two-dimensional diagram above.
Shannon’s Noisy Coding Theorem for the binary symmetric channel
To make the previous paragraph more precise we outline a proof of Shannon’s Noisy Coding Theorem for the binary symmetric channel. The proof, which goes back to Shannon, is a beautiful application of the probabilistic method, made long before this was the standard term for such proofs.
We shall simplify things by replacing the binary symmetric channel with its ‘toy’ version, in which whenever a word is sent, exactly bits are chosen uniformly at random to flip. (So we are assuming .) By the Law of Large Numbers, this is a good approximation to binomially distributed errors, and it is routine using standard estimates (Chebychev’s Inequality is enough) to modify the proof below so it works for the original channel.
Proof. Fix and let , where will be chosen later. Choose codewords uniformly at random from . Let be the probability (in the probability space of the toy binary symmetric channel) that when is sent, the received word is decoded incorrectly by nearest neighbour decoding. In the toy binary symmetric channel, the received word is distance from , so an upper bound for is the probability that when is sent, there is a codeword with within distance of the received word. Note that are all themselves random variables, defined in the probability space of the random choice of code.
Now is in turn bounded above by the expected number (over the random choice of code) of codewords with within distance of the received word. Since these codewords were chosen independently of and uniformly from , it doesn't matter what the received word is: the expected number of such codewords is simply
where is the volume (i.e. number of words) in the Hamming ball of radius about . We can model words in this ball as the output of a random source that emits the bits and with probabilities and , respectively. The entropy of this source is , so we expect to be able to compress its -bit outputs to words of length . Correspondingly,
(I find the argument by source coding is good motivation for the inequality, but there is of course a simpler proof using basic probability theory: see the problem sheet.) Hence
Since , the probability of decoding error when is sent becomes exponentially small as becomes large. In particular, the mean probability is smaller than , provided is sufficiently large. A Markov’s Inequality argument now shows that by throwing away at most half the codewords, we can assume that the probability of decoding error is less than for all remaining codewords.
Varying the alphabet
Increasing the size of the alphabet does not change the situation in an important way. In fact, if is replaced with for large then the Singleton Bound, that becomes effective, giving another constraint that also apparently contradicts Shannon’s Noisy Coding Theorem. There is however one interesting difference: in , every binary word has a unique antipodal word, obtained by flipping all its bits, whereas in there are words at distance from any given word. This is the best qualitative sense I know in which is smaller than .
Cryptography and computation
Readers interested in cryptography probably recognised above as the key length of the block cipher DES. This cipher is no longer in common use because a determined adversary knowing (as is usually assumed) some plaintext/cipher pairs can easily try all possible keys and so discover the key. Even back in 2008, an FPGA-based special purpose device costing £10000 could test DES keys every second, giving days for an exhaustive search.
Modern block ciphers such as AES typically support keys of length and . In Applied Cryptography, Schneier estimates that a Dyson Sphere capturing all the Sun’s energy for 32 years would provide enough power to perform basic operations, strongly suggesting that bits should be enough for anyone. The truly paranoid (or readers of Douglas Adams) should note that in Computational capacity of the universe, Seth Lloyd, Phys. Rev. Lett., (2002) 88 237901-3, the author estimates that even if the universe is one vast computer, then it can have performed at most calculations. Thus in a computational sense, is tiny, and and are effectively infinite.
The final number above was . The Chinese supercomputer Sunway TaihuLight runs at 93 Petaflops, that is operations per second. A modern Intel chip has AES encryption as a primitive instruction, and can encrypt at 1.76 cycles per byte for a 256-bit key, encrypting 1KB at a time. If being very conservative, we assume the supercomputer can test a key by encrypting 16 bytes, then it can test keys every second, requiring days to exhaust over keys. Therefore is in the tricky middle ground, between the easily computable and the almost certainly impossible.
My surprising claim that also sits somewhere in the middle comes from my reading of a wonderful blog post by Martín Escardó (on Andrej Bauer’s blog). To conclude, I will give an introduction to his seemingly-impossible Haskell programs.
As a warm-up, consider this Haskell function which computes the Fibonacci numbers , extending the definition in the natural way to all integers .
fib n | n == 0 = 0
| n == 1 = 1
| n >= 2 = fib (n-1) + fib (n-2)
| n <= -1 = fib (n+2) - fib (n+1)
Except from the minor notation change that the conditions appear before rather than after the equals sign, this Haskell code is a verbatim mathematical definition. The code below defines two predicates and on the integers such that is true if and only if and is true if and only if mod .
p n = fib n * fib n - fib (n+1) * fib (n-1) == (-1)^(n+1)
q n = fib n `mod` 3 == 0
The first equation is Cassini’s Identity, so is true for all ; is true if and only if is a multiple of . We check this in Haskell using its very helpful ‘list-comprehension’ syntax; again this is very close to the analogous mathematical notation for sets.
*Cantor> [p n | n <- [-5..5]]
*Cantor> [q n | n <- [-5..5]]
Therefore if we define to be true (for all ) and to be true if and only if , we have and .
An important feature of Haskell is that it is strongly typed. We haven’t seen this yet because Haskell is also type-inferring in a very powerful way, that makes explicit type signatures usually unnecessary. Simplifying slightly, the type of the predicates above is
Integer -> Bool, and the type of
Integer -> Integer. (
Integer is a built in Haskell type that supports arbitrary sized integers.) The family of predicates defined by
is defined by
r m n = n `mod` m == 0
r has type
Integer -> Integer -> Bool. It is helpful to think of this in terms of the currying isomorphism , ubiquitous in Haskell code. We now ask: is there a Haskell function
equal :: (Integer -> Bool) -> (Integer -> Bool) -> Bool
taking as its input two predicates on the integers and returning
True if and only if they are equal? The examples of above show that such a function would, at a stroke, make most mathematicians unemployed. Fortunately for us, Turing’s solution to the Entscheidungsproblem tells us that no such function can exist.
Escardó’s post concerns predicates defined not on the integers , but instead on the -adic integers . We think of the -adics as infinite bitstreams. For example, the Haskell definitions of the bitstream representing and the bitstream representing
data Bit = Zero | One
type Cantor = [Bit]
zero = Zero : zero
one = One : zero
bs = One : Zero : bs :: Cantor
(I’m deliberately simplifying by using the Haskell list type
: this is not quite right, since lists can be, and usually are, finite.) Because of its lazy evaluation — nothing is evaluated in Haskell unless it is provably necessary for the computation to proceed — Haskell is ideal for manipulating such bitstreams. For instance, while evaluating
bs at the Haskell prompt will print an infinite stream on the console, Haskell has no problem performing any computation that depends on only finitely many values of a bitstream. As proof, here is :
*Cantor> take 10 (bs + tail bs)
*Cantor> take 10 (bs + tail bs + one)
(Of course one has to tell Haskell how to define addition on the
Cantor type: see Cantor.hs for this and everything else in this section.) As an example of a family of predicates on , consider
twoPowerDivisibleC :: Int -> Cantor -> Bool
twoPowerDivisibleC p bs = take p zero == take p bs
twoPowerDivisibleC p bs holds if and only if
bs represents an element of . For example, the odd -adic integers are precisely those for which
twoPowerDivisibleC 1 bs is false:
*Cantor> [twoPowerDivisibleC 1 (fromInteger n) | n <- [0..10]]
In the rooted binary tree representation of shown below, the truth-set of this predicate is exactly the left-hand subtree. The infinite path to is shown by thick lines.
Here is a somewhat similar looking definition
nonZeroC :: Cantor -> Bool
nonZeroC (One : _) = True
nonZeroC (Zero : bs) = nonZeroC bs
While a correct (and correctly typed) Haskell definition, this does not define a predicate on because the evaluation of
nonZeroC zero never terminates. In fact, it is surprisingly difficult to define a predicate (i.e. a total function with boolean values) on the
Cantor type. The beautiful reason behind this is that the truth-set of any such predicate is open in the -adic topology. Since this topology has as a basis of open sets the cosets of the subgroups , all predicates look something like one of the
twoPowerDivisibleC predicates above.
This remark maybe makes the main result in Escardó’s blog post somewhat less amazing, but it is still very striking: it is possible to define a Haskell function
equalC :: (Cantor -> Bool) -> (Cantor -> Bool) -> Bool
which given two predicates on (as represented in Haskell using the
Cantor type) returns
True if and only if they are equal (as mathematical functions on ) and false if and only if they are unequal. Escardó’s ingenious definition of
equalC needs only a few lines of Haskell code: it may look obvious when read on the screen, but I found it a real challenge to duplicate unseen, even after having read his post. I encourage you to read it: it is fascinating to see how the compactness of as a topological space corresponds to a ‘uniformity’ property of Haskell predicates.
The results on volumes of spheres and balls are cobbled together from several MathOverflow questions and answers, in particular Joseph O’Rourke’s question asking for an intuitive explanation of the concentration of measure phenomenon, and S. Carnahan’s answer to a question on the volume of the -dimensional unit ball. The coding theory results go back to Shannon, and can be found in many textbooks, for example Van Lint’s Introduction to coding theory. The use of the ‘toy’ binary symmetric channel is my innovation, used when I lectured our Channels course in 2019–20 to reduce the technicalities in the proof of Shannon’s Noisy Coding Theorem. The diagrams were drawn using TikZ; for the spheres I used these macros due to Tomasz M. Trzeciak. The material about computability is either very basic or comes from a blog post by Martín Escardó (on Andrej Bauer’s blog) giving an introduction to his paper Infinite sets that admit fast exhaustive search, published in the proceedings of Logic in Computer Science 2007.