## The Hamming code, quadratic residue codes and an exceptional isomorphism

July 4, 2021

As a pleasant way to pass a Sunday afternoon — or at least, more pleasant than flat-hunting in Bristol in a wildly inflated property market — let me offer yet another way to prove the exceptional isomorphism $\mathrm{GL}_3(\mathbb{F}_2) \cong \mathrm{PSL}_2(\mathbb{F}_7)$. An earlier post uses modular representation theory to do in about a screenful of WordPress. Here we do it using the well known result that, up to position permutation, there is a unique perfect $1$-error correcting linear binary code of length $7$.

### The Hamming code

We define the $[7,4,3]$Hamming code $C$ to be the linear code of length $7$ and dimension $4$ with parity check matrix

$H = \left( \begin{matrix} 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 1 \end{matrix} \right).$

Since each non-zero binary word of length $3$ appears exactly once as a column of $H$, the syndromes for the $7$ possible one bit errors, namely

$H(1,0,0,0,0,0,0)^t, \ldots, H(0,0,0,0,0,0,1)^t$

are distinct. Hence $C$ is $1$-error correcting, and therefore has minimum distance $\ge 3$. Since the disjoint Hamming balls of radius $1$ about the codewords each contain $1 + \binom{7}{1} = 8$ binary words of length $7$, and $8 \times 2^4 = 2^7 = |\mathbb{F}_2^7|$, these balls partition $\mathbb{F}_2^7$. Therefore $C$ is a perfect $1$-error correcting linear binary code.

Lemma. The automorphism group of $C$ is a subgroup of $S_7$ containing $\mathrm{GL}_3(\mathbb{F}_2)$.

Proof. Each $g \in \mathrm{GL}_3(\mathbb{F}_2)$ permutes the $7$ non-zero elements of $\mathbb{F}_2^3$. Let $\sigma_g \in S_7$ be the position permutation of the columns of $H$ induced by $g$. For example,

$g = \left( \begin{matrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{matrix} \right) \implies gH = \left( \begin{matrix} 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 &0 & 1 & 1 \end{matrix} \right)$

and so in this case $\sigma_g = (1,4,2)(3,5,6)$. (Since I am hyper-paranoid about such things, I will verify that $g \mapsto \sigma_g$ is a homomorphism by comparing

$g(g'H) = g (H . \sigma_{g'}) = (gH) \cdot \sigma_{g'} = H \cdot \sigma_g \sigma_{g'}$

with $(gg') H = H \cdot \sigma_{gg'}$, where $\cdot$ denotes the column permutation action.) Now writing $P(\sigma)$ for the permutation matrix of $\sigma$ (acting on the right as usual when composing permutations left to right, so $P(\sigma)_{ij} = [i\sigma = j]$), we have

$v (gH)^t = v (HP(\sigma_g))^t = v P(\sigma_g)^t H^t = v P(\sigma_g^{-1}) H^t = (v \cdot \sigma_g^{-1}) H^t$

for any $v \in \mathbb{F}_2^7$. If $v \in C$ then the left-hand side is

$v H^t g^t = 0 g^t = 0.$

Therefore $v \cdot \sigma_g^{-1} \in C$, and indeed the image of the homomorphism $\sigma$ is a group of position permutations of $C$. For instance, in the example above, $(1,1,1,0,0,0,0)(gH)^t = 0$ and correspondingly, $(1,1,1,0,0,0,0)\sigma_g = (1,0,0,1,1,0,0)$ is a codeword in $C$. Since the group homomorphism $g \mapsto \sigma_g$ is injective, the lemma follows. $\Box$.

Another nice way to prove the lemma is to calculate the $7$ words in $C$ of weight $3$. By the definition above, these have non-zero entries in the positions

$\{1,2,3\}, \{1,4,5\}, \{2,4,6\}, \{1,6,7\}, \{2,5,7\}, \{3,4,7\}, \{3,5,6\},$

as shown below as the lines in the Fano plane $\mathbb{P}^2(\mathbb{F}_7^3)$.

### Quadratic residue codes

We may construct $\mathbb{F}_{2^3}$ as $\mathbb{F}_2(\alpha)$ where $\alpha$ has minimal polynomial $x^3 + x + 1$. The binary quadratic residue code of length $7$ is the cyclic code $D$ whose generator polynomial has as its roots the powers of $\alpha$ corresponding to the $3$ quadratic residues modulo $7$, namely $1$, $2$ and $4$. That is,

$g(x) = (x-\alpha)(x-\alpha^2)(x-\alpha^4).$

Since these roots are conjugate under the Frobenius automorphism $\beta \mapsto \beta^2$ of $\mathbb{F}_{2^3}$, the generator polynomial is the minimum polynomial of $\alpha$; that is $g(x) = x^3 + x + 1$. Thinking of $D$ as the ideal $\langle x^3+x+1 \rangle$ of $\mathbb{F}_2[x]/\langle x^7+1 \rangle$, we have $v(x) \in D$ if and only if $v(\alpha) = v(\alpha^2) = v(\alpha^4) = 0$. From this it is easy to see that $D$ has minimum weight $3$ and so, as before, $D$ is a perfect $1$-error correcting binary code. (More generally, in a quadratic residue code of odd prime length $n$, the minimum distance $d$ satisfies $d^2 \ge n$.) The same argument holds for the cyclic code $E = \langle x^3 + x^2 + 1 \rangle$ whose generator polynomial is $(x-\alpha^3)(x-\alpha^5)(x-\alpha^6)$ is defined using the $3$ non-residues modulo $7$.

By the uniqueness result mentioned earlier, the codes $C$, $D$ and $E$ are all equivalent. Fix a bijection $\star : E \rightarrow D$.

We now identify the positions of $D$ with $\mathbb{F}_7$. Since $D$ is cyclic, the position permutation $j \mapsto j + 1$ is an automorphism of $D$. This gives one element of $\mathrm{PSL}_2(\mathbb{F}_7)$. To obtain the rest, we need the following two claim.

Claim. If $r \in \{1,2,4\}$ then the position permutation $j \mapsto tj$ preserves $D$.

Proof. Thinking of $D$ as an ideal, we have $v(x) \in D$ if and only if $v(\alpha) = v(\alpha^2) = v(\alpha^4) = 0$, so if and only if $v(x^2) \in D$. This deals with the case $r=2$, and $r=4$ is obtained by squaring. $\Box$.

Claim. If $s \in \{3,5,6\}$ then the position permutation $j \mapsto tj$ sends $D$ to $E$. The position permutation $0 \mapsto 0$ and $t \mapsto t^{-1}$ sends $D$ to $E$.

Proof. For the first part, use that the generator polynomial for $E$ is defined using the $3$ non-residues modulo $7$. The second part follows similarly, since $\beta$ is a root of $x^3 + x + 1$ if and only if $\beta^{-1}$ is a root of $x^3 + x^2 + 1$. $\Box$

We may therefore define an action of $\mathrm{PSL}_2(\mathbb{F}_7)$ on $D$ by identifying the positions of codewords with $\mathbb{F}_7$ and then applying the corresponding position permutation, followed by $\star$ if (according to the second claim above), we end in $E$ rather than in $D$. Using that $\mathrm{PSL}_2(\mathbb{F}_7)$ is a simple group, we see this homomorphism is injective.

### Conclusion

We have shown that the automorphism group, $G$ say, of a perfect $1$-error correcting linear binary code contains subgroups $S$ and $T$ isomorphic to the finite simple groups $\mathrm{GL}_3(\mathbb{F}_2)$ and $\mathrm{PSL}_2(\mathbb{F}_7)$ of order $168$. Setting $H = S \cap T$ we have $|G| \ge 168^2 / |H|$. Since $G \le S_7$, and $168 = 2^3 . 3 .7$ and $7! = 2^4.3^2. 5.7$, we have $|H| \ge 2^2 . 7 = 28$. Hence $H$ has index at most $6$ in $S$. The coset action of $S$ on $H$ now gives a homomorphism $S \rightarrow S_6$; since $\mathrm{GL}_3(\mathbb{F}_2)$ is simple and has elements of order $7$, this homomorphism must be trivial. Therefore $H = S$. Similarly $H = T$. Hence $S = T$ and we have the required automorphism.

## Collisions in Greg Egan’s Orthogonal physics

April 12, 2021

In Greg Egan’s remarkable Orthogonal Trilogy, physics differs from ours in one crucial respect: the Lorentz metric $c^2\mathrm{d}t^2 - \mathrm{d}\mathbf{x}^2$ is replaced with the Riemannian metric $\mathrm{d}t^2 + \mathrm{d}\mathbf{x}^2$, putting time on the same footing as space. As one would expect from this author, the consequences are worked out in great detail, and are integral to the plot. This comes at some predictable expense to characterisation and versimilitude: still I’ve found the first two books sufficiently interesting that I can (almost) ignore that much of the exposition consists of the characters giving each other physics tutorials, conducting experiments (which remarkably always work, even if the results often come as a surprise) or listening to each other give lectures and seminars. That said, the physics is leavened by some set-piece ethical dilemmas, and there is also a well-developed biological theme, concentrated on the unique (but not completely implausible) morphology of the Orthogonal race.

The purpose of this post is to do one of the exercises left implicitly to the reader in the second book, The Eternal Flame, in which it is observed that when a photon is deflected by a stationary particle of ordinary matter (called a luxagen’ in the trilogy), it may glance off at two different angles. Moreover, there is a maximum deflection angle, which is independent of the energy of the photon.

Here I’ll show that this follows from the relation between energy and momentum in Orthogonal physics, and the two conservation laws, and implies that Orthogonal photons are heavier than luxagens. I’ll also show that the same behaviour holds in our universe, whenever the moving particle is heavier than the particle it strikes. My solution uses piles and piles of algebra to arrive at a surprisingly simple answer. Egan’s book hints at a simple trigonometric solution: made more plausible because his characters are of course brilliant at Euclidean geometry, thanks to the double motivation from their physics and our shared mathematics. Or maybe there is a better way to organize the algebra, perhaps starting We take the Lagrangian …’, or maybe By working in the reference frame in which the centre of momentum is fixed …’. Either way, I would like to know it.

Much of my interest arises from how Egan (and his characters) arrive at this setup: for more background see the supplementary material on Egan’s website, or read the first book in the trilogy, The Clockwork Rocket. At the time of writing I still have the third book to look forward to.

### Space-time and the energy-momentum relation

There is $4$-dimensional space-time. In our universe, when we measure a time interval $T$ by the distance $cT$ that light travels in this interval, space-time has the Lorentz metric $c^2 \mathrm{d}t^2 - \mathrm{d}x^2 - \mathrm{d}y^2 - \mathrm{d}z^2$. For example, a photon may move from $(0,0,0,0)$ to $(1,c,0,0)$ in one unit of time, and, corresponding to the fact that no time passes for the photon, the Lorentz distance between these space-time points is zero.

In the Orthogonal universe space-time has the Riemannian metric $k^2 \mathrm{d}t^2 + \mathrm{d}x^2 + \mathrm{d}y^2 + \mathrm{d}z^2$, where $k$ is a constant with dimension distance/time; by choosing units appropriately (thus sweeping under the carpet the first half of the first book on Yalda’s experiments on Mount Peerless, the beautiful dual Pythagorean Theorem, and the fact that in Orthogonal physics, the speed of light is not constant), we may assume that the numerical value of $k$ is $1$. We take the following (and only the following, for the time being) as further axioms:

1. Physical laws are invariant under the group of transformations preserving the Lorentz/Riemannian metric.
2. Every particle has a symmetry invariant tangent vector $(kv_t,v_x,v_y,v_z)$ to its world line called its $4$velocity with units distance/time; in our universe $k = c$ and in the Orthogonal universe $k = 1$. A stationary observer measures the particle’s velocity as the $3$-vector $(v_x,v_y,v_z)/v_t$, as illustrated in the diagram below (click for a larger pdf).
3. Multiplying the $4$-velocity, normalized to have length $c$ (our universe) or $1$ (the Orthogonal universe), by the mass $M$ of a particle gives its $4$-momentum $(E/c, \mathbf{p})$, where $\mathbf{p}$ is the $3$-vector momentum.

Note that (1) only makes sense because we measure time and distance in compatible units: thus $v_t$ is dimensionless and whenever a time appears, such as $t_1$ in the diagram above, it is multiplied by $c$. Therefore the symmetry group of space-time can reasonably mix them up.

The way I imagine (2), half-remembered from when I did special relativity 20 years ago (my reference frame), is that the $xt$-plane has clocks every metre, connected by rigid rods, which record the time of passing of every particle. The particle is at position $x_0$ (which conveniently enough happens to have a clock) at time $ct_0$, and at position $x_1$ (similarly conveniently placed) at time $ct_1$. The stationary observer can therefore collect the measurements from the clocks (simply by travelling to each one), and approximate the particle’s velocity as $(x_1-x_0)/(t_1-t_0)$ and so $v_x/v_t \approx (x_1-x_0)/(t_1-t_0)$. Everyday objects in our universe, such as cars, trains and coronavirus, move at a negligible fraction of the speed of light, so $c(t_1-t_0)$ is typically huge compared to $x_1-x_0$, and corresponding the velocity measured by the observer is a tiny fraction of the speed of light.

(Note this is not the velocity that an observer travelling with the particle would measure: this requires the notion of proper time; this is the usual way to define the $4$-vector velocity, but except for the informal diagram and the motivation in the previous paragraph, we’re not going to use it here.)

In general, writing $\mathbf{v}$ for the $3$-velocity, and as usual in physics, $\mathbf{v}^2$ for the square of its $3$-vector Euclidean norm

$\displaystyle ||v|| = \sqrt{v_x^2+ v_y^2+v_z^2},$

axiom (2) implies that $(kv_t,v_x,v_y,v_z) = (kv_t, \mathbf{v}v_t)$. By symmetry invariance, the Lorentzian norm squared of the tangent vector is $(c^2 - \mathbf{v}^2) v_t^2$, and the Euclidean norm squared is $(1+\mathbf{v}^2)v_t^2$ (using the assumption that $k=1$). Now since $(c^2 - \mathbf{v}^2)v_t^2$ has units distance/time, and we care only about the direction of the tangent vector, not its magnitude, it is most convenient (and usual in physics, although I’ve never seen it clearly explained why) to require, as in axiom (3), that the Lorentzian length is $c$.

Thus in our universe $(c^2-\mathbf{v}^2)v_t^2 = c^2$, and so $v_t = c/\sqrt{c^2-\mathbf{v}^2} = 1/\sqrt{1-\mathbf{v}^2/c} = \gamma(\mathbf{v})$, where $\gamma(\mathbf{v})$ is the usual dimensionless Lorentz factor. In the Orthogonal universe $v_t = 1/\sqrt{1+v^2}$; this is again dimensionless because the $1$ in the numerator implicitly has units distance/time.

Therefore the $4$-vector velocity is

$\displaystyle (cv_t,v_x,v_y,v_z) = (c\gamma(\mathbf{v}), \mathbf{v}\gamma(\mathbf{v}) ) = \bigl(\frac{c}{\sqrt{1-\mathbf{v}^2/c^2}}, \frac{\mathbf{v}}{\sqrt{1-\mathbf{v}^2/c^2}}\bigr)$

in our universe and

$\displaystyle (v_t,v_x,v_y,v_z) = \bigl(\frac{1}{\sqrt{1+\mathbf{v}^2}}, \frac{\mathbf{v}}{\sqrt{1+\mathbf{v}^2}} \bigr)$

in the Orthogonal universe.

According to (3) the familiar $3$-vector momentum is obtained by multiplying the spatial component of each of these $4$-vector velocities by the rest mass $M$. In our universe, this agrees with special relativity: a particle moving at $3$-vector velocity $\mathbf{v}$ is heavier according to a stationary observer by the factor $\gamma(\mathbf{v})$. In the Orthogonal universe, we find instead that the $3$-vector momentum is $M\mathbf{v}/\sqrt{1+\mathbf{v}^2}$, so fast moving particles are lighter according to a stationary observer.

Again by (3), the energy $E$ of the particle is given by $E/c = Mc \gamma(\mathbf{v})$ in our universe, and $E = M/\sqrt{1-\mathbf{v}^2}$ in the Orthogonal universe. Substituting in the $4$-vectors, we find that in our universe $(E/c, M\gamma(\mathbf{v})\mathbf{v})$ has Lorentzian length $Mc$, and in the Orthogonal universe, $(E, M\mathbf{v}/\sqrt{1+\mathbf{v}^2})$ has Euclidean length $M$. Equivalently, energy and momentum are related in our universe by

$(E/c)^2 - \mathbf{p}^2 = M^2c^2$

and in the Orthogonal universe by

$E^2 + \mathbf{p}^2 = M^2.$

The former is one statement of the energy-momentum relation between the energy $E$ and the $3$-momentum $\mathbf{p}$. Observe that the Orthogonal energy-momentum relation can be stated very neatly as $E = M \cos \Gamma$ and $|| \mathbf{p} || = M \sin \Gamma$, where $\Gamma \in [0, \pi/2]$. The analogue for our universe, using the identity $\cosh^2 \Gamma - \sinh^2\Gamma = 1$, is $E = Mc^2 \sinh \Gamma$ and $||\mathbf{p} || = Mc \cosh \Gamma$. We show both by the triangles below, noting that only the Orthogonal triangle is a genuine geometric representation, rather than a helpful aide-memoire.

### Collision setup

We now need one final pair of axioms

1. In any system of particles, the sum of all $3$-vector momenta is conserved;
2. In any system of particles, the sum of all energies (as recorded in the time component of $4$-vector momenta) is conserved.

The diagrams below show a stationary particle with mass $m$ hit by a heavy particle with mass $M$ moving in the $x$ direction (and also in time, of course). Again click on the diagram for a larger pdf.

We can suppose that after the collision motion is contained in $txy$-space (of which the diagram above shows only the $xy$-plane), so all $z$-coordinates are zero: let $\phi$ (oriented clockwise) and $\theta$ (oriented anticlockwise) be the deflection angles of the lighter and heavier particle. The $4$-momenta conservation equations are, in our universe,

\begin{aligned} mc + Mc \cosh \Gamma &= mc \cos \delta + Mc \cos \Delta \\ Mc \sinh \Gamma &= mc \sinh \delta \cos \phi + Mc \sinh \Delta \cos \theta \\ 0 &= -mc \sinh \delta \sin \phi + mc \sinh \Delta \sin \theta. \end{aligned}

Notice that these are homogeneous in $c$. We can therefore divide through by $c$ and obtain the equations for the Orthogonal universe by replacing hyperbolic functions with trigonometric functions throughout. As a further small simplification we introduce $T = m + M\, \mathrm{co}\ \Gamma$ and observe that for our universe

$(T-m)^2/M^2 - 1 = \cosh^2 \Gamma - 1 = \sinh^2 \Gamma$

and for the Orthogonal universe,

$-(T-m)^2/M^2 + 1 = -\cos^2 \Gamma + 1 = \sin^2 \Gamma.$

Therefore a unified system of equations is

\begin{aligned} T &= m\, \mathrm{co}\; \delta + M\, \mathrm{co}\; \Delta \\ \sqrt{|(T-m)^2 - M^2|} &= m\, \mathrm{si}\; \delta \cos \phi + M\, \mathrm{si}\; \Delta \cos \theta \\ 0 &= -m\, \mathrm{si}\; \delta \sin \phi + M\, \mathrm{si}\; \Delta \sin \theta \end{aligned}

where $\mathrm{co}, \mathrm{si}$ are $\cosh, \sinh$ in our universe and $\cos, \sin$ in the Orthogonal universe. The expression $(T-m)^2 - M^2$ is positive in our universe and negative for the Orthogonal universe; it has the three constants $m$, $M$ and $T$ (all with units of mass) which we regard as determined by the experimental setup. The unknowns are $\delta$, $\Delta$, $\phi$ and $\theta$, appearing only on the right-hand sides.

### Numerical solutions

It is routine to solve these equations numerically using NSolve in Mathematica. As an example, we suppose that the heavier particle has three times the mass of the lighter, so $m/M = 1/3$. The graph below show the post-collision velocity $V$ of the heavier particle for each of the two possible deflection angles $\theta$, as the energy of the heavier particle increases from least (red) to greatest (violet).

This agrees with the experiment conducted in Orthogonal where the heavier particle is a photon, and the characters observe the deflection angle by visual observation of a system of mirrors. The graph below the has the same axes, comparing the post-collision velocity in the Orthogonal universe (solid) and our universe (dashed), choosing the energies to get comparable shapes. Notice that the maximum deflection angle does not depend on the energy of the heavier particle, or even on which universe we are in: we show below that its sine is the ratio of the masses. Thus in these graphs it is $\sin^{-1} 1/3 \approx 0.33984$.

The graph below is the equivalent of the first graph showing the final energy of the heavier particle. If you are surprised that red is now at the top, note that in Orthogonal physics, the energy is $M/\sqrt{1+V^2}$, so is lower for faster moving particles.

#### Tangent space equation

Differentiating the momentum conservation equations using that $\frac{\mathrm{d}}{\mathrm{d}\alpha} \, \mathrm{si}\ \alpha = \mathrm{co}\ \alpha$, we obtain

\begin{aligned} 0 &= m\, \mathrm{co}\; \delta \cos \phi\, \mathrm{d}\delta - m\, \mathrm{si}\; \delta \sin \phi \,\mathrm{d}\phi \\ &\qquad\qquad + M\, \mathrm{co}\; \Delta \cos \theta\, \mathrm{d}\Delta - M\, \mathrm{si}\; \Delta \sin \theta\, \mathrm{d} \theta, \\ 0 &= -m\, \mathrm{co}\; \delta \sin \phi \mathrm{d}\delta - m\, \mathrm{si}\; \delta \cos \phi\, \mathrm{d}\phi \\ &\qquad\qquad - M\, \mathrm{co}\; \Delta \sin \theta\, \mathrm{d}\Delta - M\, \mathrm{si}\; \Delta \cos \theta\, \mathrm{d} \theta. \end{aligned}

At an extremum for $\theta$, we have $d \theta = 0$. Multiplying the top equation by $\cos \phi$ and the bottom equation by $\sin \phi$ and subtracting, we obtain

\begin{aligned} 0 &= m\, \mathrm{co}\; \delta \cos^2 \phi\, \mathrm{d}\phi - m \, \mathrm{si}\; \delta \sin \phi \cos \phi\, \mathrm{d}\phi \\ & \qquad\qquad\qquad\qquad\qquad\qquad + M\, \mathrm{co}\; \Delta \cos \theta \cos \phi\, \mathrm{d}\Delta \\ &\ + m\, \mathrm{co}\; \delta \sin^2 \phi\, \mathrm{d}\phi + m \, \mathrm{si}\; \delta \cos \phi \sin \phi\, \mathrm{d}\phi \\ &\qquad\qquad\qquad\qquad\qquad\qquad -M\, \mathrm{co}\; \Delta \sin \theta \sin \phi\, \mathrm{d}\Delta \\ & = m\, \mathrm{co}\; \delta (\cos^2 \phi + \sin^2 \phi)\, \mathrm{d}\delta \\ &\qquad\qquad + M\, \mathrm{co}\; \Delta (\cos \theta \sin \phi - \sin \theta \cos \phi)\, \mathrm{d}\Delta \\ &= m\, \mathrm{co}\; \delta\, \mathrm{d}\delta + M \, \mathrm{co}\;\Delta \cos (\theta +\phi) \,\mathrm{d}\Delta. \end{aligned}

Differentiating the equation from energy conservation, using that $\frac{\mathrm{d}}{\mathrm{d}\alpha} \cosh \alpha = \sinh \alpha$ and $\frac{\mathrm{d}}{\mathrm{d}\alpha} \cos \alpha = -\sin \alpha$, so the same sign appears both times, we get

$0 = m \, \mathrm{si} \; \delta \, \mathrm{d}\delta + M \, \mathrm{si}\; \Delta \, \mathrm{d}\Delta.$

Comparing these relations between the independent differentials $\mathrm{d}\delta$ and $\mathrm{d}\Delta$, we see that the coefficients must be in the same ratio: that is

$\displaystyle \frac{\mathrm{si}\;\delta}{\mathrm{co}\; \delta} = \frac{\mathrm{si}\;\Delta}{\mathrm{co}\;\Delta \cos (\theta + \phi)},$

or equivalently,

$\displaystyle \frac{\mathrm{ta}\; \Delta}{\mathrm{ta}\; \delta} = \cos (\theta + \phi)$

where $\mathrm{ta}$ is $\mathrm{tanh}$ in our universe and $\mathrm{tan}$ in the Orthogonal universe.

### Solution for maximum deflection

#### First step: solving for $\delta$ and $\Delta$

The sum of the squares of the two equations for momentum is

\begin{aligned} |(T&-m)^2 - M^2| \\ &= \sqrt{|(T-m)^2 - M^2|}^2 + 0^2 \\ &= (m\, \mathrm{si}\; \delta \cos \phi + M\, \mathrm{si}\; \Delta \cos \theta)^2 \\ &\qquad + (-m\, \mathrm{si}\; \delta \sin \phi + M\, \mathrm{si}\; \Delta \sin \theta)^2 \\ &= m^2\, \mathrm{si}^2\; \delta (\cos^2 \phi + \sin^2\phi) + M^2\, \mathrm{si}^2\; \Delta (\cos^2 \theta + \sin^2\theta) \\ &\qquad + 2mM\, \mathrm{si}\; \delta\, \mathrm{si}\; \Delta (\cos \phi \cos \theta + \sin \phi \sin \theta) \\ &= m^2\, \mathrm{si}^2 \; \delta + M^2 \, \mathrm{si}^2\; \Delta + 2mM\, \mathrm{si}\;\delta \, \mathrm{si}\; \Delta \, \cos(\theta+\phi). \end{aligned}

By the equation from the tangent space, at maximum deflection $\cos(\theta + \phi) = \frac{\mathrm{ta}\; \Delta}{\mathrm{ta}\; \delta}$. Substituting and using

$\mathrm{si}\;\delta \, \mathrm{si}\;\Delta\, \frac{\mathrm{ta}\; \Delta}{\mathrm{ta}\; \delta} = \mathrm{co}\; \delta \, \mathrm{si}\;\Delta\, \mathrm{ta}\; \Delta$

we get

$|(T-m)^2 - M^2| = m^2\, \mathrm{si}^2 \: \delta + M^2 \, \mathrm{si}^2\: \Delta + 2mM\, \mathrm{co}\; \delta \, \mathrm{si}\;\Delta\, \mathrm{ta}\; \Delta$

in which the only unknowns are $\delta$ and $\Delta$. The only equation we have not yet used is the energy conservation equation

$T = m \, \mathrm{co} \; \delta + M \, \mathrm{co}\;\Delta.$

It implies that

$m \, \mathrm{co}\;\delta = T - M\, \mathrm{co}\; \Delta$

and, by squaring this relation and subtracting $m^2$, that

$\pm m^2 \, \mathrm{si}^2 \; \delta = (T - M\, \mathrm{co}\; \Delta)^2 - m^2$

where the sign is $+$ for our universe and $-$ for the Orthogonal universe. Using these two relations to eliminate $m\, \mathrm{co}\; \delta$ and $m^2\, \mathrm{si}^2 \; \delta$ and recalling that $(T-m)^2 - M^2$ is positive for our universe and negative for the Orthogonal universe, we obtain

\begin{aligned} \pm \bigl((T-m)^2 - M^2\bigr) &= \pm \bigl( (T - M\, \mathrm{co}\; \Delta)^2 - m^2\bigr) + M^2 \, \mathrm{si}^2\: \Delta \\ & \qquad\qquad\qquad + 2M (T - M\, \mathrm{co}\; \Delta) \, \mathrm{si}\;\Delta\, \mathrm{ta}\; \Delta \end{aligned}

with the same conventions for the universe dependent signs. Now only $\mathrm{\Delta}$ appears. After multiplying through by $\pm$ we can simplify the right-hand side as follows

\begin{aligned} \bigl((T&-m)^2 - M^2\bigr) \\ &= (T - M\, \mathrm{co}\; \Delta)^2 - m^2 \pm M^2 \, \mathrm{si}^2 \; \Delta \\ &= T^2 + M^2 \, \mathrm{co}^2\; \Delta - 2MT\, \mathrm{co}\; \Delta \\ &\qquad - m^2 \pm M^2\, \mathrm{si}^2 \Delta \pm 2MT\, \frac{\mathrm{si}^2\; \Delta}{\mathrm{co}\; \Delta} \mp 2M^2 \, \mathrm{si}^2 \Delta \\ &= T^2 + M^2 (\mathrm{co}^2\;\Delta \mp \mathrm{si}^2\;\Delta) -m^2 - \frac{2MT}{\mathrm{co}\;\Delta} (\mathrm{co}^2\;\Delta \mp \mathrm{si}^2\;\Delta) \\ &= T^2 + M^2 - m^2- \frac{2MT}{\mathrm{co}\; \Delta}. \end{aligned}

It seems somewhat remarkable to me that all the signs cancel, and we are left with such a simple equation. Perhaps this is a sign that there is some more direct derivation that I am missing. Anyway, cancelling the factors of $T^2$ on each side and rearranging, we get the final equation

$\displaystyle M^2 -m^2 + mT = \frac{MT}{\mathrm{co} \; \Delta}$

which clearly has unique solution

$\mathrm{co}\; \Delta \displaystyle = \frac{MT}{M^2-m^2+mT}.$

Using the relation from energy conservation $m \, \mathrm{co}\; \delta = T - M\, \mathrm{co}\;\Delta$ we immediately get

$\displaystyle m\,\mathrm{co}\; \delta = T - \frac{M^2T}{M^2-m^2+mT} = \frac{-m^2T + mT^2}{M^2-m^2+ mT}$

hence

$\mathrm{co}\; \delta = \displaystyle \frac{T(T-m)}{M^2-m^2+mT}.$

#### Final step: finding $\phi$ and $\theta$

Substituting these expressions for $\mathrm{co}\; \delta$ and $\mathrm{co}\; \Delta$ into the equation from the tangent space equation $\cos (\theta+\phi) = \frac{\mathrm{ta}\; \Delta}{\mathrm{ta}\; \delta}$ we obtain

\begin{aligned} \cos^2 (\theta+\phi) &= \frac{\mathrm{ta}^2\; \Delta}{\mathrm{ta}^2 \; \delta} \\ &= \frac{\mathrm{co}^2 \; \Delta - 1}{\mathrm{co}^2 \; \delta - 1} \frac{\mathrm{co}^2 \; \delta}{\mathrm{co}^2\; \Delta} \\ &= \frac{M^2T^2 - (M^2-m^2+mT)^2}{T^2(T-m)^2 - (M^2-m^2+mT)^2} \frac{T^2(T-m)^2}{M^2T^2}. \end{aligned}

Again there is a surprising simplification: the numerator $M^2T^2 - (M^2-m^2+mT)^2$ of the left-hand fraction becomes $M^2T^2 - (MT)^2 = 0$ if $m = M$, and so vanishes if $m = \pm M$. (Of course only the specialization $m = M$ is physically meaningful.) The quotient by $M^2-m^2$ is $T^2-M^2+m^2-2mT$. Hence we have

\begin{aligned} M^2&T^2 - (M^2- m^2+mT)^2 \\ & \quad = (M-m)(M+m)(T+M-m)(T-M-m).\end{aligned}

Similarly one can show that the denominator factorizes as

\begin{aligned} T^2(T-m)^2 & - (M^2-m^2+mT)^2 \\ &\quad= (T^2+M^2-m^2)(T+M-m)(T-M-m). \end{aligned}

Therefore the expression above for $\cos^2 (\theta + \phi)$ simplifies to

$\cos^2 (\theta + \phi) = \displaystyle \frac{(M-m)(M+m)}{T^2+M^2-m^2} \frac{(T-m)^2}{M^2}.$

Using the momentum conservation equation

$m\,\mathrm{si}\;\delta \sin \phi = M\, \mathrm{si}\; \Delta \sin \theta$

and the same factorizations to simplify $\frac{\mathrm{co}^2\;\Delta -1}{\mathrm{co}^2 \; \delta -1}$ we get

\begin{aligned} 1 &= \frac{M^2\, \mathrm{si}^2 \; \Delta \sin^2 \theta}{m^2\,\mathrm{si}^2 \; \delta \sin^2 \phi} \\ &= \frac{M^2(\mathrm{co}^2 \;\Delta - 1)\sin^2 \theta}{m^2(\mathrm{co}^2 \;\delta - 1)\sin^2 \phi} \\ &= \frac{M^2(M-m)(M+m)}{m^2(T^2+M^2-m^2)} \frac{\sin^2\theta}{\sin^2\phi}.\end{aligned}

Hence

$\displaystyle \sin^2 \phi = \frac{M^2(M^2-m^2)}{m^2(T^2+M^2-m^2)} \sin^2 \theta.$

Rewriting the equation for $\cos^2 (\theta + \phi)$ so it uses only sines and then substituting for $\sin \phi$ using the equation above gives

\begin{aligned} &\frac{T-m}{M}\sqrt{\frac{M^2-m^2}{T^2+M^2-m^2}} \\ &= \sqrt{1-\sin^2 \theta}\sqrt{1-\sin^2\phi} - \sin \theta \sin \phi \\ &= \sqrt{1-\sin^2 \theta}\sqrt{1 - \frac{M^2(M^2-m^2)}{m^2(T+M^2-m^2)} \sin^2 \theta} \\ & \qquad\qquad\qquad\qquad\qquad - \sin^2 \theta \frac{M}{m} \sqrt{\frac{M^2-m^2}{T^2+M^2-m^2}}. \end{aligned}

Hence writing $S$ for $\sin^2 \theta$ we get by a rearrangement and squaring the following quadratic equation in $S$:

\begin{aligned} (1-S)\bigl(1 - & \frac{M^2(M^2-m^2)}{m^2(T^2+M^2-m^2)} S \bigr) \\ &\quad = \bigl( \frac{T-m}{M} + \frac{SM}{m}\bigr)^2 \frac{M^2-m^2}{T^2+M^2-m^2}. \end{aligned}

It is routine to check that the difference between the two sides is

$\displaystyle \frac{(SM^2 - m^2)(M^2-m^2+mT)^2}{M^2m^2(T^2+M^2-m^2)}.$

and so the unique solution is $S = m^2/M^2$. (This is one of several places where we can see that there is a maximum deflection angle if and only if $m \le M$.) Since $\theta$ is positive, we have $\sin \theta = m/M$, and, as claimed some time ago, the sine of the maximum deflection angle is simply the ratio of the masses.

The equation for $\sin^2 \phi$ above now implies that

$\displaystyle \sin^2 \phi = \frac{M^2(M^2-m^2)}{m^2(T^2+M^2-m^2)} \frac{m^2}{M^2} = \frac{M^2-m^2}{T^2+M^2-m^2}$

and hence

$\displaystyle \sin \phi = \sqrt{\frac{M^2-m^2}{T^2+M^2-m^2}}.$

#### Summary

A somewhat remarkable feature of the unique solution for the maximum deflection

\begin{aligned} \sin \phi &=\sqrt{\frac{M^2-m^2}{T^2+M^2-m^2}} \\ \sin \theta &= \frac{m}{M} \\ \mathrm{co}\; \delta &= \frac{T(T-m)}{M^2-m^2+mT} \\ \mathrm{co}\; \Delta &= \frac{TM}{M^2-m^2+mT} \end{aligned}

where $T = m + M \, \mathrm{co}\; \Gamma$ is the total energy (up to the factor $c^2$ in our universe), is that it depends on which universe we are working in only by a change from the hyperbolic function $\cosh$ (our universe) to the trigonometric function $\cos$ (Orthogonal universe). Moreover, the maximum deflection angle $\theta$ does not depend on the universe at all.

## Signed binomial matrices of small order

December 28, 2020

Let $B(n)$ denote the $n \times n$ Pascal’s Triangle matrix defined by $B(n)_{xy} = \binom{x}{y}$, where, as in the motivating paper, we number rows and columns of matrices from 0. More generally, let $B_a(n)$ be the shifted Pascal’s Triangle matrix defined by $B_a(n)_{xy} = \binom{x+a}{y+a}$. Let $J(n)_{xy} = [x+y=n-1]$ be the involution reversing the order of rows (or columns) of a matrix, and let

$D^\pm(n) = \mathrm{Diag}(1,-1,\ldots, (-1)^{n-1}).$

Since $n$ is fixed throughout, we shall write $B$, $B_a$, $J$ and $D^\pm$ for readability in proofs.

Claim 1. For any $a \in \mathbb{N}$, and any $n \ge 2$, the matrix $D^\pm(n)B_a(n)$ is an involution.

Proof. Since $n\ge 2$, the matrix is not the identity. Since $\bigl( D^\pm B_a \bigr)_{xy} = (-1)^x \binom{x+a}{y+a}$ we have

\begin{aligned} \bigl( D^\pm B_a D^\pm B_a \bigr)_{xz} &= (-1)^x \sum_y (-1)^y \binom{x+a}{y+a} \binom{y+a}{z+a} \\ &= (-1)^x \sum_y \binom{x+a}{z+a}\binom{x-z}{y-z}(-1)^y \\ &= (-1)^{x+z} \binom{x+a}{z+a} \sum_r \binom{x-z}{r}(-1)^r \\ &= (-1)^{x+z} \binom{x+a}{z+a} [x=z] (-1)^{x+z} \\ &= [x=z] \end{aligned}

as required. $\quad\Box$

Claim 2. We have $B(n)^{-1} J(n) B(n) = D^\pm(n) J(n) B(n) J(n)$ and $B(n) J(n) B(n)^{-1} = J(n) B(n) J(n) D^\pm(n)$.

Proof. The alternating sum used to prove Claim 1 can be used to show that $B(n)^{-1}_{xy} = (-1)^{x+y} \binom{x}{y}$. Hence

\begin{aligned} \bigl(B(n)&J(n)B(n)^{-1} \bigr)_{xz} \\ &= \sum_{y} (-1)^{x+y} \binom{x}{y} \binom{n-1-y}{z} \\ &= (-1)^x\sum_y (-1)^y \binom{x}{y} (-1)^{n-1-y-z} \binom{-z-1}{n-1-y-z} \\ &= (-1)^{x+n-1-z} \binom{x-z-1}{n-1-z} \\ &= (-1)^{x+n-1-z} \binom{n-1-x}{n-1-z} (-1)^{n-1-z} \\ &= (-1)^x \binom{n-1-x}{n-1-z} \\ &= \bigl( D^\pm(n)J(n)B(n)J(n) \bigr)_{xz}\end{aligned}

as required. The second identity is proved very similarly. $\quad\Box$

Claim 3. For any $n \ge 2$, the matrix $D^\pm(n)J(n)B(n)$ has order $3$ if $n$ is odd and order $6$ if $n$ is even. $\quad\Box$

Proof. Since $\bigl( D^\pm J B \bigr)_{xy} = (-1)^x \binom{n-1-x}{y}$ we have

\begin{aligned} \bigl( D^\pm & J B D^\pm J B D^\pm J B \bigr)_{xw} \\ &= (-1)^x \sum_y \sum_z (-1)^{y+z} \binom{n-1-x}{y}\binom{n-1-y}{z} \\ & \hspace*{3in} \binom{n-1-z}{w} \\ &= (-1)^{x+n-1} \sum_z \Bigl( \sum_y \binom{n-1-x}{y} \binom{-z-1}{n-1-y-z} \Bigr) \\& \hspace*{3in} \binom{n-1-z}{w} \\ &= (-1)^{x+n-1} \sum_z \binom{n-2-x-z}{n-1-z}\binom{n-1-z}{w} \\ &= (-1)^{x} \sum_z (-1)^{z}\binom{x}{n-1-z}\binom{n-1-z}{w} \\ &= (-1)^{x+n-1} \sum_r (-1)^r \binom{x}{r} \binom{r}{w} \\&= (-1)^{x+n-1} (-1)^x [x=w] = (-1)^{n-1}. \end{aligned}

It is easily seen that the matrix is not a scalar multiple of the identity, therefore it has the claimed order. $\quad\Box$

Alternative proof. Another proof uses Claims 1 and 2, as follows

\begin{aligned} (JBD^\pm)^3 &= JBD^\pm JBD^\pm (D^\pm J)B D^\pm \\ &= JB\bigl( D^\pm JBJ\bigr) D^\pm (-1)^{n-1}BD^\pm \\ &= (-1)^{n-1} JB((B^{-1}JB ) JD^\pm BD^\pm \\ &= (-1)^{n-1}BD^\pm BD^\pm \\ &= (-1)^{n-1}I\end{aligned}

with the end as before. $\quad\Box$

Let $X^\circ$ denote the half-turn rotation of an $n \times n$ matrix $X$, as defined by $X^\circ = J(n)XJ(n)$. By Claim 3,

$B(n)D^\pm(n)B(n)^\circ D^\pm = BD^\pm J B J D^\pm$

is conjugate to $JD^\pm BD^\pm JB$ and so to $(-1)^{n-1} (D^\pm JB)(D^\pm JB)$. Hence this matrix has order $3$ when $n$ is odd and order $6$ when $n$ is even. We state without proof some further identities along these lines.

Claim 4. The matrix $B(n)D^\pm(n)B_n(n)^\circ D^\pm(n)$ has order $3$. If $n$ is even then $B(n)D^\pm(n)B_1(n)^\circ D^\pm(n)$ has order $12$. If $n$ is odd then $B_1(n)D^\pm(n)B_1(n)^\circ D^\pm(n)$ has order $3$.

The second case seems particularly remarkable. There are some obstructions (related to root systems) to integer matrices having finite order. These identities were discovered as a spin-off of a somewhat involved construction with linear algebra; I have no idea how to motivate them in any other way. For instance, how looking at

$B(6)D^\pm(6)B_1(6)D^\pm(6) = \left( \begin{matrix} 1 & -6 & 15 & -20 & 15 & -6 \\ 1 & -5 & 10 & -10 & 5 & -1 \\ 1 & -4 & 6 & -4 & 1 & 0 \\ 1 & -3 & 3 & -1 & 0 & 0 \\ 1 & -2 & 1 & 0 & 0 & 0 \\ 1 & -1 & 0 & 0 & 0 & 0 \end{matrix} \right)$

would one guess that it has order $12$? A computer search found many more examples involving more lengthy products of the signed binomial matrices and their rotations, for instance

$B(5)D^\pm B_4(5)^\circ D^\pm B(5)D^\pm B_8(5)^\circ D^\pm$

has order $12$.

## Completely monotone sequences

December 20, 2020

A sequence $\lambda_0, \lambda_1, \ldots, \lambda_n$ is monotone decreasing if $\lambda_0 \ge \lambda_1 \ge \lambda_2 \ge \ldots$, or equivalently, if $\lambda_j - \lambda_{j+1} \ge 0$ for all $j \in \mathbb{N}_0$. We will decide once and for all to deal with decreasing sequences only, and say that such a sequence is completely monotone if all its iterated difference, including the zero differences, are non-negative. That is $\lambda_j \ge 0$, $\lambda_j - \lambda_{j+1} \ge 0$, $\lambda_j - 2\lambda_{j+1} + \lambda_{j+2} \ge 0$, $\lambda_j - 3\lambda_{j+1} + 3\lambda_{j+2} - \lambda_{j+3} \ge 0$ for all $j \in \mathbb{N}_0$, and so on. Equivalently,

$\displaystyle \sum_{i=0}^k (-1)^i \binom{k}{i}\lambda_{j+i} \ge 0$

for all $j, k \in \mathbb{N}_0$. One family of examples I stumbled across, in what seemed like a completely unrelated context of random walks defined on posets with involution, is the two-parameter family

$\displaystyle \lambda^{(a,b)}_j = \frac{\binom{a+b}{a}(a+b+1)}{\binom{a+b+j}{b}(a+b+j+1)}.$

For instance, $\lambda^{(0,0)}_j = \frac{1}{j+1}$ for each $j \in \mathbb{N}_0$. For a direct proof that this sequence is completely monotone, we use the $\beta$-integral $\int_0^1 x^a(1-x)^b \mathrm{d}x = \frac{1}{\binom{a+b}{a}(a+b+1)}$ to write

\displaystyle\begin{aligned}\sum_{i=0}^k& (-1)^i \binom{k}{i} \frac{\lambda^{(a,b)}_j}{\binom{a+b}{a}(a+b+1)} \\ &= \sum_{i=0}^k (-1)^i \binom{k}{i} \int_0^1 x^{a+j+i}(1-x)^b \mathrm{d}x \\ &= \sum_{i=0}^k \int_0^1 x^{a+j}(1-x)^{b+k} \\ &= \frac{1}{\binom{a+b+j+k}{b+k}(a+b+j+k+1)} \\ &= \frac{\lambda^{(a,b+k)}_j}{\binom{a+b+k}{a}(a+b+k+1)}.\end{aligned}

In the same context, I came across some multivariable polynomials that I conjecture are always positive when evaluated at a completely monotone sequence. These polynomials include

\displaystyle \begin{aligned}f_1(x_0,x_1) &= x_0-x_1 \\ f_2(x_0,x_1,x_2) &= x_0^2 - 2x_0x_1 + 2x_0x_2 - x_1x_2 \\ f_3(x_0,x_1,x_2,x_3) &= x_0^3 - 3 x_0^2 x_1 + 5 x_0^2 x_2 - 3 x_0 x_1 x_2\\ & \quad - 3 x_0^2 x_3 + 5 x_0 x_1 x_3 - 3 x_0 x_2 x_3 + x_1 x_2 x_3. \end{aligned}

Their positivity is obvious for $f_1$, because $f_1$ is equal to a difference. For $f_2$ positivity follows from

$\displaystyle f_2(x_0,x_1,x_2) = x_0(x_0-2x_1+x_2) + (x_0-x_1)x_2.$

Despite much struggling, I was unable to find a similar expression for $f_3$ as a sum of products of positive differences. The main purpose of this post is to show that such expressions exist for linear functions, but not in general. I therefore may well have been on a wild-goose chase, hardly for the first time.

#### Linear polynomials

Fix $n \in \mathbb{N}$. Let $\mathcal{C} \subseteq \mathbb{R}^n$ be the cone of all completely monotone sequences, as defined by the inequalities at the top of this post. Let $\mathcal{D} \subseteq \mathbb{R}^n$ be the cone spanned by the coefficients in these inequalities.

To make these cones more explicit, the following notation is helpful. Let $\Delta^k \lambda_j = \sum_{i=0}^k (-1)^i \binom{k}{i}\lambda_{j+i}$. Then $\Delta^{k-1} \lambda_j - \Delta^{k-1} \lambda_{j+1} = \Delta^{k} \lambda_j$, and so $\Delta^k \lambda_{n-1-k} + \Delta^{k-1} \lambda_{n-k} = \Delta^{k-1}\lambda_{n-1-k}$. It follows inductively that $\mathcal{D}$ is the set of non-negative linear combinations of the vectors $v^{(k)}$ defined by

$v^{(k)}_{n-1-i} = (-1)^i\binom{k}{i}$

For instance, if $n=5$ then $v^{(0)}, v^{(1)}, v^{(2)}, v^{(3)}, v^{(4)}$ are the rows of the $5 \times 5$ matrix below

$\left( \begin{matrix} \cdot & \cdot & \cdot & \cdot & 1 \\ \cdot & \cdot & \cdot & 1 & -1 \\ \cdot & \cdot & 1 & -2 & 1 \\ \cdot & 1 & -3 & 3 & -1 \\ 1 & -4 & 6 & -4 & 1 \end{matrix} \right)$

Suppose that $\sum_{i=0}^{n-1} a_i \lambda_i \ge 0$ for all completely monotone sequences. That is, $a \cdot \lambda \ge 0$ for all $\lambda \in \mathcal{C}$; we write this as $a \cdot \mathcal{C} \ge 0$. By Farkas’ Lemma applied to the cone $\mathcal{D}$, either

• $a \in \mathcal{D}$, or
• there exists $\lambda \in \mathbb{R}^n$ such that $\lambda \cdot \mathcal{D} \ge 0$ and $a \cdot \lambda < 0$.

In the either’ case, since $\lambda \cdot \mathcal{D} \ge 0$, the sequence $\lambda$ is completely monotone. But then $a \cdot \lambda < 0$ contradicts the hypothesis on $a$. Therefore $a \in \mathcal{D}$. So we have a simple necessary and sufficient condition for a linear polynomial to be non-negative on all completely monotone sequences: this is the case if and only if the coefficients are a linear coefficients of the vectors $v^{(0)}, v^{(1)}, \ldots, v^{(n-1)}$.

#### Quadratic example

I had hoped that something similar might work for quadratic and higher degree polynomials based on analogously defined cones coming from the coefficients of products in the differences. Here is an counterexample to this hope.

Take $n=3$ and order monomials lexicographically $x_0^2, x_0x_1, x_0x_2, x_1^2, x_1x_2, x_2^2$. The homogeneous quadratics in the difference polynomials $x_0, x_1,x_2, x_0-x_1,x_1-x_2,x_0-2x_1+x_2$ span the same subspace as the rows of the matrix below.

$\left( \begin{matrix} 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 1 & -1 \\ 0 & 0 & 0 & 1 & -2 & 1 \\ 0 & 0 & 1 & 0 & -2 & 1 \\ 0 & 1 & -1 & -2 & 3 & -1 \\ 1 & -4 & 2 & 4 & -4 & 1 \end{matrix}\right)$

For instance, the penultimate row has the coefficients for $(\lambda_0-2\lambda_1+\lambda_2)(\lambda_1-\lambda_2)$. Define
$g(x_0,x_1,x_2) = (x_0-x_1)(x_0-x_1-x_2)+x_1^2$.

Claim. $g(\lambda_0, \lambda_1, \lambda_2) \ge 0$ for all completely monotonic sequences $(\lambda_0,\lambda_1,\lambda_2)$.

Proof. Since the polynomial is homogeneous, we may assume that $\lambda_0 = 1$. Since $g$ is linear in $x_2$, when evaluated at $1,\lambda_1,\lambda_2$, it is negative if and only if

$\displaystyle \lambda_2 \ge \frac{(1-\lambda_1)^2+\lambda_1^2}{1-\lambda_1} = \frac{1-2\lambda_1 + 2\lambda_1^2}{1-\lambda_1}$

This inequality implies that

$\displaystyle \lambda_1 - \lambda_2 \le \frac{-1 + 3\lambda_1 - 3\lambda_1^2}{1-\lambda_1}$

but the quadratic $1 - 3y + 3y^2 - 3y = 3(y-1/2)^2 + \frac{1}{4}$ is always at least $\frac{1}{4}$. Hence $\lambda_1 -\lambda_2 \le -\frac{1}{4}$, contradicting complete monotonicity. $\qquad\Box$

Claim. $g$ is not a positive linear combination of homogeneous quadratics in the differences $x_0,x_1,x_2$, $x_0-x_1,x_1-x_2$, $x_0-2x_1+x_2$.

Proof. It is equivalent to show that the coefficients of $g$, namely $(1, -2, -1, 2, 1, 0)$ are not in the cone of positive linear combinations of the $6 \times 6$ matrix above. Using the computer algebra Magma, one can compute its dual cone, which turns out to be the set of positive linear combinations of the rows of the matrix below.

$\left(\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 4 & 1 & 0 & 0 & 0 & 0 \\ 2 & 1 & 1 & 0 & 0 & 0 \\ 4 & 2 & 0 & 1 & 0 & 0 \\ 4& 3 & 2 & 2 & 1 & 0 \\ 1 & 1 & 1 & 1 & 1 & 1 \end{matrix}\right)$

In particular, the dual cone contains $(2,1,1,0,0,0)$, and since $(1, -2, -1, 2, 1, 0) \cdot (2,1,1,0,0,0) = -1$, $(1,-2,-1,2,1,0)$ is not in the dual of the dual cone, as required.$\qquad\Box$

I suspect that there is no easy way (or maybe even no algorithmic way?) to decide if a homogeneous polynomial in a completely monotone sequence is non-negative, but would be very happy to be proved wrong on this.

## Immanants and representations of the infinite symmetric group

October 12, 2020

This is a reminder to the author of a wild question. (I won’t even call it an idea.) Given a $n \times n$ matrix $X$ and a symmetric group character $\chi$, the immanant $d_\chi(X)$ is defined by

$\displaystyle \sum_{\sigma \in \mathrm{Sym}_n} \chi(\sigma) \prod_{i=1}^n X_{(i,\sigma(i))}.$

Thus when $\chi$ is the sign character the immanant is the familiar determinant, and when $\chi$ is the trivial character, it is the permanent, of interest partly because of the Permanent dominance conjecture and its starring role in Valiant’s programme to prove the algebraic analogue of the P $\not=$ NP conjecture. Very roughly stated, a more general conjecture is that all immanants, except for the determinant (or multiples of it), are hard to compute.

My wild question is: can one generalize immanants to representations of infinite symmetric groups, and does this give any extra freedom to in some sense interpolate’ from the trivial to the sign character?

## A game on tournaments with multiple edges

August 21, 2020

Here is a quick post to record a combinatorial game that I think might be of interest, but I don’t have time to think about seriously now.

When Person A plays Person B on the online chess server on which I spend far too much time, the server looks to see who has the longest run of consecutive games as black, and chooses that person to be white. Ties are broken randomly. This is a simple but effective rule: for instance it guarantees that in a series of games, the players alternate colours.

Now suppose a team of people want to manipulate the server into giving one of their members a long run of consecutive games as white. (This would of course be strictly against the server policies.) How big must the team be to guarantee that some member gets $n$ consecutive games as white?

A quick look at the small cases reveals the answer. With three people, having arbitrary past histories (this matters below), A plays B and the black player, B say, then plays C. If B is white, then when A plays B again, either A or B establishes a run of two games as white. If B is black (for instance this always happens if C has had a run of two black games), then when A plays C, either A or C gets the same run. Note that in the first case, A never plays C, but instead plays B twice.

With four people, we use the strategy above to give A a run of two white games, and then using B, C, D, give B say, another run of two white games. Now when A and B play, a run of three white games is established. Continuing inductively we see that $n$ people can force $n-1$ consecutive white games.

One might also ask for the analogous result where multiple games between the same people are not allowed. If the aim is instead to maximize the number of games as white, then we obtain the combinatorial game where the ‘Picker’ picks an edge of the complete graph $K_n$ and the ‘Chooser’ chooses its orientation. Since in the end all edges are picked exactly once, the role of the ‘Picker’ is defunct: the Chooser can choose an $n$ vertex tournament before play begins, and follow its recommendations. By Theorem 1.2 in this article of Alspach and Gavlas, when $n$ is odd, $K_n$ can be decomposed into $n(n-1)/2m$ edge cycles of length $m$ if and only if $m$ divides $n$. Directing each cycle in an arbitrary way then gives a tournament where every vertex has the same in- and out-degree. A simple solution when $n$ is prime is to take $m=n$ and use the cycles $(0,d,2d,\ldots,(n-1)d)$ for $1 \le d \le (n-1)/2$ on the vertex set $\{0,1,\ldots, n-1\}$, working modulo $n$. When $n$ is even, a discrepancy of $1$ between the in- and out-degrees of every vertex is inevitable, and it’s not hard to use the construction for $n$ odd to see that this can be achieved.

Returning to the original setting where we care about runs of white games, this shows that the Chooser can guarantee the longest run is at most the maximum out-degree, namely $\lfloor n/2 \rfloor$.

I’ll end with two natural questions suggested by this quick analysis.

#### Question 1

Let $G$ be a finite graph. In each step the Picker chooses an edge of $G$, and the Chooser directs it; multiple picks are allowed, and the Chooser may choose a different orientation on different picks. Say that vertex $v$ has a run of length $m$ if in the most recent $m$ picks involving this vertex, it was always the source. What is the maximum run length that the Picker can force somewhere in this graph?

#### Question 2

As Question 1, but now each edge may only be picked once.

For the first in the case of the complete graph, the analysis above shows that the Picker can force a run of length $n-1$, and since the chooser can make the next directed edge involving this vertex an in-edge, this is best possible. For the second in the case of the complete graph, the Picker cannot do better than $\lfloor n/2 \rfloor$, but I do not know if this is always possible.

## Flipped classroom and online learning: Personal FAQ

June 7, 2020

I spent most of June 2 and 3 attending two meetings about online learning: Flexible education (organized by Royal Holloway), at which a College wide teaching model was unveiled, and Teaching and Learning Mathematics Online (organized by Michael Grove, Rachel Hilliam and Kevin Houston). A recurring theme was ‘Active Blended Learning’ or the ‘Flipped Classroom’.  The purpose of this post is to record some observations made by speakers at these talks and my reflections on the research literature on these pedagogic models. Please may I emphasis that this post has absolutely no official status.

#### Does the College model require the flipped classroom?

This isn’t said explicitly, but I think it is implicit throughout. For example in the seven step programme that we are supposed to follow for each ‘teaching week’, we read

• 3. Engage with learning material. Provides time for students to independently or in groups to engage with learning materials.
• 4. Learning activity. Enables the practical and critical application of learning through individual or group activities.
• 5. Learning Check. Allows the student and lecturer(s) to check that the student learned the content and achieved the learning outcome(s) for the week through an activity or formative/summative assessment.

All this seems to sit fairly happily in a framework where students are expected to work off-line and then come together for synchronous problem solving sessions, maybe checking their understanding before/after by an online quiz. It does not, as as I see it, fit with the traditional model of three live lectures per week.

#### What is the research evidence for the flipped classroom and active blended learning?

Freeman et al Active learning increases student performance in science, engineering, and mathematics, PNAS 111 (2014) 410–8415 is a meta-analysis of 225 studies that compared student performance in courses lectured in the traditional style and student performance courses with ‘at least some active learning’. The conclusion is clear: active learning improved student performance by about 1/2 of a standard deviation.

More strikingly, the average failure rate was 21.8% under active learning compared to 33.8% under traditional lecturing. Measures of student engagement and satisfaction were not considered in this meta-analysis; my own guess is that the reduction in failure rate is correlated with greater engagement by the weaker students.

A survey talk by Robert Talbot, updating the conclusions of his 2017 book Flipped Learning: A Guide for Higher Education, also has clear conclusions. From the linked slides (emphasis preserved):

• Students in FL courses typically show either greater gains in measures of learning than students in traditional courses or else the differences are not statistically significant (slide 16).
• Students show higher satisfaction with FL and the active learning techniques once FL is in place (slide 18).
• Students often highly negative about FL when first introduced, ‘even while acknowledging benefits of increased group work, more instructor attention, and better grades’ (slide 18).

In Jacqueline O’Flaherty and Craig Philips The use of flipped classrooms in higher education, Internet and Higher Education 25 (2015) 85–95 the authors conclude in Section 4.4:

Our review indicates a number of positive student learning outcomes from this pedagogical approach,

but add:

This review found very few studies that actually demonstrated robust evidence to support that the flipped learning approach is more effective than conventional teaching methods. Only one study used empirical validation to show that a structured flipped classroom in comparison to the traditional one could effectively engage students in deep learning [Hung, Flipping the classroom for English language learners to foster active learning]. Whilst some studies referred to a modest improvement of academic performance, through outcomes of increased examination scores or improved student satisfaction, further research is required in this area …

The cited study was a randomised controlled trial comparing two flipped models with the traditional lecture model on the same content on 75 students. Taking this as the minimum standard for a reliable study is clearly a high (but not unreasonable) barrier when collecting evidence.

Finally I’ll mention a study by James McEvoy, Interactive problem-solving sessions in an introductory bioscience course engaged students and gave them feedback, but did not increase their exam scores. James ‘partially-flipped’ a Royal Holloway biology course by replacing one of two weekly lectures with an interactive problem solving class (see page 5 for details of how this was run). After the change there was a significant improvement in student responses to the two questions ‘The teaching style was engaging‘ and ‘I received feedback on my progress during the course‘. However other measures of student engagement, and (as the title makes clear) exam performance were not significantly affected; there was however a reduction in failure rate.

McEvoy’s findings are consistent with the Freeman et al study and my belief that the flipped classroom may benefit weakest students the most, by helping them to get somewhere, rather than nowhere (as alas, is too often the case in mathematics courses).

#### What is ‘flipped learning’ anyway?

Talbot’s suggested definition (see about half way down) is

Flipped Learning is a pedagogical approach in which first contact with new concepts moves from the group learning space to the individual learning space in the form of structured activity, and the resulting group space is transformed into a dynamic, interactive learning environment where the educator guides students as they apply concepts and engage creatively in the subject matter.

He is clear that flipped learning does not necessarily require videos, and that a methodological flaw in some metastudies is that they exclude teaching models where videos were not used. Neither does flipped learning necessarily require lecturing. Instead a key feature is that the first contact with new concepts comes from a structured activity, done on one’s own. On my reading, the instructors’s duty is to plan this activity, and then enable students to apply the ideas creatively in individual and group work. Reassuringly (see slide 20), he reports

Implementation matters but not as much as simply offloading direct instruction into structured student activity and using the class time for active learning.

#### How does one make the flipped classroom work?

Here are some points that I jotted down from the various talks. Sue Pawley’s talk at TALMO Is there anyone out there? A guide to interactive activities in the online environment was especially useful.

• Make all live sessions worth attending. Don’t deliver the lecture (again), but instead get students to work in groups, learning from each other and you. Make live sessions motivating. Plan your teaching. Think context not content.
• Getting feedback from students is difficult: do not expect many students to speak in front of their peers. They will never speak unless recording is switched off. Quizzes are good and technologically easy. I want a system where students can collaborate in small groups on a ‘digital whiteboard’; at the moment this seems to require high-end tablets.
• Give students a ‘scaffold’. This was stressed by speakers at both events. Do not just say ‘read this’. Do not bombard students with content. Instead say ‘read this, watch that, do this quiz to check your (basic) understanding’.
• Do not use peer review since this just adds to the fear of being wrong. But critiquing anonymous wrong answers (we could even fabricate them) can be useful.
• As an example of what can be done, when everyone involved has a good internet connection and a latest-model iPad Pro, this video has three of the Royal Holloway mathematics staff (two of them pretending to be students) collaborating on a simple geometry problem.
• One thought from me: in all the online talks, the convener collated questions in the chat and then selected the most useful (or most upvoted) to ask. This seems a clear improvement on the traditional model where the most assertive person gets to set the initial subject and tone of the questions.

#### What ideas are there for quiz questions?

George Kinnear’s talk at TALMO Using quizzes to deliver a course online had many useful ideas, going beyond the usual multiple choice model.

• Faded worked examples: leave gaps in a model proof or solution for students to fill in.
• Invite students to create examples (I think they will need a lot of hand-holding).
• The ‘test effect’: recalling information from memory is beneficial (in the context of the always-connected internet generation, I think this deserves saying). For instance, before a section on integration, he got students to make a table of important functions (left column) and their derivatives (right column).

I very much hope we will be able to use the STACK computer algebra system (which integrates with Moodle) to set questions. It can validate quite complicated algebraic answers and give immediate feedback. For instance, it was used to mark the integration quiz above, and even suggest important functions the students had not used.

My own view is that quiz questions used to review material should be very basic, while questions in interactive workshops should be much tougher and focus on potential misconceptions and common errors.

#### What about flipping more advanced courses?

All the talks I’ve been to were on fairly basic courses: this includes Kevin Houston’s talk at an LMS Education Day in 2016. There is little research on flipping advanced courses. I believe it can work, e.g. Kinnear’s idea of the faded worked example can be adapted to the faded proof. And advanced courses offer many opportunities for non-trivial (for students) questions about relaxing hypotheses, or strengthening conclusion, as well as reversing implications, chaining implications, spotting contrapositive restatements that are helpful, and so on.

#### Is this consistent with academic freedom?

I hope so, but at a recent EPMS meeting, the Head of School admitted some freedom would be lost. Personally I’m keen on moving to a flipped model, but I will defend to the (academic) death the freedom of colleagues to teach as they see fit, even when I firmly believe what they are proposing will be ineffective. As the slide below from James McEvoy’s talk shows, the lecturer’s preference for the flipped/traditional model is correlated with student performance.

I’m also concerned about the lack of consultation over the College’s model: I dislike having it patiently explained to me that what they’ve decided is for the best. (Of course consultation is not the same as listening, as we all know well.) By contrast, the Maths model has been extensively consulted on and we are still considering how some parts, e.g. group work, will work in the light of concerns of colleagues.

## Athena SWAN: Personal FAQ

May 5, 2020

Yesterday I finished writing the draft Athena SWAN submission for the Royal Holloway Mathematics and Information Security Group and sent it to members of the two departments for comments.

In this post I’ll collect the gist of those replies I’ve made to comments that might be of general interest (and can be made public), and a few extra ‘non-FAQS’, that I hope will still be of interest. The post ends with the references from the bid with clickable hyperlinks.

#### Why is the bid so focused on women and gender? What about disability or LGBT+ people?

The primary purpose of Athena SWAN is to address gender inequalities. In Mathematics and Information Security this means tackling the under-representation of women in almost everything we do.

Royal Holloway was one of the first HEIs to get a Race Equality Charter Mark and some of the proposed actions are aimed at BAME (Black, Asian and Minority Ethnic) students. Another action will promote the College workshop ‘How to be an LGBT ally’ and the (excellent) external SafeZone training. Yes, we should do more, but this is a Bronze bid so only the start of a long process to address inequalities.

#### Are women really under-represented?

I think yes. Mathematics is ahead of many sector norms: for example 40.6% of our undergraduate intake is female, compared to the sector mean of 35.7%; 38.8% of A-level Mathematics students are women. Of our staff 25.0% are women, compared to a sector mean of 20.4%; of professors, 23.1% are women, compared to an appalling sector mean of 12.6%. But all that said, women are half the population, and about 40% of new mathematics graduates are women, so we have very far to go.

#### Isn’t it discrimination to focus so many actions on women?

I argue very firmly no. The question is a ‘non-FAQ’ that I’ve deliberately worded in a pejorative way. (It is impossible to use the word ‘discrimination’ in this context in the positive sense ‘X has a fine discriminating palate for wine’.) By improving our policies and procedures and thinking particularly about women, we very often make life better for everyone. It is not a zero-sum game. It is legitimate to target actions at women to address under-representation. This does not imply that critical decisions, such as recruitment and promotion, will then be biased in favour of women.

#### Why the focus on unconscious bias?

Unconscious bias training was the most frequently requested form of training when we surveyed all staff and Ph.D. students. There is strong evidence that unconscious bias exists and prevents women from achieving their potential. An important early study is Science faculty’s subtle gender biases favor male students by Moss-Racusin et al, which asked scientists to evaluate two CVs for a job as a lab manager. The CVs were identical except in one the candidate’s first name was ‘John’, and in the other ‘Jennifer’. Both men and women rated Jennifer as less competent than John, and recommended a lower starting salary.

There is evidence that unconscious bias training can be effective for reducing unconscious bias (see pages 6 and 16: the overall picture is mixed, but the conclusion is clear). My own experience suggests that high-quality training and reading around the issue has made me more aware of the issues, and at least slightly less likely to rush to (probably poor) conclusions.

#### What can I read about unconscious bias?

I highly recommend the third part of Cordelia Fine’s book Delusions of gender. The first two parts make a very convincing case that many stereotypical gender traits are not hard-wired, but instead products of culture and upbringing, or even (on closer inspection) non-existent. The final part examines how our remorselessly gendered society creates these biases and misconceptions.

#### What is unconscious bias?

First of all, I prefer the term ‘implicit bias’, since one can wrongly interpret ‘unconscious bias’ as referring to something that is independent of our thought processes and beyond our control.

Let me introduce my personal answer with an object that should be emotionally neutral and familiar to readers, namely ‘a vector space’. What comes into your head? Is it a finite dimensional vector space over $\mathbb{R}$, such as the familiar Euclidean space $\mathbb{R}^3$, or (my answer) an indeterminately large space over a finite field: $\mathbb{F}_q^n$? Or perhaps the most important vector spaces you meet are function spaces, in which case you might be imagining Hilbert space, with or without a preferred orthonormal basis. Yet other answers are possible: someone working in cryptography might think of $\mathbb{F}_2^{256}$. Quite possibly, I’ve missed your preferred example completely. Or maybe your brain just doesn’t work this way, and all you think about is the abstract definition of vector spaces and your immediate associations are to the main theorems you use when working with them. Anyway, my point is that we don’t think about vector spaces ‘in isolation’: instead they come with a bundle of implicit associations that are deeply shaped by our education and day-to-day experience.

Now instead think about ‘Mathematics professor’. Without claiming the thought processes are completely analogous, I hope you will agree that something similar goes on, with a bunch of implicit associations coming into our heads. For instance, I immediately start thinking about some of my professorial colleagues in Mathematics and ISG. In this respect I’m lucky: because I have personal examples to draw on, my immediate mental image is not the stereotypical old white man.

Taking this as a roughly accurate portrait of human cognition, we now see a mechanism in which bias can enter our decisions. For instance, in the Stanford lab manager study, the implicit associations around the word ‘manager’ bring to mind men, and so the male candidate is favoured. I suspect either you will readily accept this point, or feel it is completely unwarranted, so I won’t argue it any further, but instead refer you to the literature.

#### What about the Implicit Association Test?

My reading suggests that the Implicit Association Test is valuable as a way of raising awarenss of implicit bias. But is has been much criticised, and it is not clear the biases it identifies translate into unfair discrimination.

#### Isn’t this a huge piece of bureaucracy?

A ‘non-FAQ’, although the question has occurred to others. The answer is ‘Yes’. A recent report has many criticism of the Athena SWAN process. For instance, from the summary on page 3:

The application process must be streamlined and the administrative burden on staff, particularly female staff, reduced.

For what it’s worth, I think I could have written a major grant application or completed a substantial research project in the time it took just to draft the submission. Even this rough measure takes no account of the hours of time (not just mine) spent consulting over draft actions and the many weeks of work that the College’s Equality and Diversity Coordinator put into the bid.

#### What’s the point of doing all this when it clearly wouldn’t address Y (where Y is the manifest injustice of your choice)?

Just because it (probably) wouldn’t have prevented Y, doesn’t mean it isn’t worth doing for other reasons.

#### No seriously, what are the consequences if we don’t have an Athena SWAN award?

RCUK (the main funder for mathematics research) recommends ‘participation in schemes such as Athena SWAN’. There is no requirement to have an award, and my impression (contrary to what I half-expected) is that it is not likely to become a requirement in the near future.

Royal Holloway expects its departments to apply for awards. So if we don’t get it, we will either have to change this policy or work towards a reapplication in a few years. In short, we will be back to where we were two years ago. We could implement the Action Plan anyway, but without the motivation of holiding an award, progress might slip.

The Action Plan was formulated after long discussions within the E&D Committee and consulted on widely with department members. All actions are owned by the member of staff most closely involved: this is typically not me (the E&D Champion). I believe it will drive substantial culture improvements in Mathematics and ISG.

#### Do all Athena SWAN applications have references to the research literature on gender equality and feminism?

(A blatant ‘non-FAQ’.) No. In fact ours is the first I’ve seen. Probably it’s also the first Athena SWAN bid in which the Action Plan is generated by a customised database written from scratch in a functional programming language and outputting to LaTeX.

### References

1. Pragya Agarwal, SWAY: Unravelling unconscious bias, Bloomsbury, 2020.
2. Robert W. Aldritch et al, Black, Asian and Minority Ethnic groups in England are at increased risk of death from COVID-19: indirect standardisation of NHS mortality data [version 1; peer review: awaiting peer review], Wellcome Open Research, Coronavirus (COVID-19) collection, 6 May 2020.
3. Doyin Atewologun, Tinu Cornish and Fatima Tresh, Unconscious bias training: An assessment of the evidence for effectiveness, Equality and Human Rights Commission research report 113, March 2018.
4. Athena SWAN Charter Review Independent Steering Group for Advance HE, The Future of Athena SWAN, March 2020.
5. Anne Boring, Kellie Ottoboni and Philip B. Start, Student evaluations of teaching (mostly) do not measure teaching effectiveness, ScienceOpen Research (2016)
6. Caroline Criado-Perez, Invisible Women: Exposing Data Bias in a World Designed for Men, Chatto & Windus, 2019.
7. Cordelia Fine, Delusions of Gender, Icon Books Ltd, 2005.
8. Cordelia Fine, Testosterone Rex Icon Books Ltd, 2017.
9. Uta Frith, Understanding unconscious bias. Royal Society, 2015.
10. Cassandra M. Guarino and Victor M. H. Borden, Faculty service loads and gender: are women taking care of the academic family?, Research in Higher Education (2017) 58 672–694.
11. Nancy Hopkins, Diversification of a university faculty:
Observations on hiring women faculty in the Schools of Science and Engineering at MIT
, MIT Faculty Newsletter 2006 XVIII.
12. Corinne A. Moss-Racusin, John F. Dovidio, Victoria L. Brescoll, Mark J. Grahama, and Jo Handelsmana, Science faculty’s subtle gender biases favor male students, PNAS (2012) 109 16474–16479.
13. Ruth Pearce, Certifying equality? Critical reflections on Athena SWAN and equality accreditation, report for Centre for Women and Gender, University of Warwick, July 2017.
14. Research Excellence Framework, Guidance on submissions 2021, January 2019.
15. Safezone training.
16. UK Trendence Research, How are Gen Z responding to the coronavirus pandemic?, March 2020.
17. Trades Union Congress, Women and casualization: Women’s experience of job insecurity, January 2015.
18. Sandra Tzvetkova and Esteban Ortiz-Ospina, Working women: What determines female labor force participation?, Our World in Data (2017).
19. Liz Whitelegg, Jennifer Dyer and Eugenie Hunsicker, Work allocation models, Athena Forum, January 2018.

## The counter-intuitive behaviour of high-dimensional spaces

April 18, 2020

This post is an extended version of a talk I last gave a few years ago on an extended Summer visit to Bristol; four weeks into lockdown seems like a good time to write it up for a more varied audience. The overall thesis is that it’s hard to have a good intuition for really high-dimensional spaces, and that the reason is that, asked to picture such a space, most of us come far closer to something like $\mathbb{R}^4$ then the more accurate $\mathbb{F}_2^{1000}$. This is reflected in the rule of thumb that the size of a tower of exponents is determined by the number at the top: $1000^2$ is tiny compared to $2^{1000}$ and $2^{2^{2^{100}}}$ should seem only a tiny bit smaller than

$\displaystyle 4^{2^{2^{100}}} = 2^{2^{2^{100}+1}}$

when both are compared with $2^{2^{2^{101}}}$.

Some details in the proofs below are left to the highly optional exercise sheet that accompanies this post. Sources are acknowledged at the end.

As a warm up for my talk, I invited the audience to order the following cardinals:

$2, 2^{56}, 120^{10}, 10^{120}, \mathbb{N}, 2^\mathbb{N}, 2^{2^\mathbb{N}}.$

Of course they are already correctly (and strictly) ordered by size. From the perspective of ‘effective computation’, I claim that $2$ and $2^{56}$ are ‘tiny’, $10^{120}, \mathbb{N}$ and $2^\mathbb{N}$ are ‘huge’ and $120^{10}$ and $2^{2^\mathbb{N}}$ sit somewhere in the middle. To give some hint of the most surprising part of this, interpret $2^\mathbb{N}$ as the Cantor set $C$, so $2^{2^\mathbb{N}}$ is the set of subsets of $C$. Computationally definable subsets of $C$ are then computable predicates on $C$ (i.e. functions $C \rightarrow \{T, F\}$), and since $C$ is compact, it is computable whether two such predicates are equal. In contrast, there is no algorithm that, given two predicates on $\mathbb{N}$, will run for a finite length of time and then return the correct answer.

### Euclidean space and unit balls

Let $B^n = \{x \in \mathbb{R}^n : ||x|| \le 1\}$ be the solid $n$-dimensional unit ball in Euclidean space and let $S^n = \{x \in \mathbb{R}^{n+1} : ||x|| = 1\}$ be the $n$-dimensional surface of $B^{n+1}$.

#### House prices in Sphereland

Imagine a hypothetical ‘Sphereland’ whose inhabitants live uniformly distributed on the surface $S^n$. For instance, $S^2$ is the one-point compactification of Flatland. With a god’s eye view you are at the origin, and survey the world. At what values of the final coordinate are most of the inhabitants to be found?

For example, the diagram below shows the case $n=2$ with two regions of equal height shaded.

It is a remarkable fact, going all the way back to Archimedes, that the surface areas of the red and blue regions are equal: the reduction in cross-section as we go upwards is exactly compensated by the shallower slope. For a calculus proof, observe that in the usual Cartesian coordinate system, we can parametrize the part of the surface with $x=0$ and $z > 0$ by $(0,y,\sqrt{1-y^2})$. Choose $k > 0$. Then since $(0,-z,y)$ is orthogonal to the gradient of the sphere at $(0,y,z)$, to first order, if we increase the height $z$ from $z$ to $z+k$ then we must decrease the second coordinate $y$ from $y$ to $y-z/y k$. This is shown in the diagram below.

Hence the norm squared of the marked line segment tangent to the surface is

$\displaystyle \bigl|\bigl| \bigl(0, -\frac{z}{y}k, k\bigr) \bigr|\bigr|^2 = \frac{z^2k^2}{y^2} + k^2 = k^2 \frac{z^2+y^2}{y^2} = k^2 \frac{1}{1-z^2}.$

As in Archimedes’ argument, the area of the region $R$ between the lines of latitude of height between $z$ and $z+k$ is (to first order in $k$) the product of $k/\sqrt{1-z^2}$ and the circumference of the line of latitude at height $z$. It is therefore

$\displaystyle \frac{k}{\sqrt{1-z^2}} \times 2 \pi \sqrt{1-z^2} = 2\pi k$

which is independent of $z$. As a small check, integrating over $z$ from $-1$ to $1$ we get $4\pi$ for the surface area of the sphere; as expected this is also the surface area of the enclosing cylinder of radius $1$ and height $2$.

This cancellation in the displayed formula above is special to the case $n=2$. For instance, when $n=1$, by the argument just given, the arclength at height $z$ is proportional to $1/\sqrt{1-z^2}$. Therefore in the one-point compactification of Lineland (a place one might feel is already insular enough), from the god’s perspective (considering slices with varying $z$), most of the inhabitants are near the north and south poles ($z = \pm 1$), and almost no-one lives on the equation ($z=0$).

More generally, the surface area of $S^n$ is given by integrating the product of the arclength $1/\sqrt{1-z^2}$ and the surface area of the ‘latitude cross-section’ at height $z$. The latter is the $(n-1)$-dimensional sphere of radius $\sqrt{1-z^2}$. By dimensional analysis, its surface area is $C\bigl( \sqrt{1-z^2}\bigr)^{n-1}$ for some constant $C$. Hence the density of Sphereland inhabitants at height $z$ is proportional to $\bigl( \sqrt{1-z^2} \bigr)^{n-2}$. (A complete proof of this using only elementary calculus is given in Exercise 1 of the problem sheet.) In particular, we see that when $n$ is large, almost all the density is at the equator $z=0$. From the god’s eye perspective, in high-dimensional Sphereland, everyone lives on the equator, or a tiny distance from it. To emphasise this point, here are the probability density functions for $n \in \{1,2,3,5,10,25\}$ using colours red then blue then black.

As a further example, the scatter points below show $1000$ random samples from $S^{20}$, taking (left) $(x_1,x_2)$ coordinates and (right) $(x_{20}, x_{21})$ coordinates. Note that almost no points have a coordinate more than $1/2$ (in absolute value). Moreover, since the expected value of each $x_i^2$ is $1/20$, we find most of the mass at $x_i \approx 1/4$.

In particular, if a random inhabitant of large $n$-Spaceland is at $(x_1,\ldots, x_{n+1})$ then it is almost certain that most of the $x_i$ are very small.

One feature of this seems deeply unintuitive to me. There is, after all, nothing intrinsic about the $z$-coordinate. Indeed, the god can pick any hyperplane in $\mathbb{R}^{n+1}$ through the origin, and get a similar conclusion.

#### Concentration of measure

One could make the claim about the expected sizes of the coordinates more precise by continuing with differential geometry, but Henry Cohn’s answer to this Mathoverflow question on concentration of measure gives an elegant alternative approach. Let $X_1, \ldots, X_n$ be independent identically distributed normal random variables each with mean $0$ and variance $1/n$. Then $(X_1,\ldots,X_n)$ normalized to have length $1$ is uniformly distributed on $S^n$. (A nice way to see this is to use that a linear combination of normal random variables is normally distributed to show invariance under the orthogonal group $\mathrm{SO}_n(\mathbb{R})$.) Moreover, $\mathbb{E}[X_1^2 + \cdots + X_n^2] = 1$, and, using that $\mathrm{Var}[X_i^4] = 3/n^2$, we get

$\displaystyle \mathrm{Var}[X_1^2 + \cdots + X_n^2] = n \frac{3}{n^2} + n(n-1) \frac{1}{n^2} - 1 = \frac{2}{n}.$

Hence, by Chebychev’s Inequality that if $Z$ is a random variable then

$\displaystyle \mathbb{P}\bigl[|Z-\mathbb{E}Z| \ge c\bigr] \le \frac{\mathrm{Var} Z}{c^2},$

the probability that $X_1^2 + \cdots + X_n^2 \not\in (1-\epsilon,1+\epsilon)$ is at most $2/n\epsilon^2$ which tends to $0$ as $n \rightarrow \infty$. Therefore we can, with small error, neglect the normalization and regard $(X_1,\ldots, X_n)$ as a uniformly chosen point on $S^{n-1}$. By Markov’s inequality, that if $Z$ is a non-negative random variable then $\mathbb{P}[Z > a\mathbb{E}Z] < 1/a$, the probability that $|X_1| \ge a/\sqrt{n}$ (or equivalently, $X_1^2 \ge a^2/n$) is at most $1/a^2$, and the probability that $|X_1| \ge 1/n^{1/3}$ (or equivalently, $X_1^2 \ge 1/n^{2/3} = a/n$ with $a = n^{1/3}$) is at most $1/n^{1/3}$. Since

$\displaystyle \bigl(1 - \frac{1}{n^{1/3}}\bigr)^n \approx e^{-n/n^{1/3}} \rightarrow 0$

as $n \rightarrow \infty$, we see that with high probability, a random Sphereland inhabitant has all its coordinates in $(-1/n^{1/3}, 1/n^{1/3})$. I think this makes the counter-intuitive conclusion of the previous subsection even starker.

#### Volume of the unit ball

The ‘length of tangent line times area of cross-section’ argument says that if $A(S^n)$ is the surface area of $S^n$ then

\begin{aligned} \displaystyle A(S^n) &= \int_{-1}^1 \frac{A(S^{n-1}) \sqrt{1-z^2}^{n-1}}{\sqrt{1-z^2}} \mathrm{d} z \\ &= A(S^{n-1}) \int_{-1}^1 \sqrt{1-z^2}^{n-2} \mathrm{d} z. \end{aligned}

A quick trip to Mathematica to evaluate the integral shows that

$\displaystyle A(S^n) = A(S^{n-1}) \frac{\sqrt{\pi}\Gamma(\frac{n}{2})}{\Gamma(\frac{n+1}{2})}.$

It follows easily by induction that $A(S^n) = 2\sqrt{\pi}^{n+1} / \Gamma(\frac{n+1}{2})$. Since $B^n = \bigcup_{r=0}^1 rS^{n-1}$ and $A(rS^{n-1}) = r^{n-1} A(S^{n-1})$, the volume $V_n$ of $B^n$ is

$V_n = \displaystyle A(S^{n-1}) \int_0^1 r^{n-1} \mathrm{d}r = \frac{2\sqrt{\pi}^n}{n\Gamma(\frac{n}{2} )}.$

In particular, from $\Gamma(t) = (t-1)!$ for $t \in \mathbb{N}$ we get $V_{2m} = \pi^m / m!$. Hence $V_{2(m+1)} = \frac{\pi}{(m+1)} V_{2m}$ (is there a quick way to see this?), and the proportion of the cube $[-1,1]^{2m}$ occupied by the ball $B^{2m}$ is

$\displaystyle \frac{(\pi/4)^m}{m!}.$

Thus the proportion tends to $0$, very quickly. I find this somewhat surprising, since my mental picture of a sphere is as a bulging convex object that should somehow fill out’ an enclosing cube. Again my intuition is hopelessly wrong.

### Coding Theory

We now turn to discrete spaces. Our friends Alice and Bob must communicate using a noisy channel that can send the bits $0$ and $1$. They agree (in advance) a binary code $C \subseteq \mathbb{F}_2^n$ of length $n$, and a bijection between codewords and messages. When Bob receives a binary word in $\mathbb{F}_2^n$ he decodes it as the message in bijection with the nearest codeword in $C$; if there are several such codewords, then he chooses one at random, and fears the worst. Here ‘nearest’ of course means with respect to Hamming distance.

#### Shannon’s probabilistic model

In this model, each bit flips independently with a fixed crossover probability $p$. If $p = 1/2$ then reliable communication is impossible and if $p > 1/2$ then we can immediately reduce to the case $p < 1/2$ by flipping all bits in a received word. We therefore assume that $p < 1/2$. In this case, Shannon's Noisy Coding Theorem states that the capacity of the binary symmetric channel is $1 - h(p)$, where $h(p) = -p\log_2 p - (1-p) \log_2 (1-p)$ is the binary entropy function. That is, given any $\epsilon > 0$, provided $n$ is sufficiently large, there is a binary code $C \subseteq \mathbb{F}_2^n$ and size $|C| \ge 2^{(1-h(p)-\epsilon)n}$ such that when $C$ is used with nearest neighbour decoding to communicate on the binary symmetric channel, the probability of a decoding error is $\le \epsilon$ (uniformly for every codeword). We outline a proof later.

For example, the theoretical maximum 4G data rate is $10^8$ bits per second. Since $h(1/4) \approx 0.8113$, provided $n$ is sufficiently large, even if one in four bits gets randomly flipped by the network, Shannon’s Noisy Coding Theorem promises that Alice and Bob can communicate reliably using a code of size $2^{0.188n}$. In other words, Alice and Bob can communicate reliably at a rate of up to $18.8$ million bits per second.

#### Hamming’s adversarial model

In Hamming’s model the received word differs from the sent word in at most $e$ positions, where these positions are chosen by an adversary to be as inconvenient as possible. Nearest neighbour decoding always succeeds if and only if the minimum distance of the code is at least $2e+1$.

In the usual binary symmetric channel with crossover probability $p$, when $n$ is large, the chance that the number of flips in the normal binary symmetric channel is more than $(p+\epsilon)n$ is negligible. Therefore a binary code with minimum distance $2e+1$ can be used to communicate reliably on an adversarial binary symmetric channel with crossover probability $p < e/n$, in which the number of flipped bits is always $pn$, and these bits are chosen adversarially.

So how big can a code with minimum distance $d$ be? Let $C$ be such a code and for each $w \in \mathbb{F}_2^{n-2d}$, let

$C_w = \bigl\{(u_1,\ldots, u_{2d}) : u \in C, (u_{2d+1}, \ldots, u_n) = w\bigr\}.$

Observe that each $C_w$ is a binary code of length $2d$ and minimum distance at least $d$. By the Plotkin bound, $|C_w| \le 4d$ for all $d$. (Equality is attained by the Hadamard codes $(2d,4d,d)$.) Since each $C_w$ has size at most $2^{n-2d}$, we find that

$|C| \le 4d\times 2^{n-2d}.$

The relative rate $\rho(C)$ of a binary code $C$ of length $n$ is defined to be $(\log_2 |C|)/n$. By the bound above, if $C$ is $e$-error correcting (and so its minimum distance satisfies $d \ge 2e+1$) we have

$\displaystyle \rho(C) \le \frac{\log_2 (8e+4)2^{n-4e-2}}{n} \le 1 - 4\frac{e}{n}.$

In particular, if $e/n \ge \frac{1}{4}$ then $\rho(C) = 0$: the code consists of a vanishing small proportion of $\mathbb{F}_2^n$. We conclude that if the crossover probability in the channel is $\frac{1}{4}$ or more, fast communication in the Hamming model is impossible.

#### An apparent paradox

We have seen that Shannon’s Noisy Coding Theorem promises reliable communication at a rate beyond the Plotkin bound. I hope it is clear that there is no paradox: instead we have demonstrated that communication with random errors is far easy than communication with adversarial errors.

In my view, most textbooks on coding theory do not give enough emphasis to this distinction. They share this feature with the students in my old coding theory course: for many years I set an optional question asking them to resolve the apparent clash between Shannon’s Noisy Coding Theorem and the Plotkin Bound; despite several strong students attempting it, only one ever got close to the explanation above. In fact in most years, the modal answer was ‘mathematics is inconsistent’. Of course as a fully paid-up member of the mathematical establishment, I marked this as incorrect.

#### The structure of large-dimensional vector spaces

I now want to argue that the sharp difference between the Shannon and Hamming models illuminates the structure of $\mathbb{F}_2^n$ for large $n$ is large.

When lecturing in the Hamming model, one often draws pictures such as the one below, in which, like a homing missile, a sent codeword inexorably heads towards another codeword (two errors, red arrows), rather than heading in a random direction (two errors, blue arrows).

While accurate for adversarial errors, Shannon’s Noisy Coding Theorem tells us that for random errors, this picture is completely inaccurate. Instead, if $C$ is a large code with minimum distance $pn$ and $u \in C$ is sent over the binary symmetric channel with crossover probability $p$, and $v \in \mathbb{F}_2^n$ is received, while $v$ is at distance about $pn$ from $u$, it is not appreciably closer to any other codeword $u' \in C$. I claim that this situation is possible because $\mathbb{F}_2^n$ is ‘large’, in a sense not captured by the two-dimensional diagram above.

#### Shannon’s Noisy Coding Theorem for the binary symmetric channel

To make the previous paragraph more precise we outline a proof of Shannon’s Noisy Coding Theorem for the binary symmetric channel. The proof, which goes back to Shannon, is a beautiful application of the probabilistic method, made long before this was the standard term for such proofs.

We shall simplify things by replacing the binary symmetric channel with its ‘toy’ version, in which whenever a word $u \in \mathbb{F}_2^n$ is sent, exactly $pn$ bits are chosen uniformly at random to flip. (So we are assuming $pn \in \mathbb{N}$.) By the Law of Large Numbers, this is a good approximation to binomially distributed errors, and it is routine using standard estimates (Chebychev’s Inequality is enough) to modify the proof below so it works for the original channel.

Proof. Fix $\rho < 1 - h(p)$ and let $M = 2^{n \rho}$, where $n$ will be chosen later. Choose $2M$ codewords $U(1), \ldots, U(2M)$ uniformly at random from $\mathbb{F}_2^n$. Let $P_i$ be the probability (in the probability space of the toy binary symmetric channel) that when $U(i)$ is sent, the received word is decoded incorrectly by nearest neighbour decoding. In the toy binary symmetric channel, the received word $v$ is distance $pn$ from $U(i)$, so an upper bound for $P_i$ is the probability $Q_i$ that when $U(i)$ is sent, there is a codeword $U(j)$ with $j\not=i$ within distance $pn$ of the received word. Note that $U(i), U(j), P_i, P_j$ are all themselves random variables, defined in the probability space of the random choice of code.

Now $Q_i$ is in turn bounded above by the expected number (over the random choice of code) of codewords $U(j)$ with $j\not=i$ within distance $pn$ of the received word. Since these codewords were chosen independently of $U(i)$ and uniformly from $\mathbb{F}_2^n$, it doesn't matter what the received word is: the expected number of such codewords is simply

$\displaystyle \frac{(2M-1) V_n(pn)}{2^n}$

where $V_n(pn)$ is the volume (i.e. number of words) in the Hamming ball of radius $pn$ about $\mathbf{0} \in \mathbb{F}_2^n$. We can model words in this ball as the output of a random source that emits the bits $0$ and $1$ with probabilities $1-p$ and $p$, respectively. The entropy of this source is $h(p)$, so we expect to be able to compress its $n$-bit outputs to words of length $h(p)n$. Correspondingly,

$V_n(pn) \le 2^{h(p)n}.$

(I find the argument by source coding is good motivation for the inequality, but there is of course a simpler proof using basic probability theory: see the problem sheet.) Hence

$\displaystyle P_i \le \frac{(2M-1)2^{h(p)n}}{2^n} \le 2 \times 2^{(\rho + h(p) - 1)n}.$

Since $\rho < 1-h(p)$, the probability $P_i$ of decoding error when $U(i)$ is sent becomes exponentially small as $n$ becomes large. In particular, the mean probability $P = \frac{1}{2M} \sum_{i=1}^{2M} P_i$ is smaller than $\epsilon / 2$, provided $n$ is sufficiently large. A Markov’s Inequality argument now shows that by throwing away at most half the codewords, we can assume that the probability of decoding error is less than $\epsilon$ for all $M$ remaining codewords. $\Box$

#### Varying the alphabet

Increasing the size of the alphabet does not change the situation in an important way. In fact, if $\mathbb{F}_2$ is replaced with $\mathbb{F}_p$ for $p$ large then the Singleton Bound, that $|C| \le p^{n-d}$ becomes effective, giving another constraint that also apparently contradicts Shannon’s Noisy Coding Theorem. There is however one interesting difference: in $\mathbb{F}_2^n$, every binary word has a unique antipodal word, obtained by flipping all its bits, whereas in $\mathbb{F}_p^n$ there are $(p-1)^n$ words at distance $n$ from any given word. This is the best qualitative sense I know in which $\mathbb{F}_2^n$ is smaller than $\mathbb{F}_p^n$.

### Cryptography and computation

Readers interested in cryptography probably recognised $56$ above as the key length of the block cipher DES. This cipher is no longer in common use because a determined adversary knowing (as is usually assumed) some plaintext/cipher pairs can easily try all $\mathbb{F}_2^{56}$ possible keys and so discover the key. Even back in 2008, an FPGA-based special purpose device costing £10000 could test $65.2 \times 10^9 \approx 2^{35.9}$ DES keys every second, giving $12.79$ days for an exhaustive search.

Modern block ciphers such as AES typically support keys of length $128$ and $256$. In Applied Cryptography, Schneier estimates that a Dyson Sphere capturing all the Sun’s energy for 32 years would provide enough power to perform $2^{192}$ basic operations, strongly suggesting that $256$ bits should be enough for anyone. The truly paranoid (or readers of Douglas Adams) should note that in Computational capacity of the universe, Seth Lloyd, Phys. Rev. Lett., (2002) 88 237901-3, the author estimates that even if the universe is one vast computer, then it can have performed at most $10^{120} \approx 2^{398.6}$ calculations. Thus in a computational sense, $2^{56}$ is tiny, and $2^{256}$ and $10^{120}$ are effectively infinite.

The final number above was $120^{10} \approx 2^{69.1}$. The Chinese supercomputer Sunway TaihuLight runs at 93 Petaflops, that is $93 \times 10^{15} \approx 2^{56.4}$ operations per second. A modern Intel chip has AES encryption as a primitive instruction, and can encrypt at 1.76 cycles per byte for a 256-bit key, encrypting 1KB at a time. If being very conservative, we assume the supercomputer can test a key by encrypting 16 bytes, then it can test $2^{56.4}/2^4/1.76 = 2^{51.6}$ keys every second, requiring $2.12$ days to exhaust over $2^{69.1}$ keys. Therefore $120^{10}$ is in the tricky middle ground, between the easily computable and the almost certainly impossible.

My surprising claim that $2^{2^\mathbb{N}}$ also sits somewhere in the middle comes from my reading of a wonderful blog post by Martín Escardó (on Andrej Bauer’s blog). To conclude, I will give an introduction to his seemingly-impossible Haskell programs.

As a warm-up, consider this Haskell function which computes the Fibonacci numbers $F_n$, extending the definition in the natural way to all integers $n$.

fib n | n == 0  = 0
| n == 1  = 1
| n >= 2  = fib (n-1) + fib (n-2)
| n <= -1 = fib (n+2) - fib (n+1)


Except from the minor notation change that the conditions appear before rather than after the equals sign, this Haskell code is a verbatim mathematical definition. The code below defines two predicates $p$ and $q$ on the integers such that $p(n)$ is true if and only if $F_n^2 - F_{n-1}F_{n+1} = (-1)^{n-1}$ and $q(n)$ is true if and only if $F_n \equiv 0$ mod $3$.

p n = fib n * fib n - fib (n+1) * fib (n-1) == (-1)^(n+1)
q n = fib n mod 3 == 0


The first equation is Cassini’s Identity, so $p(n)$ is true for all $n \in \mathbb{Z}$; $q(n)$ is true if and only if $n$ is a multiple of $4$. We check this in Haskell using its very helpful ‘list-comprehension’ syntax; again this is very close to the analogous mathematical notation for sets.

*Cantor> [p n | n <- [-5..5]]
[True,True,True,True,True,True,True,True,True,True,True]

*Cantor> [q n | n <- [-5..5]]
[False,True,False,False,False,True,False,False,False,True,False]


Therefore if we define $p'(n)$ to be true (for all $n \in \mathbb{Z}$) and $q'(n)$ to be true if and only if $n \ \mathrm{mod} \ 4 = 0$, we have $p = p'$ and $q = q'$.

An important feature of Haskell is that it is strongly typed. We haven’t seen this yet because Haskell is also type-inferring in a very powerful way, that makes explicit type signatures usually unnecessary. Simplifying slightly, the type of the predicates above is Integer -> Bool, and the type of fib is Integer -> Integer. (Integer is a built in Haskell type that supports arbitrary sized integers.) The family of predicates $r_m$ defined by

$r_m(n) \iff n \ \mathrm{mod}\ m = 0$

is defined by

r m n = n mod m == 0


Here r has type Integer -> Integer -> Bool. It is helpful to think of this in terms of the currying isomorphism $C^{A \times B} \cong (C^A)^B$, ubiquitous in Haskell code. We now ask: is there a Haskell function

equal :: (Integer -> Bool) -> (Integer -> Bool) -> Bool


taking as its input two predicates on the integers and returning True if and only if they are equal? The examples of $p, p', q, q'$ above show that such a function would, at a stroke, make most mathematicians unemployed. Fortunately for us, Turing’s solution to the Entscheidungsproblem tells us that no such function can exist.

Escardó’s post concerns predicates defined not on the integers $\mathbb{Z}$, but instead on the $2$-adic integers $\mathbb{Z}_2$. We think of the $2$-adics as infinite bitstreams. For example, the Haskell definitions of the bitstream $1000 \ldots$ representing $1 \in \mathbb{Z}_2$ and the bitstream $101010 \ldots$ representing

$2^0 + 2^2 + 2^4 + 2^6 + \cdots = -\frac{1}{3} \in \mathbb{Z}_2$

are:

data Bit = Zero | One
type Cantor = [Bit]
zero = Zero : zero
one  = One : zero
bs   = One : Zero : bs :: Cantor


(I’m deliberately simplifying by using the Haskell list type []: this is not quite right, since lists can be, and usually are, finite.) Because of its lazy evaluation — nothing is evaluated in Haskell unless it is provably necessary for the computation to proceed — Haskell is ideal for manipulating such bitstreams. For instance, while evaluating bs at the Haskell prompt will print an infinite stream on the console, Haskell has no problem performing any computation that depends on only finitely many values of a bitstream. As proof, here is $-\frac{1}{3} - 2 \times \frac{1}{3} = -1$:

*Cantor> take 10 (bs + tail bs)
[One,One,One,One,One,One,One,One,One,One]
*Cantor> take 10 (bs + tail bs + one)
[Zero,Zero,Zero,Zero,Zero,Zero,Zero,Zero,Zero,Zero]


(Of course one has to tell Haskell how to define addition on the Cantor type: see Cantor.hs for this and everything else in this section.) As an example of a family of predicates on $\mathbb{Z}_2$, consider

twoPowerDivisibleC :: Int -> Cantor -> Bool
twoPowerDivisibleC p bs = take p zero == take p bs


Thus twoPowerDivisibleC p bs holds if and only if bs represents an element of $2^p \mathbb{Z}_2$. For example, the odd $2$-adic integers are precisely those for which twoPowerDivisibleC 1 bs is false:

*Cantor> [twoPowerDivisibleC 1 (fromInteger n) | n <- [0..10]]
[True,False,True,False,True,False,True,False,True,False,True]


In the rooted binary tree representation of $\mathbb{Z}_2$ shown below, the truth-set of this predicate is exactly the left-hand subtree. The infinite path to $-1/3$ is shown by thick lines.

Here is a somewhat similar looking definition

nonZeroC :: Cantor -> Bool
nonZeroC (One : _)   = True
nonZeroC (Zero : bs) = nonZeroC bs


While a correct (and correctly typed) Haskell definition, this does not define a predicate on $\mathbb{Z}_2$ because the evaluation of nonZeroC zero never terminates. In fact, it is surprisingly difficult to define a predicate (i.e. a total function with boolean values) on the Cantor type. The beautiful reason behind this is that the truth-set of any such predicate is open in the $2$-adic topology. Since this topology has as a basis of open sets the cosets of the subgroups $2^p \mathbb{Z}_2$, all predicates look something like one of the twoPowerDivisibleC predicates above.

This remark maybe makes the main result in Escardó’s blog post somewhat less amazing, but it is still very striking: it is possible to define a Haskell function
 equalC :: (Cantor -> Bool) -> (Cantor -> Bool) -> Bool 
which given two predicates on $\mathbb{Z}_2$ (as represented in Haskell using the Cantor type) returns True if and only if they are equal (as mathematical functions on $\mathbb{Z}_2$) and false if and only if they are unequal. Escardó’s ingenious definition of equalC needs only a few lines of Haskell code: it may look obvious when read on the screen, but I found it a real challenge to duplicate unseen, even after having read his post. I encourage you to read it: it is fascinating to see how the compactness of $\mathbb{Z}_2$ as a topological space corresponds to a ‘uniformity’ property of Haskell predicates.

### Sources

The results on volumes of spheres and balls are cobbled together from several MathOverflow questions and answers, in particular Joseph O’Rourke’s question asking for an intuitive explanation of the concentration of measure phenomenon, and S. Carnahan’s answer to a question on the volume of the $n$-dimensional unit ball. The coding theory results go back to Shannon, and can be found in many textbooks, for example Van Lint’s Introduction to coding theory. The use of the ‘toy’ binary symmetric channel is my innovation, used when I lectured our Channels course in 2019–20 to reduce the technicalities in the proof of Shannon’s Noisy Coding Theorem. The diagrams were drawn using TikZ; for the spheres I used these macros due to Tomasz M. Trzeciak. The material about computability is either very basic or comes from a blog post by Martín Escardó (on Andrej Bauer’s blog) giving an introduction to his paper Infinite sets that admit fast exhaustive search, published in the proceedings of Logic in Computer Science 2007.

## A dismal outlook: why does Microsoft Teams make it so hard to join meetings?

April 8, 2020

The purpose of this post is to document two of the more blatant bugs I’ve encountered in my forced exposure to Microsoft Teams during the Coronavirus crisis. To add insult to injury, the only way to work around them is to use the Microsoft Outlook, another piece of software riddled with deficiencies. And don’t even think of using Microsoft Outlook (the application): instead one must use the still clunkier web interface.

Let’s get started. Here is a screenshot of me scheduling a meeting at 15:22 for 16:00 today.

Notice anything odd. Probably not: what sane person checks the timezone? But, look closely: Microsoft Teams firmly believes that London is on UTC+00:00 (same as GMT), not BST. This belief is not shared by my system, or any other software on it that I can find.

Now let’s try to join the meeting. Okay it’s early, but we are keen. Here is a screenshot of me hovering with my mouse over the meeting.

There is no way to join. Double clicking on the meeting just gives a chance to reschedule it (maybe to a later date, when Microsoft has fixed this glaring deficiency). The ‘Meet now’ button starts an unrelated meeting.

Okay, maybe our mistake was to join an MS Teams meeting using MS Teams. Let’s try using the Outlook web calendar. Here is a screenshot.

Here is a close-up of the right hand of the right-hand side

On the one hand, the times say that the meeting started 46 minutes ago; on the other, it is ‘in 14 min’. Perhaps because of this temporal confusion, there is no way to join the meeting.

Finally, here is MS Teams at 16:00.

Nothing has changed, there is still no way to join the meeting.

Update. Apparently one of my errors was to schedule a meeting with no invitees. Under Microsoft’s interpretation, such meetings may be scheduled, but never attended (even by gate-crashing). On the time-zone front, both the Outlook web calendar and MS Teams continue to insist that London is on UTC+00:00, but, bizarrely, choosing London as my location (it already was) fixed the scheduling bug. In the Outlook example I invited myself, but still there was no link.

Many thanks to Remi from the Royal Holloway IT support team for steering me on an expert course around the shark-invested waters of Microsoft software.

Further update. The other day I accidentally scheduled a meeting using Microsoft Outlook (the application), taking care to include myself on the guestlist, and hit essentially the same bug. Here is me scheduling a test meeting.

and here are three screen shots as I frantically view the meeting in all the available applications, trying to find a way to join it. None is available.

The prosection rests its case. How can Microsoft justify releasing such inept pieces of software? Whatever top-secret protocol they are using, they can’t even speak it fluently between each other!