Athena SWAN personal FAQ

May 5, 2020

Yesterday I finished writing the draft Athena SWAN submission for the Royal Holloway Mathematics and Information Security Group and sent it to members of the two departments for comments.

In this post I’ll collect the gist of those replies I’ve made to comments that might be of general interest (and can be made public), and a few extra ‘non-FAQS’, that I hope will still be of interest. The post ends with the references from the bid with clickable hyperlinks.

Why is the bid so focused on women and gender? What about disability or LGBT+ people?

The primary purpose of Athena SWAN is to address gender inequalities. In Mathematics and Information Security this means tackling the under-representation of women in almost everything we do.

Royal Holloway was one of the first HEIs to get a Race Equality Charter Mark and some of the proposed actions are aimed at BAME (Black, Asian and Minority Ethnic) students. Another action will promote the College workshop ‘How to be an LGBT ally’ and the (excellent) external SafeZone training. Yes, we should do more, but this is a Bronze bid so only the start of a long process to address inequalities.

Are women really under-represented?

I think yes. Mathematics is ahead of many sector norms: for example 40.6% of our undergraduate intake is female, compared to the sector mean of 35.7%; 38.8% of A-level Mathematics students are women. Of our staff 25.0% are women, compared to a sector mean of 20.4%; of professors, 23.1% are women, compared to an appalling sector mean of 12.6%. But all that said, women are half the population, and about 40% of new mathematics graduates are women, so we have very far to go.

Isn’t it discrimination to focus so many actions on women?

I argue very firmly no. The question is a ‘non-FAQ’ that I’ve deliberately worded in a pejorative way. (It is impossible to use the word ‘discrimination’ in this context in the positive sense ‘X has a fine discriminating palate for wine’.) By improving our policies and procedures and thinking particularly about women, we very often make life better for everyone. It is not a zero-sum game. It is legitimate to target actions at women to address under-representation. This does not imply that critical decisions, such as recruitment and promotion, will then be biased in favour of women.

Why the focus on unconscious bias?

Unconscious bias training was the most frequently requested form of training when we surveyed all staff and Ph.D. students. There is strong evidence that unconscious bias exists and prevents women from achieving their potential. An important early study is Science faculty’s subtle gender biases favor male students by Moss-Racusin et al, which asked scientists to evaluate two CVs for a job as a lab manager. The CVs were identical except in one the candidate’s first name was ‘John’, and in the other ‘Jennifer’. Both men and women rated Jennifer as less competent than John, and recommended a lower starting salary.

There is evidence that unconscious bias training can be effective for reducing unconscious bias (see pages 6 and 16: the overall picture is mixed, but the conclusion is clear). My own experience suggests that high-quality training and reading around the issue has made me more aware of the issues, and at least slightly less likely to rush to (probably poor) conclusions.

What can I read about unconscious bias?

I highly recommend the third part of Cordelia Fine’s book Delusions of gender. The first two parts make a very convincing case that many stereotypical gender traits are not hard-wired, but instead products of culture and upbringing, or even (on closer inspection) non-existent. The final part examines how our remorselessly gendered society creates these biases and misconceptions.

What is unconscious bias?

First of all, I prefer the term ‘implicit bias’, since one can wrongly interpret ‘unconscious bias’ as referring to something that is independent of our thought processes and beyond our control.

Let me introduce my personal answer with an object that should be emotionally neutral and familiar to readers, namely ‘a vector space’. What comes into your head? Is it a finite dimensional vector space over \mathbb{R}, such as the familiar Euclidean space \mathbb{R}^3, or (my answer) an indeterminately large space over a finite field: \mathbb{F}_q^n? Or perhaps the most important vector spaces you meet are function spaces, in which case you might be imagining Hilbert space, with or without a preferred orthonormal basis. Yet other answers are possible: someone working in cryptography might think of \mathbb{F}_2^{256}. Quite possibly, I’ve missed your preferred example completely. Or maybe your brain just doesn’t work this way, and all you think about is the abstract definition of vector spaces and your immediate associations are to the main theorems you use when working with them. Anyway, my point is that we don’t think about vector spaces ‘in isolation’: instead they come with a bundle of implicit associations that are deeply shaped by our education and day-to-day experience.

Now instead think about ‘Mathematics professor’. Without claiming the thought processes are completely analogous, I hope you will agree that something similar goes on, with a bunch of implicit associations coming into our heads. For instance, I immediately start thinking about some of my professorial colleagues in Mathematics and ISG. In this respect I’m lucky: because I have personal examples to draw on, my immediate mental image is not the stereotypical old white man.

Taking this as a roughly accurate portrait of human cognition, we now see a mechanism in which bias can enter our decisions. For instance, in the Stanford lab manager study, the implicit associations around the word ‘manager’ bring to mind men, and so the male candidate is favoured. I suspect either you will readily accept this point, or feel it is completely unwarranted, so I won’t argue it any further, but instead refer you to the literature.

What about the Implicit Association Test?

My reading suggests that the Implicit Association Test is valuable as a way of raising awarenss of implicit bias. But is has been much criticised, and it is not clear the biases it identifies translate into unfair discrimination.

Isn’t this a huge piece of bureaucracy?

A ‘non-FAQ’, although the question has occurred to others. The answer is ‘Yes’. A recent report has many criticism of the Athena SWAN process. For instance, from the summary on page 3:

The application process must be streamlined and the administrative burden on staff, particularly female staff, reduced.

For what it’s worth, I think I could have written a major grant application or completed a substantial research project in the time it took just to draft the submission. Even this rough measure takes no account of the hours of time (not just mine) spent consulting over draft actions and the many weeks of work that the College’s Equality and Diversity Coordinator put into the bid.

What’s the point of doing all this when it clearly wouldn’t address Y (where Y is the manifest injustice of your choice)?

Just because it (probably) wouldn’t have prevented Y, doesn’t mean it isn’t worth doing for other reasons.

No seriously, what are the consequences if we don’t have an Athena SWAN award?

RCUK (the main funder for mathematics research) recommends ‘participation in schemes such as Athena SWAN’. There is no requirement to have an award, and my impression (contrary to what I half-expected) is that it is not likely to become a requirement in the near future.

Royal Holloway expects its departments to apply for awards. So if we don’t get it, we will either have to change this policy or work towards a reapplication in a few years. In short, we will be back to where we were two years ago. We could implement the Action Plan anyway, but without the motivation of holiding an award, progress might slip.

The Action Plan was formulated after long discussions within the E&D Committee and consulted on widely with department members. All actions are owned by the member of staff most closely involved: this is typically not me (the E&D Champion). I believe it will drive substantial culture improvements in Mathematics and ISG.

Do all Athena SWAN applications have references to the research literature on gender equality and feminism?

(A blatant ‘non-FAQ’.) No. In fact ours is the first I’ve seen. Probably it’s also the first Athena SWAN bid in which the Action Plan is generated by a customised database written from scratch in a functional programming language and outputting to LaTeX.

References

  1. Pragya Agarwal, SWAY: Unravelling unconscious bias, Bloomsbury, 2020.
  2. Robert W. Aldritch et al, Black, Asian and Minority Ethnic groups in England are at increased risk of death from COVID-19: indirect standardisation of NHS mortality data [version 1; peer review: awaiting peer review], Wellcome Open Research, Coronavirus (COVID-19) collection, 6 May 2020.
  3. Doyin Atewologun, Tinu Cornish and Fatima Tresh, Unconscious bias training: An assessment of the evidence for effectiveness, Equality and Human Rights Commission research report 113, March 2018.
  4. Athena SWAN Charter Review Independent Steering Group for Advance HE, The Future of Athena SWAN, March 2020.
  5. Anne Boring, Kellie Ottoboni and Philip B. Start, Student evaluations of teaching (mostly) do not measure teaching effectiveness, ScienceOpen Research (2016)
  6. Caroline Criado-Perez, Invisible Women: Exposing Data Bias in a World Designed for Men, Chatto & Windus, 2019.
  7. Cordelia Fine, Delusions of Gender, Icon Books Ltd, 2005.
  8. Cordelia Fine, Testosterone Rex Icon Books Ltd, 2017.
  9. Uta Frith, Understanding unconscious bias. Royal Society, 2015.
  10. Cassandra M. Guarino and Victor M. H. Borden, Faculty service loads and gender: are women taking care of the academic family?, Research in Higher Education (2017) 58 672–694.
  11. Nancy Hopkins, Diversification of a university faculty:
    Observations on hiring women faculty in the Schools of Science and Engineering at MIT
    , MIT Faculty Newsletter 2006 XVIII.

  12. Corinne A. Moss-Racusin, John F. Dovidio, Victoria L. Brescoll, Mark J. Grahama, and Jo Handelsmana, Science faculty’s subtle gender biases favor male students, PNAS (2012) 109 16474–16479.
  13. Ruth Pearce, Certifying equality? Critical reflections on Athena SWAN and equality accreditation, report for Centre for Women and Gender, University of Warwick, July 2017.
  14. Research Excellence Framework, Guidance on submissions 2021, January 2019.
  15. Safezone training.
  16. UK Trendence Research, How are Gen Z responding to the coronavirus pandemic?, March 2020.
  17. Trades Union Congress, Women and casualization: Women’s experience of job insecurity, January 2015.
  18. Sandra Tzvetkova and Esteban Ortiz-Ospina, Working women: What determines female labor force participation?, Our World in Data (2017).
  19. Liz Whitelegg, Jennifer Dyer and Eugenie Hunsicker, Work allocation models, Athena Forum, January 2018.

The counter-intuitive behaviour of high-dimensional spaces

April 18, 2020

This post is an extended version of a talk I last gave a few years ago on an extended Summer visit to Bristol; four weeks into lockdown seems like a good time to write it up for a more varied audience. The overall thesis is that it’s hard to have a good intuition for really high-dimensional spaces, and that the reason is that, asked to picture such a space, most of us come far closer to something like \mathbb{R}^4 then the more accurate \mathbb{F}_2^{1000}. This is reflected in the rule of thumb that the size of a tower of exponents is determined by the number at the top: 1000^2 is tiny compared to 2^{1000} and 2^{2^{2^{100}}} should seem only a tiny bit smaller than

\displaystyle 4^{2^{2^{100}}} = 2^{2^{2^{100}+1}}

when both are compared with 2^{2^{2^{101}}}.

Some details in the proofs below are left to the highly optional exercise sheet that accompanies this post. Sources are acknowledged at the end.

As a warm up for my talk, I invited the audience to order the following cardinals:

2, 2^{56}, 120^{10}, 10^{120}, \mathbb{N}, 2^\mathbb{N}, 2^{2^\mathbb{N}}.

Of course they are already correctly (and strictly) ordered by size. From the perspective of ‘effective computation’, I claim that 2 and 2^{56} are ‘tiny’, 10^{120}, \mathbb{N} and 2^\mathbb{N} are ‘huge’ and 120^{10} and 2^{2^\mathbb{N}} sit somewhere in the middle. To give some hint of the most surprising part of this, interpret 2^\mathbb{N} as the Cantor set C, so 2^{2^\mathbb{N}} is the set of subsets of C. Computationally definable subsets of C are then computable predicates on C (i.e. functions C \rightarrow \{T, F\}), and since C is compact, it is computable whether two such predicates are equal. In contrast, there is no algorithm that, given two predicates on \mathbb{N}, will run for a finite length of time and then return the correct answer.

Euclidean space and unit balls

Let B^n = \{x \in \mathbb{R}^n : ||x|| \le 1\} be the solid n-dimensional unit ball in Euclidean space and let S^n = \{x \in \mathbb{R}^{n+1} : ||x|| = 1\} be the n-dimensional surface of B^{n+1}.

House prices in Sphereland

Imagine a hypothetical ‘Sphereland’ whose inhabitants live uniformly distributed on the surface S^n. For instance, S^2 is the one-point compactification of Flatland. With a god’s eye view you are at the origin, and survey the world. At what values of the nth coordinate are most of the inhabitants to be found?

For example, the diagram below shows the case n=2 with two regions of equal height shaded.

It is a remarkable fact, going all the way back to Archimedes, that the surface areas of the red and blue regions are equal: the reduction in cross-section as we go upwards is exactly compensated by the shallower slope. For a calculus proof, observe that in the usual Cartesian coordinate system, we can parametrize the part of the surface with x=0 and z > 0 by (0,y,\sqrt{1-y^2}). Choose k > 0. Then since (0,-z,y) is orthogonal to the gradient of the sphere at (0,y,z), to first order, if we increase the height z from z to z+k then we must decrease the second coordinate y from y to y-z/y k. This is shown in the diagram below.

Hence the norm squared of the marked line segment tangent to the surface is

\displaystyle \bigl|\bigl| \bigl(0, -\frac{z}{y}k, k\bigr) \bigr|\bigr|^2 = \frac{z^2k^2}{y^2} + k^2 = k^2 \frac{z^2+y^2}{y^2}  = k^2 \frac{1}{1-z^2}.

As in Archimedes’ argument, the area of the region R between the lines of latitude of height between z and z+k is (to first order in k) the product of k/\sqrt{1-z^2} and the circumference of the line of latitude at height z. It is therefore

\displaystyle \frac{k}{\sqrt{1-z^2}} \times 2 \pi \sqrt{1-z^2} = 2\pi k

which is independent of z. As a small check, integrating over z from -1 to 1 we get 4\pi for the surface area of the sphere; as expected this is also the surface area of the enclosing cylinder of radius 1 and height 2.

This cancellation in the displayed formula above is special to the case n=2. For instance, when n=1, by the argument just given, the arclength at height z is proportional to 1/\sqrt{1-z^2}. Therefore in the one-point compactification of Lineland (a place one might feel is already insular enough), from the god’s perspective (considering slices with varying z), most of the inhabitants are near the north and south poles (z = \pm 1), and almost no-one lives on the equation (z=0).

More generally, the surface area of S^n is given by integrating the product of the arclength 1/\sqrt{1-z^2} and the surface area of the ‘latitude cross-section’ at height z. The latter is the (n-1)-dimensional sphere of radius \sqrt{1-z^2}. By dimensional analysis, its surface area is C\bigl( \sqrt{1-z^2}\bigr)^{n-1} for some constant C. Hence the density of Sphereland inhabitants at height z is proportional to \bigl( \sqrt{1-z^2} \bigr)^{n-2}. (A complete proof of this using only elementary calculus is given in Exercise 1 of the problem sheet.) In particular, we see that when n is large, almost all the density is at the equator z=0. From the god’s eye perspective, in high-dimensional Sphereland, everyone lives on the equator, or a tiny distance from it. To emphasise this point, here are the probability density functions for n \in \{1,2,3,5,10,25\} using colours red then blue then black.

As a further example, the scatter points below show 1000 random samples from S^{20}, taking (left) (x_1,x_2) coordinates and (right) (x_{20}, x_{21}) coordinates. Note that almost no points have a coordinate more than 1/2 (in absolute value). Moreover, since the expected value of each x_i^2 is 1/20, we find most of the mass at x_i \approx 1/4.

In particular, if a random inhabitant of large n-Spaceland is at (x_1,\ldots, x_{n+1}) then it is almost certain that most of the x_i are very small.

One feature of this seems deeply unintuitive to me. There is, after all, nothing intrinsic about the z-coordinate. Indeed, the god can pick any hyperplane in \mathbb{R}^{n+1} through the origin, and get a similar conclusion.

Concentration of measure

One could make the claim about the expected sizes of the coordinates more precise by continuing with differential geometry, but Henry Cohn’s answer to this Mathoverflow question on concentration of measure gives an elegant alternative approach. Let X_1, \ldots, X_n be independent identically distributed normal random variables each with mean 0 and variance 1/n. Then (X_1,\ldots,X_n) normalized to have length 1 is uniformly distributed on S^n. (A nice way to see this is to use that a linear combination of normal random variables is normally distributed to show invariance under the orthogonal group \mathrm{SO}_n(\mathbb{R}).) Moreover, \mathbb{E}[X_1^2 + \cdots + X_n^2] = 1, and, using that \mathrm{Var}[X_i^4] = 3/n^2, we get

\displaystyle \mathrm{Var}[X_1^2 + \cdots + X_n^2] = n \frac{3}{n^2} + n(n-1) \frac{1}{n^2} - 1 = \frac{2}{n}.

Hence, by Chebychev’s Inequality that if Z is a random variable then

\displaystyle \mathbb{P}\bigl[|Z-\mathbb{E}Z| \ge c\bigr] \le \frac{\mathrm{Var} Z}{c^2},

the probability that X_1^2 + \cdots + X_n^2 \not\in (1-\epsilon,1+\epsilon) is at most 2/n\epsilon^2 which tends to 0 as n \rightarrow \infty. Therefore we can, with small error, neglect the normalization and regard (X_1,\ldots, X_n) as a uniformly chosen point on S^{n-1}. By Markov’s inequality, that if Z is a non-negative random variable then \mathbb{P}[Z > a\mathbb{E}Z] < 1/a, the probability that |X_1| \ge a/\sqrt{n} (or equivalently, X_1^2 \ge a^2/n) is at most 1/a^2, and the probability that |X_1| \ge 1/n^{1/3} (or equivalently, X_1^2 \ge 1/n^{2/3} = a/n with a = n^{1/3}) is at most 1/n^{1/3}. Since

\displaystyle \bigl(1 - \frac{1}{n^{1/3}}\bigr)^n \approx e^{-n/n^{1/3}} \rightarrow 0

as n \rightarrow \infty, we see that with high probability, a random Sphereland inhabitant has all its coordinates in (-1/n^{1/3}, 1/n^{1/3}). I think this makes the counter-intuitive conclusion of the previous subsection even starker.

Volume of the unit ball

The ‘length of tangent line times area of cross-section’ argument says that if A(S^n) is the surface area of S^n then

\begin{aligned} \displaystyle A(S^n) &= \int_{-1}^1 \frac{A(S^{n-1}) \sqrt{1-z^2}^{n-1}}{\sqrt{1-z^2}} \mathrm{d} z \\ &= A(S^{n-1}) \int_{-1}^1 \sqrt{1-z^2}^{n-2} \mathrm{d} z. \end{aligned}

A quick trip to Mathematica to evaluate the integral shows that

\displaystyle A(S^n) = A(S^{n-1}) \frac{\sqrt{\pi}\Gamma(\frac{n}{2})}{\Gamma(\frac{n+1}{2})}.

It follows easily by induction that A(S^n) = 2\sqrt{\pi}^{n+1} / \Gamma(\frac{n+1}{2}). Since B^n = \bigcup_{r=0}^1 rS^{n-1} and A(rS^{n-1}) = r^{n-1} A(S^{n-1}), the volume V_n of B^n is

V_n = \displaystyle A(S^{n-1}) \int_0^1 r^{n-1} \mathrm{d}r = \frac{2\sqrt{\pi}^n}{n\Gamma(\frac{n}{2} )}.

In particular, from \Gamma(t) = (t-1)! for t \in \mathbb{N} we get V_{2m} = \pi^m / (m-1)!. Hence V_{2(m+1)} = \frac{\pi}{m} V_{2m} (is there a quick way to see this?), and the proportion of the cube [-1,1]^{2m} occupied by the ball B^{2m} is

\displaystyle \frac{(\pi/4)^m}{m!}.

Thus the proportion tends to 0, very quickly. I find this somewhat surprising, since my mental picture of a sphere is as a bulging convex object that should somehow `fill out’ an enclosing cube. Again my intuition is hopelessly wrong.

Coding Theory

We now turn to discrete spaces. Our friends Alice and Bob must communicate using a noisy channel that can send the bits 0 and 1. They agree (in advance) a binary code C \subseteq \mathbb{F}_2^n of length n, and a bijection between codewords and messages. When Bob receives a binary word in \mathbb{F}_2^n he decodes it as the message in bijection with the nearest codeword in C; if there are several such codewords, then he chooses one at random, and fears the worst. Here ‘nearest’ of course means with respect to Hamming distance.

Shannon’s probabilistic model

In this model, each bit flips independently with a fixed crossover probability p. If p = 1/2 then reliable communicatino is impossible and if p > 1/2 then we can immediately reduce to the case p < 1/2 by flipping all bits in a received word. We therefore assume that p < 1/2. In this case, Shannon's Noisy Coding Theorem states that the capacity of the binary symmetric channel is 1 - h(p), where h(p) = -p\log_2 p - (1-p) \log_2 (1-p) is the binary entropy function. That is, given any \epsilon >  0, provided n is sufficiently large, there is a binary code C \subseteq \mathbb{F}_2^n and size |C| \ge 2^{(1-h(p)-\epsilon)n} such that when C is used with nearest neighbour decoding to communicate on the binary symmetric channel, the probability of a decoding error is \le \epsilon (uniformly for every codeword). We outline a proof later.

For example, the theoretical maximum 4G data rate is 10^8 bits per second. Since h(1/4) \approx 0.8113, provided n is sufficiently large, even if one in four bits gets randomly flipped by the network, Shannon’s Noisy Coding Theorem promises that Alice and Bob can communicate reliably using a code of size 2^{0.188n}. In other words, Alice and Bob can communicate reliably at a rate of up to 18.8 million bits per second.

Hamming’s adversarial model

In Hamming’s model the received word differs from the sent word in at most e positions, where these positions are chosen by an adversary to be as inconvenient as possible. Nearest neighbour decoding always succeeds if and only if the minimum distance of the code is at least 2e+1.

In the usual binary symmetric channel with crossover probability p, when n is large, the chance that the number of flips in the normal binary symmetric channel is more than (p+\epsilon)n is negligible. Therefore a binary code with minimum distance 2e+1 can be used to communicate reliably on an adversarial binary symmetric channel with crossover probability p < e/n, in which the number of flipped bits is always pn, and these bits are chosen adversarially..

So how big can a code with minimum distance d be? Let C be such a code and for each w \in \mathbb{F}_2^{n-2d}, let

C_w = \bigl\{(u_1,\ldots, u_{2d}) : u \in C, (u_{2d+1}, \ldots, u_n) = w\bigr\}.

Observe that each C_w is a binary code of length 2d and minimum distance at least d. By the Plotkin bound, |C_w| \le 4d for all d. (Equality is attained by the Hadamard codes (2d,4d,d).) Since each C_w has size at most 2^{n-2d}, we find that

|C| \le 4d\times 2^{n-2d}.

The relative rate \rho(C) of a binary code C of length n is defined to be (\log_2 |C|)/n. By the bound above, if C is e-error correcting (and so its minimum distance satisfies d \ge 2e+1) we have

\displaystyle \rho(C) \le \frac{\log_2 (8e+4)2^{n-4e-2}}{n} \le 1 - 4\frac{e}{n}.

In particular, if e/n \ge \frac{1}{4} then \rho(C) = 0: the code consists of a vanishing small proportion of \mathbb{F}_2^n. We conclude that if the crossover probability in the channel is \frac{1}{4} or more, fast communication in the Hamming model is impossible.

An apparent paradox

We have seen that Shannon’s Noisy Coding Theorem promises reliable communication at a rate beyond the Plotkin bound. I hope it is clear that there is no paradox: instead we have demonstrated that communication with random errors is far easy than communication with adversarial errors.

In my view, most textbooks on coding theory do not give enough emphasis to this distinction. They share this feature with the students in my old coding theory course: for many years I set an optional question asking them to resolve the apparent clash between Shannon’s Noisy Coding Theorem and the Plotkin Bound; despite several strong students attempting it, only one ever got close to the explanation above. In fact in most years, the modal answer was ‘mathematics is inconsistent’. Of course as a fully paid-up member of the mathematical establishment, I marked this as incorrect.

The structure of large-dimensional vector spaces

I now want to argue that the sharp difference between the Shannon and Hamming models illuminates the structure of \mathbb{F}_2^n for large n is large.

When lecturing in the Hamming model, one often draws pictures such as the one below, in which, like a homing missile, a sent codeword inexorably heads towards another codeword (two errors, red arrows), rather than heading in a random direction (two errors, blue arrows).

While accurate for adversarial errors, Shannon’s Noisy Coding Theorem tells us that for random errors, this picture is completely inaccurate. Instead, if C is a large code with minimum distance pn and u \in C is sent over the binary symmetric channel with crossover probability p, and v \in \mathbb{F}_2^n is received, while v is at distance about pn from u, it is not appreciably closer to any other codeword u' \in C. I claim that this situation is possible because \mathbb{F}_2^n is ‘large’, in a sense not captured by the two-dimensional diagram above.

Shannon’s Noisy Coding Theorem for the binary symmetric channel

To make the previous paragraph more precise we outline a proof of Shannon’s Noisy Coding Theorem for the binary symmetric channel. The proof, which goes back to Shannon, is a beautiful application of the probabilistic method, made long before this was the standard term for such proofs.

We shall simplify things by replacing the binary symmetric channel with its ‘toy’ version, in which whenever a word u \in \mathbb{F}_2^n is sent, exactly pn bits are chosen uniformly at random to flip. (So we are assuming pn \in \mathbb{N}.) By the Law of Large Numbers, this is a good approximation to binomially distributed errors, and it is routine using standard estimates (Chebychev’s Inequality is enough) to modify the proof below so it works for the original channel.

Proof. Fix \rho < 1 - h(p) and let M = 2^{n \rho}, where n will be chosen later. Choose 2M codewords U(1), \ldots, U(2M) uniformly at random from \mathbb{F}_2^n. Let P_i be the probability (in the probability space of the toy binary symmetric channel) that when U(i) is sent, the received word is decoded incorrectly by nearest neighbour decoding. In the toy binary symmetric channel, the received word v is distance pn from U(i), so an upper bound for P_i is the probability Q_i that when U(i) is sent, there is a codeword U(j) with j\not=i within distance pn of the received word. Note that U(i), U(j), P_i, P_j are all themselves random variables, defined in the probability space of the random choice of code.

Now Q_i is in turn bounded above by the expected number (over the random choice of code) of codewords U(j) with j\not=i within distance pn of the received word. Since these codewords were chosen independently of U(i) and uniformly from \mathbb{F}_2^n, it doesn't matter what the received word is: the expected number of such codewords is simply

\displaystyle \frac{(2M-1) V_n(pn)}{2^n}

where V_n(pn) is the volume (i.e. number of words) in the Hamming ball of radius pn about \mathbf{0} \in \mathbb{F}_2^n. We can model words in this ball as the output of a random source that emits the bits 0 and 1 with probabilities 1-p and p, respectively. The entropy of this source is h(p), so we expect to be able to compress its n-bit outputs to words of length h(p)n. Correspondingly,

V_n(pn) \le 2^{h(p)n}.

(I find the argument by source coding is good motivation for the inequality, but there is of course a simpler proof using basic probability theory: see the problem sheet.) Hence

\displaystyle P_i \le \frac{(2M-1)2^{h(p)n}}{2^n} \le 2 \times 2^{(\rho + h(p) - 1)n}.

Since \rho < 1-h(p), the probability P_i of decoding error when U(i) is sent becomes exponentially small as n becomes large. In particular, the mean probability P = \frac{1}{2M} \sum_{i=1}^{2M} P_i is smaller than \epsilon / 2, provided n is sufficiently large. A Markov’s Inequality argument now shows that by throwing away at most half the codewords, we can assume that the probability of decoding error is less than \epsilon for all M remaining codewords. \Box

Varying the alphabet

Increasing the size of the alphabet does not change the situation in an important way. In fact, if \mathbb{F}_2 is replaced with \mathbb{F}_p for p large then the Singleton Bound, that |C| \le p^{n-d} becomes effective, giving another constraint that also apparently contradicts Shannon’s Noisy Coding Theorem. There is however one interesting difference: in \mathbb{F}_2^n, every binary word has a unique antipodal word, obtained by flipping all its bits, whereas in \mathbb{F}_p^n there are (p-1)^n words at distance n from any given word. This is the best qualitative sense I know in which \mathbb{F}_2^n is smaller than \mathbb{F}_p^n.

Cryptography and computation

Readers interested in cryptography probably recognised 56 above as the key length of the block cipher DES. This cipher is no longer in common use because a determined adversary knowing (as is usually assumed) some plaintext/cipher pairs can easily try all \mathbb{F}_2^{56} possible keys and so discover the key. Even back in 2008, an FPGA-based special purpose device costing £10000 could test 65.2 \times 10^9 \approx 2^{35.9} DES keys every second, giving 12.79 days for an exhaustive search.

Modern block ciphers such as AES typically support keys of length 128 and 256. In Applied Cryptography, Schneier estimates that a Dyson Sphere capturing all the Sun’s energy for 32 years would provide enough power to perform 2^{192} basic operations, strongly suggesting that 256 bits should be enough for anyone. The truly paranoid (or readers of Douglas Adams) should note that in Computational capacity of the universe, Seth Lloyd, Phys. Rev. Lett., (2002) 88 237901-3, the author estimates that even if the universe is one vast computer, then it can have performed at most 10^{120} \approx 2^{398.6} calculations. Thus in a computational sense, 2^{56} is tiny, and 2^{256} and 10^{120} are effectively infinite.

The final number above was 120^{10} \approx 2^{69.1}. The Chinese supercomputer Sunway TaihuLight runs at 93 Petaflops, that is 93 \times 10^{15} \approx 2^{56.4} operations per second. A modern Intel chip has AES encryption as a primitive instruction, and can encrypt at 1.76 cycles per byte for a 256-bit key, encrypting 1KB at a time. If being very conservative, we assume the supercomputer can test a key by encrypting 16 bytes, then it can test 2^{56.4}/2^4/1.76 = 2^{51.6} keys every second, requiring 2.12 days to exhaust over 2^{69.1} keys. Therefore 120^{10} is in the tricky middle ground, between the easily computable and the almost certainly impossible.

My surprising claim that 2^{2^\mathbb{N}} also sits somewhere in the middle comes from my reading of a wonderful blog post by Martín Escardó (on Andrej Bauer’s blog). To conclude, I will give an introduction to his seemingly-impossible Haskell programs.

As a warm-up, consider this Haskell function which computes the Fibonacci numbers F_n, extending the definition in the natural way to all integer n.

fib n | n == 0  = 0
      | n == 1  = 1
      | n >= 2  = fib (n-1) + fib (n-2)
      | n <= -1 = fib (n+2) - fib (n+1)

Apart from the conditions appearing before rather than after the equals sign, the Haskell code is a verbatim mathematical definition. The code below defines two predicates p and q on the integers such that p(n) is true if and only if F_n^2 - F_{n-1}F_{n+1} = (-1)^{n-1} and q(n) is true if and only if F_n \equiv 0 mod 3.

p n = fib n * fib n - fib (n+1) * fib (n-1) == (-1)^(n+1)  
q n = fib n `mod` 3 == 0

The first equation is Cassini’s Identity, so p(n) is true for all n \in \mathbb{Z}; q(n) is true if and only if n is a multiple of 4. We check this in Haskell using its very helpful ‘list-comprehension’ syntax; again this is very close to the analogous mathematical notation for sets.

*Cantor> [p n | n <- [-5..5]]
[True,True,True,True,True,True,True,True,True,True,True]
*Cantor> [q n | n <- [-5..5]]
[False,True,False,False,False,True,False,False,False,True,False]

Therefore if we define p'(n) to be true (for all n \in \mathbb{Z}) and q'(n) to be true if and only if n \ \mathrm{mod} \ 4 = 0, we have p = p' and q = q'.

An important feature of Haskell is that it is strongly typed. We haven’t seen this yet because Haskell is also type-inferring in a very powerful way, that makes explicit type signatures usually unnecessary. Simplifying slightly, the type of the predicates above is Integer -> Bool, and the type of fib is Integer -> Integer. (Integer is a built in Haskell type that supports arbitrary sized integers.) The family of predicates r_m defined by

r_m(n) \iff n \ \mathrm{mod}\ m = 0

is defined by

r m n = n `mod` m == 0

Here r has type Integer -> Integer -> Bool. It is helpful to think of this in terms of the currying isomorphism C^{A \times B} \cong (C^A)^B, ubiquitous in Haskell code. We now ask: is there a Haskell function

equal :: (Integer -> Bool) -> (Integer -> Bool) -> Bool

taking as its input two predicates on the integers and returning True if and only if they are equal? The examples of p, p', q, q' above show that such a function would, at a stroke, make most mathematicians unemployed. Fortunately for us, Turing’s solution to the Entscheidungsproblem tells us that no such function can exist.

Escardó’s post concerns predicates defined not on the integers \mathbb{Z}, but instead on the 2-adic integers \mathbb{Z}_2. We think of the 2-adics as infinite bitstreams. For example, the Haskell definitions of the bitstream 1000 \ldots representing 1 \in \mathbb{Z}_2 and the bitstream 101010 \ldots representing

2^0 + 2^2 + 2^4 + 2^6 + \cdots = -\frac{1}{3} \in \mathbb{Z}_2

are:

data Bit = Zero | One
type Cantor = [Bit]
zero = Zero : zero
one  = One : zero
bs   = One : Zero : bs :: Cantor

(I’m deliberately simplifying by using the Haskell list type []: this is not quite right, since lists can be, and usually are, finite.) Because of its lazy evaluation — nothing is evaluated in Haskell unless it is provably necessary for the computation to proceed — Haskell is ideal for manipulating such bitstreams. For instance, while evaluating bs at the Haskell prompt will print an infinite stream on the console, Haskell has no problem performing any computation that depends on only finitely many values of a bitstream. As proof, here is a check that -\frac{1}{3} - 2\frac{1}{3} = -1:

*Cantor> take 10 (bs + tail bs)
[One,One,One,One,One,One,One,One,One,One]
*Cantor> take 10 (bs + tail bs + one)
[Zero,Zero,Zero,Zero,Zero,Zero,Zero,Zero,Zero,Zero]

(Of course one has to tell Haskell how to define addition on the Cantor type: see Cantor.hs for this and everything else in this section.) As an example of a family of predicates on \mathbb{Z}_2, consider

twoPowerDivisibleC :: Int -> Cantor -> Bool
twoPowerDivisibleC p bs = take p zero == take p bs

Thus twoPowerDivisibleC p bs holds if and only if bs represents an element of 2^p \mathbb{Z}_2. For example, the odd 2-adic integers are precisely those for which twoPowerDivisibleC 1 bs is false:

*Cantor> [twoPowerDivisibleC 1 (fromInteger n) | n <- [0..10]]
[True,False,True,False,True,False,True,False,True,False,True]

In the rooted binary tree representation of \mathbb{Z}_2 shown below, the truth-set of this predicate is exactly the left-hand subtree. The infinite path to -1/3 is shown by thick lines.

Here is a somewhat similar looking definition

nonZeroC :: Cantor -> Bool
nonZeroC (One : _)   = True
nonZeroC (Zero : bs) = nonZeroC bs

While a correct (and correctly typed) Haskell definition, this does not define a predicate on \mathbb{Z}_2 because the evaluation of nonZeroC zero never terminates. In fact, it is surprisingly difficult to define a predicate (i.e. a total function with boolean values) on the Cantor type. The beautiful reason behind this is that the truth-set of any such predicate is open in the 2-adic topology. Since this topology has as a basis of open sets the cosets of the subgroups 2^p \mathbb{Z}_2, all predicates look something like one of the twoPowerDivisibleC predicates above.

This remark maybe makes the main result in Escardó’s blog post somewhat less amazing, but it is still very striking: it is possible to define a Haskell function

equalC :: (Cantor -> Bool) -> (Cantor -> Bool) -> Bool

which given two predicates on \mathbb{Z}_2 (as represented in Haskell using the Cantor type) returns True if and only if they are equal (as mathematical functions on \mathbb{Z}_2) and false if and only if they are unequal. Escardó’s ingenious definition of equalC needs only a few lines of Haskell code: it may look obvious when read on the screen, but I found it a real challenge to duplicate unseen, even after having read his post. I encourage you to read it: it is fascinating to see how the compactness of \mathbb{Z}_2 as a topological space corresponds to a ‘uniformity’ property of Haskell predicates.

Sources

The results on volumes of spheres and balls are cobbled together from several MathOverflow questions and answers, in particular Joseph O’Rourke’s question asking for an intuitive explanation of the concentration of measure phenomenon, and S. Carnahan’s answer to a question on the volume of the n-dimensional unit ball. The coding theory results go back to Shannon, and can be found in many textbooks, for example Van Lint’s Introduction to coding theory. The use of the ‘toy’ binary symmetric channel is my innovation, used when I lectured our Channels course in 2019–20 to reduce the technicalities in the proof of Shannon’s Noisy Coding Theorem. The diagrams were drawn using TikZ; for the spheres I used these macros due to Tomasz M. Trzeciak. The material about computability is either very basic or comes from a blog post by Martín Escardó (on Andrej Bauer’s blog) giving an introduction to his paper Infinite sets that admit fast exhaustive search, published in the proceedings of Logic in Computer Science 2007.


Microsoft Teams doesn’t want you to attend meetings

April 8, 2020

The purpose of this post is to document two of the more blatant bugs I’ve encountered in my forced exposure to Microsoft Teams during the Coronavirus crisis. To add insult to injury, the only way to work around them is to use Microsoft Outlook, another piece of software riddled with deficiencies.

Let’s get started. Here is a screenshot of me scheduling a meeting at 15:22 for 16:00 today.


Notice anything odd. Probably not: what sane person checks the timezone? But, look closely: Microsoft Teams firmly believes that London is on UTC+00:00 (same as GMT), not BST. This belief is not shared by my system, or any other software on it that I can find.

Now let’s try to join the meeting. Okay it’s early, but we are keen. Here is a screenshot of me hovering with my mouse over the meeting.

There is no way to join. Double clicking on the meeting just gives a chance to reschedule it (maybe to a later date, when Microsoft has fixed this glaring deficiency). The ‘Meet now’ button starts an unrelated meeting.

Okay, maybe our mistake was to join an MS Teams meeting using MS Teams. Let’s try using the Outlook web calendar. Here is a screenshot.

Here is a close-up of the right hand of the right-hand side

On the one hand, the times say that the meeting started 46 minutes ago; on the other, it is ‘in 14 min’. Perhaps because of this temporal confusion, there is no way to join the meeting.

Finally, here is MS Teams at 16:00.

Nothing has changed, there is still no way to join the meeting.

The prosection rests its case. How can Microsoft justify releasing such as inept piece of software?

Update. apparently my error was to schedule a meeting with no invitees. Under Microsoft’s interpretation, such meetings may be scheduled, but never attended (even by gate-crashing). On the time-zone front, both the Outlook web calendar and MS Teams continue to insist that London is on UTC+00:00, but, bizarrely, choosing London as my location (it already was) fixed the scheduling bug. Many thanks to Remi from the Royal Holloway IT support team for steering me on an expert course around the shark-invested waters of Microsoft software.


Stanley’s theory of P-partitions and the Hook Formula

March 22, 2020

The aim of this post is to introduce Stanley’s theory of labelled \mathcal{P}_\preceq-partitions and, as an application, give a short motivated proof of the Hook Formula for the number of standard Young tableaux of a given shape. In fact we prove the stronger q-analogue, in which hook lengths are replaced with quantum integers. All the ideas may be found in Chapter 7 of Stanley’s book Enumerative Combinatorics II so, in a particularly strong sense, no originality is claimed.

The division into four parts below and the recaps at the start of each part have the aim of reducing notational overload (the main difficulty in the early parts), while also giving convenient places to take a break.

Part 1: Background on arithmetic partitions

Recall that an arithmetic partition of size n is a weakly decreasing sequence of natural numbers whose sum is n. For example, there are 8 partitions of 7 with at most three parts, namely

(7), (6,1), (5,2), (5,1,1), (4,3), (4,2,1), (3,3,1), (3,2,2),

as represented by the Young diagrams shown below.

Since the Young diagram of a partition into at most 3 parts is uniquely determined by its number of columns of lengths 1, 2 and 3, such partitions are enumerated by the generating function

\frac{1}{(1-q)(1-q^2)(1-q^3)} = 1 \!+\! q \!+ \!2q^2 \!+\! 3q^3 \!+\! 4q^4 \!+\! 5q^5 \!+\! 7q^6 \!+\! 8q^7 \!+\! \cdots

For example (4,2,1) has two columns of length 1, and one each of lengths 2 and 3. It is counted in the coefficient of q^7, obtained when we expand the geometric series by choosing q^{1 \times 2} (two columns of length 1) from

\displaystyle \frac{1}{1-q} = 1 + q + q^2 + \cdots,

then q^{2 \times 1} (one column of length 2) from

\displaystyle \frac{1}{1-q^2} = 1+q^2 + q^4 + \cdots

and finally q^{3 \times 1} (one column of length 3) from

\displaystyle \frac{1}{1-q^3} = 1+q^3+q^6 + \cdots .

Now suppose that we are only interested in partitions where the first part is strictly bigger than the second. Then the Young diagram must have a column of size 1, and so we replace 1/(1-q) in the generating function with q/(1-q). Since the coefficient of q^6 above is 7, it follows (without direct enumeration) that there are 7 such partitions. What if the second part must also be strictly bigger than the third? Then the Young diagram must have a column of size 2, and so we also replace 1/(1-q^2) with q^2/(1-q^2). (I like to see this by mentally removing the first row: the remaining diagram then has a column of size 1, by the case just seen.) By a routine generalization we get the following result: partitions with at most k parts such that the jth largest part is strictly more than the (j+1)th largest part for all j \in J \subseteq \{1,\ldots, k-1\} are enumerated by

\displaystyle \frac{q^{\sum_{j \in J} j}}{(1-q)(1-q^2) \ldots (1-q^k)}.

Part 2: \mathcal{P}_\preceq-partitions

Let \mathcal{P} be a poset with partial order \preceq. A \mathcal{P}_\preceq-partition is an order-preserving function p : \mathcal{P} \rightarrow \mathbb{N}_0.

For example, if \mathcal{P} = \{1,\ldots, k\} with the usual order then \bigl( p(1), \ldots, p(k)\bigr) is the sequence of values of a \mathcal{P}-partition if and only if p(1) \le \ldots \le p(k). Thus, by removing any initial zeros and reversing the sequence, \mathcal{P}-partitions are in bijection with arithmetic partitions having at most k parts.

I should mention that it is more usual to write \mathcal{P} rather than \mathcal{P}_\preceq. More importantly, Stanley’s original definition has ‘order-preserving’ rather than ‘order-reversing’. This fits better with arithmetic partitions, and plane-partitions, but since our intended application is to reverse plane-partitions and semistandard Young tableaux, the definition as given (and used for instance in Stembridge’s generalization) is most convenient.

Reversed plane partitions

Formally the Young diagram of the partition (4,2,1) is the set

[(4,2,1)] = \{(1,1),(1,2),(1,3),(1,4),(2,1),(2,2),(3,1)\}

of boxes. We partially order the boxes by (the transitive closure of) (a,b) \preceq (a+1,b) and (a,b) \preceq (a,b+1). This is shown diagramatically below.

For this partial order, \mathcal{P}_\preceq-partitions correspond to assignments of non-negative integers to [(4,2,1)] such that that the rows and columns are weakly increasing when read left to right and top to bottom. For example, three of the 72 \mathcal{P}_\preceq-partitions of size 6 are shown below

Such assignments are known as reverse plane partitions. The proof of the Hook Formula given below depends on finding the generating function for reverse plane partitions in two different ways: first using the general theory of \mathcal{P}-partitions, and then in a more direct way, for instance using the Hillman–Grassl bijection.

Enumerating \mathrm{RPP}(2,1)

As a warm-up, we replace (4,2,1) with the smaller partition (2,1), so now \mathcal{P}_\preceq = \{(1,1),(1,2),(2,1)\}, ordered by (1,1) \preceq (1,2), (1,1) \preceq (2,1). Comparing p(1,2) and p(2,1), we divide the \mathcal{P}_\preceq-partitions into two disjoint classes: those with p(1,2) \le p(2,1), and those with p(1,2) > p(2,1). The first class satisfy

p(1,1) \le p(1,2) \le p(2,1)

and so are in bijection with arithmetic partitions with at most 3 parts. The second class satisfy

p(1,1) \le p(2,1) < p(1,2)

so are in bijection with arithmetic partitions with at most 3 parts, whose largest part is strictly the largest. By the first section we deduce that

\displaystyle \begin{aligned}\sum_{n=0}^\infty |\mathrm{RPP}_{(2,1)}(n)|q^n &= \frac{1+q}{(1-q)(1-q^2)(1-q^3)} \\&= \frac{1}{(1-q)^2(1-q^3)}.\end{aligned}

The cancellation to leave a unit numerator shows one feature of the remarkably nice generating function for reverse plane partitions revealed in the final part below.

Labelled \mathcal{P}-partitions

A surprisingly helpful way to keep track of the weak/strong inequalities seen above is to label the poset elements by natural numbers. We define a labelling of a poset \mathcal{P}_\preceq of size k to be a bijective function L : \mathcal{P} \rightarrow \{1,\ldots, k\}. Suppose that y covers x in the order on \mathcal{P}_\preceq. Either L(x) < L(y), in which case we say the labelling is natural for (x,y), or L(x) > L(y), in which case we say the labelling is strict for (x,y). A (\mathcal{P}_\preceq,L)-partition is then an order preserving function p : \mathcal{P} \rightarrow \mathbb{N}_0 such that if x \prec y is a covering relation and

  • L(x) < L(y) (natural) then p(x) \ge p(y);
  • L(x) > L(y) (strict) then p(x) > p(y).

Note that the role of the labelling is only to distinguish weak/strong inequalities: the poset itself determines whether p(u) \ge p(v) or p(v) \le p(u) for each comparable pair u, v \in \mathcal{P}. If we drop the restriction that x \prec y is a covering relation, and just require x \prec y then we clearly define a subset of the labelled (\mathcal{P}_\preceq, L)-partitions, and it is not hard to see that in fact the definitions are equivalent. If feels most intuitive to me to state the definition as above.

Let \mathrm{Par}(\mathcal{P}_\preceq, L) denote the set of (\mathcal{P}_\preceq,L)-partitions. For example, if \mathcal{P} = \{(1,1),(1,2),(2,1)\} as in the example above, then the \mathcal{P}_\preceq-partitions are precisely the (\mathcal{P}_\preceq, L)-partitions for any all-natural labelling; the two choices are

L(1,1) = 1, L(1,2) = 2, L(2,1) = 3,

and

L'(1,1) = 1, L'(1,2) = 3, L'(2,1) = 2.

Working with L, the partitions with p(1,1) \le p(1,2) \le p(2,1) form the set \mathrm{Par}(\mathcal{P}_\unlhd, L) where \unlhd is the total order refining \preceq such that

(1,1) \unlhd (1,2) \unlhd (2,1)

and the partitions with p(1,1) \le p(2,1) < p(1,2) form the set \mathrm{Par}(\mathcal{P}_{\unlhd'}, L) where

(1,1) \unlhd' (2,1) \unlhd' (1,2).

The division of \mathcal{P}-partitions above is an instance of the following result.

Fundamental Lemma. Let \mathcal{P} be a poset with partial order \preceq and let L : \mathcal{P} \rightarrow \{1,\ldots, k\} be a labelling. Then

\mathrm{Par}(\mathcal{P}_\preceq, L) = \bigcup \mathrm{Par}(\mathcal{P}_\unlhd, L)

where the union over all total orders \unlhd refining \preceq is disjoint.

Proof. Every (\mathcal{P}_\preceq, L) partition appears in the right-hand side for some \unlhd: just choose \unlhd so that if p(x) < p(y) then x \lhd y and if p(x)=p(y) and x \prec y then x \lhd y. On the other hand, suppose that p \in \mathrm{Par}(\mathcal{P}_\preceq,L) is in both \mathrm{Par}(\mathcal{P}_{\unlhd},L) and \mathrm{Par}(\mathcal{P}_{\unlhd'},L). Choose x and y \in \mathcal{P} incomparable under \preceq and such that x \lhd y and y \lhd' x. From \mathcal{P}_\lhd we get p(x) \le p(y) and from \mathcal{P}_{\lhd'} we get p(y) \le p(x). Therefore p(x) = p(y). Now using the labelling for the first time, we may suppose without loss of generality that L(x) < L(y); since y \lhd' x the labelling is strict for x and y and so we have p(y) < p(x), a contradiction. \Box.

Suggested exercise. Show that the generating functions enumerating \mathrm{RPP}(2,2) and \mathrm{RPP}(3,1) are 1/(1-q)^3(1-q^3) and

\displaystyle\frac{1}{(1-q)^2(1-q^2)(1-q^4)};

there are respectively 2 and 3 linear extensions that must be considered.

Part 3: Permutations

Recall that \mathcal{P}_{\preceq} is a poset, L : \mathcal{P} \rightarrow \{1,\ldots, k\} is a bijective labelling and that a (\mathcal{P}_{\preceq},L)-partition is a function p : \mathcal{P} \rightarrow \mathbb{N}_0 such that if x \preceq y then p(x) \le p(y), with strict inequality whenever L(x) > L(y).

Connection with permutations

We write permutations of \{1,\ldots, k\} in one-line form as \pi_1\ldots \pi_k. Recall that \pi has a descent in position i if \pi_i > \pi_{i+1}.

Example. Let (1,1) \preceq (1,2), (1,1) \preceq (2,1) be the partial order used above to enumerate \mathrm{RPP}(2,1), as labelled by L(1,1) = 1, L(1,2) = 2, L(2,1) = 3. The total order (1,1) \unlhd (1,2) \unlhd (2,1) corresponds under L to the identity permutation 123 of \{1,2,3\}, with no descents. The total order (1,1) \unlhd' (2,1) \unlhd' (1,2) corresponds under L to the permutation 132 swapping 2 and 3, with descent set \{2\}.

In general, let x_1 \unlhd \ldots \unlhd x_k be a total order refining \preceq. Let i \le k-1 and consider the elements x_i and x_{i+1} of \mathcal{P} labelled L(x_i) and L(x_{i+1}). In any \mathcal{P}_\unlhd-partition p we have p(x_i) \le p(x_{i+1}), with strict inequality required if and only if L(x_i) > L(x_{i+1}). Therefore, using the total order \unlhd to identify p with a function on \{1,\ldots, k\}, i.e. i \mapsto p(x_i), we require p(i) \le p(i+1) for all i, with strict inequality if and only if L(x_i) > L(x_{i+1}). Equivalently,

p(1) \le \ldots \le p(k)

with strict inequality whenever there is a descent L(x_i)L(x_{i+1}) in the permutation L(x_1) \ldots L(x_k) corresponding under L to \unlhd. Conversely, a permutation \pi_1\ldots \pi_k of \{1,\ldots, k\} corresponds to a total order refining \preceq if and only if L(x) appears left of L(y) whenever x \preceq y. Therefore Stanley’s Fundamental Lemma may be restated as follows.

Fundamental Lemma restated. Let \mathcal{P} be a poset with partial order \preceq and let L : \mathcal{P}_\preceq \rightarrow \{1,\ldots, k\} be a labelling such that if x \preceq y then L(x) appears to the left of L(y) in \pi. Then, using the labelling L to identity elements of \mathrm{Par}(\mathcal{P}_\preceq, L) with functions on \{1,\ldots, k\},

\mathrm{Par}(\mathcal{P}_\preceq, L) = \bigcup P_\pi

where P_\pi is the set of all functions p : \{1,\ldots, k\} \rightarrow \mathbb{N}_0 such that p(1) \le \ldots \le p(k) with strict inequality whenever \pi_i > \pi_{i+1}, and the union is over all permutations \pi such that L(x) appears to the left of L(y) in the one-line form of \pi whenever x \preceq y. Moreover the union is disjoint. \Box

The sequences \bigl( p(1), \ldots, p(k) \bigr) above correspond to partitions whose \bigl((k+1)-(i+1)\bigr)th largest part is strictly more than their ((k+1)-i)th largest part (which might be zero), and so are enumerated by

\displaystyle \frac{q^{\sum_{i : \pi_i > \pi_{i+1}} (k-i)}}{(1-q)\ldots (1-q^k)}.

The power of x in the numerator is, by the standard definition, the comajor index of the permutation \pi. We conclude that

\displaystyle \sum_{p \in P(\mathcal{P}_\preceq,L)} =  \frac{\sum_\pi q^{\mathrm{comaj}(\pi)}}{(1-q)\ldots (1-q^k)}

where the sum in the numerator is over all permutations \pi as in the restated Fundamental Lemma.

Example: \mathrm{RPP}(3,2)

We enumerate \mathrm{RPP}(3,2). There are 5 extensions of the partial order on [3,2],

  • (1,1) \unlhd (1,2) \unlhd (1,3) \unlhd (2,1) \unlhd (2,2)
  • (1,1) \unlhd (1,2) \unlhd (2,1) \unlhd (1,3) \unlhd (2,2)
  • (1,1) \unlhd (1,2) \unlhd (2,1) \unlhd (2,2) \unlhd (1,3)
  • (1,1) \unlhd (2,1) \unlhd (1,2) \unlhd (1,3) \unlhd (2,2)
  • (1,1) \unlhd (2,1) \unlhd (1,2) \unlhd (2,2) \unlhd (1,3)

corresponding under the labelling

to the permutations 12345, 12435, 12453, 14235, 14253 with descent sets \varnothing, \{3\}, \{4\}, \{2\}, \{2,4\} and comajor indices 0, 2, 1, 3, 4, respectively. By the restatement of the fundamental lemma and the following remark,

\begin{aligned}\sum_{n=0}^\infty |\mathrm{RPP}_{(3,2)}(n)|q^n &= \frac{1 + q^2 + q + q^3 + q^4}{(1-q)(1-q^2)(1-q^3)(1-q^4)(1-q^5)} \\ &= \frac{1}{(1-q)^2(1-q^2)(1-q^3)(1-q^4)}. \end{aligned}

We end this part by digressing to outline a remarkably short proof of an identity due to MacMahon enumerating permutations by major index and descent count.

Exercise. If \mathcal{P} = \{1,\ldots, k\} and all elements are incomparable under \preceq, then a \mathcal{P}_\preceq-partition is simply a function p : \{1,\ldots, k\} \rightarrow \mathbb{N}_0.

  • How are such partitions enumerated by the restated Fundamental Lemma?
  • Deduce that

    \displaystyle \frac{1}{(1-q)^k} = \frac{\sum_{\pi} q^{\mathrm{comaj}(\pi)}}{(1-q)\ldots (1-q^k)}.

    where the sum is over all permutations of \{1,\ldots, k\}.

  • Give an involution on permutations that preserves the number of descents and swaps the comajor and major indices. Deduce that \mathrm{comaj}(\pi) can be replaced with \mathrm{maj}(\pi) above.

Exercise. The argument at the start of this post shows that 1/(1-qt) \ldots (1-q^k t) enumerates partitions with at most k parts by their size (power of q) and largest part (power of t).

  • Show that partitions whose jth largest part is strictly larger than their j+1th largest part for all j \in J \subseteq \{1,\ldots,k-1\} are enumerated in this sense by

    \displaystyle \frac{q^{\sum_{j \in J}j}t^{|J|}}{(1-qt)\ldots (1-q^kt)}.

  • Let c_k(m) be the number of compositions with k parts having m as their largest part. Show that

    \displaystyle  \sum_{m=0}^\infty c_k(m) t^m = \frac{\sum_{\pi} q^{\mathrm{comaj}(\pi)} t^{\mathrm{des}(\pi)}}{(1-qt)\ldots (1-q^k t)}

    where \mathrm{des}(\pi) is the number of descents of \pi.

  • Deduce that

    \displaystyle \sum_{m=0}^\infty \Bigl(\frac{q^{m+1}-1}{q-1}\Bigr)^k t^m = \frac{\sum_{\pi} q^{\mathrm{comaj}(\pi)} t^{\mathrm{des}(\pi)}}{(1-t)(1-qt)\ldots (1-q^k t)}.

  • Hence prove MacMahon’s identity
    \displaystyle \sum_{r=0}^\infty [r]_q^k t^r = \frac{\sum_\pi q^{\mathrm{maj}(\pi)} t^{\mathrm{des}(\pi)+1}}{(1-t)(1-qt)\ldots (1-q^k t)}

    where [r]_q = (q^r-1)/(q-1) = 1 + q + \cdots + q^{r-1}.

Part 4: Proof of the Hook Formula

The restated Fundamental Lemma for reverse plane partitions

Fix a partition \lambda of size k and let \mathcal{P}_{\preceq} be the poset whose elements are the boxes of the Young diagram [\lambda] ordered by (the transitive closure of) (a,b) \preceq (a+1,b) and (a,b) \preceq (a,b+1). Let \gamma_a be the sum of the a largest parts of \lambda, and define a labelling L : [\lambda] \rightarrow \{1,\ldots, k\} so that the boxes in row a of the Young diagram [\lambda] are labelled \alpha_{a-1}+1,\ldots, \alpha_a. (This labelling was seen for \lambda=(3,2) in the example at the end of the previous part.) Since this label is all-natural, the (\mathcal{P}_\preceq, L)-partitions are precisely the reversed plane partitions of shape \lambda. By the restated Fundamental Lemma and the following remark we get

\displaystyle \sum_{n=0}^\infty |\mathrm{RPP}_\lambda(n)|q^n = \frac{\sum_{\pi} q^{\mathrm{comaj}(\pi)}}{(1-q) \ldots, (1-q^k)}

where the sum is over all permutation \pi such that the linear order defined by \pi refines \preceq. Equivalently, as seen in Part 3 and the final example, the label L(a,b) appears to the left of both L(a+1,b) and L(a,b+1) in the one-line form of \pi (when the boxes exist). This says that \pi^{-1}_{L(a,b)} < \pi^{-1}_{L(a+1,b)} and \pi^{-1}_{L(a,b)} < \pi^{-1}_{L(a,b+1)}. Therefore, when we put \pi^{-1}_i in the box of [\lambda] with label i, we get a standard tableau. Moreover, \pi has a descent in position i, if and only if i appears to the right of i+1 in the one-line form of \pi^{-1}. Therefore defining \mathrm{comaj}(t) = \sum_i (n-i), where the sum is over all i appearing strictly below i+1 in t, we have \mathrm{comaj}(\pi) = \mathrm{comaj}(t). (Warning: this is the definition from page 4 of this paper of Krattenthaler, and is clearly convenient here; however it does not agree with the definition on page 364 of Stanley Enumerative Combinatorics II.) We conclude that

\displaystyle \sum_{n=0}^\infty |\mathrm{RPP}_\lambda(n)|q^n = \frac{\sum_{t \in \mathrm{SYT}(\lambda)} q^{\mathrm{comaj}(t)}}{(1-q)\ldots (1-q^k)}

where \mathrm{SYT}(\lambda) is the set of standard tableaux of shape \lambda.

For example we saw at the end of the previous part that the refinements of \preceq when \lambda = (3,2) correspond to the permutations

12345, 12435, 12453, 14235, 14253.

Their comajor indices are 0, 2, 1, 3, 4, their inverses are

12345, 12435, 12534, 13425, 13524

and the corresponding standard tableaux, obtained by putting \pi^{-1}_i in the box of [(3,2)] labelled i are

Here the final tableau has 3 strictly above 2 and 5 strictly above 4, so its comajor index is (5-2) + (5-4) = 4.

Hillman–Grassl algorithm

Since this post is already rather long, I will refer to Section 7.22 of Stanley Enumerative Combinatorics II or, for a more detailed account that gives the details of how to invert the algorithm, Section 4.2 of Sagan The symmetric group: representations, combinatorial algorithms and symmetric functions. For our purposes, all we need is the remarkable corollary that

\displaystyle \sum_{n=0}^\infty |\mathrm{RPP}_\lambda(n)|q^n = \frac{1}{\prod_{(a,b)\in[\lambda]} (1-q^{h_{(a,b)}})}

where h_{(a,b)} is the hook-length of the box (a,b) \in [\lambda]. We saw the special cases for the partitions (2,1) and (3,2) above.

This formula can also be derived by letting the number of variables tend to infinity in Stanley’s Hook Content Formula: see Theorem 7.21.2 in Stanley’s book, or Theorem 2.6 in this joint paper with Rowena Paget, where we give the representation theoretic context.

Proof of the Hook Formula

Combining the results of the two previous subsections we get

\displaystyle \frac{1}{\prod_{(a,b)\in[\lambda]} (1-q^{h_{(a,b)}})} = \frac{\sum_{t \in \mathrm{SYT}(\lambda)} q^{\mathrm{comaj}(t)}}{(1-q)\ldots (1-q^k)}.

Equivalently, using the quantum integer notation [r]_q = (1-q^r)/(1-q) and [k]!_q = [k]_q \ldots [1]_q, we have

\displaystyle \sum_{t \in \mathrm{SYT}(\lambda)} q^{\mathrm{comaj}(t)} = \frac{[k]!_q}{\prod_{(a,b) \in [\lambda]} [h_{(a,b)}]_q}.

This is the q-analogue of the Hook Formula; the weaker normal version is obtained by setting q=1 to get

\displaystyle |\mathrm{SYT}(\lambda)| = \frac{k!}{\prod_{(a,b) \in [\lambda]} h_{(a,b)}}.


What makes a good/bad lecture?

December 30, 2019

In the opinion of 45 Royal Holloway 2nd year students the answers, taking only those mentioned at least three times, are:

What makes a good lecture?

  • Engaging/enthusiastic lecturer (22)
  • Interactive (8)
  • Clear voice (7)
  • Eye contact with audience (5)
  • Checking for understanding (4)
  • Clear (4)
  • Clear handwriting (4)
  • Jokes/humour (4)
  • Seems interested in what they’re saying (4)
  • Well prepared (4)
  • Examples (3)
  • Sound excited/animated (3)
  • High quality notes (3)

What makes a bad lecture?

  • Too quiet (18)
  • Reading off board/book/page/slide (17)
  • Facing the board (9)
  • Not engaging with audience (8)
  • Monotone (7)
  • Poor handwriting (7)
  • Boring (5)
  • Not enough explanation (5)
  • Slow (3)
  • Too much talking (3)

Putnam 2018: determinant of an adjacency matrix

October 20, 2019

I took in the October issue of the American Mathematical Monthly to entertain me during the lulls while manning the mathematics desk at our open day on Saturday. As traditional, the first article was the questions in this year’s Putnam. Question A2 asked for the determinant of the adjacency matrix of the graph on the non-empty subsets of \{1,2,\ldots, n\} in which two subsets are connected if and only if they intersect.

Three hours later, hardly overwhelmed by student visits, although I did get to wheel out infinitely many primes and 0.\dot{9} = 1 to the evident interest of several potential students — and their parents, I had at most the outline of a fiddly solution that piled row operations on half-remembered determinantal identities.

Here is the far simpler solution I found when it came to write it up. The real purpose of this post is to make the connection with the Schur algebra and permutation representations of the symmetric group, and to record some related matrices that also have surprisingly simple determinants or spectral behaviour. I also include an explicit formula for the inverse of the adjacency matrix; this can be expressed using the Möbius function of the poset of subsets of \{1,2,\ldots, n\}, for reasons I do not yet fully understand.

A2. Let S_1, S_2, \ldots, S_{2^n-1} be the nonempty subsets of \{1,2,\ldots, n\} in some order and let M be the (2^n-1) \times (2^n-1) matrix whose (i,j) entry is

\displaystyle m_{ij} = \begin{cases} 0 & S_i \cap S_j = \varnothing \\ 1 & S_i \cap S_j \neq \varnothing \end{cases}

Calculate the determinant of M.

Solution. As is implicit in the question, we can order the sets as we like, since swapping two rows and then swapping the two corresponding columns does not change the determinant of M. We put the 2^{n-1} - 1 sets not having 1 first, then the 2^{n-1} sets having 1. With this order, continued recursively, the matrix M_n for n-subsets has the block form

\left( \begin{matrix} M_{n-1} & R_{n-1} \\ R_{n-1}^t & K_{n-1} \end{matrix} \right)

where R_k is the (2^k-1) \times 2^k matrix with first column all zero and then M_{k} to its right, and K_k is the k \times k all-ones matrix. For example,

M_3 = \left( \begin{matrix} 1 & 0 & 1 & {\bf 0} & 1 & 0 & 1 \\ 0 & 1 & 1 & {\bf 0} & 0 & 1 & 1 \\ 1 & 1 & 1 & {\bf 0} & 1 & 1 & 1 \\ {\bf 0} & {\bf 0} & {\bf 0} & 1 & 1 & 1 & 1 \\ 1& 0 & 1 & 1 & 1 & 1 & 1 \\ 0 & 1 & 1 & 1 & 1 &1 &1 \\ 1 & 1 & 1 & 1 & 1 &1 &1 \end{matrix} \right)

where the exceptional zero column in R_{2} and zero row in R_{2}^t are shown in bold.

Clearly \det M_1 = 1. We shall show, by induction on n, that \det M_n = -1 if n \ge 2. The 3 \times 3 matrix M_2 can be seen as a submatrix of M_3 above; computing the determinant as a sum over the 6 permutations of \{1,2,3\} gives 1 - 1 -1 = -1, as required.

For the inductive step, observe that row 2^{n-1} of M_n has 2^{n-1}-1 zeros (from the top row of R_{n-1}^t) followed by 2^{n-1} ones (from the top row of K_{n-1}). By row operations subtracting this row from each of the rows below, we obtain

\left( \begin{matrix} M_{n-1} & 0_{(2^{n-1}-1) \times 1} & M_{n-1} \\ 0_{1 \times (2^{n-1}-1)} & 1 & 1_{1 \times (2^{n-1}-1)} \\ M_{n-1} & 0_{(2^{n-1}-1) \times 1)} & 0_{(2^{n-1}-1) \times (2^{n-1}-1)} \end{matrix} \right).

The unique non-zero entry in row 2^{n-1} and column 2^{n-1} is the one in position (2^{n-1},2^{n-1}). Therefore deleting this row and column, we find that

\det M_n = \det \left( \begin{matrix} M_{n-1} & M_{n-1} \\ M_{n-1} & 0_{(2^{n-1}-1) \times (2^{n-1}-1)} \end{matrix} \right).

Given the zero matrix in the bottom right, it is impossible to obtain a non-zero contribution in the determinant by any permutation that picks an entry in the top-left copy of M_{n-1}. Therefore

\det M_n = -(\det M_{n-1})^2 = -1

as claimed.

Inverse

The inverse matrix M_n^{-1} can be computed explicitly: labelling rows and columns by subsets of \{1,2,\ldots, n\} we have

\displaystyle -(M_n^{-1})_{XY} = \begin{cases} (-1)^{|Y \backslash X'|} & Y \supseteq X' \\ 0 & Y \not\supseteq X' \end{cases}

where X' is the complement of X in \{1,2,\ldots, n\}. Can the appearance of the Möbius function \mu(Z,Y) = (-1)^{Y \backslash Z} of the poset of subsets of \{1,2,\ldots, n\} be explained as an instance of Mobius inversion?

Symmetric group

Let \Omega be the set of all subsets of \{1,2,\ldots, n\} and let \mathbb{F} be an infinite field. The vector space \mathbb{F}\Omega with basis \Omega is a permutation module for the symmetric group S_n on \{1,2,\ldots, n\}. The Putnam matrix, extended by an extra zero row and column corresponding to the empty set, is the endomorphism of this permutation module that sends a subset to the sum of all the subsets that it intersects. This interpretation of the Putnam matrix suggests a rich family of variations on the theme.

Variations on the Putnam matrix

Diagonal blocks

Fix k \in \{1,\ldots, n\} with k \le n/2. Let \Omega^{(k)} be the set of k-subsets of \{1,2,\ldots, n\} and let M^{(k)}_n be the block of M_n corresponding to k-subsets of \{1,2,\ldots, n\}. Thus M^{(k)}_n is the endomorphism of the permutation module \mathbb{F}\Omega^{(k)} sending a k-subset to all the sum of all the k-subsets that it intersects. The permutation module \mathbb{F}\Omega^{(k)} has a multiplicity-free direct sum decomposition

U = U_0 \oplus U_1 \oplus \cdots U_k

where U_j has irreducible character canonically labelled by the partition (n-j,j). (Since k \le n/2 this is a partition.) It therefore follows from Schur’s Lemma that M_n^{(k)} has k+1 integer eigenvalues whose multiplicities are the dimensions of the U_i. Varying j and n, these numbers form a triangle whose diagonal entries, i.e. those for which j = \lfloor n/2 \rfloor, are the Catalan Numbers.

For example, by working with the filtration of \mathbb{F}\Omega^{(2)} by the Specht modules S^{(n-2,2)} \cong U_2, S^{(n-1,1)} \cong U_1 and S^{(n)} \cong U_0 one can show that the eigenvalues are 2n+1, n-3 and -1 with multiplicities 1, n-1 and n(n-3)/2, respectively. Computational data suggests that in general the eigenvalue associated to the partition (n-j,j) for j \ge 1 is (-1)^{j-1} \binom{n-k-j}{k-j}; probably this can be proved by a generalization of the filtration argument. (But maybe there is a better approach just using character theory.) Since the unique trivial submodule is spanned by the sum of all k-subsets, the eigenvalue for the partition (n) is simply the number of k-subsets meeting \{1,2,\ldots, k\}, namely \binom{n}{k} - \binom{n-k}{k}.

Weighting additions and removals by indeterminates

Let P_n(\alpha,\beta, \theta) be the generalization of the Putnam matrix where the entry remains zero for disjoint subsets, and if X meets Y then the entry is \alpha^{|X \backslash Y|}\beta^{|Y\backslash X|}\theta^{2|X\cap Y|}. For example, P_2(\alpha,\beta,\theta) is

\left( \begin{matrix} \theta^2 & 0 & \alpha\theta^2 \\ 0 & \theta^2 & \alpha \theta^2 \\ \beta\theta^2 & \beta\theta^2 & \theta^4 \end{matrix} \right)

where rows and columns are labelled \{1\}, \{2\}, \{1,2\}, and P_3(\alpha,\beta,1) is shown below, with rows and columns labelled \{1\}, \{2\}, \{1,2\}, \{3\}, \{1,3\}, \{2,3\}, \{1,2,3\}

\left( \begin{matrix} 1 & 0 & \alpha & 0 & \alpha & 0 & \alpha^2 \\ 0 & 1 & \alpha & 0 & 0 & \alpha & \alpha^2 \\ \beta & \beta & 1 & 0 & \alpha\beta & \alpha\beta & \alpha \\ 0 & 0 & 0 & 1 & \alpha & \alpha & \alpha^2 \\ \beta & 0 & \alpha\beta & \beta & 1 & \alpha\beta & \alpha \\ 0 & \beta & \alpha\beta & \beta & \alpha\beta & 1 & \alpha \\ \beta^2 & \beta^2 & \beta & \beta^2 & \beta & \beta & 1 \end{matrix} \right).

Specializing \alpha, \beta and \theta to 1 we get the Putnam matrix for n = 3 shown above. When P_n(\alpha,\beta,\theta) is written as a sum over the (2^n-1)! permutations of the collection of non-empty subsets of \{1,\ldots, n\}, each summand factorizes as a product over the disjoint cycles in the corresponding permutation. If (X_0, \ldots, X_{\ell-1}) is such a cycle then the power of \alpha is \sum_{i=1}^{\ell}|X_i \backslash X_{i-1}| (taking X_{\ell} = X_0), and similarly the power of \beta is \sum_{i=1}^{\ell} |X_{i-1}\backslash X_i|. Clearly these are the same. Moreover, by the identity

|X\backslash Y| + |Y\backslash X| + 2|X\cap Y| = |X| + |Y| = 2k,

after taking the product over all disjoint cycles, the total degree is \sum_{k=1}^n k \binom{n}{k} = n2^{n-1}. Therefore the determinant of the matrix a homogeneous polynomial in \alpha, \beta, \theta of degree n2^{n-1}. Since swapping \alpha and \beta corresponds to transposing the matrix, the determinant is a function of \alpha + \beta, \alpha\beta and \theta. We have seen that each summand of the determinant the same power of \alpha as \beta, and so the determinant depends only on \gamma = \alpha\beta. Therefore the determinant factorizes into homogeneous polynomials in \alpha\beta (degree 2) and \theta^2 (also degree 2). Hence we may specialize so that \alpha\beta = 1 and replace \theta^2 with \tau without losing any information. (This turns out to be the most convenient form for computation because, using MAGMA at least, it is fast to compute determinants of matrices whose coefficients lie in the single variable polynomial ring \mathbb{Q}[\tau] but much slower if the coefficients lie in \mathbb{Q}[\alpha,\beta,\tau].) The determinants for small n are shown below.

\begin{array}{ll} 1 & \tau \\ 2 & \tau^3(\tau-2) \\ 3 & \tau^7(\tau-2)^3(\tau^2-3\tau+3)\\ 4 & \tau^{15}(\tau-2)^7 (\tau^2-3\tau+3)^4 (\tau^2-2\tau+2)\end{array}

To continue, it is useful to define f_2 = \tau - 2 and for odd primes p, a polynomial f_p \in \mathbb{Z}[\tau] of degree p-1 by \tau f_p = (\tau-1)^p + 1. (The definition becomes uniform on changing 1 to (-1)^{p-1}.) For instance f_3 = \tau^2 - 3\tau + 3 and

f_5 = \tau^4 - 5\tau^3 + 10 \tau^2 - 10 \tau + 5.

We also define the following further polynomials:

\begin{aligned} f_4 &= \tau^2 - 2\tau + 2 \\ f_6 &= \tau^2 - \tau + 1 \\ f_8 &= \tau^4 - 4\tau^3 + 6\tau^2 -4\tau + 2 \\ f_{9} &= \tau^6 - 6\tau^5 + 15\tau^4 - 21\tau^3 + 18\tau^2 - 9\tau + 3 \\ f_{10} &= \tau^4 - 3\tau^3 + 4\tau^2 - 2\tau + 1 \\ f_{12} &= \tau^4 -4\tau^3 + 5\tau^2 - 2\tau + 1 \end{aligned}

Using this notation, the determinants for n \le 13 are

\begin{array}{ll} 1 & \tau \\ 2 & \tau^3 f_2 \\ 3 & \tau^7 f_2^3 f_3 \\ 4 & \tau^{15} f_2^7 f_3^4 f_4 \\ 5 & \tau^{31} f_2^{15} f_3^{10} f_4^{5} f_5 \\ 6 & \tau^{63} f_2^{31}f_3^{21}f_4^{15}f_5^6 f_6 \\ 7 & \tau^{127} f_2^{63} f_3^{42} f_4^{35} f_5^{21} f_6^{7} f_7 \\ 8 & \tau^{255} f_2^{127} f_3^{84} f_4^{71} f_5^{56}f_6^{28} f_7^8 f_8 \\ 9 & \tau^{511} f_2^{255} f_3^{169} f_4^{135} f_5^{126} f_6^{84} f_7^{36} f_8^{9} f_9 \\ 10 & \tau^{1023} f_2^{511} f_3^{340} f_4^{255} f_5^{253}f_6^{210}f_7^{120}f_8^{45} f_9^{10}f_{10}\\ 11 &\tau^{2047}f_2^{1023}f_3^{682}f_4^{495}f_5^{473}f_6^{462}f_7^{330}f_8^{165}f_9^{55}f_{10}^{11}f_{11} \\ 12 & \tau^{4095} f_2^{2047} f_3^{1365}f_4^{991}f_5^{858} f_6^{925}f_7^{792} f_8^{495} f_9^{220}f_{10}^{66} f_{11}^{12}f_{12} \\ 13 & \tau^{8191}f_2^{4095}f_3^{2730}f_4^{2015}f_5^{1573} f_6^{1729}f_7^{1716}f_8^{1287}f_9^{715}f_{10}^{286}f_{11}^{78}f_{12}^{13}f_{13} \end{array}

Note that f_n has degree \phi(n) in every case where it is defined, and that f_n(1) = 1 for all n except n =2, for which f_2(1) = -1. Assuming that this behaviour continues, and that f_2 always has exponent 2^{n-1}-1, as suggested by the data above, this gives another solution to the Putnam problem.

Weighting by integers

Given (w_0,w_1,\ldots, w_n) \in \mathbb{Z}^{n+1}, let Q_n(w) be the generalization of the Putnam matrix where if X meets Y then the entry is w_{|X\cap Y|}. When w_0 = 0 and w_i = (-1)^{i-1} for 1 \le i \le n, the determinants are given by specializing the matrix above by \alpha = 1, \beta = 1 and \theta = \sqrt{-1}, and so they are obtained from the factorizations in \mathbb{Q}[\tau] presented above by specializing \tau to -1.

The evaluations at -1 of the polynomials defined above are as follows: f_2(-1) = -3, f_p(-1) = 2^p - 1 for p an odd prime, and f_4(-1) = 5, f_6(-1) = 3, f_8(-1) = 18, f_9(-1) = 73, f_{10}(-1) = 11, f_{12}(-1) = 13.

Now suppose that w_i = (-1)^{i} for each i, so now w_0 = 1. For instance the matrix for n =3 is

\left( \begin{matrix} - & + & - & + & - & + & - \\ + & - & - & + & + & - & - \\ - & - & + & + & - & - & + \\ + & + & + & - & - & - & - \\ - & + & - & - & + & - & + \\ + & - & - & - & - & + & + \\ - & - & + & - & + & + & - \end{matrix}\right)

where + denotes 1 and - denotes -1. The determinants for n \in \{1,\ldots, 10\} are

1, 2^2, 2^9, 2^{28}, 2^{75}, 2^{186}, 2^{441}, 2^{1016}, 2^{2295}, 2^{5110}.

The sequence of exponents is in OEIS A05887, as the number of labelled acyclic digraphs with n vertices containing exactly n-1 points of in-degree zero. The determinant itself enumerates unions of directed cycles, with sign and a weighting, in the complete graph on n vertices; does this give the connection, or is it a new interpretation of sequence A05887?

Schur algebra

Let me finish by making the connection with an apparently completely unrelated object. Consider representations of the general linear group \mathrm{GL}_2(\mathbb{F}) in which the entries of the representing matrices are polynomials of a fixed degree in the four entries of each \mathrm{GL}_2(\mathbb{F}) matrix. For example, let E = \langle e_1, e_2 \rangle be the natural representation of \mathrm{GL}_2(\mathbb{F}). Then E itself is a polynomial representation of degree 1, and its symmetric square \mathrm{Sym}^2 \!E with basis e_1^2, e_1e_2, e_2^2 is a polynomial representation of degree 2, in which

\left( \begin{matrix} \alpha & \beta \\ \gamma & \delta \end{matrix} \right) \mapsto \left( \begin{matrix} \alpha^2 & 2\alpha\beta & \beta^2 \\ \alpha\gamma & \alpha\delta + \beta\gamma & \beta\delta \\ \gamma^2 & 2\gamma\delta & \delta^2 \end{matrix} \right).

By definition, \mathrm{Sym}^2 \! E is a quotient of E \otimes E. More generally, any polynomial representation of degree n is a subquotient of a direct sum of copies of E^{\otimes n}. Observe that there is a linear isomorphism E^{\otimes n} \rightarrow \mathbb{F}\Omega defined by

e_{i_1} \otimes e_{i_2} \otimes \cdots \otimes e_{i_n} \mapsto \bigl\{j \in \{1,2,\ldots, n\} : i_j = 2 \bigr\}.

The action of S_r on \mathbb{F}\Omega induces the place permutation action on E^{\otimes n}, defined (using right actions) by

e_{i_1} \otimes e_{i_2} \otimes \cdots \otimes e_{i_n} \sigma = e_{i_{1\sigma^{-1}}} \otimes e_{i_{2\sigma^{-1}}} \otimes \cdots \otimes e_{i_{n\sigma^{-1}}}.

For example, the 3-cycle (1,2,3) sends e_{a} \otimes e_{b} \otimes e_{c} to e_{c} \otimes e_{a} \otimes e_{b}, moving each factor one to the right (with wrap around). This gives some motivation for introducing the Schur algebra S(n,2), defined to be the algebra of linear endomorphisms of the polynomial representation E^{\otimes n} of \mathrm{GL}_2(\mathbb{F}) that commute with the place permutation action of S_n. In symbols,

S(n,2) = \mathrm{End}_{\mathbb{F}S_n}(E^{\otimes n}).

Somewhat remarkably one can show that the category of polynomial representations of \mathrm{GL}_2(\mathbb{F}) of degree n is equivalent to the module category of the Schur algebra. (See for instance the introduction to Green’s lecture notes.) The extended Putnam matrix corresponds to the element of the Schur algebra sending the pure tensor e_{i_1} \otimes \cdots \otimes e_{i_n} to the sum of all the pure tensors e_{j_1} \otimes \cdots \otimes e_{j_n} such that i_k = j_k = 1 for at least one k. For example, if n=2 then

\left( \begin{matrix} \alpha & \beta \\ \gamma & \delta \end{matrix} \right) \mapsto \left( \begin{matrix} \alpha^2 & \alpha\beta & \alpha\beta & \beta^2 \\ \alpha\gamma & \alpha\delta & \beta\gamma & \beta\delta \\ \beta^2 & \beta\gamma & \alpha\delta & \beta\delta \\ \gamma^2 & \gamma\delta & \gamma\delta & \delta^2 \end{matrix} \right), \  M_2 \mapsto \left( \begin{matrix} 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 1 & 1 & 1 \end{matrix} \right)

We see that S(2,n) is 10-dimensional, and, since the image of M_2 has equal coefficients in the required positions, it lies in S(2,n) as expected. For another example,

\left( \begin{matrix} 1 & 0 \\ \delta & 1 \end{matrix} \right) \mapsto \left( \begin{matrix} 1 & 0 & 0 & 0 \\ \delta & 1 & 0 & 0 \\ \delta & 1 & 1 & 0 \\ \delta^2 & \delta & \delta & 1 \end{matrix} \right)

corresponds to the endomorphism of \mathbb{F}\Omega sending a subset of \{1,2\} to the sum of all its subsets, weighted by a power of \delta recording how many elements were removed. This case is unusual in that the Schur algebra element comes from a single matrix in \mathrm{GL}_2(\mathbb{F}), rather than a linear combination of matrices lying in the group algebra \mathbb{F}\mathrm{GL}_2(\mathbb{F}), as would be required for the Putnam matrix. Passing to the Schur algebra replaces the infinite dimensional algebra \mathbb{F}\mathrm{GL}_2(\mathbb{F}) with a finite dimensional quotient.


Two proofs of Shannon’s Source Coding Theorem and an extension to lossy coding

September 29, 2019

The first purpose of this post is to record two approaches to Shannon’s Source Coding Theorem for memoryless sources, first by explicit codes and secondly using statistical properties of the source. After that I try, with mixed success, to give a motivated proof of a more general result on lossy coding.

Setup and statement.

Suppose that a memoryless source produces the symbols 1, \ldots, m with probabilities p(1), \ldots, p(m), respectively. Let

h = H(p(1),\ldots, p(m))

be the entropy of this probability dis tribution. Since the source is memoryless, the entropy of an r-tuple of symbols is then rh. Shannon’s Source Coding Theorem (or First Main Theorem, or Noiseless Coding Theorem) states that, given \epsilon > 0, provided r is sufficiently large, there is a variable length binary code of average length less than r(h + \epsilon) that can be used to encode all r-ruples of symbols.

Proof by Kraft–McMillan inequality

Choose r sufficiently large so that r > 1/\epsilon. Let q(1), \ldots, q(m^r) be the probabilities of the r-tuples and let \ell(j) \in \mathbb{N} be least such that \ell(j) \ge \log_2 1/q(j). Equivalently, \ell(j) is least such that

2^{-\ell(j)} \le q(j).

This equivalent characterisation shows that

\sum_{j=1}^{m^r} 2^{-\ell(j)} \le \sum_{j=1}^{m^r} q(j) =1

so by the Kraft–McMillan inequality, there exists a prefix free binary code with these codeword lengths. Moreover, since

\ell(j) = \lceil \log_2 1/q(j) \rceil \le 1 + \log_2 \frac{1}{q(j)}

we have

\begin{aligned}\sum_{j=1}^{m^r} q(j)\ell(j) &\le \sum_{j=1}^{m^r} q(j) \bigl( 1 + \log_2 \frac{1}{q(j)} \bigr) \\ &= \sum_{j=1}^{m^r} q(j) + \sum_{j=1}^{m^r} q(j) \log_2 \frac{1}{q(j)} &= 1 + hr \end{aligned}

as required.

The required prefix-free code is the Shannon code for the probabilities p(1), \ldots, p(m^r). Another algorithmic construction is given by the proof of the sufficiency of the Kraft–McMillan inequality on page 17 of Welsh Codes and cryptography. (The same proof is indicated, rather less clearly in my view, on the Wikipedia page.)

Proof using Asymptotic Equipartition Property (AEP).

Despite its forbidding name, for memoryless sources the AEP is little more than the Weak Law of Large Numbers, albeit applied in a slightly subtle way to random variables whose values are certain (functions of) probabilities. Let \mathcal{M} = \{1,\ldots, m\} be the probability space of symbols produced by the source. Let X \in \mathcal{M} be a random symbol produced by the source. Then \log p(X) is a random variable. Summing over the m outcomes in the probability space, we find that

\mathbb{E}[\log p(X)] = \sum_{i=1}^m \mathbb{P}[X=i] \log p(i) =  \sum_{i=1}^m p(i) \log p(i) = -h

where h = H(p(1), \ldots, p(m)) is the entropy already seen. While straightforward, and giving some motivation for considering \log p(X), the calculation hides some subtleties. For example, is it true that

\mathbb{P}[\log p(X) = \log p(i)] = p(i)?

In general the answer is ‘no’: in fact equality holds if and only if i is the unique symbol with probability p(i). (One further thing, that I certainly will not even dare to mention in my course, is that, by the formal definition of a random variable, X is the identity function on \mathcal{M}, and so p(X) is a composition of two functions; in Example 2.3.1 of Goldie and Pinch Communication Theory the authors remark that forming such compositions is a ‘characteristic feature of information theory, not found elsewhere in probability theory’.)

Anyway, returning to the AEP, let (X_1, \ldots, X_r) \in \mathcal{M}^r be a random r-tuple. For brevity we shall now call these r-tuples messages. Let P_r : \mathcal{M}^r \rightarrow [0,1] be defined by

P_r\bigl( (x_1, \ldots, x_r) \bigr) = p(x_1) \ldots p(x_r)

Since the source is memoryless, P_r\bigl( (x_1,\ldots, x_r)\bigr) is the probability that the first r symbols produced by the source are x_1, \ldots, x_r. Consider the random variable \log P_r\bigl( (X_1, \ldots, X_r) \bigr). By the displayed equation above,

\log P_r\bigl( (X_1,\ldots, X_r) \bigr) = \sum_{i=1}^r \log p(X_i).

This is a sum of r independent identically distributed random variables. We saw above that each has expected value -h. Hence, by the weak law, given any \eta > 0,

\displaystyle \mathbb{P}\Bigl[ -\frac{\log P_r\bigl( (X_1, \ldots, X_r) \bigr)}{r} \not\in (h-\eta, h+\eta) \Bigr] \rightarrow 0

as r \rightarrow \infty. By taking r sufficiently large so that the probability is less than \eta, we obtain a subset \mathcal{T} of \mathcal{M}^r such that (a)

\mathbb{P}[(X_1,\ldots, X_r) \in \mathcal{T}] > 1 - \eta

and (b) if (x_1,\ldots, x_r) \in \mathcal{T} then

P_r\bigl( (x_1, \ldots, x_r) \bigr) \in (2^{-r(H+\eta)}, 2^{-r(H-\eta)}).

The properties (a) and (b) say that memoryless sources satisfy the AEP. It is usual to call the elements of \mathcal{T} typical messages.

For example, if p(1) \ge \ldots \ge p(m) then the most likely message is (1,\ldots, 1)^r and the least likely is (m,\ldots, m); unless the distribution is uniform, both are atypical when \eta is small. A typical message will instead have each symbol i with frequency about rp(i). Indeed, the proof of the AEP for memoryless sources on page 80 of Welsh Codes and cryptography takes this as the definition of ‘typical’ and uses the Central Limit Theorem to replace the weak law so that ‘about’ can be made precise.

We can now prove the Source Coding Theorem in another way: if (x_1, \ldots, x_r) is typical then

P_r\bigl( (x_1,\ldots, x_r)\bigr) \ge 2^{-r(H+\eta)}

and so

\begin{aligned} 1 &\ge \mathbb{P}[(X_1,\ldots, X_r) \in \mathcal{T}] \\ &= \sum_{(x_1,\ldots, x_r) \in \mathcal{T}} P_r\bigl( (x_1, \ldots, x_r) \bigr) \\ &\ge |\mathcal{T}| 2^{-r(H+ \eta)}.\end{aligned}

Hence |\mathcal{T}| \;\text{\textless}\; 2^{r(h+\eta)}. We may therefore encode all the typical messages using codewords of length r(h + \eta) + 1. For the atypical messages we make no attempt at efficiency, and instead encode them using binary words of length N for some N > r \log_2 m, avoiding the prefixes already used for the typical words. Thus the code has (at most) two distinct lengths of codewords, and the expected length of a codeword is at most

r(h + \eta) + 1 + \eta N.

For r sufficiently large the expected length is at most

r(h + 2\eta) + \eta r \log_2 m;

when \eta is sufficiently small this is less than r(h+ \epsilon).

In Part C of my course I plan to give this proof before defining the AEP, and then take (a) and (b) as its definition for general sources. As a preliminary, I plan to prove the Weak Law of Large Numbers from Chebyshev’s Inequality.

AEP for general sources

Going beyond this, things quickly get difficult. Recall that a source producing symbols X_1, X_2, \ldots is stationary if (X_{i_1},\ldots,X_{i_t}) and (X_{{i_1}+s},\ldots X_{i_t+s}) have the same distribution for all i_1, \ldots, i_t and s, and ergodic if with probability 1 the observed frequencies of every t-tuple of symbols converges to its expected value. There is a nice proof using Fekete’s Lemma in Welsh Codes and cryptography (see page 78) that any stationary source has an entropy, defined by

\displaystyle \lim_{r \rightarrow \infty} \frac{H(X_1, X_2,\ldots, X_r)}{r}.

However the proof that a stationary ergodic source satisfies the AEP is too hard for my course. (For instance the sketch on Wikipedia blithely uses Lévy’s Martingale Convergence Theory.) And while fairly intuitive, it is hard to prove that a memoryless source is ergodic — in fact this is essentially the Strong Law of Large Numbers — so there are difficulties even in connecting the memoryless motivation to the general result. All in all this seems to be a place to give examples and state results without proof.

Lossy source coding

In §10 of Cover and Thomas, Elements of information theory, Wiley (2006) there is a very general but highly technical result on source encoding sources satisfying the AEP when a non-zero error probability is permitted. After much staring at the proof in §10.5 and disentangling of the definitions from rate-distortion theory, I was able to present a very simple version for the unbiased memoryless binary source in my course. As a rare question, someone asked about the (potentially) biased memoryless binary source that emits 1 with probability p \le \frac{1}{2}. Here is (one version of) the relevant result. Throughout let h(p) = H(p,1-p).

Theorem. Let 0 \;\textless\; D \;\textless\; p. Provided r is sufficiently large, there exists a code C \subseteq \{0,1\}^r such that

|C| = \lceil 2^{(h(p)-h(D)+\epsilon)r} \rceil

and an encoder f : \{0,1\}^r \rightarrow C such that

\mathbb{P}\bigl[d\bigl(f(S_1,\ldots, S_r), (S_1,\ldots, S_r)\bigr) > Dr\bigr] < \epsilon.

Here d denotes Hamming distance. So the conclusion is that with probability \ge 1- \epsilon, a random word emitted by the source can be encoded by a codeword in C obtained by flipping at most Dr bits. Therefore each bit is correct with probability at least 1-D. By sending the number of a codeword we may therefore compress the source by a factor of h(p)-h(D)+\epsilon, for every \epsilon \;\textgreater\; 0. Note that we assume D < p, since if a bit error probability exceeding p is acceptable then there is no need for any encoding: the receiver simply guesses that every bit is 0.

In the lecture, I could not see any good way to motivate the appearance of h(p) - h(D), so instead I claimed that one could compress by a factor of h(p)\bigl(1-h(D)\bigr). The argument I had in mind was that by the AEP (or Shannon's Noiseless Coding Theorem) one can represent the typical words emitted by the source using all binary words of length h(p)r. Then compress to h(p)(1-h(D))r bits by thinking of these binary words as emitted by an unbiased source, using the special case of the theorem above when p = \frac{1}{2}. The problem is that decoding gives a bitwise error probability of D on the binary words of length h(p)r used to encode the source, and this does not give good control of the bitwise error probability on the original binary words of length r emitted by the source.

In an attempt to motivate the outline proof below, we take one key idea in the proof of the more general result in Cover and Thomas: the code C should be obtained by taking a random word X emitted by the source and flipping bits to obtain a codeword Y such that I(X;Y) is minimized, subject to the constraint that the expectation of d(X,Y) is \le Dr.

In the special case I presented in the lecture all words are equally likely to be emitted by the source, and so flipping bits at random gives codewords distributed uniformly at random on \{0,1\}^r. As a warm up, we use such codewords to prove the theorem in this special case.

Proof when p = \frac{1}{2}. Fix w \in \{0,1\}^r and let P be the probability that a codeword U, chosen uniformly at random from \{0,1\}^r, is within distance Dr of w. Since U is uniformly distributed, P is the size of the Hamming ball of radius Dr about U divided by 2^r. Standard bounds show that

\displaystyle \frac{1}{(r+1)} \frac{2^{h(D)r}}{2^r} \le P \le \frac{2^{h(D)r}}{2^r}.

By the lower bound, if we pick slightly more than 2^{(1-h(D))r} codewords uniformly at random, then the expected number of codewords within distance Dr of w is at least 1. More precisely, let M = \lceil 2^{(1-h(D)+\epsilon)r} \rceil be as in the statement of the theorem above. Let C be a random code constructed by chosing M codewords independently and uniformly at random from \{0,1\}^r. The probability that no codeword in C is within distance Dr of w is at most

\displaystyle \bigl( 1 - \frac{1}{2^{(1-h(d))r}} \bigr)^{2^{(1-h(d)+\epsilon)r}} \le \exp \bigl( -\frac{2^{(1-h(d)+\epsilon)r}}{2^{(1-h(d))r}} \bigr) = \exp (-2^{\epsilon r})

which tends to 0 as r \rightarrow \infty. Since w was arbitrary, the above is also the probability that no codeword is within distance Dr of a word in \{0,1\}^r chosen uniformly at random. Therefore this average probability, \overline{P} say, can also be made less than \epsilon. Hence there exists a particular code with codewords u(1), \ldots, u(M) such that, for this code, \overline{P} \le \epsilon. Thus at least 1-\epsilon of all words in \{0,1\}^n are within distance Dr of some codeword in C. \Box

It seemed highly intuitive to me that, since by the AEP, we need only consider those w of weight about pr, the right construction in general would be to take such words and construct random codewords by randomly flipping Dr of their bits. But some thought shows this cannot work: for example, take p = \frac{1}{3} and D = \frac{1}{6}. Then flipping bits at random gives codewords of average weight

\frac{1}{3}(1-\frac{1}{6})r + \frac{2}{3}\frac{1}{6}r = \frac{7}{18}r.

Let w = 1\ldots 10\ldots 0 be of weight \frac{1}{3}r. Then w is at distance

\displaystyle \binom{\frac{1}{3}r}{\frac{1}{18}r} \binom{\frac{2}{3}r}{\frac{1}{9}r}  \approx 2^{\frac{1}{3} h(\frac{1}{6}r) + \frac{2}{3} h(\frac{1}{6})r} = 2^{h(\frac{1}{6})r}

words of weight \frac{7}{18}r, and to pick a random code from words of this weight so that the expectated number of codewords close to w is at least 1, we must choose M codewords where

\displaystyle M \times \frac{2^{h(\frac{1}{6})r} }{2^{h(\frac{7}{18})r}} \ge 1

If the denominator was 2^{h(\frac{1}{3}r)}, we could take M = 2^{(h(\frac{1}{3})-h(\frac{1}{6}))r} as required. But because we are stuck in the bigger space of words of weight (\frac{1}{3} + \frac{1}{18})r, we require a larger M.

Some of my confusion disappeared after doing the minimization of I(X;Y) more carefully. Since I(X;Y) = H(X) - H(X|Y) and H(X) = H(p) is constant, a good — although maybe not very intuitive — way to do this is to choose the probabilities \mathbb{P}[X=\alpha|Y=\beta] to maximize H(X|Y), consistent with the requirements that if \mathbb{P}[Y=1] = q then

(1-q)\mathbb{P}[X=0|Y=0] + q\mathbb{P}[X=0|Y=1] = p

and

(1-q)\mathbb{P}[X=1|Y=0] + q\mathbb{P}[X=0|Y=1] \le D.

Replacing the inequality with equality (it is a safe assumption that the expected distortion is maximized when the mutual information is minimized), and solving the linear equations, one finds that the flip probabilities are

\mathbb{P}[X=1|Y=0] = \frac{D+p-q}{2(1-q)},\ \mathbb{P}[X=0|Y=1] = \frac{D+q-p}{2q}.

Thus one may pick any q such that p-D \le q \le p+D. Since H(X|Y) = (1-q) h(\mathbb{P}[X=1|Y=0]) + qh(\mathbb{P}[X=0|Y=1]), substituting from the equations above gives a formula for H(X|Y) in terms of p, D and q. Differentiating with respect to q (this is fiddly enough to warrant computer algebra), one finds that

\begin{aligned} \displaystyle 2\frac{\mathrm{d}H(X|Y)}{\mathrm{d}q} = \log_2& \frac{(D+p-q)(2-D-p-q)}{(1-q)^2} \\ &- \log_2 \frac{(D-p+q)(-D+p+q)}{q^2}\end{aligned}

and so the extrema occur when

\displaystyle \frac{(D+p-q)(2-D-p-q)}{(1-q)^2} = \frac{(D-p+q)(-D+p+q)}{q^2}.

For no reason that I can see, on multiplying out, all powers of q^3 and q^4 cancel, and one is left with the factoring quadratic

\bigl(p-D-q(1-2D)\bigr)\bigl(p-D+q(1-2p)\bigr)

Hence H(X|Y) is extremized only when q = (p-D)/(1-2D) and both flip probabilities are D and when q = (p-D)/(1-2p), when \mathbb{P}[X=1|Y=0]=(D-Dp-p^2)/(1+D-3p) and \mathbb{P}[X=0|Y=1]=p. In the first solution the second derivative satisfies

\displaystyle \frac{\mathrm{d}^2H(X|Y)}{\mathrm{d}q^2} = -\frac{(1-2D)^2}{2(1-D)D(p-D)(1-D-p)}

so it is a maximum and the choice we make. Consistent with the failed attempt above, we have q < p, so the space of codewords is smaller than the space of source words. The extreme value is H(X|Y) = h(D). Again this is somewhat intuitive: the minimum mutual information consistent with a distortion of D should be the uncertainty in the (D,1-D) probability distribution. But note that the distortion is applied to codewords, rather than words emitted by the source. (A related remark is given at the end of this post.)

To emphasise the difference, the left diagrams below show the probabilities going from Y to X, relevant to H(X|Y). The right diagram shows the probabilities going from X to Y, relevant to H(Y|X) and needed in the proof below.

Flip probabilities

Note that the flip probabilities going from X to Y are not, in general equal. In the example above with p=\frac{1}{3} and D = \frac{1}{6} we have q = \frac{1}{4} and the flip probabilities from X to Y are \gamma = \frac{1}{16} and \delta = \frac{3}{8}. The extreme value is H(X|Y) = h(\frac{1}{6}) \approx 0.650.

The second solution appears to be spurious, i.e. not corresponding to a zero of the derivative. But since two maxima are separated by another zero of the derivative, it cannot be a maximum. This solution exists only when D-Dp-p^2 \ge 0, a condition equivalent to D \ge p \frac{p}{1-p}. In the example above q = \frac{1}{2} and (by complete chance) \mathbb{P}[X=1|Y=0] = 0 and \mathbb{P}[X=0|Y=1] = \frac{1}{3}. The extreme value is H(X|Y) = \frac{1}{2}\log_2 3 - \frac{1}{3} \approx 0.459.

Outline Proof. Let q = (p-D)/(1-2p). Let M = \lceil 2^{(h(p)-h(D)+\epsilon)r} \rceil as in the statement of the theorem. Choose codewords U(1), \ldots, U(M) independently and uniformly at random from the binary words of length r of weight qr. Let w be a fixed word of weight pr. When we flip \gamma r 0s and \delta r 1s in w we obtain a binary word of weight

\begin{aligned} pr + &(1-p) \delta r - p \epsilon r \\ &= \bigl( p + \frac{D(p-D)}{1-2D} - \frac{D(1-p-D)}{1-2D}\bigr)r \\ &= \bigl( \frac{p(1-2D) + D(p-D) - D(1-p-D)}{1-2D}\bigr)r \\ &= \bigl( \frac{p-2Dp + Dp - D^2 - D - Dp +D^2}{1-2D} \\ &= \frac{p-D}{1-2D}\bigr)r \\ &= qr \end{aligned}

as expected. Moreover, the number of binary words of weight qr we obtain in this way is \binom{pr}{\epsilon pr} \binom{(1-p)r}{\delta pr}. By the bounds above, the probability that a codeword of weight qr, chosen uniformly at random, is within distance Dr of w is about

2^{pr h(\delta) + (1-p)r h(\gamma)-h(q)}.

Considering all M codewords, the probability that no codeword is within distance Dr of w is, by the same exponential approximation used earlier, about

\exp (- 2^{(h(p)-h(D) + \epsilon + ph(\delta) - (1-p)h(\gamma) - h(q))r}).

Now, as one would hope, and as can be shown using Mathematica, with FullSimplify and the assumptions that 0 < p < 1 and 0 < d < p, we have h(p)-h(D) + ph(\delta) - (1-p)h(\gamma) - h(q) = 0. The end of the proof is therefore as before. \Box

Remark. It is very tempting to start on the other side, and argue that a codeword U chosen uniformly at random from the set of binary words of weight qr is within distance Dr of 2^{qrh(D) + (1-q)r h(D)} = 2^{h(D)r} source words of weight p by flipping exactly qrD 0s and (1-q)rD 1s. This avoids the horrendous algebraic simplification required in the proof above. But now it is the source word that is varying, not the codeword. One can argue this way by varying both, which is permitted by the AEP since typically source words have about pn 1s. This combines both sources of randomness: the random code and the random source word in a slightly subtle way, typical of the way the AEP is used in a practice.