## Microsoft Teams doesn’t want you to attend meetings

April 8, 2020

The purpose of this post is to document two of the more blatant bugs I’ve encountered in my forced exposure to Microsoft Teams during the Coronavirus crisis. To add insult to injury, the only way to work around them is to use Microsoft Outlook, another piece of software riddled with deficiencies.

Let’s get started. Here is a screenshot of me scheduling a meeting at 15:22 for 16:00 today.

Notice anything odd. Probably not: what sane person checks the timezone? But, look closely: Microsoft Teams firmly believes that London is on UTC+00:00 (same as GMT), not BST. This belief is not shared by my system, or any other software on it that I can find.

Now let’s try to join the meeting. Okay it’s early, but we are keen. Here is a screenshot of me hovering with my mouse over the meeting.

There is no way to join. Double clicking on the meeting just gives a chance to reschedule it (maybe to a later date, when Microsoft has fixed this glaring deficiency). The ‘Meet now’ button starts an unrelated meeting.

Okay, maybe our mistake was to join an MS Teams meeting using MS Teams. Let’s try using the Outlook web calendar. Here is a screenshot.

Here is a close-up of the right hand of the right-hand side

On the one hand, the times say that the meeting started 46 minutes ago; on the other, it is ‘in 14 min’. Perhaps because of this temporal confusion, there is no way to join the meeting.

Finally, here is MS Teams at 16:00.

Nothing has changed, there is still no way to join the meeting.

The prosection rests its case. How can Microsoft justify releasing such as inept piece of software?

## Stanley’s theory of P-partitions and the Hook Formula

March 22, 2020

The aim of this post is to introduce Stanley’s theory of labelled $\mathcal{P}_\preceq$-partitions and, as an application, give a short motivated proof of the Hook Formula for the number of standard Young tableaux of a given shape. In fact we prove the stronger $q$-analogue, in which hook lengths are replaced with quantum integers. All the ideas may be found in Chapter 7 of Stanley’s book Enumerative Combinatorics II so, in a particularly strong sense, no originality is claimed.

The division into four parts below and the recaps at the start of each part have the aim of reducing notational overload (the main difficulty in the early parts), while also giving convenient places to take a break.

### Part 1: Background on arithmetic partitions

Recall that an arithmetic partition of size $n$ is a weakly decreasing sequence of natural numbers whose sum is $n$. For example, there are $8$ partitions of $7$ with at most three parts, namely

$(7), (6,1), (5,2), (5,1,1), (4,3), (4,2,1), (3,3,1), (3,2,2),$

as represented by the Young diagrams shown below.

Since the Young diagram of a partition into at most $3$ parts is uniquely determined by its number of columns of lengths $1$, $2$ and $3$, such partitions are enumerated by the generating function

$\frac{1}{(1-q)(1-q^2)(1-q^3)} = 1 \!+\! q \!+ \!2q^2 \!+\! 3q^3 \!+\! 4q^4 \!+\! 5q^5 \!+\! 7q^6 \!+\! 8q^7 \!+\! \cdots$

For example $(4,2,1)$ has two columns of length 1, and one each of lengths 2 and 3. It is counted in the coefficient of $q^7$, obtained when we expand the geometric series by choosing $q^{1 \times 2}$ (two columns of length 1) from

$\displaystyle \frac{1}{1-q} = 1 + q + q^2 + \cdots,$

then $q^{2 \times 1}$ (one column of length 2) from

$\displaystyle \frac{1}{1-q^2} = 1+q^2 + q^4 + \cdots$

and finally $q^{3 \times 1}$ (one column of length 3) from

$\displaystyle \frac{1}{1-q^3} = 1+q^3+q^6 + \cdots .$

Now suppose that we are only interested in partitions where the first part is strictly bigger than the second. Then the Young diagram must have a column of size $1$, and so we replace $1/(1-q)$ in the generating function with $q/(1-q)$. Since the coefficient of $q^6$ above is $7$, it follows (without direct enumeration) that there are $7$ such partitions. What if the second part must also be strictly bigger than the third? Then the Young diagram must have a column of size $2$, and so we also replace $1/(1-q^2)$ with $q^2/(1-q^2)$. (I like to see this by mentally removing the first row: the remaining diagram then has a column of size $1$, by the case just seen.) By a routine generalization we get the following result: partitions with at most $k$ parts such that the $j$th largest part is strictly more than the $(j+1)$th largest part for all $j \in J \subseteq \{1,\ldots, k-1\}$ are enumerated by

$\displaystyle \frac{q^{\sum_{j \in J} j}}{(1-q)(1-q^2) \ldots (1-q^k)}.$

### Part 2: $\mathcal{P}_\preceq$-partitions

Let $\mathcal{P}$ be a poset with partial order $\preceq$. A $\mathcal{P}_\preceq$-partition is an order-preserving function $p : \mathcal{P} \rightarrow \mathbb{N}_0$.

For example, if $\mathcal{P} = \{1,\ldots, k\}$ with the usual order then $\bigl( p(1), \ldots, p(k)\bigr)$ is the sequence of values of a $\mathcal{P}$-partition if and only if $p(1) \le \ldots \le p(k)$. Thus, by removing any initial zeros and reversing the sequence, $\mathcal{P}$-partitions are in bijection with arithmetic partitions having at most $k$ parts.

I should mention that it is more usual to write $\mathcal{P}$ rather than $\mathcal{P}_\preceq$. More importantly, Stanley’s original definition has ‘order-preserving’ rather than ‘order-reversing’. This fits better with arithmetic partitions, and plane-partitions, but since our intended application is to reverse plane-partitions and semistandard Young tableaux, the definition as given (and used for instance in Stembridge’s generalization) is most convenient.

#### Reversed plane partitions

Formally the Young diagram of the partition $(4,2,1)$ is the set

$[(4,2,1)] = \{(1,1),(1,2),(1,3),(1,4),(2,1),(2,2),(3,1)\}$

of boxes. We partially order the boxes by (the transitive closure of) $(a,b) \preceq (a+1,b)$ and $(a,b) \preceq (a,b+1)$. This is shown diagramatically below.

For this partial order, $\mathcal{P}_\preceq$-partitions correspond to assignments of non-negative integers to $[(4,2,1)]$ such that that the rows and columns are weakly increasing when read left to right and top to bottom. For example, three of the $72$ $\mathcal{P}_\preceq$-partitions of size $6$ are shown below

Such assignments are known as reverse plane partitions. The proof of the Hook Formula given below depends on finding the generating function for reverse plane partitions in two different ways: first using the general theory of $\mathcal{P}$-partitions, and then in a more direct way, for instance using the Hillman–Grassl bijection.

#### Enumerating $\mathrm{RPP}(2,1)$

As a warm-up, we replace $(4,2,1)$ with the smaller partition $(2,1)$, so now $\mathcal{P}_\preceq = \{(1,1),(1,2),(2,1)\}$, ordered by $(1,1) \preceq (1,2)$, $(1,1) \preceq (2,1)$. Comparing $p(1,2)$ and $p(2,1)$, we divide the $\mathcal{P}_\preceq$-partitions into two disjoint classes: those with $p(1,2) \le p(2,1)$, and those with $p(1,2) > p(2,1)$. The first class satisfy

$p(1,1) \le p(1,2) \le p(2,1)$

and so are in bijection with arithmetic partitions with at most $3$ parts. The second class satisfy

$p(1,1) \le p(2,1) < p(1,2)$

so are in bijection with arithmetic partitions with at most $3$ parts, whose largest part is strictly the largest. By the first section we deduce that

\displaystyle \begin{aligned}\sum_{n=0}^\infty |\mathrm{RPP}_{(2,1)}(n)|q^n &= \frac{1+q}{(1-q)(1-q^2)(1-q^3)} \\&= \frac{1}{(1-q)^2(1-q^3)}.\end{aligned}

The cancellation to leave a unit numerator shows one feature of the remarkably nice generating function for reverse plane partitions revealed in the final part below.

#### Labelled $\mathcal{P}$-partitions

A surprisingly helpful way to keep track of the weak/strong inequalities seen above is to label the poset elements by natural numbers. We define a labelling of a poset $\mathcal{P}_\preceq$ of size $k$ to be a bijective function $L : \mathcal{P} \rightarrow \{1,\ldots, k\}$. Suppose that $y$ covers $x$ in the order on $\mathcal{P}_\preceq$. Either $L(x) < L(y)$, in which case we say the labelling is natural for $(x,y)$, or $L(x) > L(y)$, in which case we say the labelling is strict for $(x,y)$. A $(\mathcal{P}_\preceq,L)$-partition is then an order preserving function $p : \mathcal{P} \rightarrow \mathbb{N}_0$ such that if $x \prec y$ is a covering relation and

• $L(x) < L(y)$ (natural) then $p(x) \ge p(y)$;
• $L(x) > L(y)$ (strict) then $p(x) > p(y)$.

Note that the role of the labelling is only to distinguish weak/strong inequalities: the poset itself determines whether $p(u) \ge p(v)$ or $p(v) \le p(u)$ for each comparable pair $u$, $v \in \mathcal{P}$. If we drop the restriction that $x \prec y$ is a covering relation, and just require $x \prec y$ then we clearly define a subset of the labelled $(\mathcal{P}_\preceq, L)$-partitions, and it is not hard to see that in fact the definitions are equivalent. If feels most intuitive to me to state the definition as above.

Let $\mathrm{Par}(\mathcal{P}_\preceq, L)$ denote the set of $(\mathcal{P}_\preceq,L)$-partitions. For example, if $\mathcal{P} = \{(1,1),(1,2),(2,1)\}$ as in the example above, then the $\mathcal{P}_\preceq$-partitions are precisely the $(\mathcal{P}_\preceq, L)$-partitions for any all-natural labelling; the two choices are

$L(1,1) = 1, L(1,2) = 2, L(2,1) = 3,$

and

$L'(1,1) = 1, L'(1,2) = 3, L'(2,1) = 2.$

Working with $L$, the partitions with $p(1,1) \le p(1,2) \le p(2,1)$ form the set $\mathrm{Par}(\mathcal{P}_\unlhd, L)$ where $\unlhd$ is the total order refining $\preceq$ such that

$(1,1) \unlhd (1,2) \unlhd (2,1)$

and the partitions with $p(1,1) \le p(2,1) < p(1,2)$ form the set $\mathrm{Par}(\mathcal{P}_{\unlhd'}, L)$ where

$(1,1) \unlhd' (2,1) \unlhd' (1,2).$

The division of $\mathcal{P}$-partitions above is an instance of the following result.

Fundamental Lemma. Let $\mathcal{P}$ be a poset with partial order $\preceq$ and let $L : \mathcal{P} \rightarrow \{1,\ldots, k\}$ be a labelling. Then

$\mathrm{Par}(\mathcal{P}_\preceq, L) = \bigcup \mathrm{Par}(\mathcal{P}_\unlhd, L)$

where the union over all total orders $\unlhd$ refining $\preceq$ is disjoint.

Proof. Every $(\mathcal{P}_\preceq, L)$ partition appears in the right-hand side for some $\unlhd$: just choose $\unlhd$ so that if $p(x) < p(y)$ then $x \lhd y$ and if $p(x)=p(y)$ and $x \prec y$ then $x \lhd y$. On the other hand, suppose that $p \in \mathrm{Par}(\mathcal{P}_\preceq,L)$ is in both $\mathrm{Par}(\mathcal{P}_{\unlhd},L)$ and $\mathrm{Par}(\mathcal{P}_{\unlhd'},L)$. Choose $x$ and $y \in \mathcal{P}$ incomparable under $\preceq$ and such that $x \lhd y$ and $y \lhd' x$. From $\mathcal{P}_\lhd$ we get $p(x) \le p(y)$ and from $\mathcal{P}_{\lhd'}$ we get $p(y) \le p(x)$. Therefore $p(x) = p(y)$. Now using the labelling for the first time, we may suppose without loss of generality that $L(x) < L(y)$; since $y \lhd' x$ the labelling is strict for $x$ and $y$ and so we have $p(y) < p(x)$, a contradiction. $\Box$.

Suggested exercise. Show that the generating functions enumerating $\mathrm{RPP}(2,2)$ and $\mathrm{RPP}(3,1)$ are $1/(1-q)^3(1-q^3)$ and

$\displaystyle\frac{1}{(1-q)^2(1-q^2)(1-q^4)};$

there are respectively $2$ and $3$ linear extensions that must be considered.

### Part 3: Permutations

Recall that $\mathcal{P}_{\preceq}$ is a poset, $L : \mathcal{P} \rightarrow \{1,\ldots, k\}$ is a bijective labelling and that a $(\mathcal{P}_{\preceq},L)$-partition is a function $p : \mathcal{P} \rightarrow \mathbb{N}_0$ such that if $x \preceq y$ then $p(x) \le p(y)$, with strict inequality whenever $L(x) > L(y)$.

#### Connection with permutations

We write permutations of $\{1,\ldots, k\}$ in one-line form as $\pi_1\ldots \pi_k$. Recall that $\pi$ has a descent in position $i$ if $\pi_i > \pi_{i+1}$.

Example. Let $(1,1) \preceq (1,2)$, $(1,1) \preceq (2,1)$ be the partial order used above to enumerate $\mathrm{RPP}(2,1)$, as labelled by $L(1,1) = 1$, $L(1,2) = 2$, $L(2,1) = 3$. The total order $(1,1) \unlhd (1,2) \unlhd (2,1)$ corresponds under $L$ to the identity permutation $123$ of $\{1,2,3\}$, with no descents. The total order $(1,1) \unlhd' (2,1) \unlhd' (1,2)$ corresponds under $L$ to the permutation $132$ swapping $2$ and $3$, with descent set $\{2\}$.

In general, let $x_1 \unlhd \ldots \unlhd x_k$ be a total order refining $\preceq$. Let $i \le k-1$ and consider the elements $x_i$ and $x_{i+1}$ of $\mathcal{P}$ labelled $L(x_i)$ and $L(x_{i+1})$. In any $\mathcal{P}_\unlhd$-partition $p$ we have $p(x_i) \le p(x_{i+1})$, with strict inequality required if and only if $L(x_i) > L(x_{i+1})$. Therefore, using the total order $\unlhd$ to identify $p$ with a function on $\{1,\ldots, k\}$, i.e. $i \mapsto p(x_i)$, we require $p(i) \le p(i+1)$ for all $i$, with strict inequality if and only if $L(x_i) > L(x_{i+1})$. Equivalently,

$p(1) \le \ldots \le p(k)$

with strict inequality whenever there is a descent $L(x_i)L(x_{i+1})$ in the permutation $L(x_1) \ldots L(x_k)$ corresponding under $L$ to $\unlhd$. Conversely, a permutation $\pi_1\ldots \pi_k$ of $\{1,\ldots, k\}$ corresponds to a total order refining $\preceq$ if and only if $L(x)$ appears left of $L(y)$ whenever $x \preceq y$. Therefore Stanley’s Fundamental Lemma may be restated as follows.

Fundamental Lemma restated. Let $\mathcal{P}$ be a poset with partial order $\preceq$ and let $L : \mathcal{P}_\preceq \rightarrow \{1,\ldots, k\}$ be a labelling such that if $x \preceq y$ then $L(x)$ appears to the left of $L(y)$ in $\pi$. Then, using the labelling $L$ to identity elements of $\mathrm{Par}(\mathcal{P}_\preceq, L)$ with functions on $\{1,\ldots, k\}$,

$\mathrm{Par}(\mathcal{P}_\preceq, L) = \bigcup P_\pi$

where $P_\pi$ is the set of all functions $p : \{1,\ldots, k\} \rightarrow \mathbb{N}_0$ such that $p(1) \le \ldots \le p(k)$ with strict inequality whenever $\pi_i > \pi_{i+1}$, and the union is over all permutations $\pi$ such that $L(x)$ appears to the left of $L(y)$ in the one-line form of $\pi$ whenever $x \preceq y$. Moreover the union is disjoint. $\Box$

The sequences $\bigl( p(1), \ldots, p(k) \bigr)$ above correspond to partitions whose $\bigl((k+1)-(i+1)\bigr)$th largest part is strictly more than their $((k+1)-i)$th largest part (which might be zero), and so are enumerated by

$\displaystyle \frac{q^{\sum_{i : \pi_i > \pi_{i+1}} (k-i)}}{(1-q)\ldots (1-q^k)}.$

The power of $x$ in the numerator is, by the standard definition, the comajor index of the permutation $\pi$. We conclude that

$\displaystyle \sum_{p \in P(\mathcal{P}_\preceq,L)} = \frac{\sum_\pi q^{\mathrm{comaj}(\pi)}}{(1-q)\ldots (1-q^k)}$

where the sum in the numerator is over all permutations $\pi$ as in the restated Fundamental Lemma.

#### Example: $\mathrm{RPP}(3,2)$

We enumerate $\mathrm{RPP}(3,2)$. There are $5$ extensions of the partial order on $[3,2]$,

• $(1,1) \unlhd (1,2) \unlhd (1,3) \unlhd (2,1) \unlhd (2,2)$
• $(1,1) \unlhd (1,2) \unlhd (2,1) \unlhd (1,3) \unlhd (2,2)$
• $(1,1) \unlhd (1,2) \unlhd (2,1) \unlhd (2,2) \unlhd (1,3)$
• $(1,1) \unlhd (2,1) \unlhd (1,2) \unlhd (1,3) \unlhd (2,2)$
• $(1,1) \unlhd (2,1) \unlhd (1,2) \unlhd (2,2) \unlhd (1,3)$

corresponding under the labelling

to the permutations $12345, 12435, 12453, 14235, 14253$ with descent sets $\varnothing$, $\{3\}, \{4\}, \{2\}, \{2,4\}$ and comajor indices $0$, $2$, $1$, $3$, $4$, respectively. By the restatement of the fundamental lemma and the following remark,

\begin{aligned}\sum_{n=0}^\infty |\mathrm{RPP}_{(3,2)}(n)|q^n &= \frac{1 + q^2 + q + q^3 + q^4}{(1-q)(1-q^2)(1-q^3)(1-q^4)(1-q^5)} \\ &= \frac{1}{(1-q)^2(1-q^2)(1-q^3)(1-q^4)}. \end{aligned}

We end this part by digressing to outline a remarkably short proof of an identity due to MacMahon enumerating permutations by major index and descent count.

Exercise. If $\mathcal{P} = \{1,\ldots, k\}$ and all elements are incomparable under $\preceq$, then a $\mathcal{P}_\preceq$-partition is simply a function $p : \{1,\ldots, k\} \rightarrow \mathbb{N}_0$.

• How are such partitions enumerated by the restated Fundamental Lemma?
• Deduce that

$\displaystyle \frac{1}{(1-q)^k} = \frac{\sum_{\pi} q^{\mathrm{comaj}(\pi)}}{(1-q)\ldots (1-q^k)}.$

where the sum is over all permutations of $\{1,\ldots, k\}$.

• Give an involution on permutations that preserves the number of descents and swaps the comajor and major indices. Deduce that $\mathrm{comaj}(\pi)$ can be replaced with $\mathrm{maj}(\pi)$ above.

Exercise. The argument at the start of this post shows that $1/(1-qt) \ldots (1-q^k t)$ enumerates partitions with at most $k$ parts by their size (power of $q$) and largest part (power of $t$).

• Show that partitions whose $j$th largest part is strictly larger than their $j+1$th largest part for all $j \in J \subseteq \{1,\ldots,k-1\}$ are enumerated in this sense by

$\displaystyle \frac{q^{\sum_{j \in J}j}t^{|J|}}{(1-qt)\ldots (1-q^kt)}.$

• Let $c_k(m)$ be the number of compositions with $k$ parts having $m$ as their largest part. Show that

$\displaystyle \sum_{m=0}^\infty c_k(m) t^m = \frac{\sum_{\pi} q^{\mathrm{comaj}(\pi)} t^{\mathrm{des}(\pi)}}{(1-qt)\ldots (1-q^k t)}$

where $\mathrm{des}(\pi)$ is the number of descents of $\pi$.

• Deduce that

$\displaystyle \sum_{m=0}^\infty \Bigl(\frac{q^{m+1}-1}{q-1}\Bigr)^k t^m = \frac{\sum_{\pi} q^{\mathrm{comaj}(\pi)} t^{\mathrm{des}(\pi)}}{(1-t)(1-qt)\ldots (1-q^k t)}.$

• Hence prove MacMahon’s identity
$\displaystyle \sum_{r=0}^\infty [r]_q^k t^r = \frac{\sum_\pi q^{\mathrm{maj}(\pi)} t^{\mathrm{des}(\pi)+1}}{(1-t)(1-qt)\ldots (1-q^k t)}$

where $[r]_q = (q^r-1)/(q-1) = 1 + q + \cdots + q^{r-1}$.

### Part 4: Proof of the Hook Formula

#### The restated Fundamental Lemma for reverse plane partitions

Fix a partition $\lambda$ of size $k$ and let $\mathcal{P}_{\preceq}$ be the poset whose elements are the boxes of the Young diagram $[\lambda]$ ordered by (the transitive closure of) $(a,b) \preceq (a+1,b)$ and $(a,b) \preceq (a,b+1)$. Let $\gamma_a$ be the sum of the $a$ largest parts of $\lambda$, and define a labelling $L : [\lambda] \rightarrow \{1,\ldots, k\}$ so that the boxes in row $a$ of the Young diagram $[\lambda]$ are labelled $\alpha_{a-1}+1,\ldots, \alpha_a$. (This labelling was seen for $\lambda=(3,2)$ in the example at the end of the previous part.) Since this label is all-natural, the $(\mathcal{P}_\preceq, L)$-partitions are precisely the reversed plane partitions of shape $\lambda$. By the restated Fundamental Lemma and the following remark we get

$\displaystyle \sum_{n=0}^\infty |\mathrm{RPP}_\lambda(n)|q^n = \frac{\sum_{\pi} q^{\mathrm{comaj}(\pi)}}{(1-q) \ldots, (1-q^k)}$

where the sum is over all permutation $\pi$ such that the linear order defined by $\pi$ refines $\preceq$. Equivalently, as seen in Part 3 and the final example, the label $L(a,b)$ appears to the left of both $L(a+1,b)$ and $L(a,b+1)$ in the one-line form of $\pi$ (when the boxes exist). This says that $\pi^{-1}_{L(a,b)} < \pi^{-1}_{L(a+1,b)}$ and $\pi^{-1}_{L(a,b)} < \pi^{-1}_{L(a,b+1)}$. Therefore, when we put $\pi^{-1}_i$ in the box of $[\lambda]$ with label $i$, we get a standard tableau. Moreover, $\pi$ has a descent in position $i$, if and only if $i$ appears to the right of $i+1$ in the one-line form of $\pi^{-1}$. Therefore defining $\mathrm{comaj}(t) = \sum_i (n-i)$, where the sum is over all $i$ appearing strictly below $i+1$ in $t$, we have $\mathrm{comaj}(\pi) = \mathrm{comaj}(t)$. (Warning: this is the definition from page 4 of this paper of Krattenthaler, and is clearly convenient here; however it does not agree with the definition on page 364 of Stanley Enumerative Combinatorics II.) We conclude that

$\displaystyle \sum_{n=0}^\infty |\mathrm{RPP}_\lambda(n)|q^n = \frac{\sum_{t \in \mathrm{SYT}(\lambda)} q^{\mathrm{comaj}(t)}}{(1-q)\ldots (1-q^k)}$

where $\mathrm{SYT}(\lambda)$ is the set of standard tableaux of shape $\lambda$.

For example we saw at the end of the previous part that the refinements of $\preceq$ when $\lambda = (3,2)$ correspond to the permutations

$12345, 12435, 12453, 14235, 14253.$

Their comajor indices are $0$, $2$, $1$, $3$, $4$, their inverses are

$12345, 12435, 12534, 13425, 13524$

and the corresponding standard tableaux, obtained by putting $\pi^{-1}_i$ in the box of $[(3,2)]$ labelled $i$ are

Here the final tableau has $3$ strictly above $2$ and $5$ strictly above $4$, so its comajor index is $(5-2) + (5-4) = 4$.

#### Hillman–Grassl algorithm

Since this post is already rather long, I will refer to Section 7.22 of Stanley Enumerative Combinatorics II or, for a more detailed account that gives the details of how to invert the algorithm, Section 4.2 of Sagan The symmetric group: representations, combinatorial algorithms and symmetric functions. For our purposes, all we need is the remarkable corollary that

$\displaystyle \sum_{n=0}^\infty |\mathrm{RPP}_\lambda(n)|q^n = \frac{1}{\prod_{(a,b)\in[\lambda]} (1-q^{h_{(a,b)}})}$

where $h_{(a,b)}$ is the hook-length of the box $(a,b) \in [\lambda]$. We saw the special cases for the partitions $(2,1)$ and $(3,2)$ above.

This formula can also be derived by letting the number of variables tend to infinity in Stanley’s Hook Content Formula: see Theorem 7.21.2 in Stanley’s book, or Theorem 2.6 in this joint paper with Rowena Paget, where we give the representation theoretic context.

#### Proof of the Hook Formula

Combining the results of the two previous subsections we get

$\displaystyle \frac{1}{\prod_{(a,b)\in[\lambda]} (1-q^{h_{(a,b)}})} = \frac{\sum_{t \in \mathrm{SYT}(\lambda)} q^{\mathrm{comaj}(t)}}{(1-q)\ldots (1-q^k)}.$

Equivalently, using the quantum integer notation $[r]_q = (1-q^r)/(1-q)$ and $[k]!_q = [k]_q \ldots [1]_q$, we have

$\displaystyle \sum_{t \in \mathrm{SYT}(\lambda)} q^{\mathrm{comaj}(t)} = \frac{[k]!_q}{\prod_{(a,b) \in [\lambda]} [h_{(a,b)}]_q}.$

This is the $q$-analogue of the Hook Formula; the weaker normal version is obtained by setting $q=1$ to get

$\displaystyle |\mathrm{SYT}(\lambda)| = \frac{k!}{\prod_{(a,b) \in [\lambda]} h_{(a,b)}}.$

## What makes a good/bad lecture?

December 30, 2019

In the opinion of 45 Royal Holloway 2nd year students the answers, taking only those mentioned at least three times, are:

#### What makes a good lecture?

• Engaging/enthusiastic lecturer (22)
• Interactive (8)
• Clear voice (7)
• Eye contact with audience (5)
• Checking for understanding (4)
• Clear (4)
• Clear handwriting (4)
• Jokes/humour (4)
• Seems interested in what they’re saying (4)
• Well prepared (4)
• Examples (3)
• Sound excited/animated (3)
• High quality notes (3)

#### What makes a bad lecture?

• Too quiet (18)
• Facing the board (9)
• Not engaging with audience (8)
• Monotone (7)
• Poor handwriting (7)
• Boring (5)
• Not enough explanation (5)
• Slow (3)
• Too much talking (3)

## Putnam 2018: determinant of an adjacency matrix

October 20, 2019

I took in the October issue of the American Mathematical Monthly to entertain me during the lulls while manning the mathematics desk at our open day on Saturday. As traditional, the first article was the questions in this year’s Putnam. Question A2 asked for the determinant of the adjacency matrix of the graph on the non-empty subsets of $\{1,2,\ldots, n\}$ in which two subsets are connected if and only if they intersect.

Three hours later, hardly overwhelmed by student visits, although I did get to wheel out infinitely many primes and $0.\dot{9} = 1$ to the evident interest of several potential students — and their parents, I had at most the outline of a fiddly solution that piled row operations on half-remembered determinantal identities.

Here is the far simpler solution I found when it came to write it up. The real purpose of this post is to make the connection with the Schur algebra and permutation representations of the symmetric group, and to record some related matrices that also have surprisingly simple determinants or spectral behaviour. I also include an explicit formula for the inverse of the adjacency matrix; this can be expressed using the Möbius function of the poset of subsets of $\{1,2,\ldots, n\}$, for reasons I do not yet fully understand.

A2. Let $S_1, S_2, \ldots, S_{2^n-1}$ be the nonempty subsets of $\{1,2,\ldots, n\}$ in some order and let $M$ be the $(2^n-1) \times (2^n-1)$ matrix whose $(i,j)$ entry is

$\displaystyle m_{ij} = \begin{cases} 0 & S_i \cap S_j = \varnothing \\ 1 & S_i \cap S_j \neq \varnothing \end{cases}$

Calculate the determinant of $M$.

Solution. As is implicit in the question, we can order the sets as we like, since swapping two rows and then swapping the two corresponding columns does not change the determinant of $M$. We put the $2^{n-1} - 1$ sets not having $1$ first, then the $2^{n-1}$ sets having $1$. With this order, continued recursively, the matrix $M_n$ for $n$-subsets has the block form

$\left( \begin{matrix} M_{n-1} & R_{n-1} \\ R_{n-1}^t & K_{n-1} \end{matrix} \right)$

where $R_k$ is the $(2^k-1) \times 2^k$ matrix with first column all zero and then $M_{k}$ to its right, and $K_k$ is the $k \times k$ all-ones matrix. For example,

$M_3 = \left( \begin{matrix} 1 & 0 & 1 & {\bf 0} & 1 & 0 & 1 \\ 0 & 1 & 1 & {\bf 0} & 0 & 1 & 1 \\ 1 & 1 & 1 & {\bf 0} & 1 & 1 & 1 \\ {\bf 0} & {\bf 0} & {\bf 0} & 1 & 1 & 1 & 1 \\ 1& 0 & 1 & 1 & 1 & 1 & 1 \\ 0 & 1 & 1 & 1 & 1 &1 &1 \\ 1 & 1 & 1 & 1 & 1 &1 &1 \end{matrix} \right)$

where the exceptional zero column in $R_{2}$ and zero row in $R_{2}^t$ are shown in bold.

Clearly $\det M_1 = 1$. We shall show, by induction on $n$, that $\det M_n = -1$ if $n \ge 2$. The $3 \times 3$ matrix $M_2$ can be seen as a submatrix of $M_3$ above; computing the determinant as a sum over the $6$ permutations of $\{1,2,3\}$ gives $1 - 1 -1 = -1$, as required.

For the inductive step, observe that row $2^{n-1}$ of $M_n$ has $2^{n-1}-1$ zeros (from the top row of $R_{n-1}^t$) followed by $2^{n-1}$ ones (from the top row of $K_{n-1}$). By row operations subtracting this row from each of the rows below, we obtain

$\left( \begin{matrix} M_{n-1} & 0_{(2^{n-1}-1) \times 1} & M_{n-1} \\ 0_{1 \times (2^{n-1}-1)} & 1 & 1_{1 \times (2^{n-1}-1)} \\ M_{n-1} & 0_{(2^{n-1}-1) \times 1)} & 0_{(2^{n-1}-1) \times (2^{n-1}-1)} \end{matrix} \right).$

The unique non-zero entry in row $2^{n-1}$ and column $2^{n-1}$ is the one in position $(2^{n-1},2^{n-1})$. Therefore deleting this row and column, we find that

$\det M_n = \det \left( \begin{matrix} M_{n-1} & M_{n-1} \\ M_{n-1} & 0_{(2^{n-1}-1) \times (2^{n-1}-1)} \end{matrix} \right).$

Given the zero matrix in the bottom right, it is impossible to obtain a non-zero contribution in the determinant by any permutation that picks an entry in the top-left copy of $M_{n-1}$. Therefore

$\det M_n = -(\det M_{n-1})^2 = -1$

as claimed.

#### Inverse

The inverse matrix $M_n^{-1}$ can be computed explicitly: labelling rows and columns by subsets of $\{1,2,\ldots, n\}$ we have

$\displaystyle -(M_n^{-1})_{XY} = \begin{cases} (-1)^{|Y \backslash X'|} & Y \supseteq X' \\ 0 & Y \not\supseteq X' \end{cases}$

where $X'$ is the complement of $X$ in $\{1,2,\ldots, n\}$. Can the appearance of the Möbius function $\mu(Z,Y) = (-1)^{Y \backslash Z}$ of the poset of subsets of $\{1,2,\ldots, n\}$ be explained as an instance of Mobius inversion?

#### Symmetric group

Let $\Omega$ be the set of all subsets of $\{1,2,\ldots, n\}$ and let $\mathbb{F}$ be an infinite field. The vector space $\mathbb{F}\Omega$ with basis $\Omega$ is a permutation module for the symmetric group $S_n$ on $\{1,2,\ldots, n\}$. The Putnam matrix, extended by an extra zero row and column corresponding to the empty set, is the endomorphism of this permutation module that sends a subset to the sum of all the subsets that it intersects. This interpretation of the Putnam matrix suggests a rich family of variations on the theme.

#### Variations on the Putnam matrix

##### Diagonal blocks

Fix $k \in \{1,\ldots, n\}$ with $k \le n/2$. Let $\Omega^{(k)}$ be the set of $k$-subsets of $\{1,2,\ldots, n\}$ and let $M^{(k)}_n$ be the block of $M_n$ corresponding to $k$-subsets of $\{1,2,\ldots, n\}$. Thus $M^{(k)}_n$ is the endomorphism of the permutation module $\mathbb{F}\Omega^{(k)}$ sending a $k$-subset to all the sum of all the $k$-subsets that it intersects. The permutation module $\mathbb{F}\Omega^{(k)}$ has a multiplicity-free direct sum decomposition

$U = U_0 \oplus U_1 \oplus \cdots U_k$

where $U_j$ has irreducible character canonically labelled by the partition $(n-j,j)$. (Since $k \le n/2$ this is a partition.) It therefore follows from Schur’s Lemma that $M_n^{(k)}$ has $k+1$ integer eigenvalues whose multiplicities are the dimensions of the $U_i$. Varying $j$ and $n$, these numbers form a triangle whose diagonal entries, i.e. those for which $j = \lfloor n/2 \rfloor$, are the Catalan Numbers.

For example, by working with the filtration of $\mathbb{F}\Omega^{(2)}$ by the Specht modules $S^{(n-2,2)} \cong U_2$, $S^{(n-1,1)} \cong U_1$ and $S^{(n)} \cong U_0$ one can show that the eigenvalues are $2n+1$, $n-3$ and $-1$ with multiplicities $1$, $n-1$ and $n(n-3)/2$, respectively. Computational data suggests that in general the eigenvalue associated to the partition $(n-j,j)$ for $j \ge 1$ is $(-1)^{j-1} \binom{n-k-j}{k-j}$; probably this can be proved by a generalization of the filtration argument. (But maybe there is a better approach just using character theory.) Since the unique trivial submodule is spanned by the sum of all $k$-subsets, the eigenvalue for the partition $(n)$ is simply the number of $k$-subsets meeting $\{1,2,\ldots, k\}$, namely $\binom{n}{k} - \binom{n-k}{k}$.

##### Weighting additions and removals by indeterminates

Let $P_n(\alpha,\beta, \theta)$ be the generalization of the Putnam matrix where the entry remains zero for disjoint subsets, and if $X$ meets $Y$ then the entry is $\alpha^{|X \backslash Y|}\beta^{|Y\backslash X|}\theta^{2|X\cap Y|}$. For example, $P_2(\alpha,\beta,\theta)$ is

$\left( \begin{matrix} \theta^2 & 0 & \alpha\theta^2 \\ 0 & \theta^2 & \alpha \theta^2 \\ \beta\theta^2 & \beta\theta^2 & \theta^4 \end{matrix} \right)$

where rows and columns are labelled $\{1\}, \{2\}, \{1,2\}$, and $P_3(\alpha,\beta,1)$ is shown below, with rows and columns labelled $\{1\}$, $\{2\}$, $\{1,2\}$, $\{3\}$, $\{1,3\}$, $\{2,3\}$, $\{1,2,3\}$

$\left( \begin{matrix} 1 & 0 & \alpha & 0 & \alpha & 0 & \alpha^2 \\ 0 & 1 & \alpha & 0 & 0 & \alpha & \alpha^2 \\ \beta & \beta & 1 & 0 & \alpha\beta & \alpha\beta & \alpha \\ 0 & 0 & 0 & 1 & \alpha & \alpha & \alpha^2 \\ \beta & 0 & \alpha\beta & \beta & 1 & \alpha\beta & \alpha \\ 0 & \beta & \alpha\beta & \beta & \alpha\beta & 1 & \alpha \\ \beta^2 & \beta^2 & \beta & \beta^2 & \beta & \beta & 1 \end{matrix} \right).$

Specializing $\alpha, \beta$ and $\theta$ to $1$ we get the Putnam matrix for $n = 3$ shown above. When $P_n(\alpha,\beta,\theta)$ is written as a sum over the $(2^n-1)!$ permutations of the collection of non-empty subsets of $\{1,\ldots, n\}$, each summand factorizes as a product over the disjoint cycles in the corresponding permutation. If $(X_0, \ldots, X_{\ell-1})$ is such a cycle then the power of $\alpha$ is $\sum_{i=1}^{\ell}|X_i \backslash X_{i-1}|$ (taking $X_{\ell} = X_0$), and similarly the power of $\beta$ is $\sum_{i=1}^{\ell} |X_{i-1}\backslash X_i|$. Clearly these are the same. Moreover, by the identity

$|X\backslash Y| + |Y\backslash X| + 2|X\cap Y| = |X| + |Y| = 2k$,

after taking the product over all disjoint cycles, the total degree is $\sum_{k=1}^n k \binom{n}{k} = n2^{n-1}$. Therefore the determinant of the matrix a homogeneous polynomial in $\alpha, \beta, \theta$ of degree $n2^{n-1}$. Since swapping $\alpha$ and $\beta$ corresponds to transposing the matrix, the determinant is a function of $\alpha + \beta$, $\alpha\beta$ and $\theta$. We have seen that each summand of the determinant the same power of $\alpha$ as $\beta$, and so the determinant depends only on $\gamma = \alpha\beta$. Therefore the determinant factorizes into homogeneous polynomials in $\alpha\beta$ (degree $2$) and $\theta^2$ (also degree $2$). Hence we may specialize so that $\alpha\beta = 1$ and replace $\theta^2$ with $\tau$ without losing any information. (This turns out to be the most convenient form for computation because, using MAGMA at least, it is fast to compute determinants of matrices whose coefficients lie in the single variable polynomial ring $\mathbb{Q}[\tau]$ but much slower if the coefficients lie in $\mathbb{Q}[\alpha,\beta,\tau]$.) The determinants for small $n$ are shown below.

$\begin{array}{ll} 1 & \tau \\ 2 & \tau^3(\tau-2) \\ 3 & \tau^7(\tau-2)^3(\tau^2-3\tau+3)\\ 4 & \tau^{15}(\tau-2)^7 (\tau^2-3\tau+3)^4 (\tau^2-2\tau+2)\end{array}$

To continue, it is useful to define $f_2 = \tau - 2$ and for odd primes $p$, a polynomial $f_p \in \mathbb{Z}[\tau]$ of degree $p-1$ by $\tau f_p = (\tau-1)^p + 1$. (The definition becomes uniform on changing $1$ to $(-1)^{p-1}$.) For instance $f_3 = \tau^2 - 3\tau + 3$ and

$f_5 = \tau^4 - 5\tau^3 + 10 \tau^2 - 10 \tau + 5.$

We also define the following further polynomials:

\begin{aligned} f_4 &= \tau^2 - 2\tau + 2 \\ f_6 &= \tau^2 - \tau + 1 \\ f_8 &= \tau^4 - 4\tau^3 + 6\tau^2 -4\tau + 2 \\ f_{9} &= \tau^6 - 6\tau^5 + 15\tau^4 - 21\tau^3 + 18\tau^2 - 9\tau + 3 \\ f_{10} &= \tau^4 - 3\tau^3 + 4\tau^2 - 2\tau + 1 \\ f_{12} &= \tau^4 -4\tau^3 + 5\tau^2 - 2\tau + 1 \end{aligned}

Using this notation, the determinants for $n \le 13$ are

$\begin{array}{ll} 1 & \tau \\ 2 & \tau^3 f_2 \\ 3 & \tau^7 f_2^3 f_3 \\ 4 & \tau^{15} f_2^7 f_3^4 f_4 \\ 5 & \tau^{31} f_2^{15} f_3^{10} f_4^{5} f_5 \\ 6 & \tau^{63} f_2^{31}f_3^{21}f_4^{15}f_5^6 f_6 \\ 7 & \tau^{127} f_2^{63} f_3^{42} f_4^{35} f_5^{21} f_6^{7} f_7 \\ 8 & \tau^{255} f_2^{127} f_3^{84} f_4^{71} f_5^{56}f_6^{28} f_7^8 f_8 \\ 9 & \tau^{511} f_2^{255} f_3^{169} f_4^{135} f_5^{126} f_6^{84} f_7^{36} f_8^{9} f_9 \\ 10 & \tau^{1023} f_2^{511} f_3^{340} f_4^{255} f_5^{253}f_6^{210}f_7^{120}f_8^{45} f_9^{10}f_{10}\\ 11 &\tau^{2047}f_2^{1023}f_3^{682}f_4^{495}f_5^{473}f_6^{462}f_7^{330}f_8^{165}f_9^{55}f_{10}^{11}f_{11} \\ 12 & \tau^{4095} f_2^{2047} f_3^{1365}f_4^{991}f_5^{858} f_6^{925}f_7^{792} f_8^{495} f_9^{220}f_{10}^{66} f_{11}^{12}f_{12} \\ 13 & \tau^{8191}f_2^{4095}f_3^{2730}f_4^{2015}f_5^{1573} f_6^{1729}f_7^{1716}f_8^{1287}f_9^{715}f_{10}^{286}f_{11}^{78}f_{12}^{13}f_{13} \end{array}$

Note that $f_n$ has degree $\phi(n)$ in every case where it is defined, and that $f_n(1) = 1$ for all $n$ except $n =2$, for which $f_2(1) = -1$. Assuming that this behaviour continues, and that $f_2$ always has exponent $2^{n-1}-1$, as suggested by the data above, this gives another solution to the Putnam problem.

##### Weighting by integers

Given $(w_0,w_1,\ldots, w_n) \in \mathbb{Z}^{n+1}$, let $Q_n(w)$ be the generalization of the Putnam matrix where if $X$ meets $Y$ then the entry is $w_{|X\cap Y|}$. When $w_0 = 0$ and $w_i = (-1)^{i-1}$ for $1 \le i \le n$, the determinants are given by specializing the matrix above by $\alpha = 1$, $\beta = 1$ and $\theta = \sqrt{-1}$, and so they are obtained from the factorizations in $\mathbb{Q}[\tau]$ presented above by specializing $\tau$ to $-1$.

The evaluations at $-1$ of the polynomials defined above are as follows: $f_2(-1) = -3$, $f_p(-1) = 2^p - 1$ for $p$ an odd prime, and $f_4(-1) = 5$, $f_6(-1) = 3$, $f_8(-1) = 18$, $f_9(-1) = 73$, $f_{10}(-1) = 11$, $f_{12}(-1) = 13$.

Now suppose that $w_i = (-1)^{i}$ for each $i$, so now $w_0 = 1$. For instance the matrix for $n =3$ is

$\left( \begin{matrix} - & + & - & + & - & + & - \\ + & - & - & + & + & - & - \\ - & - & + & + & - & - & + \\ + & + & + & - & - & - & - \\ - & + & - & - & + & - & + \\ + & - & - & - & - & + & + \\ - & - & + & - & + & + & - \end{matrix}\right)$

where $+$ denotes $1$ and $-$ denotes $-1$. The determinants for $n \in \{1,\ldots, 10\}$ are

$1, 2^2, 2^9, 2^{28}, 2^{75}, 2^{186}, 2^{441}, 2^{1016}, 2^{2295}, 2^{5110}.$

The sequence of exponents is in OEIS A05887, as the number of labelled acyclic digraphs with $n$ vertices containing exactly $n-1$ points of in-degree zero. The determinant itself enumerates unions of directed cycles, with sign and a weighting, in the complete graph on $n$ vertices; does this give the connection, or is it a new interpretation of sequence A05887?

#### Schur algebra

Let me finish by making the connection with an apparently completely unrelated object. Consider representations of the general linear group $\mathrm{GL}_2(\mathbb{F})$ in which the entries of the representing matrices are polynomials of a fixed degree in the four entries of each $\mathrm{GL}_2(\mathbb{F})$ matrix. For example, let $E = \langle e_1, e_2 \rangle$ be the natural representation of $\mathrm{GL}_2(\mathbb{F})$. Then $E$ itself is a polynomial representation of degree $1$, and its symmetric square $\mathrm{Sym}^2 \!E$ with basis $e_1^2, e_1e_2, e_2^2$ is a polynomial representation of degree $2$, in which

$\left( \begin{matrix} \alpha & \beta \\ \gamma & \delta \end{matrix} \right) \mapsto \left( \begin{matrix} \alpha^2 & 2\alpha\beta & \beta^2 \\ \alpha\gamma & \alpha\delta + \beta\gamma & \beta\delta \\ \gamma^2 & 2\gamma\delta & \delta^2 \end{matrix} \right).$

By definition, $\mathrm{Sym}^2 \! E$ is a quotient of $E \otimes E$. More generally, any polynomial representation of degree $n$ is a subquotient of a direct sum of copies of $E^{\otimes n}$. Observe that there is a linear isomorphism $E^{\otimes n} \rightarrow \mathbb{F}\Omega$ defined by

$e_{i_1} \otimes e_{i_2} \otimes \cdots \otimes e_{i_n} \mapsto \bigl\{j \in \{1,2,\ldots, n\} : i_j = 2 \bigr\}.$

The action of $S_r$ on $\mathbb{F}\Omega$ induces the place permutation action on $E^{\otimes n}$, defined (using right actions) by

$e_{i_1} \otimes e_{i_2} \otimes \cdots \otimes e_{i_n} \sigma = e_{i_{1\sigma^{-1}}} \otimes e_{i_{2\sigma^{-1}}} \otimes \cdots \otimes e_{i_{n\sigma^{-1}}}.$

For example, the $3$-cycle $(1,2,3)$ sends $e_{a} \otimes e_{b} \otimes e_{c}$ to $e_{c} \otimes e_{a} \otimes e_{b}$, moving each factor one to the right (with wrap around). This gives some motivation for introducing the Schur algebra $S(n,2)$, defined to be the algebra of linear endomorphisms of the polynomial representation $E^{\otimes n}$ of $\mathrm{GL}_2(\mathbb{F})$ that commute with the place permutation action of $S_n$. In symbols,

$S(n,2) = \mathrm{End}_{\mathbb{F}S_n}(E^{\otimes n}).$

Somewhat remarkably one can show that the category of polynomial representations of $\mathrm{GL}_2(\mathbb{F})$ of degree $n$ is equivalent to the module category of the Schur algebra. (See for instance the introduction to Green’s lecture notes.) The extended Putnam matrix corresponds to the element of the Schur algebra sending the pure tensor $e_{i_1} \otimes \cdots \otimes e_{i_n}$ to the sum of all the pure tensors $e_{j_1} \otimes \cdots \otimes e_{j_n}$ such that $i_k = j_k = 1$ for at least one $k$. For example, if $n=2$ then

$\left( \begin{matrix} \alpha & \beta \\ \gamma & \delta \end{matrix} \right) \mapsto \left( \begin{matrix} \alpha^2 & \alpha\beta & \alpha\beta & \beta^2 \\ \alpha\gamma & \alpha\delta & \beta\gamma & \beta\delta \\ \beta^2 & \beta\gamma & \alpha\delta & \beta\delta \\ \gamma^2 & \gamma\delta & \gamma\delta & \delta^2 \end{matrix} \right), \ M_2 \mapsto \left( \begin{matrix} 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 1 & 1 & 1 \end{matrix} \right)$

We see that $S(2,n)$ is $10$-dimensional, and, since the image of $M_2$ has equal coefficients in the required positions, it lies in $S(2,n)$ as expected. For another example,

$\left( \begin{matrix} 1 & 0 \\ \delta & 1 \end{matrix} \right) \mapsto \left( \begin{matrix} 1 & 0 & 0 & 0 \\ \delta & 1 & 0 & 0 \\ \delta & 1 & 1 & 0 \\ \delta^2 & \delta & \delta & 1 \end{matrix} \right)$

corresponds to the endomorphism of $\mathbb{F}\Omega$ sending a subset of $\{1,2\}$ to the sum of all its subsets, weighted by a power of $\delta$ recording how many elements were removed. This case is unusual in that the Schur algebra element comes from a single matrix in $\mathrm{GL}_2(\mathbb{F})$, rather than a linear combination of matrices lying in the group algebra $\mathbb{F}\mathrm{GL}_2(\mathbb{F})$, as would be required for the Putnam matrix. Passing to the Schur algebra replaces the infinite dimensional algebra $\mathbb{F}\mathrm{GL}_2(\mathbb{F})$ with a finite dimensional quotient.

## Two proofs of Shannon’s Source Coding Theorem and an extension to lossy coding

September 29, 2019

The first purpose of this post is to record two approaches to Shannon’s Source Coding Theorem for memoryless sources, first by explicit codes and secondly using statistical properties of the source. After that I try, with mixed success, to give a motivated proof of a more general result on lossy coding.

#### Setup and statement.

Suppose that a memoryless source produces the symbols $1, \ldots, m$ with probabilities $p(1), \ldots, p(m)$, respectively. Let

$h = H(p(1),\ldots, p(m))$

be the entropy of this probability dis tribution. Since the source is memoryless, the entropy of an $r$-tuple of symbols is then $rh$. Shannon’s Source Coding Theorem (or First Main Theorem, or Noiseless Coding Theorem) states that, given $\epsilon > 0$, provided $r$ is sufficiently large, there is a variable length binary code of average length less than $r(h + \epsilon)$ that can be used to encode all $r$-ruples of symbols.

#### Proof by Kraft–McMillan inequality

Choose $r$ sufficiently large so that $r > 1/\epsilon$. Let $q(1), \ldots, q(m^r)$ be the probabilities of the $r$-tuples and let $\ell(j) \in \mathbb{N}$ be least such that $\ell(j) \ge \log_2 1/q(j)$. Equivalently, $\ell(j)$ is least such that

$2^{-\ell(j)} \le q(j).$

This equivalent characterisation shows that

$\sum_{j=1}^{m^r} 2^{-\ell(j)} \le \sum_{j=1}^{m^r} q(j) =1$

so by the Kraft–McMillan inequality, there exists a prefix free binary code with these codeword lengths. Moreover, since

$\ell(j) = \lceil \log_2 1/q(j) \rceil \le 1 + \log_2 \frac{1}{q(j)}$

we have

\begin{aligned}\sum_{j=1}^{m^r} q(j)\ell(j) &\le \sum_{j=1}^{m^r} q(j) \bigl( 1 + \log_2 \frac{1}{q(j)} \bigr) \\ &= \sum_{j=1}^{m^r} q(j) + \sum_{j=1}^{m^r} q(j) \log_2 \frac{1}{q(j)} &= 1 + hr \end{aligned}

as required.

The required prefix-free code is the Shannon code for the probabilities $p(1), \ldots, p(m^r)$. Another algorithmic construction is given by the proof of the sufficiency of the Kraft–McMillan inequality on page 17 of Welsh Codes and cryptography. (The same proof is indicated, rather less clearly in my view, on the Wikipedia page.)

#### Proof using Asymptotic Equipartition Property (AEP).

Despite its forbidding name, for memoryless sources the AEP is little more than the Weak Law of Large Numbers, albeit applied in a slightly subtle way to random variables whose values are certain (functions of) probabilities. Let $\mathcal{M} = \{1,\ldots, m\}$ be the probability space of symbols produced by the source. Let $X \in \mathcal{M}$ be a random symbol produced by the source. Then $\log p(X)$ is a random variable. Summing over the $m$ outcomes in the probability space, we find that

$\mathbb{E}[\log p(X)] = \sum_{i=1}^m \mathbb{P}[X=i] \log p(i) = \sum_{i=1}^m p(i) \log p(i) = -h$

where $h = H(p(1), \ldots, p(m))$ is the entropy already seen. While straightforward, and giving some motivation for considering $\log p(X)$, the calculation hides some subtleties. For example, is it true that

$\mathbb{P}[\log p(X) = \log p(i)] = p(i)?$

In general the answer is ‘no’: in fact equality holds if and only if $i$ is the unique symbol with probability $p(i)$. (One further thing, that I certainly will not even dare to mention in my course, is that, by the formal definition of a random variable, $X$ is the identity function on $\mathcal{M}$, and so $p(X)$ is a composition of two functions; in Example 2.3.1 of Goldie and Pinch Communication Theory the authors remark that forming such compositions is a ‘characteristic feature of information theory, not found elsewhere in probability theory’.)

Anyway, returning to the AEP, let $(X_1, \ldots, X_r) \in \mathcal{M}^r$ be a random $r$-tuple. For brevity we shall now call these $r$-tuples messages. Let $P_r : \mathcal{M}^r \rightarrow [0,1]$ be defined by

$P_r\bigl( (x_1, \ldots, x_r) \bigr) = p(x_1) \ldots p(x_r)$

Since the source is memoryless, $P_r\bigl( (x_1,\ldots, x_r)\bigr)$ is the probability that the first $r$ symbols produced by the source are $x_1, \ldots, x_r$. Consider the random variable $\log P_r\bigl( (X_1, \ldots, X_r) \bigr)$. By the displayed equation above,

$\log P_r\bigl( (X_1,\ldots, X_r) \bigr) = \sum_{i=1}^r \log p(X_i).$

This is a sum of $r$ independent identically distributed random variables. We saw above that each has expected value $-h$. Hence, by the weak law, given any $\eta > 0$,

$\displaystyle \mathbb{P}\Bigl[ -\frac{\log P_r\bigl( (X_1, \ldots, X_r) \bigr)}{r} \not\in (h-\eta, h+\eta) \Bigr] \rightarrow 0$

as $r \rightarrow \infty$. By taking $r$ sufficiently large so that the probability is less than $\eta$, we obtain a subset $\mathcal{T}$ of $\mathcal{M}^r$ such that (a)

$\mathbb{P}[(X_1,\ldots, X_r) \in \mathcal{T}] > 1 - \eta$

and (b) if $(x_1,\ldots, x_r) \in \mathcal{T}$ then

$P_r\bigl( (x_1, \ldots, x_r) \bigr) \in (2^{-r(H+\eta)}, 2^{-r(H-\eta)}).$

The properties (a) and (b) say that memoryless sources satisfy the AEP. It is usual to call the elements of $\mathcal{T}$ typical messages.

For example, if $p(1) \ge \ldots \ge p(m)$ then the most likely message is $(1,\ldots, 1)^r$ and the least likely is $(m,\ldots, m)$; unless the distribution is uniform, both are atypical when $\eta$ is small. A typical message will instead have each symbol $i$ with frequency about $rp(i)$. Indeed, the proof of the AEP for memoryless sources on page 80 of Welsh Codes and cryptography takes this as the definition as typical, using the Central Limit Theorem to replace the weak law so that ‘about’ can be defined precisely.

We can now prove the Source Coding Theorem in another way: if $(x_1, \ldots, x_r)$ is typical then

$P_r\bigl( (x_1,\ldots, x_r)\bigr) \ge 2^{-r(H+\eta)}$

and so

\begin{aligned} 1 &\ge \mathbb{P}[(X_1,\ldots, X_r) \in \mathcal{T}] \\ &= \sum_{(x_1,\ldots, x_r) \in \mathcal{T}} P_r\bigl( (x_1, \ldots, x_r) \bigr) \\ &\ge |\mathcal{T}| 2^{-r(H+ \eta)}.\end{aligned}

Hence $|\mathcal{T}| \;\text{\textless}\; r(h+\eta)$. For the atypical messages we make no attempt at efficiency, and instead encode them using binary words of length $N$ for some $N > r \log_2 m$. Thus the code has (at most) two distinct lengths of codewords, and the expected length of a codeword is at most

$L + \eta M.$

By taking $r$ sufficiently large and $L = \lfloor r(h+ \eta)\rfloor + 1$, we may assume that $L < r(h + 2\eta)$. Therefore the expected length is at most

$r(h + 2\eta) + \eta r \log_2 m$;

when $\eta$ is sufficiently small this is less than $r(h+ \epsilon)$.

In Part C of my course I plan to give this proof before defining the AEP, and then take (a) and (b) as its definition for general sources. As a preliminary, I plan to prove the Weak Law of Large Numbers from Chebyshev’s Inequality.

#### AEP for general sources

Going beyond this, things quickly get difficult. Recall that a source producing symbols $X_1, X_2, \ldots$ is stationary if $(X_{i_1},\ldots,X_{i_t})$ and $(X_{{i_1}+s},\ldots X_{i_t+s})$ have the same distribution for all $i_1, \ldots, i_t$ and $s$, and ergodic if with probability $1$ the observed frequencies of every $t$-tuple of symbols converges to its expected value. There is a nice proof using Fekete’s Lemma in Welsh Codes and cryptography (see page 78) that any stationary source has an entropy, defined by

$\displaystyle \lim_{r \rightarrow \infty} \frac{H(X_1, X_2,\ldots, X_r)}{r}.$

However the proof that a stationary ergodic source satisfies the AEP is too hard for my course. (For instance the sketch on Wikipedia blithely uses Lévy’s Martingale Convergence Theory.) And while fairly intuitive, it is hard to prove that a memoryless source is ergodic — in fact this is essentially the Strong Law of Large Numbers — so there are difficulties even in connecting the memoryless motivation to the general result. All in all this seems to be a place to give examples and state results without proof.

#### Lossy source coding

In §10 of Cover and Thomas, Elements of information theory, Wiley (2006) there is a very general but highly technical result on source encoding sources satisfying the AEP when a non-zero error probability is permitted. After much staring at the proof in §10.5 and disentangling of the definitions from rate-distortion theory, I was able to present a very simple version for the unbiased memoryless binary source in my course. As a rare question, someone asked about the biased memoryless binary source. Here is (one version of) the relevant result. Throughout let $h(p) = H(p,1-p)$.

Theorem. Let $0 \;\textless\; D \;\textless\; p$. Provided $r$ is sufficiently large, there exists a code $C \subseteq \{0,1\}^r$ such that

$|C| = \lceil 2^{(h(p)-h(D)+\epsilon)r} \rceil$

and an encoder $f : \{0,1\}^r \rightarrow C$ such that

$\mathbb{P}[d(f(S_1\ldots S_r), u) > Dr] < \epsilon$.

Here $d$ denotes Hamming distance. So the conclusion is that with probability $\ge 1- \epsilon$, a random word emitted by the source can be encoded by a codeword in $C$ obtained by flipping at most $Dr$ bits. Therefore each bit is correct with probability at least $1-D$. By sending the number of a codeword we may therefore compress the source by a factor of $h(p)-h(D)+\epsilon$, for every $\epsilon \;\textgreater\; 0$. Note that we assume $D < p$, since if a bit error probability exceeding $p$ is acceptable then there is no need for any encoding: the receiver simply guesses that every bit is $0$.

In the lecture, I could not see any good way to motivate the appearance of $h(p) - h(D)$, so instead I claimed that one could compress by a factor of $h(p)\bigl(1-h(D)\bigr)$. The argument I had in mind was that by the AEP (or Shannon's Noiseless Coding Theorem) one can represent the typical words emitted by the source using all binary words of length $h(p)r$. Then compress to $h(p)(1-h(D))r$ bits by thinking of these binary words as emitted by an unbiased source, using the special case of the theorem above when $p = \frac{1}{2}$. The problem is that decoding gives a bitwise error probability of $D$ on the binary words of length $h(p)r$ used to encode the source, and this does not give good control of the bitwise error probability on the original binary words of length $r$ emitted by the source.

In an attempt to motivate the outline proof below, we take one key idea in the proof of the more general result in Cover and Thomas: the code $C$ should be obtained by taking a random word $X$ emitted by the source and flipping bits to obtain a codeword $Y$ such that $I(X;Y)$ is minimized, subject to the constraint that the expectation of $d(X,Y)$ is $\le Dr$.

In the special case I presented in the lecture all words are equally likely to be emitted by the source, and so flipping bits at random gives codewords distributed uniformly at random on $\{0,1\}^r$. As a warm up, we use such codewords to prove the theorem in this special case.

Proof when $p = \frac{1}{2}$. Fix $w \in \{0,1\}^r$ and let $P$ be the probability that a codeword $U$, chosen uniformly at random from $\{0,1\}^r$, is within distance $Dr$ of $w$. Since $U$ is uniformly distributed, $P$ is the size of the Hamming ball of radius $Dr$ about $U$ divided by $2^r$. Standard bounds show that

$\displaystyle \frac{1}{(r+1)} \frac{2^{h(D)r}}{2^r} \le P \le \frac{2^{h(D)r}}{2^r}.$

By the lower bound, if we pick slightly more than $2^{(1-h(D))r}$ codewords uniformly at random, then the expected number of codewords within distance $Dr$ of $w$ is at least $1$. More precisely, let $M = \lceil 2^{(1-h(D)+\epsilon)r} \rceil$ be as in the statement of the theorem above. Let $C$ be a random code constructed by chosing $M$ codewords independently and uniformly at random from $\{0,1\}^r$. The probability that no codeword in $C$ is within distance $Dr$ of $w$ is at most

$\displaystyle \bigl( 1 - \frac{1}{2^{-(1-h(d))r}} \bigr)^{2^{(1-h(d)+\epsilon)r}} \le \exp \bigl( -\frac{2^{(1-h(d)+\epsilon)r}}{2^{(1-h(d))r}} \bigr) = \exp (-2^{\epsilon r})$

which tends to $0$ as $r \rightarrow \infty$. Since $w$ was arbitrary, the above is also the probability that no codeword is within distance $Dr$ of a word in $\{0,1\}^r$ chosen uniformly at random. Therefore this average probability, $\overline{P}$ say, can also be made less than $\epsilon$. Hence there exists a particular code with codewords $u(1), \ldots, u(M)$ such that, for this code, $\overline{P} \le \epsilon$. Thus at least $1-\epsilon$ of all words in $\{0,1\}^n$ are within distance $Dr$ of some codeword in $C$. $\Box$

It seemed highly intuitive to me that, since by the AEP, we need only consider those $w$ of weight about $pr$, the right construction in general would be to take such words and construct random codewords by randomly flipping $Dr$ of their bits. But some thought shows this cannot work: for example, take $p = \frac{1}{3}$ and $D = \frac{1}{6}$. Then flipping bits at random gives codewords of average weight

$\frac{1}{3}(1-\frac{1}{6})r + \frac{2}{3}\frac{1}{6}r = \frac{7}{18}r.$

Let $w = 1\ldots 10\ldots 0$ be of weight $\frac{1}{3}r$. Then $w$ is at distance

$\displaystyle \binom{\frac{1}{3}r}{\frac{1}{18}r} \binom{\frac{2}{3}r}{\frac{1}{9}r} \approx 2^{\frac{1}{3} h(\frac{1}{6}r) + \frac{2}{3} h(\frac{1}{6})r} = 2^{h(\frac{1}{6})r}$

words of weight $\frac{7}{18}r$, and to pick a random code from words of this weight so that the expectated number of codewords close to $w$ is at least $1$, we must choose $M$ codewords where

$\displaystyle M \times \frac{2^{h(\frac{1}{6})r} }{2^{h(\frac{7}{18})r}} \ge 1$

If the denominator was $2^{h(\frac{1}{3}r)}$, we could take $M = 2^{(h(\frac{1}{3})-h(\frac{1}{6}))r}$ as required. But because we are stuck in the bigger space of words of weight $(\frac{1}{3} + \frac{1}{18})r$, we require a larger $M$.

Some of my confusion disappeared after doing the minimization of $I(X;Y)$ more carefully. Since $I(X;Y) = H(X) - H(X|Y)$ and $H(X) = H(p)$ is constant, a good — although maybe not very intuitive — way to do this is to choose the probabilities $\mathbb{P}[X=\alpha|Y=\beta]$ to maximize $H(X|Y)$, consistent with the requirements that if $\mathbb{P}[Y=1] = q$ then

$(1-q)\mathbb{P}[X=0|Y=0] + q\mathbb{P}[X=0|Y=1] = p$

and

$(1-q)\mathbb{P}[X=1|Y=0] + q\mathbb{P}[X=0|Y=1] \le D.$

Replacing the inequality with equality (it is a safe assumption that the expected distortion is maximized when the mutual information is minimized), and solving the linear equations, one finds that the flip probabilities are

$\mathbb{P}[X=1|Y=0] = \frac{D+p-q}{2(1-q)},\ \mathbb{P}[X=0|Y=1] = \frac{D+q-p}{2q}$.

Thus one may pick any $q$ such that $p-D \le q \le p+D$. Since $H(X|Y) = (1-q) h(\mathbb{P}[X=1|Y=0]) + qh(\mathbb{P}[X=0|Y=1])$, substituting from the equations above gives a formula for $H(X|Y)$ in terms of $p$, $D$ and $q$. Differentiating with respect to $q$ (this is fiddly enough to warrant computer algebra), one finds that

\begin{aligned} \displaystyle 2\frac{\mathrm{d}H(X|Y)}{\mathrm{d}q} = \log_2& \frac{(D+p-q)(2-D-p-q)}{(1-q)^2} \\ &- \log_2 \frac{(D-p+q)(-D+p+q)}{q^2}\end{aligned}

and so the extrema occur when

$\displaystyle \frac{(D+p-q)(2-D-p-q)}{(1-q)^2} = \frac{(D-p+q)(-D+p+q)}{q^2}$.

For no reason that I can see, on multiplying out, all powers of $q^3$ and $q^4$ cancel, and one is left with the factoring quadratic

$\bigl(p-D-q(1-2D)\bigr)\bigl(p-D+q(1-2p)\bigr)$

Hence $H(X|Y)$ is extremized only when $q = (p-D)/(1-2D)$ and both flip probabilities are $D$ and when $q = (p-D)/(1-2p)$, when $\mathbb{P}[X=1|Y=0]=(D-Dp-p^2)/(1+D-3p)$ and $\mathbb{P}[X=0|Y=1]=p$. In the first solution the second derivative satisfies

$\displaystyle \frac{\mathrm{d}^2H(X|Y)}{\mathrm{d}q^2} = -\frac{(1-2D)^2}{2(1-D)D(p-D)(1-D-p)}$

so it is a maximum. Consistent with the failed attempt above, we have $q < p$, so the space of codewords is smaller than the space of source words. The extreme value is $H(X|Y) = h(D)$. Again this is somewhat intuitive: the minimum mutual information consistent with a distortion of $D$ should be the uncertainty in the $(D,1-D)$ probability distribution. But note that the distortion is applied to codewords, rather than words emitted by the source. (A related remark is given at the end of this post.)

To emphasise the difference, the left diagrams below show the probabilities going from $Y$ to $X$, relevant to $H(X|Y)$. The right diagram shows the probabilities going from $X$ to $Y$, relevant to $H(Y|X)$ and need in the proof below.

Note that the flip probabilities going from $X$ to $Y$ are not, in general equal. In the example above with $p=\frac{1}{3}$ and $D = \frac{1}{6}$ we have $q = \frac{1}{4}$ and the flip probabilities from $X$ to $Y$ are $\gamma = \frac{1}{16}$ and $\delta = \frac{3}{8}$. The extreme value is $H(X|Y) = h(\frac{1}{6}) \approx 0.650$.

The second solution appears to be spurious, i.e. not corresponding to a zero of the derivative. But since two maxima are separated by another zero of the derivative, it cannot be a maximum. This solution exists only when $D-Dp-p^2 \ge 0$, a condition equivalent to $D \ge p \frac{p}{1-p}$. In the example above $q = \frac{1}{2}$ and (by complete chance) $\mathbb{P}[X=1|Y=0] = 0$ and $\mathbb{P}[X=0|Y=1] = \frac{1}{3}$. The extreme value is $H(X|Y) = \frac{1}{2}\log_2 3 - \frac{1}{3} \approx 0.459$.

Outline Proof. Let $q = (p-D)/(1-2p)$. Let $M = \lceil 2^{(h(p)-h(D)+\epsilon)r} \rceil$ as in the statement of the theorem. Choose codewords $U(1), \ldots, U(M)$ independently and uniformly at random from the binary words of length $r$ of weight $qr$. Let $w$ be a fixed word of weight $pr$. When we flip $\gamma r$ $0$s and $\delta r$ $1$s in $w$ we obtain a binary word of weight

\begin{aligned} pr + &(1-p) \delta r - p \epsilon r \\ &= \bigl( p + \frac{D(p-D)}{1-2D} - \frac{D(1-p-D)}{1-2D}\bigr)r \\ &= \bigl( \frac{p(1-2D) + D(p-D) - D(1-p-D)}{1-2D}\bigr)r \\ &= \bigl( \frac{p-2Dp + Dp - D^2 - D - Dp +D^2}{1-2D} \\ &= \frac{p-D}{1-2D}\bigr)r \\ &= qr \end{aligned}

as expected. Moreover, the number of binary words of weight $qr$ we obtain in this way is $\binom{pr}{\epsilon pr} \binom{(1-p)r}{\delta pr}$. By the bounds above, the probability that a codeword of weight $qr$, chosen uniformly at random, is within distance $Dr$ of $w$ is about

$2^{pr h(\delta) + (1-p)r h(\gamma)-h(q)}$.

Considering all $M$ codewords, the probability that no codeword is within distance $Dr$ of $w$ is, by the same exponential approximation used earlier, about

$\exp (- 2^{(h(p)-h(D) + \epsilon + ph(\delta) - (1-p)h(\gamma) - h(q))r}).$

Now, as one would hope, and as can be shown using Mathematica, with FullSimplify and the assumptions that $0 < p < 1$ and $0 < d < p$, we have $h(p)-h(D) + ph(\delta) - (1-p)h(\gamma) - h(q) = 0$. The end of the proof is therefore as before. $\Box$

Remark. It is very tempting to start on the other side, and argue that a codeword $U$ chosen uniformly at random from the set of binary words of weight $qr$ is within distance $Dr$ of $2^{qrh(D) + (1-q)r h(D)} = 2^{h(D)r}$ source words of weight $p$ by flipping exactly $qrD$ $0$s and $(1-q)rD$ $1$s. This avoids the horrendous algebraic simplification required in the proof above. But now it is the source word that is varying, not the codeword. One can argue this way by varying both, which is permitted by the AEP since typically source words have about $pn$ $1$s. This combines both sources of randomness: the random code and the random source word in a slightly subtle way, typical of the way the AEP is used in a practice.

## Linearly recurrent sequences and LFSRs

September 17, 2019

Part B of my Cipher Systems course is on stream ciphers. Modern stream ciphers, such as Trivium, are typically constructed by judiciously introducing non-linearity into a long-period linear stream cipher. The linear case is therefore still important.

By one mathematical definition, a Linear Feedback Shift Register (LFSR) is a function $F : \mathbb{F}_2^\ell \rightarrow \mathbb{F}_2^\ell$ of the form

$(x_0,\ldots, x_{\ell-2}, x_{\ell-1}) \mapsto \bigl(x_1,\ldots,x_{\ell-1}, f(x_0,,\ldots, x_{\ell-1})\bigr)$

where the feedback function $f$ is linear. If

$f(x_0,x_1,\ldots, x_{\ell-1}) = \sum_{t \in T} x_{\ell -t}$

then we say that $f$ has taps $T$ and that $\ell$ is the width of $F$. The keystream generated by $F$ for a key $(k_0,\ldots,k_{\ell-1}) \in \mathbb{F}_2^\ell$ is the sequence $k_0,\ldots, k_{\ell-1},k_{\ell}, k_{\ell+1}, \ldots$, where each $k_s$ for $s \ge \ell$ is defined recursively by

$k_s = f(k_{s-\ell}, \ldots, k_{s-1}) = \sum_{t \in T} k_{s-t}.$

Equivalently, for $s \ge \ell$,

$F\bigl( k_{s-\ell}, \ldots, k_{s-2}, k_{s-1} \bigr) = (k_{s-\ell+1}, \ldots, k_{s-1}, k_s) \quad(\star).$

The equation $(\star)$ is close to the hardware implementation of an LFSR: fill $\ell$ registers with the key $k_0k_1\ldots k_{\ell-1}$ and then repeatedly step by putting the sum of the bits $k_{\ell-t}$ for each $t \in T$ into the rightmost register, shifting the other registers to the left. (In hardware, these bits would be ‘tapped’ by physical wires.) The keystream can be read off from the rightmost register. This interpretation makes it clear that an LFSR of width $\ell$ is invertible if and only if it taps the bit about to leave the register for good: that is $\ell$ must be a tap.

The period of an invertible LFSR $F$ is the least $P \in \mathbb{N}$ such that $F^P = F$. One way to find $P$ uses that in the canonical basis of $\mathbb{F}_2^\ell$, the matrix (acting on column vectors) representing $F$ is the companion matrix of the polynomial $X^\ell + \sum_{t \in T}X^{\ell - t}$. Thus $F^m = F$ if and only if $X^\ell + \sum_{t \in T}X^{\ell -t}$ divides $X^m + 1$. This may seem concise written here, but in practice it takes well over a lecture to remind students of matrices, find the matrix representing $F$ and determine its minimal polynomial. All this adds up to a substantial diversion from cryptography; I fear it lost most students in my Cipher Systems course.

Instead, this year I’m going to give up on linear algebra and use the theory of linearly recurrent sequences and infinite power series. My plan is to subsume all of the ring-theoretic background into the following definition.

Definition. A polynomial $f(X)$ annihilates a power series $K(X)$ if $f(X)K(X)$ is a polynomial.

For example, the LFSR of width $4$ with taps $\{3,4\}$ generates the keystream ${\bf 0001}00110101111{\bf 0001} \ldots$ of period $15$. The corresponding power series is

\begin{aligned}(X^3 + X^6 + &X^7 + X^9 + X^{11} + X^{12} + X^{13} + X^{14})(1+X^{15} + \cdots ) \\ &= \displaystyle \frac{X^3 + X^6 + X^7 + X^9 + X^{11} + X^{12} + X^{13} + X^{14}}{1+X^{15}}.\end{aligned}

Clearly the power series is annihilated by $1+X^{15}$. But since $k_s = k_{s-3} + k_{s-4}$ for $s \ge 4$, it is also annihilated by $1+X^3+X^4$. Correspondingly,

$X^3 + X^6 + X^7 + X^9 + X^{11} + X^{12} + X^{13} + X^{14} + \cdots = \frac{X^3}{1+X^3+X^4}.$

More generally, the keystream generated by an LFSR with taps $T$ is annihilated by the feedback polynomial $f_T(X) = 1 + \sum_{t \in T} X^t$. Indeed, the coefficient of $X^s$ in the product is $k_s + \sum_{t \in T} k_{s-t} = 0$, and so $f_T(X)K(X)$ is a polynomial of degree at most $\ell-1$. Hence the period of $F$ is the minimum $m \in \mathbb{N}$ such that $X^m + 1$ is divisible by $1 + \sum_{t \in T} X^t$. (This is equivalent to the condition found before by taking reciprocal polynomials.)

One quickly gets further non-trivial results.

#### Keystream of maximum period

Given $h(X) = h_0 + h_1X + \cdots + h_{\ell-1}X^{\ell-1}$, we claim that the equation $K(X) = h(X)/f_T(X)$ defines a corresponding keystream. To avoid the need to define inverses in the ring of power series, let’s agree to interpret this equation as equivalent to

$(1+\sum_{t \in T}X^t)(k_0 + k_1X + k_2X^2 + \cdots + k_sX^s + \cdots ) = h(X)$.

In turn, this is equivalent to the linear equations

$h_s = k_s + \sum_{t \in T : t \le s} k_{s-t}$

that must be satisfied by $k_0, k_1, k_2, \ldots$; clearly these equations have a unique solution. In particular, there is a keystream such that $K(x) = 1/f_T(X)$. (This holds even if the LFSR is not invertible.) It’s obvious that $K(x)$ is annihilated only by the multiples of $f_T(X)$. In the invertible case we have seen that $f_T(X)$ divides $X^m + 1$ if and only if $m$ is a multiple of the period $P$ of the LFSR. Therefore there is a keystream of period $P$.

(For comparison, using the linear algebra method, one argues that the matrix representing an LFSR is cyclic and that, as is clear from the left-shift in the definition, the corresponding $\mathbb{F}_2[x]$-module is generated by $(0,\ldots,0,1)$. The period of the keystream for $(0,\ldots, 0,1)$ is therefore the period of $F$. Again this is an argument that becomes far more long-winded when enough detail is included to make it accessible to students.)

#### Sum of two keystreams

Suppose, in any attempt to increase the linear complexity, we add the keystream ${\bf 001}0111{\bf 001} \ldots$ of the LFSR of width $3$ with taps $\{2,3\}$ to the keystream with period $15$ in the example above. The new keystream has period $105$ and is annihilated by

$(1+X^3+X^4)(1+X^2+X^3) = 1+X^2+X^4+X^5+X^7.$

It is therefore a keystream of the LFSR with taps $\{2,4,5,7\}$. I plan to get students to find these taps by the slow method of solving the linear equations $(\star)$, and then, I hope, impress them with how much faster things are with a bit more algebra.

(Incidentally one can have endless fun with the two competing conventions for taps. While less standard, this example shows that our choice in which $k_s = \sum_{t \in T}k_{s-t}$ is a natural fit for the annihilator method. It is also clearly best for the Berlekamp—Massey algorithm.)

#### Length of all keystreams

One computer demonstration that I think worked quite well last year used a Mathematica notebook to compute the lengths of all keystreams for all invertible LFSRs of width $\ell \le 5$. I then showed that the keystreams of maximum possible period $2^{\ell}-1$ came from the LFSRs whose minimal polynomial is primitive. Of course most, quite reasonably, didn’t know what ‘primitive’ meant, but they could see Mathematica did, and that there was some underlying pattern related to the factorization of the minimal polynomial.

Since the matrix representing an LFSR with taps $T$ in the canonical basis of $\mathbb{F}_2^\ell$ is the companion matrix of $X^\ell + \sum_{t \in T}X^{\ell -t}$, the corresponding $\mathbb{F}_2[X]$-module is

$\mathbb{F}_2[X]/\langle X^\ell + \sum_{t \in T} X_{\ell - t}\rangle.$

Switching to the reciprocal polynomial, the lengths of the keystreams are the orbit sizes of $X$ in its action on $\mathbb{F}_2[X]/\langle f_T(X) \rangle$. When $f_T(X)$ is irreducible and primitive, we get a single non-zero orbit. The general case is quite intricate, and stated in Theorem 8.63 in Lidl and Niederreiter Finite fields.

For this retelling, we need the following function. Given $r \in \mathbb{N}$ with $2^{m-1} < r \le 2^m$, let $b(r) = m$. Thus the sequence $b(r)$ starts

$0, 1, 2, 2, 3, 3, 3, 3, 4, \ldots$

and $b(r)$ is the number of bits in the binary form of $r-1$.

In the next two results, let $g(X) \in \mathbb{F}_2[X]$ be an irreducible polynomial of degree $d$ and let $t$ be the order of $X$ in the finite field $\mathbb{F}_2[X]/g(X)$.

Proposition. Let $r \ge 2$. The orbits of $X$ on $\mathbb{F}_2[X]/g(X)^r$ not contained in the unique maximal ideal $g(X)\mathbb{F}_2[X]/g(X)^r$ have size $2^{b(r)}t$ and there are $\bigl(2^{rd} - 2^{(r-1)d}\bigr)/ 2^{b(r)} t$ of them.

Proof. Let $J$ be the complement of the maximal ideal. By hypothesis, $X^t + 1$ is divisible by $g(X)$. Since $t$ is odd,

$X^{2^m t} + 1 = (X^t+1)^{2^r}$

is divisible by $g(X)^r$ if and only if $2^m \ge r$. Therefore $X$ has order $2^{b(r)} t$ in its action on $J$. Since $r \ge 2$, the elements of $J$ are invertible. Hence if $p(X) \in J$ then $p(X)X^m = p(X)$ mod $g(X)^r$ if and only if $X^m = 1$ mod $g(X)^r$. It follows that all the orbits of $X$ on $J$ have size $2^{b(r)}t$. Since $J$ has size $2^{dr} - 2^{d(r-1)}$, the number of orbits is as claimed. $\Box$

Theorem. The non-zero orbits of $X$ on $\mathbb{F}_2[X]/g(X)^e$ have sizes $2^b t$ for $b \in \mathbb{N}_0$. If $2^b \le e$ then the number of orbits of size $2^b t$ is

$\displaystyle \frac{2^{2^b d} - 2^{2^{b-1}d}}{2^b t}.$

The number of orbits of size $2^{b(e)} t$ is

$\displaystyle \frac{2^{e d} - 2^{2^{b(e)-1}}d}{2^{b(e)} t}.$

Proof. Let $b \in \mathbb{N}$. Suppose that $2^b \le e$. For each $r$ such that $2^{b-1} < r \le 2^{b}$ we have $b(r) = b$ and the count of orbits of size $2^b t$ in the previous proposition telescopes as required. A similar telescoping works to count the orbits of size $2^{b(e)} t$. Finally we count the zero orbit $\{0\}$ and $(2^d-1)/t$ orbits of size $t$, which together form the minimal ideal $g(X)^{d-1} \mathbb{F}_2[X]/g(X)^d$ isomorphic to the finite field $\mathbb{F}_2[X]/g(X)$. $\Box$

For example, take $g(X) = (1+X+X^2)^5$. Then $b(e) = 3$, $t = 3$ and the orbits of $X$ have sizes $1, 3$; $6^2$; $12^{20}$; $24^{32}$, where the semicolons indicate the split according to $b \in \mathbb{N}$ used in the theorem, and the exponents indicate multiplicities. These orbit sizes form the multiset $\{1,3,6^2, 12^{20}, 24^{32} \}$.

Just for the next corollary, we define the product of two multisets to be the multiset obtained by multiplying the elements pairwise, respecting multiplicities. For example

$\{1,3,6^2\} \times \{1,7\} = \{1,3,6^2,7,21,42^2\}.$

Corollary. Let $F$ be an invertible LFSR with taps $T$. Let $f_T(X) = g_1^{e_1} \ldots g_c^{e_c}$ be the irreducible factorization of its feedback polynomial. Then the multiset of orbit sizes of $F$ is the product of the multisets of orbit sizes for each factor $g_i^{e_i}$, as given by the previous theorem.

Proof. This is routine from the Chinese Remainder Theorem. $\Box$

#### Product of two keystreams

Again using that the matrix representing an invertible LFSR with taps $T$ in the canonical basis of $\mathbb{F}_2^\ell$ is the companion matrix of $X^\ell + \sum_{t \in T}X^{\ell -t}$, say $A_T$, we see that the bitwise product of keystreams of invertible LFSRs with taps $T$ and $T'$ on keys $k$ and $k'$ is obtained by a linear operation (namely multiplication) on the tensors obtained by iterating $A_T \otimes A_{T'}$, starting with $k \otimes k'$. Define the set $T \otimes T' \subseteq \{1,\ldots, \ell\ell'\}$ so that $A_T \otimes A_{T'}$ has characteristic polynomial

$X^{\ell\ell} + \sum_{t \in T \otimes T'} X^{\ell\ell' - t}.$

Then $T \otimes T'$ is the set of taps for an LFSR generating the product keystream.

We now identify $T \otimes T'$ more concretely. Let $\alpha \in \mathbb{F}_{2^{\ell}}$ and $\alpha' \in \mathbb{F}_{2^{\ell'}}$ be roots of the feedback polynomials for $T$ and $T'$. The characteristic polynomial of $A_T$ then has roots

$\alpha^{-1}, \alpha^{-2}, \ldots, \alpha^{-2^{\ell-1}}$

and similarly for $A_{T'}$. Hence the characteristic polynomial of $A_T \otimes A_{T'}$ has roots $\alpha^{-2^i}{\alpha'}^{-2^{i'}}$ for $0 \le i < \ell$ and $0 \le i' < \ell'$. It follows that the feedback polynomial for the LFSR generating the product keystream has roots $\alpha^{2^i}{\alpha'}^{2^{i'}}$ for the same $i$ and $i'$.

In the special case where $\ell$ and $\ell'$ are coprime, the fields $\mathbb{F}_2(\alpha) \cong \mathbb{F}_{2^\ell}$ and $\mathbb{F}_2(\alpha') \cong \mathbb{F}_{2^{\ell'}}$ together generate $\mathbb{F}_2(\alpha\alpha') \cong \mathbb{F}_{2^{\ell\ell'}}$ and so

$\prod_{i,i'} (X - \alpha^{2^i}{\alpha'}^{2^{i'}}) = X^{\ell\ell'} + \sum_{t \in T \otimes T'} X^t$.

For example, the bitwise product of the keystreams, starting with the standard keys $0001$ and $00001$ for the invertible full-period LFSRs with taps $\{1,4\}$ and $\{2,5\}$ has period $(2^4-1)(2^5-1) = 465$. A computer algebra calculation shows that the minimal polynomial of $\alpha\alpha'$ is

$X^{20} + X^{17} + X^{14} + X^{11} + X^{10} + X^9 + X^8 + X^2 + 1$,

hence $\{1,4\} \otimes \{2,5\} = \{2,8,9,10,11,14,17\}$. In this case the LFSR has minimal possible width, so the same set of taps is given by applying the Berlekamp–Massey algorithm to the product keystream.

Some nice features of the $\otimes$ product are that $T \otimes \{m\} = mT$ for any $m \in \mathbb{N}$ and that $T \otimes T = T$ for any $T$. A more interesting product is perhaps $T \circ T'$, defined to be the set of taps of the LFSR of minimal width producing the product keystream, starting with the standard keys as above. In general the width will be smaller than $\ell\ell'$: for example, this is the case whenever $T = T'$. It shares these nice features, and maybe deserves closer investigation, even though the general behaviour appears quite wild.

## A toy model for the proof of Shannon’s Noisy Coding Theorem for the BSC

September 6, 2019

In the Binary Symmetric Channel (BSC) with symbol error probability $p$, an input bit flips with probability $p < 1/2$. A binary code of length $n$ can be used to communicate on the BSC by sending each bit of a codeword $u$, collecting the $n$ received bits as a word $v$ still of length $n$, and then decoding $v$ as the nearest codeword $u'$ to $v$ with respect to Hamming distance. For instance, using the repetition code of length $2e+1$, a decoding error occurs if and only if $e+1$ or more bits get flipped in the channel. This event has probability

\begin{aligned}\sum_{k=e+1}^n \binom{n}{k} p^k(1-p)^{n-k} &\le p^{e+1}(1-p)^e \sum_{k=e+1}^n \binom{n}{k} \\ &\le p^e (1-p)^e 2^{n-1} \\ &\le \rho^e\end{aligned}

where $\rho = 4p(1-p) < 1$. Hence we can communicate with decoding error probability $\le \epsilon$ provided that $e \ge \log \epsilon / \log \rho$. For example, if $p = 1/4$ then $\rho = 3/4$ and $\rho^e < 5\%$ provided $n \ge 23$. With this solution we send $23$ bits through the channel for each bit of the message. The information rate is therefore $1/23$. Can we do better?

#### Capacity and mutual information

The capacity of the BSC is $1-h(p)$, where

$h(p) = -p \log_2 p -(1-p) \log_2(1-p)$

is Shannon’s entropy function. Capacity has a slightly technical definition in terms of mutual information. Recall that if $X$ and $Y$ are random variables then the conditional entropy $H(X|Y)$ is the expected value

$H(X|Y) = \sum_{y} \mathbb{P}[Y=y]H(X|Y=y),$

where $H(X|Y=y)$ is the entropy of the random variable $X$, conditioned on $Y=y$. Informally, $H(X|Y)$ is the uncertainty in $X$ that remains after learning $Y$. The mutual information $I(X;Y)$ is then $H(X) - H(X|Y)$; informally this is the amount of information about $X$ we learn from $Y$. For example, if $p =0$ then $Y$ determines $X$, so all uncertainty is removed, and $I(X;Y) = H(X)$. By

$I(X;Y) = H(X) - H(X|Y) = H(X) + H(Y) - H(X,Y) = H(Y) - H(Y|X).$

mutual information is symmetric: the amount we learn about $X$ from $Y$ is also the amount we learn about $Y$ from $X$. Is this intuitive?

Returning to the BSC, consider the most ‘informative’ source $X^\star$ for which $\mathbb{P}[X^\star=0] = \mathbb{P}[X^\star=1] = \frac{1}{2}$. Let $Y$ be the output bit. The capacity is then $I(X^\star;Y)$. We saw a moment ago that, as expected, if $p=0$ then the capacity is $H(X^\star) = 1$. The most convenient way to compute the capacity for general $p$ uses the right-hand side of the identity above. Since $X^\star$ is uniform, so is $Y$, and hence $H(Y) = 1$. Moreover

\begin{aligned} H(Y|X^\star) &= \frac{1}{2} H(Y | X^\star = 0) + \frac{1}{2}H(Y|X^\star=1) \\ &= 2 \times \frac{1}{2} (-p \log_2 p -(1-p) \log_2(1-p)) \\ &= h(p)\end{aligned}.

Therefore the capacity is $1-h(p)$, as claimed.

#### Shannon’s Noisy Coding Theorem

By Shannon’s Noisy Coding Theorem capacity has the following interpretation: given any $\epsilon > 0$ and any $\eta > 0$, for all sufficiently large $n$ there exists a binary code $C$ of length $n$ of size $> 2^{n (1- h(p) - \eta)}$ such that the probability of a decoding error when $C$ is used to communicate on the BSC, with each codeword equally likely to be sent, is $< \epsilon$. (To be precise, a decoding error means that the received word is not decoded as the sent codeword.) For example, if $p = 1/4$ then

$1-h(p) \approx 0.189.$

Therefore Shannon's Noisy Coding Theorem promises that, by taking $n$ sufficiently large, we can communicate with negligible decoding error probability and information rate $0.188$.

As far as I know, all known proofs of Shannon's Noisy Coding Theorem use the probabilistic method.

#### Outline proof

Let $C = \{u(1),\ldots, u(N)\}$ be a binary code of length $n$ and size about $2^{n (1- h(p) - \eta/2)}$, obtained by choosing each codeword $u(i)$ uniformly and independently at random from $\mathbb{F}_2^n$. Let $p_{i}(C)$ be the probability (in the BSC model) that if $u(i)$ is sent then the received word is wrongly decoded as some $u(j)$ with $j\not=i$. An expectation argument shows that, provided $n$ is sufficiently large,

$\mathbb{E}_{\mathrm{code}}[p_{i}(C)] < \epsilon/2.$

Note that here the expectation is taken with respect to the random choice of $C$ and $p_{i}$ is a random variable (that just happens to be a probability in the BSC model). We'll see shortly why $\epsilon/2$ is the right upper bound to take. It follows by averaging over all codewords that

$\frac{1}{N}\sum_{i=1}^N\mathbb{E}_\mathrm{code}[p_i(C)] < \epsilon/2.$

By the probabilistic method, there exists a particular code $C$ such that

$\frac{1}{N}\sum_{i=1}^N p_i(C) < \epsilon/2.$

Does this mean that we can use $C$ to communicate with decoding error probability $< \epsilon/2$? Unfortunately not, because the previous equation only says that on average, we meet this bound. There could be some codewords $u(i)$, perhaps those unusually near to many other codewords, for which the decoding error probability is higher. But since the average is $\epsilon/2$, at most half the codewords have $p_i(C) \ge \epsilon$. Therefore by removing at most half the codewords, we get a code $C'$ of size $N' \ge N/2$ with decoding error probability $< \epsilon$. It might seem bad that, at worst, $|C'| = |C|/2 = N/2$, but as $N = 2^{(1-h(p)-\eta/2)n}$, we have

$\frac{N}{2} = 2^{(1-h(p) -\eta/2 - 1/n))n}$

and provided $n$ is sufficiently large, $\eta + 1/n < \eta$. This randomly constructed $C'$ proves Shannon's Noisy Coding Theorem for the parameters $\epsilon$ and $\eta$.

#### Remarks

1. The final expurgation step, and the need to work with $\epsilon/2$ rather than $\epsilon$ can be avoided by choosing a random linear code. Then the multiset of Hamming distances $\{ d(u(i),u(j)) : j\not=i \}$ which determines the probability $p_i(C)$ of a decoding error on sending $u(i)$ is the same for all $i$. Therefore all the $p_i(C)$ are equal. Some linear codes will be uniformly terrible, but by the probabilistic method, some linear codes are uniformly good.
2. Care is needed to turn the outline above into a complete proof. For example, in the first step bounding the $p_i(C)$, one needs Chernoff’s bound or the Central Limit Theorem to justify that when a codeword $u$ is sent, the received word $v$ is, with probability tending to $1$ with $n$, of distance between $(p-\delta)n$ and $(p+\delta)n$ of $u$, for any fixed $\delta > 0$.
3. One subtle feature of the proof is that there are two different probability models in play. I tried to emphasise this above, for example by using $\epsilon$ for the error bounds in the channel probability space, and $\eta$ for the error bounds in the code probability space. A related feature is that we have to work with expectations, rather than probabilities, in the code’s probability space. For instance, it would be nice to have

$\mathbb{P}_\mathrm{code}\bigl[\mathbb{P}_\mathrm{channel}[A_i] > \epsilon\bigr] < \eta$

for sufficiently large $n$, where $A_i$ is the event that a sent $u(i)$ is wrongly decoded as some $u(j)$ with $j\not= i$. But these probabilities seem hard to reason about directly.

#### A toy model

Here is a toy model that has the same split into two probability spaces, but is, I think, much easier to understand. To play Match! using a supplied deck of $c$ cards each with some number between $1$ and $n$, the two players each draw a card from the deck. If they find they have matching numbers, they both eat some cake. Let $M$ be this ‘in game’ event. Otherwise they go to bed hungry.

Fix a deck and let $f_r$ be the frequency of card $r$, for each $r \in \{1,\ldots, n\}$. We have

$\mathbb{P}_\mathrm{in game}[M] = \sum_{r=1}^n \frac{f_r(f_r-1)}{c(c-1)}.$

In the random process used by the manufacturer to construct decks, the cards are picked independently and uniformly. Hence $(f_1,\ldots,f_r)$ has the multinomial distribution $B(c,1/n,\ldots, 1/n)$. Therefore $\mathbb{E}_\mathrm{deck}[f_r] = c/n$ and

$\mathbb{E}_\mathrm{deck}[f_r^2] = \mathrm{Var}_\mathrm{deck}[f_r] + \mathbb{E}_\mathrm{deck}[f_r]^2 = \frac{c(n-1)}{n^2} + \frac{c^2}{n^2}.$

It follows that

\begin{aligned}\mathbb{E}_\mathrm{deck}\bigl[\mathbb{P}_\mathrm{in game}[M]\bigr] &= \frac{1}{c(c-1)}\sum_{i=1}^r \bigl( \frac{c(n-1)}{n^2} + \frac{c^2}{n^2} - \frac{c}{n} \bigr) \\ &= \frac{n}{c(c-1)n^2} \bigl( cn- c + c^2 - cn \bigr) \\ &= \frac{1}{n}. \end{aligned}

This can also be seen without calculation: mixing up the two probability models, we can imagine that the manufacturer builds a two card deck, and issues the cards to the players. The probability they are equal is clearly $1/n$.

We can also reason about probabilities in the deck model. For example $\mathbb{P}_\mathrm{deck}\bigl[ \mathbb{P}_\mathrm{in game}[M=0] \bigr]$ is the probability that all cards are different, namely

$\frac{n(n-1) \ldots (n-c+1)}{n^c} = \frac{c!}{n^c} \binom{n}{c}.$