## Two proofs of Shannon’s Source Coding Theorem

September 29, 2019

The purpose of this post is to record two approaches to Shannon’s Source Coding Theorem for memoryless sources, first by explicit codes and secondly using statistical properties of the source.

#### Setup and statement.

Suppose that a memoryless source produces the symbols $1, \ldots, m$ with probabilities $p(1), \ldots, p(m)$, respectively. Let

$h = H(p(1),\ldots, p(m))$

be the entropy of this probability distribution. Since the source is memoryless, the entropy of an $r$-tuple of symbols is then $rh$. Shannon’s Source Coding Theorem (or First Main Theorem, or Noiseless Coding Theorem) states that, given $\epsilon > 0$, provided $r$ is sufficiently large, there is a variable length binary code of average length less than $r(h + \epsilon)$ that can be used to encode all $r$-ruples of symbols.

#### Proof by Kraft–McMillan inequality

Choose $r$ sufficiently large so that $r > 1/\epsilon$. Let $q(1), \ldots, q(m^r)$ be the probabilities of the $r$-tuples and let $\ell(j) \in \mathbb{N}$ be least such that $\ell(j) \ge \log_2 1/q(j)$. Equivalently, $\ell(j)$ is least such that

$2^{-\ell(j)} \le q(j).$

This equivalent characterisation shows that

$\sum_{j=1}^{m^r} 2^{-\ell(j)} \le \sum_{j=1}^{m^r} q(j) =1$

so by the Kraft–McMillan inequality, there exists a prefix free binary code with these codeword lengths. Moreover, since

$\ell(j) = \lceil \log_2 1/q(j) \rceil \le 1 + \log_2 \frac{1}{q(j)}$

we have

\begin{aligned}\sum_{j=1}^{m^r} q(j)\ell(j) &\le \sum_{j=1}^{m^r} q(j) \bigl( 1 + \log_2 \frac{1}{q(j)} \bigr) \\ &= \sum_{j=1}^{m^r} q(j) + \sum_{j=1}^{m^r} q(j) \log_2 \frac{1}{q(j)} &= 1 + hr \end{aligned}

as required.

The required prefix-free code is the Shannon code for the probabilities $p(1), \ldots, p(m^r)$. Another algorithmic construction is given by the proof of the sufficiency of the Kraft–McMillan inequality on page 17 of Welsh Codes and cryptography. (The same proof is indicated, rather less clearly in my view, on the Wikipedia page.)

#### Proof using Asymptotic Equipartition Property (AEP).

Despite its forbidding name, for memoryless sources the AEP is little more than the Weak Law of Large Numbers, albeit applied in a slightly subtle way to random variables whose values are certain (functions of) probabilities. Let $\mathcal{M} = \{1,\ldots, m\}$ be the probability space of symbols produced by the source. Let $X \in \mathcal{M}$ be a random symbol produced by the source. Then $\log p(X)$ is a random variable. Summing over the $m$ outcomes in the probability space, we find that

$\mathbb{E}[\log p(X)] = \sum_{i=1}^m \mathbb{P}[X=i] \log p(i) = \sum_{i=1}^m p(i) \log p(i) = -h$

where $h = H(p(1), \ldots, p(m))$ is the entropy already seen. While straightforward, and giving some motivation for considering $\log p(X)$, the calculation hides some subtleties. For example, is it true that

$\mathbb{P}[\log p(X) = \log p(i)] = p(i)?$

In general the answer is ‘no’: in fact equality holds if and only if $i$ is the unique symbol with probability $p(i)$. (One further thing, that I certainly will not even dare to mention in my course, is that, by the formal definition of a random variable, $X$ is the identity function on $\mathcal{M}$, and so $p(X)$ is a composition of two functions; in Example 2.3.1 of Goldie and Pinch Communication Theory the authors remark that forming such compositions is a ‘characteristic feature of information theory, not found elsewhere in probability theory’.)

Anyway, returning to the AEP, let $(X_1, \ldots, X_r) \in \mathcal{M}^r$ be a random $r$-tuple. For brevity we shall now call these $r$-tuples messages. Let $P_r : \mathcal{M}^r \rightarrow [0,1]$ be defined by

$P_r\bigl( (x_1, \ldots, x_r) \bigr) = p(x_1) \ldots p(x_r)$

Since the source is memoryless, $P_r\bigl( (x_1,\ldots, x_r)\bigr)$ is the probability that the first $r$ symbols produced by the source are $x_1, \ldots, x_r$. Consider the random variable $\log P_r\bigl( (X_1, \ldots, X_r) \bigr)$. By the displayed equation above,

$\log P_r\bigl( (X_1,\ldots, X_r) \bigr) = \sum_{i=1}^r \log p(X_i).$

This is a sum of $r$ independent identically distributed random variables. We saw above that each has expected value $-h$. Hence, by the weak law, given any $\eta > 0$,

$\displaystyle \mathbb{P}\Bigl[ -\frac{\log P_r\bigl( (X_1, \ldots, X_r) \bigr)}{r} \not\in (h-\eta, h+\eta) \Bigr] \rightarrow 0$

as $r \rightarrow \infty$. By taking $r$ sufficiently large so that the probability is less than $\eta$, we obtain a subset $\mathcal{T}$ of $\mathcal{M}^r$ such that (a)

$\mathbb{P}[(X_1,\ldots, X_r) \in \mathcal{T}] > 1 - \eta$

and (b) if $(x_1,\ldots, x_r) \in \mathcal{T}$ then

$P_r\bigl( (x_1, \ldots, x_r) \bigr) \in (2^{-r(H+\eta)}, 2^{-r(H-\eta)}).$

The properties (a) and (b) say that memoryless sources satisfy the AEP. It is usual to call the elements of $\mathcal{T}$ typical messages.

For example, if $p(1) \ge \ldots \ge p(m)$ then the most likely message is $(1,\ldots, 1)^r$ and the least likely is $(m,\ldots, m)$; unless the distribution is uniform, both are atypical when $\eta$ is small. A typical message will instead have each symbol $i$ with frequency about $rp(i)$. Indeed, the proof of the AEP for memoryless sources on page 80 of Welsh Codes and cryptography takes this as the definition as typical, using the Central Limit Theorem to replace the weak law so that ‘about’ can be defined precisely.

We can now prove the Source Coding Theorem in another way: if $(x_1, \ldots, x_r)$ is typical then

$P_r\bigl( (x_1,\ldots, x_r)\bigr) \ge 2^{-r(H+\eta)}$

and so

$1 \ge \mathbb{P}[(X_1,\ldots, X_r) \in \mathcal{T}] = \sum_{(x_1,\ldots, x_r) \in \mathcal{T}} P_r\bigl( (x_1, \ldots, x_r) \bigr) \ge |\mathcal{T}| 2^{-r(H+ \eta)}.$

Hence $|\mathcal{T}| \;\text{\textless}\; r(h+\eta)$. For the atypical messages we make no attempt at efficiency, and instead encode them using binary words of length $N$ for some $N > r \log_2 m$. Thus the code has (at most) two distinct lengths of codewords, and the expected length of a codeword is at most

$L + \eta M.$

By taking $r$ sufficiently large and $L = \lfloor r(h+ \eta)\rfloor + 1$, we may assume that $L < r(h + 2\eta)$. Therefore the expected length is at most

$r(h + 2\eta) + \eta r \log_2 m$;

when $\eta$ is sufficiently small this is less than $r(h+ \epsilon)$.

In Part C of my course I plan to give this proof before defining the AEP, and then take (a) and (b) as its definition for general sources. As a preliminary, I plan to prove the Weak Law of Large Numbers from Chebyshev’s Inequality.

#### AEP for general sources

Going beyond this, things quickly get difficult. Recall that a source producing symbols $X_1, X_2, \ldots$ is stationary if $(X_1,\ldots,X_t)$ and $(X_{1+s},\ldots X_{t+s})$ have the same distribution for all $s$ and $t$, and ergodic if with probability $1$ the observed frequencies of every $t$-tuple of symbols converges to its expected value. There is a nice proof using Fekete’s Lemma in Welsh Codes and cryptography (see page 78) that any stationary source has an entropy, defined by

$\displaystyle \lim_{r \rightarrow \infty} \frac{H(X_1, X_2,\ldots, X_r)}{r}.$

However the proof that a stationary ergodic source satisfies the AEP is too hard for my course. (For instance the sketch on Wikipedia blithely uses Lévy’s Martingale Convergence Theory.) And while fairly intuitive, it is hard to prove that a memoryless source is ergodic — in fact this is essentially the Strong Law of Large Numbers — so there are difficulties even in connecting the memoryless motivation to the general result. All in all this seems to be a place to give examples and state results without proof.

## Linearly recurrent sequences and LFSRs

September 17, 2019

Part B of my Cipher Systems course is on stream ciphers. Modern stream ciphers, such as Trivium, are typically constructed by judiciously introducing non-linearity into a long-period linear stream cipher. The linear case is therefore still important.

By one mathematical definition, a Linear Feedback Shift Register (LFSR) is a function $F : \mathbb{F}_2^\ell \rightarrow \mathbb{F}_2^\ell$ of the form

$(x_0,\ldots, x_{\ell-2}, x_{\ell-1}) \mapsto \bigl(x_1,\ldots,x_{\ell-1}, f(x_0,,\ldots, x_{\ell-1})\bigr)$

where the feedback function $f$ is linear. If

$f(x_0,x_1,\ldots, x_{\ell-1}) = \sum_{t \in T} x_{\ell -t}$

then we say that $f$ has taps $T$ and that $\ell$ is the width of $F$. The keystream generated by $F$ for a key $(k_0,\ldots,k_{\ell-1}) \in \mathbb{F}_2^\ell$ is the sequence $k_0,\ldots, k_{\ell-1},k_{\ell}, k_{\ell+1}, \ldots$, where each $k_s$ for $s \ge \ell$ is defined recursively by

$k_s = f(k_{s-\ell}, \ldots, k_{s-1}) = \sum_{t \in T} k_{s-t}.$

Equivalently, for $s \ge \ell$,

$F\bigl( k_{s-\ell}, \ldots, k_{s-2}, k_{s-1} \bigr) = (k_{s-\ell+1}, \ldots, k_{s-1}, k_s) \quad(\star).$

The equation $(\star)$ is close to the hardware implementation of an LFSR: fill $\ell$ registers with the key $k_0k_1\ldots k_{\ell-1}$ and then repeatedly step by putting the sum of the bits $k_{\ell-t}$ for each $t \in T$ into the rightmost register, shifting the other registers to the left. (In hardware, these bits would be ‘tapped’ by physical wires.) The keystream can be read off from the rightmost register. This interpretation makes it clear that an LFSR of width $\ell$ is invertible if and only if it taps the bit about to leave the register for good: that is $\ell$ must be a tap.

The period of an invertible LFSR $F$ is the least $P \in \mathbb{N}$ such that $F^P = F$. One way to find $P$ uses that in the canonical basis of $\mathbb{F}_2^\ell$, the matrix (acting on column vectors) representing $F$ is the companion matrix of the polynomial $X^\ell + \sum_{t \in T}X^{\ell - t}$. Thus $F^m = F$ if and only if $X^\ell + \sum_{t \in T}X^{\ell -t}$ divides $x^m + 1$. This may seem concise written here, but in practice it takes well over a lecture to remind students of matrices, find the matrix representing $F$ and determine its minimal polynomial. All this adds up to a substantial diversion from cryptography; I fear it lost most students in my Cipher Systems course.

Instead, this year I’m going to give up on linear algebra and use the theory of linearly recurrent sequences and infinite power series. My plan is to subsume all of the ring-theoretic background into the following definition.

Definition. A polynomial $f(X)$ annihilates a power series $K(X)$ if $f(X)K(X)$ is a polynomial.

For example, the LFSR of width $4$ with taps $\{3,4\}$ generates the keystream ${\bf 0001}00110101111{\bf 0001} \ldots$ of period $15$. The corresponding power series is

\begin{aligned}(X^3 + X^6 + &X^7 + X^9 + X^{11} + X^{12} + X^{13} + X^{14})(1+X^{15} + \cdots ) \\ &= \displaystyle \frac{X^3 + X^6 + X^7 + X^9 + X^{11} + X^{12} + X^{13} + X^{14}}{1+X^{15}}.\end{aligned}

Clearly the power series is annihilated by $1+X^{15}$. But since $k_s = k_{s-3} + k_{s-4}$ for $s \ge 4$, it is also annihilated by $1+X^3+X^4$. Correspondingly,

$X^3 + X^6 + X^7 + X^9 + X^{11} + X^{12} + X^{13} + X^{14} + \cdots = \frac{X^3}{1+X^3+X^4}.$

More generally, the keystream generated by an LFSR with taps $T$ is annihilated by the feedback polynomial $f_T(X) = 1 + \sum_{t \in T} X^t$. Indeed, the coefficient of $X^s$ in the product is $k_s + \sum_{t \in T} k_{s-t} = 0$, and so $f_T(X)K(X)$ is a polynomial of degree at most $\ell-1$. Hence the period of $F$ is the minimum $m \in \mathbb{N}$ such that $X^m + 1$ is divisible by $1 + \sum_{t \in T} X^t$. (This is equivalent to the condition found before by taking reciprocal polynomials.)

One quickly gets further non-trivial results.

#### Keystream of maximum period

Given $h(X) = h_0 + h_1X + \cdots + h_{\ell-1}X^{\ell-1}$, we claim that the equation $K(X) = h(X)/f_T(X)$ defines a corresponding keystream. To avoid the need to define inverses in the ring of power series, let’s agree to interpret this equation as equivalent to

$(1+\sum_{t \in T}X^t)(k_0 + k_1X + k_2X^2 + \cdots + k_sX^s + \cdots ) = h(X)$.

In turn, this is equivalent to the linear equations

$h_s = k_s + \sum_{t \in T : t \le s} k_{s-t}$

that must be satisfied by $k_0, k_1, k_2, \ldots$; clearly these equations have a unique solution. In particular, there is a keystream such that $K(x) = 1/f_T(X)$. (This holds even if the LFSR is not invertible.) It’s obvious that $K(x)$ is annihilated only by the multiples of $f_T(X)$. In the invertible case we have seen that $f_T(X)$ divides $X^m + 1$ if and only if $m$ is a multiple of the period $P$ of the LFSR. Therefore there is a keystream of period $P$.

(For comparison, using the linear algebra method, one argues that the matrix representing an LFSR is cyclic and that, as is clear from the left-shift in the definition, the corresponding $\mathbb{F}_2[x]$-module is generated by $(0,\ldots,0,1)$. The period of the keystream for $(0,\ldots, 0,1)$ is therefore the period of $F$. Again this is an argument that becomes far more long-winded when enough detail is included to make it accessible to students.)

#### Sum of two keystreams

Suppose, in any attempt to increase the linear complexity, we add the keystream ${\bf 001}0111{\bf 001} \ldots$ of the LFSR of width $3$ with taps $\{2,3\}$ to the keystream with period $15$ in the example above. The new keystream has period $105$ and is annihilated by

$(1+X^3+X^4)(1+X^2+X^3) = 1+X^2+X^4+X^5+X^7.$

It is therefore a keystream of the LFSR with taps $\{2,4,5,7\}$. I plan to get students to find these taps by the slow method of solving the linear equations $(\star)$, and then, I hope, impress them with how much faster things are with a bit more algebra.

(Incidentally one can have endless fun with the two competing conventions for taps. While less standard, this example shows that our choice in which $k_s = \sum_{t \in T}k_{s-t}$ is a natural fit for the annihilator method. It is also clearly best for the Berlekamp—Massey algorithm.)

#### Length of all keystreams

One computer demonstration that I think worked quite well last year used a Mathematica notebook to compute the lengths of all keystreams for all invertible LFSRs of width $\ell \le 5$. I then showed that the keystreams of maximum possible period $2^{\ell}-1$ came from the LFSRs whose minimal polynomial is primitive. Of course most, quite reasonably, didn’t know what ‘primitive’ meant, but they could see Mathematica did, and that there was some underlying pattern related to the factorization of the minimal polynomial.

Since the matrix representing an LFSR with taps $T$ in the canonical basis of $\mathbb{F}_2^\ell$ is the companion matrix of $X^\ell + \sum_{t \in T}X^{\ell -t}$, the corresponding $\mathbb{F}_2[X]$-module is

$\mathbb{F}_2[X]/\langle X^\ell + \sum_{t \in T} X_{\ell - t}\rangle.$

Switching to the reciprocal polynomial, the lengths of the keystreams are the orbit sizes of $X$ in its action on $\mathbb{F}_2[X]/\langle f_T(X) \rangle$. When $f_T(X)$ is irreducible and primitive, we get a single non-zero orbit. The general case is quite intricate, and stated in Theorem 8.63 in Lidl and Niederreiter Finite fields.

For this retelling, we need the following function. Given $r \in \mathbb{N}$ with $2^{m-1} < r \le 2^m$, let $b(r) = m$. Thus the sequence $b(r)$ starts

$0, 1, 2, 2, 3, 3, 3, 3, 4, \ldots$

and $b(r)$ is the number of bits in the binary form of $r-1$.

In the next two results, let $g(X) \in \mathbb{F}_2[X]$ be an irreducible polynomial of degree $d$ and let $t$ be the order of $X$ in the finite field $\mathbb{F}_2[X]/g(X)$.

Proposition. Let $r \ge 2$. The orbits of $X$ on $\mathbb{F}_2[X]/g(X)^r$ not contained in the unique maximal ideal $g(X)\mathbb{F}_2[X]/g(X)^r$ have size $2^{b(r)}t$ and there are $\bigl(2^{rd} - 2^{(r-1)d}\bigr)/ 2^{b(r)} t$ of them.

Proof. Let $J$ be the complement of the maximal ideal. By hypothesis, $X^t + 1$ is divisible by $g(X)$. Since $t$ is odd,

$X^{2^m t} + 1 = (X^t+1)^{2^r}$

is divisible by $g(X)^r$ if and only if $2^m \ge r$. Therefore $X$ has order $2^{b(r)} t$ in its action on $J$. Since $r \ge 2$, the elements of $J$ are invertible. Hence if $p(X) \in J$ then $p(X)X^m = p(X)$ mod $g(X)^r$ if and only if $X^m = 1$ mod $g(X)^r$. It follows that all the orbits of $X$ on $J$ have size $2^{b(r)}t$. Since $J$ has size $2^{dr} - 2^{d(r-1)}$, the number of orbits is as claimed. $\Box$

Theorem. The non-zero orbits of $X$ on $\mathbb{F}_2[X]/g(X)^e$ have sizes $2^b t$ for $b \in \mathbb{N}_0$. If $2^b \le e$ then the number of orbits of size $2^b t$ is

$\displaystyle \frac{2^{2^b d} - 2^{2^{b-1}d}}{2^b t}.$

The number of orbits of size $2^{b(e)} t$ is

$\displaystyle \frac{2^{e d} - 2^{2^{b(e)-1}}d}{2^{b(e)} t}.$

Proof. Let $b \in \mathbb{N}$. Suppose that $2^b \le e$. For each $r$ such that $2^{b-1} < r \le 2^{b}$ we have $b(r) = b$ and the count of orbits of size $2^b t$ in the previous proposition telescopes as required. A similar telescoping works to count the orbits of size $2^{b(e)} t$. Finally we count the zero orbit $\{0\}$ and $(2^d-1)/t$ orbits of size $t$, which together form the minimal ideal $g(X)^{d-1} \mathbb{F}_2[X]/g(X)^d$ isomorphic to the finite field $\mathbb{F}_2[X]/g(X)$. $\Box$

For example, take $g(X) = (1+X+X^2)^5$. Then $b(e) = 3$, $t = 3$ and the orbits of $X$ have sizes $1, 3$; $6^2$; $12^{20}$; $24^{32}$, where the semicolons indicate the split according to $b \in \mathbb{N}$ used in the theorem, and the exponents indicate multiplicities. These orbit sizes form the multiset $\{1,3,6^2, 12^{20}, 24^{32} \}$.

Just for the next corollary, we define the product of two multisets to be the multiset obtained by multiplying the elements pairwise, respecting multiplicities. For example

$\{1,3,6^2\} \times \{1,7\} = \{1,3,6^2,7,21,42^2\}.$

Corollary. Let $F$ be an invertible LFSR with taps $T$. Let $f_T(X) = g_1^{e_1} \ldots g_c^{e_c}$ be the irreducible factorization of its feedback polynomial. Then the multiset of orbit sizes of $F$ is the product of the multisets of orbit sizes for each factor $g_i^{e_i}$, as given by the previous theorem.

Proof. This is routine from the Chinese Remainder Theorem. $\Box$

## A toy model for the proof of Shannon’s Noisy Coding Theorem for the BSC

September 6, 2019

In the Binary Symmetric Channel (BSC) with symbol error probability $p$, an input bit flips with probability $p < 1/2$. A binary code of length $n$ can be used to communicate on the BSC by sending each bit of a codeword $u$, collecting the $n$ received bits as a word $v$ still of length $n$, and then decoding $v$ as the nearest codeword $u'$ to $v$ with respect to Hamming distance. For instance, using the repetition code of length $2e+1$, a decoding error occurs if and only if $e+1$ or more bits get flipped in the channel. This event has probability

\begin{aligned}\sum_{k=e+1}^n \binom{n}{k} p^k(1-p)^{n-k} &\le p^{e+1}(1-p)^e \sum_{k=e+1}^n \binom{n}{k} \\ &\le p^e (1-p)^e 2^{n-1} \\ &\le \rho^e\end{aligned}

where $\rho = 4p(1-p) < 1$. Hence we can communicate with decoding error probability $\log \epsilon / \log \rho$. For example, if $p = 1/4$ then $\rho = 3/4$ and $\rho^e < 5\%$ provided $n \ge 23$. With this solution we send $23$ bits through the channel for each bit of the message. The information rate is therefore $1/23$. Can we do better?

#### Capacity and mutual information

The capacity of the BSC is $1-h(p)$, where

$h(p) = -p \log_2 p -(1-p) \log_2(1-p)$

is Shannon’s entropy function. Capacity has a slightly technical definition in terms of mutual information. Recall that if $X$ and $Y$ are random variables then the conditional entropy $H(X|Y)$ is the expected value

$H(X|Y) = \sum_{y} \mathbb{P}[Y=y]H(X|Y=y),$

where $H(X|Y=y)$ is the entropy of the random variable $X$, conditioned on $Y=y$. Informally, $H(X|Y)$ is the uncertainty in $X$ that remains after learning $Y$. The mutual information $I(X;Y)$ is then $H(X) - H(X|Y)$; informally this is the amount of information about $X$ we learn from $Y$. For example, if $p =0$ then $Y$ determines $X$, so all uncertainty is removed, and $I(X;Y) = H(X)$. By

$I(X;Y) = H(X) - H(X|Y) = H(X) + H(Y) - H(X,Y) = H(Y) - H(Y|X).$

mutual information is symmetric: the amount we learn about $X$ from $Y$ is also the amount we learn about $Y$ from $X$. Is this intuitive?

Returning to the BSC, consider the most ‘informative’ source $X^\star$ for which $\mathbb{P}[X^\star=0] = \mathbb{P}[X^\star=1] = \frac{1}{2}$. Let $Y$ be the output bit. The capacity is then $I(X^\star;Y)$. We saw a moment ago that, as expected, if $p=0$ then the capacity is $H(X^\star) = 1$. The most convenient way to compute the capacity for general $p$ uses the right-hand side of the identity above. Since $X^\star$ is uniform, so is $Y$, and hence $H(Y) = 1$. Moreover

\begin{aligned} H(Y|X^\star) &= \frac{1}{2} H(Y | X^\star = 0) + \frac{1}{2}H(Y|X^\star=1) \\ &= 2 \times \frac{1}{2} (-p \log_2 p -(1-p) \log_2(1-p)) \\ &= h(p)\end{aligned}.

Therefore the capacity is $1-h(p)$, as claimed.

#### Shannon’s Noisy Coding Theorem

By Shannon’s Noisy Coding Theorem capacity has the following interpretation: given any $\epsilon > 0$ and any $\eta > 0$, for all sufficiently large $n$ there exists a binary code $C$ of length $n$ of size $> 2^{n (1- h(p) - \eta)}$ such that the probability of a decoding error when $C$ is used to communicate on the BSC, with each codeword equally likely to be sent, is $< \epsilon$. (To be precise, a decoding error means that the received word is not decoded as the sent codeword.) For example, if $p = 1/4$ then

$1-h(p) \approx 0.189.$

Therefore Shannon's Noisy Coding Theorem promises that, by taking $n$ sufficiently large, we can communicate with negligible decoding error probability and information rate $0.188$.

As far as I know, all known proofs of Shannon's Noisy Coding Theorem use the probabilistic method.

#### Outline proof

Let $C = \{u(1),\ldots, u(N)\}$ be a binary code of length $n$ and size about $2^{n (1- h(p) - \eta/2)}$, obtained by choosing each codeword $u(i)$ uniformly and independently at random from $\mathbb{F}_2^n$. Let $p_{i}(C)$ be the probability (in the BSC model) that if $u(i)$ is sent then the received word is wrongly decoded as some $u(j)$ with $j\not=i$. An expectation argument shows that, provided $n$ is sufficiently large,

$\mathbb{E}_{\mathrm{code}}[p_{i}(C)] < \epsilon/2.$

Note that here the expectation is taken with respect to the random choice of $C$ and $p_{i}$ is a random variable (that just happens to be a probability in the BSC model). We'll see shortly why $\epsilon/2$ is the right upper bound to take. It follows by averaging over all codewords that

$\frac{1}{N}\sum_{i=1}^N\mathbb{E}_\mathrm{code}[p_i(C)] < \epsilon/2.$

By the probabilistic method, there exists a particular code $C$ such that

$\frac{1}{N}\sum_{i=1}^N p_i(C) < \epsilon/2.$

Does this mean that we can use $C$ to communicate with decoding error probability $< \epsilon/2$? Unfortunately not, because the previous equation only says that on average, we meet this bound. There could be some codewords $u(i)$, perhaps those unusually near to many other codewords, for which the decoding error probability is higher. But since the average is $\epsilon/2$, at most half the codewords have $p_i(C) \ge \epsilon$. Therefore by removing at most half the codewords, we get a code $C'$ of size $N' \ge N/2$ with decoding error probability $< \epsilon$. It might seem bad that, at worst, $|C'| = |C|/2 = N/2$, but as $N = 2^{(1-h(p)-\eta/2)n}$, we have

$\frac{N}{2} = 2^{(1-h(p) -\eta/2 - 1/n))n}$

and provided $n$ is sufficiently large, $\eta + 1/n < \eta$. This randomly constructed $C'$ proves Shannon's Noisy Coding Theorem for the parameters $\epsilon$ and $\eta$.

#### Remarks

1. The final expurgation step, and the need to work with $\epsilon/2$ rather than $\epsilon$ can be avoided by choosing a random linear code. Then the multiset of Hamming distances $\{ d(u(i),u(j)) : j\not=i \}$ which determines the probability $p_i(C)$ of a decoding error on sending $u(i)$ is the same for all $i$. Therefore all the $p_i(C)$ are equal. Some linear codes will be uniformly terrible, but by the probabilistic method, some linear codes are uniformly good.
2. Care is needed to turn the outline above into a complete proof. For example, in the first step bounding the $p_i(C)$, one needs Chernoff’s bound or the Central Limit Theorem to justify that when a codeword $u$ is sent, the received word $v$ is, with probability tending to $1$ with $n$, of distance between $(p-\delta)n$ and $(p+\delta)n$ of $u$, for any fixed $\delta > 0$.
3. One subtle feature of the proof is that there are two different probability models in play. I tried to emphasise this above, for example by using $\epsilon$ for the error bounds in the channel probability space, and $\eta$ for the error bounds in the code probability space. A related feature is that we have to work with expectations, rather than probabilities, in the code’s probability space. For instance, it would be nice to have

$\mathbb{P}_\mathrm{code}\bigl[\mathbb{P}_\mathrm{channel}[A_i] > \epsilon\bigr] < \eta$

for sufficiently large $n$, where $A_i$ is the event that a sent $u(i)$ is wrongly decoded as some $u(j)$ with $j\not= i$. But these probabilities seem hard to reason about directly.

#### A toy model

Here is a toy model that has the same split into two probability spaces, but is, I think, much easier to understand. To play Match! using a supplied deck of $N$ cards numbered from $1$ up to $n$, the two players each draw a card from the deck. If they find they have matching numbers, they both eat some cake. Let $M$ be this ‘in game’ event. Otherwise they go to bed hungry.

We suppose that the manufacturer constructs decks by choosing each card uniformly and independent at random. Let $f_r$ be the frequency of card $r$, for each $r \in \{1,\ldots, n\}$. We have

$\mathbb{P}_\mathrm{in game}[M] = \sum_{r=1}^n \frac{f_r(f_r-1)}{N(N-1)}.$

In the random process used by the manufacturer to construct decks, $f_r$ has the binomial distribution $B(1/n,N)$. Hence $\mathbb{E}_\mathrm{deck}[f_r] = N/n$ and

$\mathbb{E}_\mathrm{deck}[f_r^2] = \mathrm{Var}_\mathrm{deck}[f_r] + \mathbb{E}_\mathrm{deck}[f_r]^2 = \frac{N(n-1)}{n^2} + \frac{N^2}{n^2}.$

It follows that

\begin{aligned}\mathbb{E}_\mathrm{deck}\bigl[\mathbb{P}_\mathrm{in game}[M]\bigr] &= \frac{1}{N(N-1)}\sum_{i=1}^r \bigl( \frac{N(n-1)}{n^2} + \frac{N^2}{n^2} - \frac{N}{n} \bigr) \\ &= \frac{n}{N(N-1)n^2} \bigl( Nn- N + N^2 - Nn \bigr) \\ &= \frac{1}{n}. \end{aligned}

This can also be seen without calculation: mixing up the two probability models, we can imagine that the manufacturer builds a two card deck, and issues the cards to the players. The probability they are equal is clearly $1/n$.

We can also reason about probabilities in the deck model. For example $\mathbb{P}_\mathrm{deck}\bigl[ \mathbb{P}_\mathrm{in game}[M=0] \bigr]$ is the probability that all cards are different, namely

$\frac{n(n-1) \ldots (n-N+1)}{n^N} = \frac{N!}{n^N} \binom{n}{N}.$

## Appropriate: Donmar Warehouse

August 28, 2019

Appropriate, directed at the Donmar Warehouse by Ola Ince is a new production of a 2014 play by Branden Jacobs-Jenkins. The three central characters Toni, Bo and Franz meet in their deceased father’s plantation house to sort out his estate. Their tangled personal and family history, revealed in a long series of explosive dialogues, spans an impressive range of themes: most prominently, parent/child relationships and racism.

Here I’ll concentrate on Toni. Pathologically unable to accept help from anyone, but appointed as sole executor the estate, we meet her in the first act surrounded by the accumulated detritus of 20 years of her father’s hoarding. Toni is exhausted by caring for her father through a long illness, unaided by her younger brothers, not to mention dealing with the fallout of her son Reese’s brief career as an utterly unsuccesful drug-dealer. Her engrained negativity is excusable. Monica Dolan’s superb performance is nuanced and brings out Toni’s capacity for love, and her need for it. She is however hard to like: hard for the audience, and impossible for her family members. The moment when she snaps is the climax of the play.

The reviews I read beforehand suggested that the album of appalling photographs of lynchings found part-way through the first act would dominate the play. While important, and central to the impressively developed racism theme, it is only one of many elements. When it turns out the photos are highly valuable (unlike the house, which, blighted by a graveyard, is worthless), we quickly see that none of the Lafayette children would hesitate to sell them. The moment when they reflexively reach for the polite formulations of the auction house and ‘private dealer’ is most revealing. Of them all, Toni is the most eager to excuse her father’s racism, even as the evidence for it becomes overwhelming. She is obliged to confront her own attitudes in several clever set pieces. The most impressive, however, concerns her youngest brother Franz, a diligent 12-stepper, and his fiancée River, who is firmly committed to helping him on his journey with whatever resources her relentless spirituality (neatly shading into meaningless psychobabble) can provide. When Bo’s wife Rachel wrongly assumes she is a native American, rather than a New Yorker née Tracy, there is a brilliant moment where we see both characters expose their darkest and most instinctive reactions.

The first act of the play, while highly entertaining (the witty touches begin even with the title of the play), did at times reduce to a sequence of monologues-as-dialogues, all delivered at a surprising high volume for normal conversation. The febrile atmosphere was convincing, but maybe less oppressive than the director intended. Still, it was effective as a scene-setting prologue to the longer second act. I’m not sure if it was the actors settling down, or a deliberate directorial decision, but anyway, the vocal and dramatic range vastly improved between the acts. The second act mixed comedy and tragic irony in as good a piece of theatre as I’ve seen in recent years.

The play seems to deliberately avoid tying up its loose ends. I suspect the lack of any obvious character development, except perhaps in two of the teenage children, is also deliberate. At the end, we are left with the characters, all ruthlessly exposed, heading back to their normal lives. And instead of any resolution, we are left to think about them, their problems, and the many themes of the play. Overall, an excellent production of a remarkable play, that leaves one with much to chew over.

## Why is Adobe Acrobat so innumerate?

July 12, 2019

The image below shows Adobe Acrobat’s conversion to Microsoft Powerpoint of a typical slide produced by LaTeX and Beamer. The original is shown below.

For the £70 annual fee Adobe charge for turning on this misfeature, I think one might expect something better. The execrable kerning is just one final kick in the teeth. Fortunately for my credit card, I was able to brandish the Consumer Contracts Regulations in an online chat and have been promised a full refund. I’ve seen similar glitches in journal papers: the culprit now seems clear. As an alternative on the Mac, I suggest using pdf-to-keynote and then exporting from Keynote. Or better still, have nothing to do with either Adobe or Microsoft — if only that were feasible. In fairness, I should note that the output from Adobes’s conversion is editable in Microsoft Powerpoint; unfortunately the only reasonable edit one could make it to delete the lot and start over.

## Difference attacks on toy block ciphers

May 22, 2019

After setting an examination for my Cipher Systems course that turned me from hero to zero in the eyes of the students (you can guess why), I’m somewhat disillusioned with the subject. One unavoidable issue is that it’s hard to isolate the interesting mathematical ideas from the forest of notation and definitions needed to define a modern block cipher such as AES. With a few honorable exceptions, it seems most textbooks don’t even try.

As block ciphers go, AES is a relatively elegant design: its cryptographic strength comes ultimately from the pseudo-inverse function $P : \mathbb{F}_2^8 \rightarrow \mathbb{F}_2^8$ defined by identifying $\mathbb{F}_2^8$ with the finite field $\mathbb{F}_{2^8}$ and then inverting non-zero elements. (The zero word is the unique fixed point of $P$.) In a typical round the input in $\mathbb{F}_2^{128}$ is split into $16$ bytes and the non-linear function $P$ is applied to each byte; the bytes are then combined and mixed up by two $\mathbb{F}_2$-affine permutations, and these are followed by a final key addition in $\mathbb{F}_2$.

Stream ciphers seem to be even messier: I asked
without great success on crypto.stackexchange.com for a simply defined but not-trivially-breakable stream cipher I could use in my course.

As a toy model for a difference attack in my course, I used the block cipher with plaintext and ciphertexts $\mathbb{F}_2^8$, key space $\mathbb{F}_2^8 \times \mathbb{F}_2^8$ and encryption functions defined by

$e_{(k,k')}(x) = P(x+k) + k'.$

Suppose you know a single plaintext/ciphertext pair $(x,z)$. For each guessed first key $k_\star$ there is a unique $k'_\star$ such that $e_{(k_\star,k'_\star)}(x) = z$, namely

$k'_\star = P(x+k_\star) + z.$

Thus a single known plaintext/ciphertext pair does not reveal either component of the key. To motivate the attack in this post, let us consider what two pairs $(x,z)$ and $(x+\Delta, z+\Gamma)$ reveal. To save space, write $x_\Delta$ for $x + \Delta$. Suppose that $(k_\star, k'_\star)$ is a possible key:

\begin{aligned} x &\mapsto x + k_\star\hskip7.5pt \mapsto P(x+k_\star) \hskip7pt \mapsto P(x+k_\star) + k'_\star \hskip8pt = z \\ x_\Delta &\mapsto x_\Delta + k_\star \mapsto P(x_\Delta + k_\star) \mapsto P(x_\Delta+k_\star) +k'_\star = z + \Gamma.\end{aligned}

A little thought shows that another possible key is $(k_\star + \Delta, k'_\star + \Gamma)$:

\begin{aligned} x &\mapsto x_\Delta + k_\star \mapsto P(x_\Delta+k_\star) \mapsto P(x_\Delta+k_\star) + k'_\star + \Gamma = z \\ x_\Delta &\mapsto x + k_\star\hskip7.5pt \mapsto P(x + k_\star)\hskip7pt \mapsto P(x+k_\star) +k'_\star +\Gamma\hskip8pt = z + \Gamma.\end{aligned}

Observe that the differences of top and bottom are in both cases

$\Delta \mapsto \Delta \mapsto \Gamma \mapsto \Gamma.$

This illustrates the critical point that differences are invariant under key addition, but typically changed by non-linear functions such as $P$. By the lemma below, for any non-zero input difference $\Delta \in \mathbb{F}_2^8$, the output difference

$P(x+k_\star) + P(x + k_\star + \Delta)$

can be $127$ of the $256$ elements of $\mathbb{F}_{2}^8$. This is usually taken as a sign of the cryptographic strength of $P$. But here it means that one can (apart from in one case) determine the pair $\{x+k_\star,x+k_\star+\Delta\}$ from $\Delta$ and $\Gamma$. For the toy cipher, this pair is $\{x+k, x+\Delta+k\}$, and so we determine $\{k, k+\Delta\}$, leaving exactly two possible keys.

#### Differences through pseudo-inverse.

Lemma. Let $\delta \in \mathbb{F}_{2^8} \backslash \{0\}$. If $\gamma \in \mathbb{F}_{2^8} \backslash \{0,1/\delta\}$ then the equation $p(\alpha) + p(\alpha+\delta) = \gamma$ has exactly zero or two solutions.

Proof. If $\alpha \in \{0,\delta\}$ then

$p(\alpha)+p(\alpha+\delta) = p(0) + p(\delta) = 0 + 1/\delta = 1/\delta.$

We may therefore assume that $\alpha\not\in \{0,\delta\}$ and so $p(\alpha) = \alpha^{-1}$ and $p(\alpha+\delta) = (\alpha+\delta)^{-1}$. Now $\alpha^{-1} + (\alpha+\delta)^{-1} = \gamma$ if and only if $\delta = \gamma \alpha (\alpha + \delta)$. This is a quadratic equation in $\alpha$ not having a repeated root. $\Box$.

(In the exceptional case $\gamma = 1/\delta$, there are four solutions, namely $0$, $1$, and the two roots of $(\alpha/\delta)^2 + (\alpha/\delta) + 1$ lying in $\delta \mathbb{F}_{2^2}^\times \subseteq \mathbb{F}_{2^8}$. In fact, since $(\alpha/\delta)^2 + (\alpha/\delta) + 1/\gamma\delta$ has $2$ roots if and only if $1/\gamma\delta \in \{\beta^2 + \beta : \beta \in \mathbb{F}_2^8\}$, which is a co-dimension $1$ subspace of $\mathbb{F}_{2^8}$, there are exactly $127$ output differences $\gamma$ for each non-zero input difference $\delta$, each except $1/\delta$ arising from exactly two inputs $\alpha$.)

In my course I spent an entire lecture on the difference attack, simplifying the lemma by only considering $\Delta = 0000\,0001$, corresponding to $\delta = 1 \in \mathbb{F}_{2^8}$, and omitting the parenthetical paragraph entirely. Still I’m fairly sure no-one except the keenest M.Sc. students got the idea. Probably I will scrap it next year, and instead give an example from a different toy block cipher with a more easily defined $S$-box. Or maybe, since there is no chance of universal comprehension, I should just leave it to a problem sheet, with a hint that the lemma above follows from Hilbert’s Theorem 90. (Joke.)

#### Variations on pseudo-inverse

• The exceptional case $\gamma=1/\delta$ is annoying. It arises because $X^2+X+1$ has a root in $\mathbb{F}_{2^8}$. Over any $\mathbb{F}_{2^c}$ with $c$ odd, there are $2^{c-1}$ output differences for any input difference $\gamma$. For instance, over $\mathbb{F}_{2^2}$ we only see the exceptional case (and even more embarrassingly, the pseudo-inverse function $p$ is linear), while over $\mathbb{F}_{2^3}$, defined by $\beta^3 + \beta = 1$, the output differences for input difference $1$ are $\{1,1+\beta, 1+\beta^2, 1+\beta+\beta^2\}$; these are the (no-longer-so) exceptional $\gamma = 1$, and the reciprocals of the non-zero elements in the subspace

$\{\alpha^2 + \alpha : \alpha \in \mathbb{F}_{2^3} \} = \langle \beta, \beta^2\rangle.$

• It is mathematically tempting to replace the tricky finite field $\mathbb{F}_{2^8}$ with $\mathbb{F}_p$ for $p$ an odd prime. Taking the input difference $\delta$, now applied (and extracted) using arithmetic in $\mathbb{F}_p$, we have

$p(x+1) - p(x) = 1/(x+1) - 1/x = 1/x(x+1)$

for $x\not\in \{-1,0\}$. Quadratic equations are now more easily understood: we can just complete the square to see that exactly $(p-1)/2$ of the elements $\mathbb{F}_p^\times$ are output differences; again the output difference $1$ has double the probability of the others.

• I hoped that $\mathbb{Z}/2^n\mathbb{Z}$ might give a nice example, using the function $q$ such that $q(2x) = 1/(2x+1) -1$ and $q(2x+1) = 1/(2x+1)$. This function is weak respect to the input difference of $1$, since, by definition, $q(2x+1)-q(2x) = 1$ for any $x$. Perhaps surprisingly, this makes $1$ less suitable for the attack. The situation for even input differences is somewhat messy. Clearly this is a non-starter for my course.
• It’s tempting to try $s(\alpha) = \alpha^2$. But because of the identity

$s(\alpha + \delta) = (\alpha + \delta)^2 = \alpha^2 + \delta^2 = s(\alpha) + \delta^2,$

an input difference of $\delta$ transitions deterministically (i.e. the input $\alpha$ is irrelevant) to an output difference of $\delta^2$. One could replace the one-time-pad key addition in my toy block cipher with $s$ and run essentially the same attack, but as a direct replacement for pseudo-inverse, squaring has completely the wrong properties.

• Given this, one might try $r(\alpha) = \alpha^3$, working in $\mathbb{F}_{2^c}$. We have

$(\alpha+1)^3 + \alpha^3 = \alpha^2 + \alpha + 1,$

so by the argument already seen, there is a affine codimension $1$ subspace of output differences. Unfortunately cubing is only invertible if $c$ is odd; this is the case when the subspace is properly affine and rules out $\mathbb{F}_{2^8}$.

• ## The sum of all hook characters

April 2, 2019

Given a partition $\lambda$ of $n$, let $\chi^\lambda$ denote the irreducible character of the symmetric group $S_n$ canonically labelled by the partition $\lambda$. For example, $\chi^{(n)} = 1$ is the trivial character, $\chi^{(1^n)}$ is the sign character, and $\chi^{(n-1,1)}$ is the irreducible $n-1$-degree constituent of the natural permutation character $\pi$. Thus $\pi(g) = |\mathrm{Fix}(g)|$ and $\chi^{(n-1,1)}(g) = |\mathrm{Fix}(g)| - 1$ for $g \in S_n$. Let $\Gamma = \sum_{a=0}^{n-1} \chi^{(n-a,1^a)}$ be the sum of all the characters labelled by hook partitions. Let $\mathcal{O}_n$ be the set of permutations in $S_n$ having only cycles of odd length.

A nice short paper by Jay Taylor uses characters of the symmetric group labelled by skew partitions to prove that

$\Gamma(g) = \begin{cases} 2^{\mathrm{cyc}(g) - 1} & g \in \mathcal{O}_n \\ 0 & g \not\in \mathcal{O}_n \end{cases}$

where $\mathrm{cyc}(g)$ is the number of cycles in $g$.

As Taylor notes, this result is a special case of a more general theorem of Regev, proved by him using Lie superalgebras. Taylor’s paper gives a short proof using skew characters of the symmetric group. The purpose of this post is to give an alternative proof, using only a basic result on exterior powers of the natural permutation character and to suggest some generalizations not contained in Regev’s original results.

#### Background: a generating function for exterior powers.

Let $\bigwedge^k$ denote the $k$th exterior power of a character or representation. Let $V$ be a complex representation of a finite group $G$ with character $\chi$ and let $g \in G$ have eigenvalues $\lambda_1, \ldots, \lambda_d$ in its action on $V$. Thus $\chi(g) = \lambda_1 + \cdots + \lambda_d$. Let $v_1,\ldots, v_n$ be a basis for $V$ in which $g$ acts diagonally. Then

$\bigwedge^r V = \langle v_{i_1} \wedge\cdots \wedge v_{i_r} : i_1 < \ldots < i_r \rangle_\mathbb{C}$

and we see that

$(\bigwedge^r \chi) (g) = \sum_{i_1 < \ldots < i_r} \lambda_{i_1} \ldots \lambda_{i_r}.$

This is the $r$-th elementary symmetric polynomial evaluated at the eigenvalues $\lambda_1, \ldots, \lambda_d$. Hence

$(1)\quad \sum_{r=0}^d (\bigwedge^r \chi)(g) x^r = (1+\lambda_1 x) \cdots (1+\lambda_d x).$

This generating function is the counterpart for exterior powers of the Molien series, defined using symmetric powers, that is fundamental to invariant theory.

#### Background: exterior powers of the natural module.

Let $V = \langle e_1, \ldots, e_n\rangle_\mathbb{C}$ be the natural representation of $S_n$ over the complex numbers. Now $\bigwedge^n V = \langle e_1 \wedge \cdots \wedge e_n \rangle$ is the sign representation of $S_n$, and correspondingly, $\bigwedge^n \pi = \chi^{(1^n)}$. More generally, $\bigwedge^r V = \langle e_{i_1} \wedge \cdots \wedge e_{i_r} : i_1 < \ldots < i_r\rangle_\mathbb{C}$ is the monomial representation of $S_n$ induced from $\mathrm{sgn}_{S_r} \times \mathbb{C}_{S_{n-r}}$. Its character is known to be

$(2)\quad\bigwedge^r \pi = \chi^{(n-r,1^r)} + \chi^{(n-r+1,1^{r-1})}$

for $1 \le r < n$. This decomposition is an immediate corollary of either the Young or Pieri rules. It may also be proved using Specht modules and an explicit isomorphism $\bigwedge^r S^{(n-1,1)} \cong S^{(n-r,1^r)}$: see for instance Section 5 of this paper with Giannelli and Lim.

#### Proof.

By (2) we have

$\chi^{(n-r,1^r)} = \bigwedge^r \pi - \bigwedge^{r-1} \pi + \cdots + (-1)^{r-1} \pi + (-1)^r 1.$

Hence the coefficient of $\bigwedge^k \pi$ in $\Gamma$ is $1$ if $n-k$ is odd, and $0$ if $n-k$ is even. For example,

\begin{aligned} \Gamma_4 &= \chi^{(4)} + \chi^{(3,1)} + \chi^{(2,1,1)} + \chi^{(1,1,1,1)} \\ &= 1+ (\pi - 1) + (\bigwedge^2 \pi - \pi + 1) + (\bigwedge^3 \pi - \bigwedge^2 \pi + \pi - 1) \\ &= \pi + \bigwedge^3 \pi \end{aligned}

Thus $\Gamma = \sum_{k} \bigwedge^k \pi$, where the sum is over all $k$ such that $n-k$ is odd.

This sum is related to (1). Let $g \in S_n$ have precisely $a_k$ cycles of length $k$, for each $k \in \{1,\ldots, n\}$. The eigenvalues of the $n \times n$ permutation matrix representing $g$ can be found by considering the cycles separately: each $k$-cycle contributes eigenvalues $1, \zeta_k, \ldots, \zeta_{k}^{k-1}$, where $\zeta_k$ is a primitive $k$th root of unity. Since $(1+x)(1+\zeta x) \cdots (1+\zeta^{k-1}x ) = 1 + (-1)^{k-1} x^k$, the generating function in (1) gives

$\sum_{k=0}^n (\bigwedge^k \pi)(g)x^k = (1+x)^{a_1}(1-x^2)^{a_2} \ldots (1+(-1)^{n-1} x^n)^{a_n}.$

Denote this polynomial by $F_g(x)$. We now have

$\Gamma(g) = \begin{cases} \frac{F_g(1) + F_g(-1)}{2} & n \equiv 1 \bmod 2 \\ \frac{F_g(1) - F_g(-1)}{2} & n \equiv 0 \bmod 2. \end{cases}$

Since

$F_g(1) = \begin{cases} 2^{\mathrm{cyc}(g)} & g \in \mathcal{O}_n \\ 0 & g \not\in \mathcal{O}_n \end{cases}$

and $F_g(-1) = 0$ for all $g$, the result follows.

#### Possible generalizations.

Let $\Delta = \sum_{a} \chi^{(n-a,1^a)}$ where the sum is over all $a$ not congruent to $1$ modulo $3$. Then by (2), $\Delta = \sum_{b \ge 0} \bigwedge^{3b} \pi$, and the proof above suggests that $\Delta(g)$ may also have a fairly simple form. One might also look for analogous identities replacing $\bigwedge^k \pi$ with the permutation character of $S_n$ acting on the $k$-subsets of $\{1,\ldots, n\}$.