## Lagrange Inversion

November 18, 2017

In this post I’ll give two short proofs of the Lagrange Inversion formula, both based on the ideas in these notes. The first uses nothing more than Cauchy’s Integral Formula, and Cauchy’s Theorem to justify a change of variable. Rather than use Rouché’s Theorem, or the Open Mapping Theorem, as in the linked notes, I’ve added the necessary assumption on bijectivity into the hypotheses. In the standard combinatorial applications this is often obvious. The inverse is assumed to be analytic; this is easily proved, by trivial changes to the usual real variable argument, once it is known to exist. The second is even more elementary, but a little harder to motivate: some version of it will probably be the one I use for my planned planned textbook (expected, at the current rate, some time in 2022).

Theorem. Let $f : \mathbb{C} \rightarrow \mathbb{C}$ be an analytic function such that $f(0) = 0$ and $f$ restricts to a bijective function on a domain containing $0$. Then

$\displaystyle [w^n] f^{-1}(w) = \frac{1}{n} [z^{n-1}] \bigl( \frac{z}{f(z)}\bigr)^n.$

Proof 1. We use Cauchy’s Integral Formula to find the coefficient of $w^n$ in the Taylor Series of $f^{-1}(w)$, change variables by $s = f(z)$ (so $\mathrm{d}s = f'(z)\mathrm{d}z$), and then integrate by parts. The change of variables is justified by Cauchy’s Theorem since the approximation $f(z) \approx z$ for small $z$ shows that the contour $t \mapsto f(r\mathrm{e}^{it})$ for $t \in [0,2\pi]$ winds once around the origin provided $r$ is sufficiently small. Thus

\displaystyle \begin{aligned} \ [w^n] f^{-1}(w) &= \frac{1}{2\pi \mathrm{i}} \oint \frac{f^{-1}(s)}{s^{n+1}} \mathrm{d} s \\ &= \frac{1}{2\pi \mathrm{i}} \oint \frac{z}{f(z)^{n+1}} f'(z) \mathrm{d} z \\ &= \frac{1}{2\pi \mathrm{i}} \oint - \frac{z}{n} \frac{\mathrm{d}}{\mathrm{d}z} \frac{1}{(f(z))^{n}} \\ &= \frac{1}{n} \frac{1}{2\pi \mathrm{i}} \oint \frac{1}{f(z)^n} \mathrm{d} z \\ &= \frac{1}{n} [z^{-1}] \frac{1}{f(z)^n} \\ &= \frac{1}{n} [z^{n-1}] \bigl( \frac{z}{f(z)}\Bigr)^n \end{aligned}

as required. $\Box$

Since $[z^n]f(z) = \frac{1}{n!}\frac{\mathrm{d}^n}{\mathrm{d}z^n} f(z)$, evaluated at $0$, an equivalent statement of the result is

$\displaystyle f^{-1}(w) = \sum_{n=0}^\infty \frac{w^n}{n!} \frac{\mathrm{d}^{n-1}}{\mathrm{d}z^{n-1}} \bigl( \frac{z}{f(z)} \bigr)^n \bigr|_{z=0}$.

Proof 2. Let $c_n = [w^n] f^{-1}(w)$. Since $z = f^{-1}\bigl( f(z) \bigr)$, the Taylor series for $f^{-1}$ gives

$\displaystyle z = \sum_{m=1}^\infty c_m f(z)^m.$

Differentiate and then divide through by $f(z)^n$ to get

$\displaystyle \frac{1}{f(z)^n} = n c_n \frac{f'(z)}{f(z)} + \sum_{m \not= n} m c_m f(z)^{m-1-n} f'(z).$

The formal residue of $f'(z)/f(z)$ is $1$ (this is the count of zeros and poles). The summand for $m$ where $m\not= n$ is, up to scalars, the derivative of $f(z)^{m-n}$, so has zero residue. Therefore, taking residues, we obtain $[z^{-1}] \frac{1}{f(z)^n} = n c_n$, which is equivalent to $c_n = [z^{n-1}] (z/f(z))^n$, as required. $\Box$

Not a proof. It seems impossible to replace the use of residues in Proof 2 with the more elementary derivative. From

$\displaystyle \frac{z}{f(z)^n} = c_n + \sum_{m\not= n} c_m f(z)^{m-n}$

we can multiply through by $z^{n-1}$ to get

$\displaystyle c_n z^{n-1} + \sum_{m=1}^{n-1} c_m \bigl(\frac{z}{f(z)} \bigr)^n \frac{f(z)^m}{z} + \sum_{r=1}^\infty c_{n+r} z^{n-1} f(z)^r.$

It is now tempting to differentiate $n-1$ times to extract $c_n$: the third summand is no obstacle, since it is $\mathrm{O}(z^n)$. The second summand is now holomorphic (this is clear from the rewriting above, since the singularity at $0$ is removable). But of course it contributes erroneous terms to the derivative. Indeed, this procedure gives $(n-1)!c_n$, whereas the theorem predicts that $n! c_n = \frac{\mathrm{d}^{n-1}}{\mathrm{d}z^{n-1}} \bigl( \frac{z}{f(z)} \bigr)^n \bigr|_{z=0}$, so something is surely wrong.

## Shannon’s 1949 paper

September 20, 2017

In 1949 Claude Shannon published Communication theory of secrecy systems, Bell Systems Technical Journal 28 656–715. The members of Bell Labs around this time included Richard Hamming, the three inventors of the transistor and Harry Nyquist, of the Nyquist bound. Its technical journal published many seminal papers, including Shannon’s 1948 paper A mathematical theory of communication defining entropy and Hamming’s 1950 paper Error detecting and error correcting codes, defining Hamming distance and essentially inventing the modern adversarial’ setting for coding theory.

Incidentally, the story goes that Shannon asked Von Neumann what he should call his new measure of information content and Von Neumann replied

You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.

Like Hamming’s paper, Shannon’s two papers still amply repay reading today. One idea introduced in the 1949 paper is ‘perfect secrecy’.

### Perfect secrecy

Consider a cryptosystem with plaintexts $\mathcal{P}$ and ciphertexts $\mathcal{C}$, with encryption functions $e_k : \mathcal{P} \rightarrow \mathcal{C}$ parametrised by a set $\mathcal{K}$ of keys.

Shannon supposes that there is a probability distribution on the plaintexts, assigning an a priori probability $p_x$ to each $x \in \mathcal{P}$. There is also an independent probability distribution on the keys. These interact in a rich way. For instance, $\mathbb{P}[Y=y|X=x] = \mathbb{P}[e_K(x) = y]$ is a sum of key probabilities, while $\mathbb{P}[Y=y|K=k] = \mathbb{P}[e_k(X) = y]$ is either zero, or the probability attached to the unique plaintext $p_{e_k^{-1}(y)}$. Other probabilities, such as $\mathbb{P}[X=x|Y=y]$ can be computed from these using Bayes’ Theorem type arguments.

Suppose we observe a ciphertext $y \in \mathcal{C}$: what, if anything do we learn about the corresponding plaintext $x \in \mathcal{P}$? Shannon defines the a posteriori probability $\mathbb{P}[X = x | Y = y]$ to be the conditional probability of the plaintext $x \in \mathcal{P}$ given we observed the ciphertext $y \in \mathcal{C})$; the system then has perfect secrecy if, for any a priori probability distribution $p_x$, we have $\mathbb{P}[X = x | Y = y] = p_x$ for all $x \in \mathcal{P}$, and all $y \in \mathcal{C}$. (This assumes implicitly that $\mathbb{P}[Y = y] > 0$ for every $y \in \mathcal{C}$.)

Shannon proves in his Theorem 6 that a necessary and sufficient condition for perfect secrecy is that $\mathbb{P}[Y = y|X = x] = \mathbb{P}[Y = y]$ for all $x \in \mathcal{P}$ and $y \in \mathcal{C}$.

The proof is a short application of Bayes’ Theorem: since $\mathbb{P}[X = x | Y = y] = \mathbb{P}[Y = y|X = x]p_x/\mathbb{P}[Y = y]$, and since we may choose $p_x \not=0$ (which is necessary anyway for the conditional probability to be well-defined), we have $p_x = \mathbb{P}[X = x | Y = y]$ if and only if $\mathbb{P}[Y = y|X = x] = \mathbb{P}[Y = y]$.

Corollary. In a system with perfect secrecy, $|\mathcal{K}| \ge |\mathcal{C}|$. Moreover, if equality holds then every key must be used with equal probability $1/|\mathcal{K}|$ and for each $x \in \mathcal{P}$ and $y \in \mathcal{C}$ there exist a unique $k \in \mathcal{K}$ such that $e_k(x) = y$.

Proof Fix a plaintext $x \in \mathcal{P}$. We claim that for each $y \in \mathcal{C}$ there exists a key $k$ such that $e_k(x) = y$. Indeed, if $y_{\mathrm{bad}} \in \mathcal{C}$ is never an encryption of $x$ then, for any choice of a priori probabilities that gives some probability to $x$, we have $\mathbb{P}[Y = y_{\mathrm{bad}}|X = x] = 0$, so by Shannon’s Theorem 6, $\mathbb{P}[Y = y_\mathrm{bad}] = 0$. But, by the implicit assumption, there is a non-zero chance of observing $y_\mathrm{bad}$, a contradiction.

Since $x$ has at most $|\mathcal{K}|$ different encryptions, the claim implies that $|\mathcal{K}| \ge |\mathcal{C}|$. Moreover if equality holds then for every $y \in \mathcal{C}$ there exists a unique $k \in \mathcal{K}$ such that $e_k(x) = y$. Since $\mathbb{P}[Y=y | X =x ] = \mathbb{P}[K = k]$, where $k$ is this unique key, the conclusion of Theorem 6, that $\mathbb{P}[Y = y|X = x]$ is constant for $y \in \mathcal{C}$, then implies that each key is equiprobable. $\Box$

The ‘only if’ direction of Theorem 2.4 in Cryptography: theory and practice (3rd edition) by Douglas Stinson, is the corollary above, but, according to my reading of pages 48 and 50, interpreted with a different definition of perfect secrecy, in which the a priori distribution is fixed, as part of the cryptosystem. (This makes some sense, for example, in a cryptosystem encrypting English plaintexts nonsensical strings should always have zero probability.) Unfortunately, Stinson’s change makes the result false. The diagram below shows a toy cryptosystem with two keys and two plaintexts.

Take $p_w = 0$ and $p_x = 1$. Then

$\mathbb{P}[X = x | Y = y] = \mathbb{P}[X = x | Y = y'] = p_x = 1$

and

$\mathbb{P}[X = w | Y = y] = \mathbb{P}[X = w | Y = y'] = p_w = 0,$

so the system has perfect secrecy, no matter what probabilities we assign to the keys. (Incidentally, setting $p_{x} = 1$ and $\mathbb{P}[K = k] = 1$ gives a cryptosystem where we always send $x$ and observe the ciphertext $y$; the a posteriori probability of $x$ is therefore the same as the a priori probability, so the system has perfect secrecy, even though the keys are not equiprobable. This shows that Shannon’s implicit assumption is not just a technicality required to make the conditional probabilities well-defined.)

The error in the proof of Theorem 2.4 comes in the application of Bayes’ Law, where it is implicitly assumed that $p_x \not= 0$. This shows that the extra layer of quantification in Shannon’s paper is not a mere technicality. Given the difficulties students have with nested quantifiers, I’m inclined to keep Stinson’s definition, and fix the problem by assuming $p_x\not =0$ for each $x \in \mathcal{P}$. One can get a more general, but somewhat technical result, by taking an arbitrary cryptosystem and applying Shannon’s Theorem to the subsystem where $\mathcal{P}$ is the set of plaintexts with positive probability, and $\mathcal{C}$ and $\mathcal{K}$ are unchanged. (To be fair to Stinson, he observes before the theorem that plaintexts $x$ such that $p_x = 0$ are never an obstacle to perfect secrecy, so clearly he was aware of the issue. He also assumes, as we have done, that $\mathbb{P}[Y = y] \not=0$ for each $y \in \mathcal{C}$, but this is something else.)

Incidentally, there is a subtle English trap in Shannon’s paper: he says, quite correctly, that when $|\mathcal{K}| = |\mathcal{C}|$, ‘it is possible to obtain perfect secrecy with only this number of keys’. Here with only’ does not mean the same as ‘only with’.

#### Example from permutation groups

Given a finite group $G$ acting transitively on a set $\Omega$ we obtain a cryptosystem with plaintexts and ciphertexts $\Omega$ and encryption maps $\alpha \mapsto \alpha g$ indexed by the elements of $G$. For which probability distributions on $G$ does this cryptoscheme have perfect secrecy?

Let $H_\beta$ denote the point stabiliser of $\beta \in \Omega$. Since

$\mathbb{P}[X = \alpha | Y = \beta] = \mathbb{P}[k \in g_{\alpha \beta} H_\beta],$

where $g_{\alpha \beta}$ is any element such that $\alpha g_{\alpha\beta}= \beta$, the cryptosystem has perfect secrecy if and only if, for each $\beta \in \Omega$,

$\mathbb{P}[K \in g_{\alpha\beta} H_\beta]$

is constant as $\alpha$ varies over $\Omega$. Call this condition $(\star)$.

Lemma 2. Suppose that $G$ has a regular normal subgroup $N$. Any probability distribution constant on the cosets of $N$ satisfies $(\star)$.

Proof. Since $G = N \rtimes H_\beta$, the subgroup $H_\beta$ meets each coset of $N$ in a unique element. The same holds for each coset $t H_\beta, \ldots, t^{p-1} H_\beta$. Therefore

$\sum_{k \in g_{\alpha \beta} H_\beta} \mathbb{P}[k] = \sum_{k \in H_\beta} \mathbb{P}[k]$

is independent of $\alpha$. $\Box$

In the case of the affine cipher, the converse also holds.

Theorem 3. Let $p$ be a prime and let $G$ be the affine group on $\mathbb{F}_p$ with normal translation subgroup $N$. A probability distribution on $G$ satisfies $(\star)$ if and only if it is constant on each of the cosets of $N$.

Proof. Let $r$ generate the multiplicative group of integers modulo $p$. Then $G$ is a semidirect product $N \rtimes H$ where $N = \langle t \rangle$ is the regular subgroup generated by translation by $+1$ and the point stabiliser $H = \langle h \rangle$ is generated by multiplication by $r$.

A probability distribution on $G$ can be regarded as an element of the group algebra $\mathbb{C}G$. The probability distributions satisfying $(\star)$ correspond to those $\sum_{k \in G} c_k k \in \mathbb{C}G$ such that, for each $\beta \in \mathbb{F}_p$,

$\sum_{k \in t^{-\alpha + \beta} H^{t^\beta}} c_k$

is constant as $\alpha$ varies over $\mathbb{F}_p$. Thus if $U$ is the corresponding subspace of $\mathbb{C}G$, then $U$ is invariant under left-multiplication by $N$ and right-conjugation by $N$. But since $kg = gk^t$ for $k, g \in G$, taken together, these conditions are equivalent to invariance under left- and right- multiplication by $N$. It is therefore natural to ask for the decomposition of $\mathbb{C}G$ as a representation of $N \times N$, where $N \times N$ acts by

$k \cdot (g,g') = g^{-1}k g'.$

Since $G/N$ is abelian, $(Nh^\beta)t = (Nh^\beta)^t = Nh^\beta$ for all $\beta$. Hence $\mathbb{C}G = \bigoplus_{\beta = 0}^{p-1} Nh^\beta$. The calculation behind this observation is

$h^\beta t = h^\beta t h^{-\beta} h^\beta = t^{s^\beta} h^\beta$

where $s = r^{-1}$ in $\mathbb{F}_p$. This shows that $h^\beta \in Nh^\beta$ is stabilised by $(t^{-s^\beta}, t) \in N \times N$ and so $Nh^\beta$, regarded as a representation of $N \times N$, is induced from the subgroup of $N \times N$ generated by $(t^{-s^\beta}, t)$. By Frobenius Reciprocity, $Nh^\beta$ decomposes as a direct sum of a trivial representation of $N \times N$, spanned by

$h^\beta + th^\beta + \cdots + t^{p-1}h^\beta,$

and a further $p-1$ non-trivial representations, each with kernel $\langle (t^{-s^\beta}, t) \rangle$. Critically all $(p-1)^2$ of these representations are non-isomorphic.

By Lemma 2, $U$ contains all $p-1$ trivial representations. Writing $T$ for their direct sum, we have $\mathbb{C}G = T \oplus C$ for a unique complement $C$ that decomposes uniquely as a sum of the non-trivial representations in the previous paragraph. Therefore if $U$ properly contains $T$ then $U$ contains a non-trivial summand of some $Nh^\beta$, spanned by some $u \in \mathbb{C}G$ of the form

$b_0 h^\beta + b_1 t h^\beta + \cdots + b_{p-1} t^{p-1} h^\beta$.

The support of $u$ is $h^\beta N$, which meets each point stabiliser in a unique element. Hence all the $b_i$ must be equal, a contradiction. $\Box$

This should generalize to an arbitrary Frobenius group with abelian kernel, and probably to other permutation groups having a regular normal subgroup.

### The key equivocation given a ciphertext

In practice, we might care more about what an observed ciphertext tells us about the key. This question is considered in Part B of Shannon’s paper. Let $X$, $Y$ and $K$ be random variables representing the plaintext, ciphertext and key, respectively. We assume that $X$ and $K$ are independent. (This is easily motivated: for instance, it holds if the key is chosen before the plaintext, and the plaintext is not a message about the key.) The conditional entropy $H(K | Y)$, defined by

\begin{aligned}H(K{}&{} | Y) \\ &= \sum_{y \in \mathcal{C}} \mathbb{P}[Y=y] H(K | Y = y) \\ &= \sum_{y \in \mathcal{C}} \mathbb{P}[Y=y] \sum_{k \in \mathcal{K}} \mathbb{P}[K = k | Y = y] \log \frac{1}{\mathbb{P}[K = k | Y = y]} \end{aligned}

represents our expected uncertainty, or equivocation’ to use Shannon’s term, in the key, after we have observed a ciphertext. (Throughout $\log$ denotes logarithm in base $2$.)

It is intuitive that $H(K | Y) \ge H(X | Y)$: since a given plaintext $x$ can be encrypted to a given ciphertext $y$ by several different keys, given a ciphertext, we are at least as uncertain about the key as we are about the plaintext. The proof is yet another exercise in Bayes’ Theorem, which I’ll include because it sheds some light on the probability model.

Lemma 4. $H(K | Y) \ge H(X | Y)$.

Proof. Fix $y \in \mathcal{C}$. Let $\mathcal{P}_y$ be the set of plaintext that encrypt, under some key, to $y$. (Thus $\mathcal{P}_y$ is the union of the inverse images of the singleton set $\{y\}$ under all encryption maps.) For $x \in \mathcal{P}$, let $\mathcal{K}_{xy}$ be the set (possibly empty) of keys $k$ such that $e_k(x) = y$. By Bayes’ Theorem,

$\mathbb{P}[X=x|Y=y] = \displaystyle \frac{\mathbb{P}[K \in \mathcal{K}_{xy}] p_x}{\mathbb{P}[Y=y]}.$

Similarly

$\mathbb{P}[K=k|Y=y] = \displaystyle \frac{\mathbb{P}[K=k] p_x}{\mathbb{P}[Y=y]}$

if $y = e_k(x)$ for some (necessarily unique) $x \in \mathcal{P}$ and otherwise this probability is $0$. Thus

$\mathbb{P}[X=x|Y=y] = \sum_{k \in \mathcal{K}_{xy}} \mathbb{P}[K=k|Y=y]$

and so the probability distribution on $X$ (conditioned on $Y=y$) is a coarsening of the probability distribution on $K$ (conditioned on $Y=y$). Therefore $H(K | Y=y) \ge H(X | Y = y)$ and the lemma now follows by summing over $y$, using the definition of conditional entropy above. $\Box$

A related formula is now a textbook staple. If we know the key $K$ then we know $X$ if and only if we know $Y$. Therefore the joint entropies $H(K, X)$ and $H(K, Y)$ agree and

$H(K | Y) + H(Y) = H(K,Y) = H(K,X) = H(K) + H(X)$

by the independence assumption. Hence

$H(K | Y) = H(K) + H(X) - H(Y),$

Shannon considers a toy cryptosystem with keys $K = \{k,k'\}$ and
$\mathcal{P} = \mathcal{C} = \{0,1\}^n$, in which a plaintext is encrypted bit-by-bit, as itself if $K = k$ and flipping each bit if $K = k'$. Suppose that the keys are equiprobable, and that each plaintext bit is $0$ independently with probability $p$, so the probability of a plaintext having exactly $s$ zeros is $\binom{n}{s} p^s (1-p)^{n-s})$. Denoting this quantity by $p_s$ we have $H(X) = -\sum_{s=0}^n p_s \log p_s$. The probability that the ciphertext $Y$ has exactly $s$ zeros is $(p_s + p_{n-s})/2$. Therefore

\begin{aligned} H(Y) &= -\sum_{s=0}^n \frac{p_s + p_{n-s}}{2} \log \frac{p_s + p_{n-s}}{2} \\ &= -\sum_{s=0}^n p_s \log \frac{p_s + p_{n-s}}{2}. \end{aligned}

and so

\begin{aligned} H(K{}&{} |Y) \\ &= 1 - \sum_{s=0}^n p_s \log \frac{p_s}{(p_s+p_{n-s})/2} \\ &= -\sum_{s=0}^n p_s \log \frac{p_s}{p_s + p_{n-s}} \\ &= -\sum_{s=0}^n \binom{n}{s} p^s (1-p)^{n-s} \log \frac{p^s (1-p)^{n-s}}{p^s (1-p)^{n-s} + p^{n-s}(1-p)^s} \\ &= \sum_{s=0}^n \binom{n}{s} p^s (1-p)^{n-s} \log \bigl(1 + p^{n-2s}(1-p)^{2s-n} \bigr) \\ &= \sum_{s=0}^n \binom{n}{s} p^s (1-p)^{n-s} \log \bigl(1 + q^{2s-n} \bigr). \end{aligned}

where $q = (1-p)/p$.

The graph on page 690 of Shannon’s paper shows $H(K | Y)$ against $n$ for two values of $p$. He does not attempt to analyse the formula any further, remarking yet already the formulas are so involved as to be nearly useless’. I’m not convinced this is the case, but the analysis is certainly fiddly.

When $p = \frac{1}{2}$ Shannon’s formula gives $H(K | Y) = 1$, as expected. We may therefore reduce to the case when $p \textgreater 1/2$ and so $q \textless 1$). Take $n = 2m$. The summand for $m+u$ is

$\binom{2m}{m+u} p^{m+u}(1-p)^{m-u} \log (1 + q^{2u}).$

Since $\binom{2m}{m+u}(m-u) = \binom{2m}{m+u+1}(m+u+1)$, the ratio of the summands for $m+u$ and $m+u+1$ is

$\displaystyle \frac{m+u+1}{m-u} \frac{1-p}{p} \frac{\log (1 + q^{2u})}{\log (1 + q^{2u+2})}$.

When $u \ge 0$ we can use the inequalities

$x \ge \log (1+x) \ge x - x^2/2$

for $x \in [0,1]$ to bound the ratio below by

$\displaystyle \frac{m+u+1}{m-u} \Bigl( \frac{1}{q} - q^{2u-1}/2 \Bigr).$

Therefore the summands in Shannon’s formula get exponentially small, at a rate approximately $1/q > 1$ as $s$ increases from $n/2$. Their contribution can be bounded by summing a geometric series and the same order as the middle term. (This does not require the contribution from the binomial coefficient, which is initially small, but eventually dominates.)

Going the other way, the summands for $m-t$ and $m-t-1$ is

$\displaystyle \frac{m+t+1}{m-t} \frac{p}{1-p} \frac{\log (1 + q^{-2t})}{\log (1 + q^{-2t+2})}$

which can be rewritten as

$\displaystyle \frac{m+t+1}{m-t} \frac{1}{q} \frac{2t \log \frac{1}{q} + \log (1+q^{2t})}{(2t+2)\log \frac{1}{q} + \log (1+q^{2t+2})}.$

It is useful to set $p = 1/2 + \rho$. If the final fraction were $t/(t+1)$ then the maximum would occur when $t/(t+1) = q$, i.e. when $t = q/(1-q) = (1-p)/(2p-1) = (1/2 -\rho)/\rho$. Numerical tests suggest this gives about the right location of the maximum when $m$ large (and so the first fraction can be neglected) and $p$ near to $1/2$.
For $t$ of the same order of $m$ the first fraction is dominant and gives exponential decay. It is therefore not so surprising that the middle term gives a reasonable lower bound for the entire series, namely

$H(K | Y) \ge \frac{A}{\sqrt{n}} \exp (-2 \rho^2 n )$

for some constant $A$. The graph below shows $H(K|Y)$ and the middle summand for $p = 1/2 + 1/20, 3/5, 2/3, 4/5$ with colours red, green, blue, black.

It seems possible that the lower bound is, up to a multiplicative constant, also an upper bound. However the argument above will show at most that $H(K | Y) \le A'\sqrt{n} \exp(-2 \rho^2 n)$.

#### Upper bound

A slightly weaker upper bound follows from tail estimates for binomial probabilities. (Update. There might be a stronger result using the Central Limit Theorem.) Let $S$ be distributed as $\mathrm{Bin}(p,2m)$, so $\mathbb{P}[S= s] = \binom{2m}{s}p^s (1-p)^{2m-s}$. By Hoeffding’s inequality,

$\mathbb{P}[S - 2pm \le -2\epsilon m] \le \mathrm{e}^{-4 \epsilon^2 m}.$

The argument by exponential decay shows that the contribution to $H(K | Y)$ from the summands for $s > m$ is of the same order as the middle term. Using $\log(1+ q^{2s-n}) \le 1 + (n-2s) \log (1/q)$ we get

$\mathbb{P}[S \le n/2] + 2\log (1/q) \sum_{s=0}^m (m-s) \mathbb{P}[S = s]$

as an upper bound for the remaining terms. By a standard trick, related to the formula $\mathbb{E}[X] = \mathbb{P}[X \ge 0] + \mathbb{P}[X \ge 1] + \cdots$ for the expectation of a random variable taking values in $\mathbb{N}_0$, we have

$\sum_{s=0}^m (m-s)\mathbb{P}[S=s] = \sum_{s=0}^{m-1} \mathbb{P}[S\le s].$

Take $\epsilon = p - 1/2 + \alpha$ in the version of Hoeffding’s inequality to get

\begin{aligned} \mathbb{P}[S \le m(1-\alpha)] &\le \mathrm{e}^{-4 (p-1/2+\alpha)^2 m} \\ &= \mathrm{e}^{-4 (p-1/2)^2 m} \mathrm{e}^{-8 (p-1/2) \alpha m - 4 \alpha^2 m} \\ &\le \mathrm{e}^{-4 \rho^2 m} \mathrm{e}^{-8 \rho \alpha m}. \end{aligned}

Thus the upper bounds for $\mathbb{P}[S \le s]$ become exponentially smaller as we decrease $s$ from $m$ by as much as $\alpha m$, for any $\alpha > 0$. By summing a geometric series, as before, we get

$\sum_{s=0}^{m-1} P[S\le s] \le Bm \mathrm{e}^{-4 \rho^2 m}$

for some constant $B$. The neglected contributions are of the order of the middle term, so bounded above by $\mathrm{e}^{-4 \rho^2 m}$. Therefore

$H(K | Y) \le Cm \mathrm{e}^{-4 \rho^2 m}$

for some further constant $C$.

### Random ciphers and unicity distance

#### A selection effect

Imagine a hypothetical world where families have either no children, an only child, or two twins, with probabilities $1/4$, $1/2$, $1/4$. The mean number of children per family is therefore $1$. In an attempt to confirm this empirically, a researcher goes to a primary school and asks each child to state his or her number of siblings. Twin-families are half as frequent as only-families, but send two representatives to the school rather than one: these effects cancel, so the researcher observes equal numbers of children reporting $0$ and $1$ siblings. (Families with no children are, of course, never sampled.) The estimate for the mean number of children is therefore the inflated $3/2$.

#### The random cipher

Shannon’s proof of the Noisy Coding Theorem for the memoryless binary channel is a mathematical gem; his chief insight was that a random binary code of suitable size can (with high probability) be used as part of a coding scheme that achieves the channel capacity. In his 1949 paper he considers the analogous random cipher.

Let $\mathcal{A} = \{\mathrm{a}, \ldots, \mathrm{z}\}$ be the Roman alphabet. Let $\mathcal{P}_n = \mathcal{C}_n = \mathcal{A}^n$ and let $X_n$ and $Y_n$ be the random variables recording the plaintext and ciphertext, respectively. We suppose that the message is chosen uniformly at random from those plaintexts that make good sense in English (once spaces are inserted). In yet another fascinating paper Shannon estimated that the per-character redundancy of English, $R$ say, is between $3.2$ and $3.7$ bits. Thus $H(X_n) = (\log_2 26 - R)n$ and the number of plausible plaintexts is $S_n = 2^{(\log_2 26 - R)n}$. Let $T_n = |\mathcal{P}_n| = 26^n$.

Fix $t \in \mathbb{N}$. We aim to construct a random cipher with exactly $t$ keys. Each key will be chosen with equal probability. A quick reading of (3) on page 691 of Shannon’s paper may suggest a too naive construction: for each ciphertext, we chose $t$ plaintexts uniformly at random from $\mathcal{P}_n$ to be its decryptions. (Some plaintexts may be chosen multiple times, meaning two or more keys decrypt the ciphertext to the same plaintext.) However if we pick the same $x \in \mathcal{P}_n$ first for different ciphertexts $y$ and $y'$, then it is ambiguous whether to encrypt $x$ as $y$ or as $y'$ under the first key. Moreover some plaintexts may never be chosen, giving them no encryption under the first key. Instead, it seems to work best to choose, for each of $t$ keys, a random bijection $\mathcal{P}_n \rightarrow \mathcal{C}_n$ to be the encryption function for that key. This also arrives at the situation required by Shannon in (3).

Let $g(y)$ be the number of keys giving plausible decryptions of the ciphertext $y \in \mathcal{C}_n$. (The number of keys may be more than the number of plausible decryptions if there are two keys decrypting $y$ to the same plaintext.) Thus $g : \mathrm{C}_n \rightarrow \mathbb{N}_0$ is a random quantity, where the randomness comes from the choices made in the construction of the random cipher. If $Z_n$ is chosen uniformly at random from $\mathcal{C}_n$ then, for each key, there is a probability $|S_n|/|T_n|$ that the decrypt of $Z_n$ under this key is plausible. Therefore

$\displaystyle \mathbb{P}[g(Z_n) = m] = \binom{t}{m} \bigl( \frac{S_n}{T_n} \bigr)^m \bigl( 1 - \frac{S_n}{T_n} \bigr)^{t-m}$

and $g(Z_n)$ is distributed binomially as $\mathrm{Bin}(S_n/T_n, t)$.

However this is not the distribution of $g(Y_n)$: since $Y_n$ is the encryption of a plausible plaintext, ciphertexts $y$ with a high $g(y)$ are more frequent, while, as in the family example, ciphertexts $y$ such that $g(y) = 0$ are never seen at all.

Lemma 5. $\mathbb{P}[g(Y_n) = m] = \displaystyle \frac{T_n m}{S_n t} \mathbb{P}[g(Z_n) = m]$.

Proof. Let $\mathcal{P}_n(y)$ be the multiset of the $g(y)$ plausible plaintexts that are decryptions of $y \in \mathcal{C}_n$. (Alternatively, and maybe more clearly, we could define $\mathcal{P }_n(y)$ to be the set of pairs $(k,x)$ where $k$ is a key decrypting $y$ to a plausible plaintext $x$.) Conditioning on the event that $x \in \mathcal{P}_n(y)$ we have

\begin{aligned} \mathbb{P}[g(Y_n) = m] &= \sum_{y \in \mathcal{C}_n \atop g(y) = m} \mathbb{P}[Y=y] \\ &= \sum_{y \in \mathcal{C}_n \atop g(y) = m} \sum_{x \in \mathcal{P}_n(y)} \mathbb{P}[Y=y|X=x]\mathbb{P}[X=x] \\ &= \sum_{y \in \mathcal{C}_n \atop g(y) = m} \sum_{x \in \mathcal{P}_n(y)} \frac{1}{t} \frac{1}{S_n} \\ &= \sum_{y \in \mathcal{C}_n \atop g(y) = m} \frac{m}{kS_n} \\ &= T_n \mathbb{P}[Z_n = m] \frac{m}{S_n t} \\ &= \frac{T_n m}{S_n t} \mathbb{P}[Z_n = m] \end{aligned}

as required. $\Box$

Corollary 6. The random variable $g(Y_n)$ is distributed as $1+ \mathrm{Bin}(S_n/T_n, t-1)$.

Proof. By Lemma 5 and the identity $m \binom{t}{m} = t \binom{t-1}{m}$ we have

\begin{aligned} \mathbb{P}[g(Y_n) = m] &= \frac{T_n m}{S_n t}\binom{t}{m} \bigl( \frac{S_n}{T_n} \bigr)^m \bigl( 1 - \frac{S_n}{T_n} \bigr)^{t-m}\\ &= \binom{t-1}{m-1}\bigl( \frac{S_n}{T_n} \bigr)^{m-1} \bigl( 1 - \frac{S_n}{T_n} \bigr)^{t-m}. \end{aligned}

Hence $g(Y_n)-1$ is distributed as $\mathrm{Bin}(S_n/T_n, t-1)$. $\Box$

It feels like there should be a quick direct proof of the corollary, along the lines we know $Y$ has one plausible decrypt; each of the remaining $t-1$ is plausible with probability $S_n/T_n$, hence … ‘. But this seems dangerously close to we know two fair coin flips gave at least one head; the other flip is a head with probability $1/2$, hence …’, which gives the wrong answer. The difference is that the plausible decrypt of $Y$ comes with a known key, whereas the at least one head’ could be either flip. Given the subtle nature of selection effects and conditional probability, I prefer the calculation in the lemma.

Shannon’s paper replaces the lemma with the comment [with one notation change] The probability of such a cryptogram [our event $g(Y) = m$] is $mT/St$, since it can be produced by $m$ keys from high probability messages [our plausible plaintexts] each with probability $T/S$.’ I cannot follow this: in particular $T/S$ cannot be a probability, since it is far greater than $1$.

Given that $Y = y \in \mathcal{C}_n$, the entropy in the key is $H(K|Y = y) = \log g(y)$. Therefore, going back to the lemma, we have

\begin{aligned} H(K|Y) &= \sum_{m=1}^t \mathbb{P}[g(Y) = m] \log m \\ &= \sum_{m=1}^k \frac{T_n}{S_n t} \mathbb{P}[Z_n = m] m \log m.\end{aligned}

Shannon argues that if $t$ is large compared to $n$ then $\log m$ is almost constant for $m$ near the mean $t S_n/T_n$ of $Z$, and so the expected value can be approximated by

$\frac{T_n}{S_n t} \frac{t S_n}{T_n} \log \frac{t S_n}{T_n} = \log t + \log S_n - \log T_n.$

Observe that $\log S_n - \log T_n = nR$, where $R$ is the per-character redundancy of English. Therefore Shannon’s approximation becomes

$H(K|Y_n) = \log t - nR.$

When $n$ is large compared to $t$, Shannon uses a Poisson approximation. As an alternative, we argue from Corollary 5. Let $Z^-$ be distributed as $\mathrm{Bin}(S_n/T_n, t-1)$. We have

\begin{aligned} H(K|Y_n) &= \mathbb{E}[\log (1 + Z^-)] \\ &\approx \log \bigl( 1+\mathbb{E} [Z^-] \bigr) \\ &= \log \bigl( 1 + (t-1)S_n/T_n \bigr) \\ &\approx (t-1)S_n/T_n \\ &\approx t2^{-nR}.\end{aligned}

The graph of $H(K|Y_n)$ is therefore as sketched below.

The quantity $H(K) / R = \log t / R$ is known as the unicity distance of the cipher. Roughly one expects that, after observing $n$ characters of the ciphertext, the key will be substantially known.

#### A more general random cipher?

It would be interesting to generalize the random cipher to the case when $|\mathcal{P}_n| < |\mathcal{C}_n|$ and we pick random injections.

## Cipher Systems: possible extras

September 12, 2017

This year I’m lecturing our course Cipher Systems for the first time. The purpose of this post is to record some of the ideas that I hope to at least touch on in the course.

### Cryptocurrency (1) RSA Scheme

Of course the Bitcoin blockchain is a splendid advertisement for hash functions. But here I want to give a much simpler example of a toy cryptocurrency in the RSA setup.

TTTT (Totally Trusted Transmission Technology) is going into the cryptocurrency game. In readiness, its RSA modulus $N$ and encryption exponent $e$ are prominently posted on the corporate website. TTTT issues currency units by signing the string ‘TTTT promises faithfully to redeem BitFlop $x$ for £100′, where $x$ is the serial number of the relevant BitFlop. Let $f(x) \in \{0,1,\ldots, N-1\}$ be the number representing this string for BitFlop $x$. (The function $f$ is public knowledge and is injective.) We say $f(x)$ is an unsigned BitFlop.

In the proposed protocol, a customer wishing to buy a BitFlop sends TTTT a self-addressed envelope stuffed with $20$ used fivers. TTTT removes the fivers and inserts a signed Bitflop $f(x)^d$ mod $N$, neatly typed out on letterhead paper. The serial number $x$ is inscribed in red ink in the company’s ledgers. Alice, as well as anyone she gives the signed BitFlop to, can calculate $(f(x)^d)^e = f(x)$, read the string containing the serial number, and know they have a legitimate BitFlop signed by TTTT. To redeem a BitFlop the postal process is reversed, and $x$ is crossed off from the ledger.

One of the many problems with this scheme is that there is nothing to stop a nefarious Alice (with access to a photocopier) from passing the same signed BitFlop onto both Bob and Charlie. To get round this, TTTT decides to publish its ledger of issued BitFlop serial numbers on the web, reasoning that Bob can then check that he has received an unredeemed BitFlop.

Another drawback is that if Alice buys a BitFlop from TTTT and then gives it to Bob, who then redeems it, TTTT can connect Alice and Bob. (Admittedly not necessarily as parties in the same transaction because of the possibility of a chain Alice to Charlie to Bob.)

#### Blind signing

To get around the second drawback, TTTT decides to use blind signing.

In the new protocol, if Alice wishes to transfer money to Bob, Bob (not Alice) gets TTTT to send him $f(x)$, representing a candidate-BitFlop’ with serial number $x$. TTTT enters $x$ in a new ledger. Bob passes $f(x)$ to Alice, who calculates the product $a^e f(x)$ mod $N$ for some privately chosen $a$. Alice then sends TTTT this number, along with the usual envelope of used fivers, and receives by return

$(a^e f(x))^d = a^{ed}f(x)^d = a f(x)^d \text{ mod } N.$

To transfer the BitFlop, Alice divides by $a$ and gives the signed BitFlop $f(x)^d$ to Bob. Bob can check that $x$ appears in the public ledger and later redeem the BitFlop. Over the course of the protocol:

• Alice learns: $a$, $f(x)$, $f(x)^d$
• Bob learns: $f(x)$, $f(x)^d$
• TTTT learns $f(x)$, that Alice has paid for a signature of $a^e f(x)$, that Bob has redeemed the signed BitFlop $f(x)^d$.

Assumingly (generously enough) that TTTT has many customers, there is no way for TTTT to associate $f(x)$ with $a^e f(x)^e$, or Alice’s transaction with Bob’s.

#### Question

1. What protection does BitFlop have against double spending?
2. TTTT decides it would be a nice touch to allow customers to choose their preferred serial number and encryption exponent. (TTTT knows the factorization $N = pq$, so can easily compute the decryption exponent $d'$ for any invertible $e'$. What could possibly go wrong?
3. While still depending on the postal system to receive used fivers, TTTT decides that email could be used for all other customer communications. Emails have to be encrypted, but no problem, TTTT already has $N$ and $e$ published on its website. What could possibly go wrong, under (a) the blind-signing scheme, (b) the original scheme?

1. TTTT is protected by its ledger system. Using a private ledger there is nothing to stop Alice passing the same BitFlop to both Bob and Charlie; if Bob redeems the BitFlop first, he gets its full value, while Charlie gets a nasty surprise later. Using a public ledger Bob and Charlie can verify the BitFlop is unspent at the time they accept it, but are then in a race to redeem it. (Switching from post to email doesn’t really help: it just means the race is run faster.)

This might be acceptable: for instance if Alice is in debt to Bob, and pays Bob by BitFlop, then Bob can simply wait for the envelope of used fivers to arrive before agreeing the debt is settled. Alice cannot settle two debts in this way, while Bob cannot plausibly claim to have had the transaction refused by TTTT when $x$ is removed from the public ledger.

In other cases, for instance if Alice wishes to purchase a physical object from Bob, some escrow system or degree of trust is needed even for less dubious schemes than BitFlop.

2. When Alice’s request for an encryption exponent $e$ such that $(e, \phi(N)) \not= 1$ is politely refused by TTTT, she learns a prime factor of $p-1$ or $q-1$. This could be useful in a Pollard $\rho$ attack.
3. (a) Eve who is snooping on communications between Bob and TTTT, intercepts an encrypted message $M^e$ mod $N$ from Bob to TTTT. Eve sends $M^e$ mod $N$ to TTTT, along with the usual envelope of used fivers. TTTT, believing that $M^e$ mod $N$ is an obfuscated unsigned BitFlop $a^e f(x)$ mod $N$, happily sends back $(M^e)^d = M$ to Eve. So for the price of one BitFlop, Eve has obtained the supposedly confidential message $M$.

(b) One day Alice asks TTTT to start signing its messages. Clearly authentication is good practice, so TTTT agrees. A typical message from TTTT to Alice, signed using the hash function $h$, is a pair $(M^{e_A} \text{ mod } N_A,h(M)^d \text{ mod } N)$. Alice notes that any email to TTTT is immediately bounced back with response Dear Ms. Alice, your custom is important to us. TTTT will reply within four months …’. She carefully crafts a message from a Ms. TYH!ubN(CZ…, such that TTTT’s automatic response $M$ has hash $h(M) = f(x)$ for some serial number $x$. (Again this is rather easy if $f$ is bijective.) The signed message is then $(?, f(x)^d \text{ mod } N)$, giving Alice (and anyone else snooping) the signed BitFlop with serial number $x$. Of course this $x$ does not appear in the company’s ledger, so this is most obviously a problem for the first scheme. But even with the public ledger, a malicious Alice can bombard TTTT with emails and so acquire a large number of ‘protoBitFlops’. Publishing each $f(x)^d$ destroys an unledgered BitFlop $x$, and creates a race to redeem if $x$ is already in the ledger.

### The Boomerang Attack

For the moment this is just a link to Wagner’s original paper.

### Cryptocurrency (2) Block chain

Linked lists of type Int may be defined in Haskell by Data List = Empty | Cons Int List. For example Cons 2 (Cons 1 Empty) represents the list [2,1]. We define an analogous data type HashList in which each a cons-cell with data $x$ is argumented by a hash value $h$. At the level of types $h$ could be any Int, but in a well-formed hash list, we will suppose that $h$ is the sum of the hash of $x$ and the hash of the tail of the list. Assume that hashInt :: Int -> Int has already been defined.

   type Hash = Int
data HashList = Empty | Cons Int Hash HashList

hash :: HashList -> Hash
hash Empty = 0
hash (Cons _ hashValue _) = hashValue

cons :: Int -> HashList -> HashList
cons x tail = Cons x h tail
where h = hashInt x + hash tail
 

For example, if (to make a simple illustration) hashInt x = x*x then the hash list with data 3, 2, 1 is

   Cons 3 14 (Cons 2 5 (Cons 1 1 Empty))

the hash values being $1^2 = 1$, $2^2 + 1 = 5$ and $3^2 + 5 = 14$. Note that data values contribute to the hash (albeit at the head only) as soon as they enter a hash list.

As defined above, hash lists provide warning of accidental corruption. But they offer no protection against malicious corruption. Indeed, the function cons allows anyone to create well-formed hash lists having arbitrary values.

In the Bitcoin blockchain, malicious corruption is prevented—or at least, made as hard as inverting a one-way function—by digital signatures. Again I find the Haskell code the clearest way to present the idea. We assume there is a basic public key infrastructure in place, including functions with the types below such that (unsign m) . (sign m) is the identity for each person $m$.

   type Person = Int

sign   :: Person -> Int -> Int
unsign :: Person -> Int -> Int

type SignedHash = (Person, Int)
data BlockChain = Empty
| Cons Int SignedHash BlockChain

signedHash :: BlockChain -> SignedHash
signedHash Empty = 0
signedHash (Cons _ signedHash _) = signedHash

cons :: Person -> Int -> BlockChain -> BlockChain
cons m x tail = Cons x (m, (sign m h)) tail
where h = let (_, h') = signedHash tail
in  hashInt m + hashInt x + h'



It is a simple exercise to write a function verify :: BlockChain -> Bool that checks a block chain is valid.

    verify Empty = True
verify (Cons x (m, h) tail) =
let (_, h') = signedHash tail
in  unsign m h ==    hashInt m + hashInt x + h'
&& verify tail


Note that verify only calls the publically known function unsign m. For example, if (thanks to an appalling programming error swapping encryption and decryption functions in Diffie–Hellman), Person $0$'s signature function turns out to be $x \mapsto x^3$ (modulo some large RSA modulus) then the block chain with data 3, 2, 1 is

   Cons 3 (0, 134*134*134) (Cons 2 (0, 5*5*5)
(Cons 1 (1, 1*1*1) Empty))


which passes verification. (Person $0$ is particularly unfortunate in that his identity enters the hash by adding $0^2$.) A well-formed length one block chain Cons x (m, sign m (hash x)) Empty is equivalent to a signed $x$ from Person $m$.

Using BlockChain one could implement a simple cryptocurrency, along the lines of the 'ScroogeCoin' considered in the recent book Bitcoin and cryptocurrency technologies (draft available online). Here Scrooge is a trusted party who must sign every transaction, and has sole responsibility for the integrity of the currency. The brilliance of Bitcoin lies in how it achieves decentralization by rewarding (with Bitcoins) anyone willing to play the role of an honest Scrooge, while still thwarting double spending.

A minimal working implementation of BlockChain is available here.

## Generalizations of Nim

June 18, 2017

Probably everyone knows that the P-positions (previous player has won) in Nim are exactly those $(x_1,\ldots,x_r)$ such that $x_1 \oplus \cdots \oplus x_r = 0$, where $\oplus$ denotes bitwise XOR. For example, $(3,2,1)$ is a P-position, because $3 \oplus 2 \oplus 1 = 11_2 \oplus 10_2 \oplus 01_2 = 00_2 = 0$, whereas one glance at $(9,7,5,1)$ shows it is an N-position (next player wins), because only the pile of size $9$ contains $2^3$ in binary. Any winning move must take at least $2$ counters from this pile: doing so leaves a pile of size $7 = 4+2+1$, which is ideal for balancing the contribution $7 \oplus 5 \oplus 1 = 3$ from the other piles. Therefore we leave $3$ counters, taking $6$, reaching the P-position $(3,7,5,1)$.

Every finite impartial game is equal to a nimber. Nim is unusual in that knowing the P-positions immediately determines all its nimbers: $(x_1,\ldots,x_r) = m \star$ if and only if $(x_1,\ldots,x_r) + m\star = 0$ (disjoint sum of games) if and only if $(x_1,\ldots,x_r,m)$ is a P-position.

Theorem. The nimber of $(x_1,\ldots, x_r)$ is $(x_1 \oplus \cdots \oplus x_r)\star$.

Proof. By the previous paragraph, it suffices to find a winning move when $x_1 \oplus \cdots \oplus x_r = 2^s + M$ with $M < 2^s$. By induction, we may assume that all options of $(x_1,\ldots,x_r)$ have their expected nimbers. By reordering piles we may assume without loss of generality that $x_1 = 2^{s+1}A + 2^s + B$ where $A \in \mathbb{N}_0$ and $B < 2^s$. We take all but $2^{s+1}A + 2^s-1$ counters from the first pile, pause to admire the situation, and then take further counters to leave exactly $2^{s+1}A + (2^{s+1}A \oplus x_2 \oplus \cdots \oplus x_r)$. The parenthetical quantity is

$2^s \oplus B \oplus x_1 \oplus \cdots \oplus x_r = 2^s \oplus B \oplus 2^s \oplus M = B \oplus M < 2^s,$

so this is always possible. $\Box$

A small generalization of this argument finds explicit moves giving options of $(x_1,\ldots,x_r)$ with all nimbers $m' \star$ with $m' < x_1 \oplus \cdots \oplus x_r$.

Example. Before I learned about Combinatorial Game Theory I got thrashed at Nim by someone whose main tactic was to reduce to the zero position $(3,2,1)$. (Of course my opponent could also win any non-balanced position with two piles.) In memory of this occasion, consider the position $(19,13,3,2,1) = 19\star + 13\star = 30 \star$. To reach $(16 + c) \star$ with $0 \le c \le 7$, we observe that $30$ and $16 + c$ both have $16$ in their binary expansion, but only $30$ has $8$ (present in the pile $13$). We mentally cancel out the $16$s, leaving $(3,13,3,2,1)$ with target $s \star$, and play, as a first step, to $(3,7,3,2,1)$. We then calculate that

$3 \oplus 3 \oplus 2 \oplus 1 \oplus c = 3 \oplus c.$

We must leave $3 \oplus c$ counters in the half-eaten pile of size $7$; since $0 \le c \le 7$, this is always possible. For instance, if $c = 5$ then since $3 \oplus 5 = 6$, we leave $6$ counters. Since $3 \oplus 6 \oplus 3 \oplus 2 \oplus 1 = 6$, on restoring the cancelled subpiles, we reach a position of nimber $(19 \oplus 6 \oplus 3 \oplus 2 \oplus 1)\star = 21 \star$, as required.

Strong players of games instinctively maximize their mobility. This principle gives a decent strategy for games ranging from Settlers of Catan (a resource acquisition game in which early mobility is essential) and Othello (in which weak players often minimize their mobility, by playing to maximizing their counters in the early game; typically this creates multiple good moves for the opponent) and even a reasonable heuristic for chess, albeit one that has been criticized by Turing. It can be seen at work in the proof and example above, where the half-eaten pile of size

\begin{aligned} 2^{s+1}A + 2^s-1 &= 2^{s+1}A + 1 + 2 + \cdots + 2^{s-1} \\ &= 2^{s+1}A \oplus 1 \oplus 2 \oplus \cdots \oplus 2^{s-1} \end{aligned}

gives us ample options.

### $p$-Nim

Nim boils down to a fight over the nimber $x_1 \oplus \cdots \oplus x_r$. With the end firmly in mind, let us ask for a generalization in which $\oplus$ is replaced with $\oplus_p$, the analogue of $\oplus$ defined with addition modulo $p$. For instance,

$12 \oplus_3 3 \oplus_3 2 \oplus_3 1 = 110_3 \oplus_3 010_3 \oplus_3 002_3 \oplus_3 001_3 = 120_3 = 15,$

and in the hoped for generalization, $(12,3,2,1)$ will have nimber $15{}\star$ and (this is just a restatement) options with nimbers $0,\star,\ldots, 14{}\star$, but not $15{}\star$. Consider the option $9\star$. This can be reached only by removing $3$ counters from both the piles of sizes $12$ and $3$. So it seems we have to permit moves in multiple piles. The position $(1,1,\ldots,1)$ with $p-1$ singleton piles shows that moves in up to $p-1$ piles may be required. So, as a first try, we make the following definition.

Definition. Naive $p$-Nim is the impartial game in which each move consists of up to $p-1$ moves in Nim.

Thus, writing $d$ for Hamming distance, a position $x \in \mathbb{N}_0^r$ in naive $p$-Nim has $y \in \mathbb{N}_0^r$ as an option if and only if $y_i \le x_i$ for each $i$ and $1 \le d(x,y) \le p-1$.

Suppose, inductively, that the options of $(x_1,\ldots, x_r)$ have the required nimbers. Then $x$ has as options all $m' \star$ with $m' < m$.

Sketch proof. Take $s$ maximal such that $p^s$ appears more often in $m$ than $m'$. Cancelling as in the example, we may assume that $p^s$ is the greatest power of $p$ appearing in the base $p$-forms of $x_1, \ldots, x_r$. So $m = ap^s + \cdots$ and $m' = a'p^s + \cdots$. Take $a-a'-1$ subpiles $p^s$ from piles containing $p^s$ in $p$-ary, and in a remaining pile of size $\alpha p^s + B$ take all but $(\alpha-1)p^s + p^s-1$ of its counters; then use its subpile of size $(p-1) + (p-1)p + \cdots + (p-1)p^{s-1}$ to reach $m' \star$. $\Box$

However, $(x_1,\ldots, x_r)$ may well have further options. By the mex rule, the problem occurs when there is an option $(y_1,\ldots,y_r)$ with $y_1 \oplus_p \cdots \oplus_p y_r = x_1 \oplus_p \cdots \oplus_p x_r$. For instance, $(3,2,1)$ has the option $(3,0,0)$, which has nimber $3 \star$. This destroys the inductive foundations for the sketch proof above. In fact $(3,2,1) = 6\star$ in naive $3$-Nim.

The obvious fix is to bar all moves taking counters $[c_1,\ldots,c_r]$ in which $c_1 \oplus_p \cdots \oplus_p c_r = 0$. (The square brackets are used to distinguish moves from positions.) But still there are too many options: for instance $(6,3)$ should be a P-position with nimber $0$ but, according to the current rules, we can move by $[4,2]$ or $[5,1]$ to the P-positions $(2,1)$ and $(1,2)$ (which really do have nimber $0$). A computer search finds the following illustrative examples, listed with their intended nimber and the additional moves that must be made illegal:

• $(3,6), 0, \{[2,4],[1,5]\}$,
• $(3,7), 1, \{[2,7],[1,5]\}$,
• $(4,8), 0, \{[2,7]\}$,
• $(5,6), 2, \{[5,4],[4,5]\}$,
• $(5,7), 0, \{[4,5]\}$,
• $(6,6), 3, \{[5,1],[4,2],[2,4],[1,5]\}$
• $(1,5,7), 1, \{[0,4,5]\}$,
• $(2,3,7), 0, \{[2,1,0],[2,2,2]],[0,2,7],[0,1,5]\}$.

Let $v_p(x)$ denote the highest power of $p$ dividing $x \in \mathbb{N}$, and let $v_p(0) = \infty$. Some inspection of these examples may suggest the following definition.

Definition. $p$-Nim is naive $p$-Nim barring any move taking counters $[c_1,\ldots,c_r]$ such that

$v_p(c_1 \oplus_p \cdots \oplus_p \cdots \oplus_p c_r) > \mathrm{min} \{v_p(c_1), \ldots, v_p(c_r) \}.$

By the definition of $v_p(0)$, this generalizes our first attempted rule. Moreover, it bars any move not changing the purported nimber $x_1 \oplus_p \cdots \oplus_p x_r$ of $(x_1,\ldots,x_r)$. The following result is a corollary of Lemmas 3.1 and 3.2 in this paper of Irie.

Theorem [Irie] The nimber of the $p$-Nim position $(x_1,\ldots,x_r)$ is $(x_1 \oplus_p \cdots \oplus_p x_r)\star$.

Proof. Let $m = x_1 \oplus_p \cdots \oplus_p x_r$ and let $m' < m$. Let $p^b = \nu_p(m-m')$ and let $p^s$ be the greatest power of $p$ appearing in $m-m'$ in $p$-ary. Cancelling as before, we may assume that $p^s$ is the greatest power of $p$ appearing in the base $p$-forms of $x_1, \ldots, x_r$. Similarly, we may cancel the powers $p^a$ with $a < b$ between $m$ and $m'$. So the only powers of $p$ that appear are between $p^b$ and $p^s$. (This will guarantee that our move satisfies the valuation condition.) Now play as in the sketch proof, replacing $p^s - 1$ with

$p^s - p^b = (p-1)p^b + (p-1)p^{b+1} + \cdots + (p-1)p^{s-1}.$

Since $m \star$ is not an option of $(x_1,\ldots,x_r)$, the mex rule implies that the nimber of $(x_1,\ldots,x_r)$ is $m \star$, as required. $\Box$

In fact Irie’s result is more general: a surprising effect of the valuation condition is that allowing moves in $p$ or more piles does not create options with new nimbers. This leads to the notion of the $p$-saturation of a game. The main focus of Irie’s paper is the $p$-saturation of Welter’s game, which is shown to have a remarkable connection with the representation theory of the symmetric group.

### Back to naive $p$-Nim

It seems that in many cases the nimbers in naive $p$-Nim are given by ordinary (naive?) addition. For example, this is true for all $k$-pile positions whenever $p > k$. When $p =3$, the exceptions for $3$ pile positions (in increasing order) with a small first pile are listed below.

• $(1,x,y)$: $(1,1,1) = 0$,
• $(2,x,y)$: $(2,2,2) = 0, (2,2,3) = \star$,
• $(3,x,y)$: $(2,2,3) = \star, (3,3,3) = 0, (3,3,4) = 2\star$,
• $(4,x,y)$: $(3,3,4) = 2\star, (4,4,4) = 0, (4,4,5) = \star, (4,4,6) = 3\star$, $(4,5,5) = 3\star$.

For instance, the matrix below shows nimbers for $(2,x,y)$ with $0\le x,y \le 4$.

$\left(\begin{matrix} 2 & 3 & 4 & 5 & 6 \\ 3 & 4 & 5 & 6 & 7 \\ 4 & 5 & 0 & 1 & 8 \\ 5 & 6 & 1 & 8 & 9 \\ 6 & 7 & 8 & 9 & 10 \end{matrix} \right)$

Once the pattern $(2,x,y) = (2+x+y)\star$ is established (this happens by $(2,4,4)$), the mex rule implies that it continues.

## Burnside’s method

April 24, 2017

Burnside proved in 1901 that if $p$ is an odd prime then a permutation group containing a regular subgroup isomorphic to $C_{p^2}$ is either imprimitive or 2-transitive. His proof was an early application of character theory to permutation groups. Groups with this property are now called B-groups.

Burnside attempted to generalize his 1901 result in two later papers: in 1911, he claimed a proof that $C_{p^n}$ is a B-group for any prime $p$ and any $n \ge 2$, and in 1921, he claimed a proof that all abelian groups, except for elementary abelian groups, are B-groups. The first claim is correct, but his proof has a serious gap. This error appears to have been unobserved (or, just possibly, observed but ignored, since the result was soon proved in another way using Schur’s theory of $S$-rings) until 1994 when it was noted by Peter Neumann, whose explication may be found in his introduction to Burnside’s collected works. In 1995, Knapp extended Burnside’s argument to give a correct proof. Burnside’s second claim is simply false: for example, $S_4 \wr S_2$ acts primitively on $\{1,2,3,4\}^2$, and has a regular subgroup isomorphic to $C_4 \times C_4$. In one of my current projects, I’ve simplified Knapp’s proof and adapted Burnside’s character-theoretic methods to show, more generally, that any cyclic group of composite order is a B-group.

The purpose of this post is to record some proofs omitted for reasons of space from the draft paper. This companion post has some notes on B-groups that may be of more general interest.

### Sums over roots of unity

Let $\xi$ be a primitive $n$th root of unity. Define

$R(r) = \{ r, r + p^{n-1}, \ldots, r+ (p-1)p^{n-1} \}$

for $0 < r < p^{n-1}$. Define a subset $Z$ of $\{1,\ldots,p^n-1\}$ to be null if there exists $s \in \mathbb{N}_0$ and distinct $r_{ij} \in \{1,\ldots, p^{n-1}-1\}$ for $0 \le i \le p-1$ and $1 \le j \le s$ such that $r_{ij} \equiv i$ mod $p$ for each $i$ and $j$ and

$Z = \bigcup_{i=0}^{p-1} \bigcup_{j=1}^s R(r_{ij}).$

Proposition 6.2 Let $\omega = \zeta^{p^{n-1} c}$ where $c$ is not divisible by $p$. Let $\mathcal{O} \subseteq \{1,\ldots,p^n-1\}$. Then

$\sum_{i \in \mathcal{O}} \zeta^i = \sum_{i \in \mathcal{O}} \omega^i$

if and only if either

1. $\mathcal{O}$ is null; or
2. $\mathcal{O} = \{p^{n-1}, \ldots, (p-1)p^{n-1} \} \; \cup \; \bigcup_{i=1}^{p-1} R(r_i) \; \cup \; Z$ where $Z$ is a null set, the $r_i$ are distinct elements of $\{1,\ldots,p^{n-1}-1\}\backslash Z$ and $r_i \equiv i$ mod $p$ for each $i$.

Proof. Since the minimum polynomial of $\zeta$ is

$1+X^{p^{n-1}} + \cdots + X^{(p-1)p^{n-1}}$

we have $\sum_{i \in R(r)} \zeta^i = 0$. Since $\xi^{p^{n-1}} = \omega$, we have $\sum_{i \in R(r)} \omega^i = p\omega^r$. It follows that $\sum_{i \in Z} \xi^i = \sum_{i \in Z} \omega^i = 0$ if $Z$ is a null set. (For the second equality, note the contributions from the $R(r_{ij})$ for fixed $j$ combine to give $p + p\omega + \cdots + p\omega^{p-1} = 0$.) For (2) we have $\sum_{i=1}^{p-1} \xi^{i p^{n-1}} = \omega + \cdots + \omega^{p-1} = -1$, $\sum_{i=1}^{p-1} \omega^{i p^{n-1}} = (p-1)$ and $\sum_{i=1}^{p-1} p \omega^i = - p$. This proves the ‘if’ direction.

Conversely, by Lemma 2.1 in the paper, $\mathcal{O} \backslash \{p^{n-1},\ldots,(p-1)p^{n-1} \}$ is a union of some of some of the sets $R(r)$. There exists a unique subset $A$ of $\{1,\ldots,p-1\}$ and unique $j_i \in \mathbb{N}$ for $0 \le i \le p-1$ and unique $r_{ij} \in \{1,\ldots, p^{n-1}-1\}$ for $1 \le j \le s$ and $0 \le i \le j_i$ such that

$\mathcal{O} = \{ p^{n-1} i : i \in A \} \cup \bigcup_{i=0}^{p-1}\bigcup_{j=1}^{i_j} R(r_{ij}).$

We have $\sum_{i \in \mathcal{O}} \zeta^i = \sum_{i \in A} \omega^i$ and $\sum_{i \in \mathcal{O}} \omega^i = |A| + \sum_{i=0}^{p-1} pj_i \omega^i$. Therefore

$|A| + \sum_{i=0}^{p-1} (pj_i - [i \in A]) X^i$

has $\omega$ as a root. Since this polynomial has degree at most $p-1$ and the minimal polynomial of $\omega$ is $1+X +\cdots + X^{p-1}$, it follows that the coefficients are constant. Hence

$|A| + pj_0 = pj_i - [i \in A]$

for $1\le i \le p-1$. If $A = \varnothing$ then $j_0 = j_1 = \ldots = j_{p-1}$ and $\mathcal{O}$ is null. Otherwise, taking the previous displayed equation mod $p$ we see that $A = \{1,\ldots,p-1\}$, and, moreover, the $j_i$ are constant for $1 \le i \le p-1$. (This holds even if $p=2$.) Set

$s = j_1 = \ldots = j_{p-1}.$

We have $j_0 = s-1$. Hence, $s \in \mathbb{N}$ and choosing in any way $r_i \in \{1,\ldots,p^{n-1}-1\}$ for $1 \le i \le p-1$ such that $r_i \in \mathcal{O}$, we see that $\mathcal{O}$ is the union of $\{p^{n-1}, \ldots, (p-1)p^{n-1}\}$, the sets $R(r_i)$ for $1 \le i \le p-1$ and a null set. $\Box$.

### Ramanujan matrices.

For $p$ prime and $n \in \mathbb{N}$ we define $R(p^n)$ to be the matrix

$\left( \begin{matrix} \scriptstyle 1 & \scriptstyle1 & \scriptstyle 1 & \scriptstyle\ldots & \scriptstyle 1 & \scriptstyle1 & \scriptstyle1 \\ \scriptstyle-1 & \scriptstyle p-1 & \scriptstyle p-1 & \scriptstyle \ldots & \scriptstyle p-1 & \scriptstyle p-1 & \scriptstyle p-1 \\ \scriptstyle0 & \scriptstyle -p & \scriptstyle p(p-1) & \scriptstyle \ldots & \scriptstyle p(p-1) & \scriptstyle p(p-1) & \scriptstyle p(p-1) \\ \scriptstyle 0 & \scriptstyle 0 & \scriptstyle -p^2 & \scriptstyle \ldots & \scriptstyle p^2(p-1) & \scriptstyle p^2(p-1) & \scriptstyle p^2(p-1) \\ \scriptstyle\vdots & \scriptstyle \vdots & \scriptstyle \vdots & \scriptstyle \ddots & \scriptstyle \vdots & \scriptstyle \vdots & \scriptstyle \vdots \\ \scriptstyle 0 & \scriptstyle 0 & \scriptstyle 0 & \scriptstyle \ldots & \scriptstyle -p^{n-2} & \scriptstyle p^{n-2}(p-1) & \scriptstyle p^{n-2}(p-1) \\ \scriptstyle 0 & \scriptstyle 0 & \scriptstyle 0 & \scriptstyle \ldots & \scriptstyle 0 & \scriptstyle -p^{n-1} & \scriptstyle p^{n-1}(p-1) \end{matrix} \right).$

More generally, if $d$ has prime factorization $p_1^{a_1} \ldots p_s^{a_s}$, we define $R(d) = R(p_1^{a_1}) \otimes \cdots \otimes R(p_s^{a_s})$. The rows and columns of $R(d)$ are labelled by the divisors $D$ of $2^m p^n$, as indicated below for the case $d = 2^3 p$, with $p$ and odd prime.

$R(2^3 p) = \begin{matrix} 1 \\ 2 \\ 2^2 \\ 2^3 \\ p \\ 2p \\ 2^2p \\ 2^3p \end{matrix} \left( \begin{array}{cccc|cccc} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ -1 & 1 & 1 & 1 & -1 & 1 & 1 & 1 \\ 0 & -2 & 2 &2 & 0 & -2 & 2 & 2 \\ 0 & 0 & -2^2 & 2 & 0 & 0 & -2^2 & 2^2\\ \hline -1 & -1 & -1 & -1 & 1 & 1 & 1 & 1 \\ 1 & -1 & -1 & -1 & -1 & 1 & 1 & 1 \\ 0 & 2 & -2 & -2 & 0 & -2 & 2 & 2 \\ 0 & 0 & 2^2 & -2^2 & 0 & 0 & -2^2 & 2^2 \end{array} \right).$

Say that a partition of $D \backslash \{ d\}$ is coprime if the highest common factor of the numbers in each part is $1$. The aim of the game is to find a non-empty set of rows, $X$ say, of $R(d)$ such that the subsets of columns (excluding column $d$) on which the sum of the rows in $X$ are equal form a coprime partition of the divisors.

There is an application to B-groups only when $d$ is even, in which case we may assume that $2 \in X$. Proposition 6.7 in the paper states that in this case, the only way to win this game when $d = 2^n$, $d = 2^n p$ or $d = 2 p^n$, where $p$ is an odd prime, is to put every single row in $X$. This implies that $C_{2^m p^n}$ is a B-group when $n \le 1$ or $m \le 1$. However the result on the game may well hold more generally.

#### Constant sums for $p^n$

The only coprime partition of $D \backslash \{p^n\}$ has a singleton part, so the row sums are all equal. By adding $p^{e-1}$ to each entry in row $p^e$ for $e \in \{1,\ldots, n\}$, we obtain the matrix below.

$\left( \begin{matrix} \scriptstyle 1 & \scriptstyle1 & \scriptstyle 1 & \scriptstyle\ldots & \scriptstyle 1 & \scriptstyle1 & \scriptstyle1 \\ \scriptstyle 0 & \scriptstyle p & \scriptstyle p & \scriptstyle \ldots & \scriptstyle p & \scriptstyle p & \scriptstyle p \\ \scriptstyle p & \scriptstyle 0 & \scriptstyle p^2 & \scriptstyle \ldots & \scriptstyle p^2 & \scriptstyle p^2 & \scriptstyle p^2 \\ \scriptstyle p^2 & \scriptstyle p^2 & \scriptstyle 0 & \scriptstyle \ldots & \scriptstyle p^3 & \scriptstyle p^3 & \scriptstyle p^3 \\ \scriptstyle\vdots & \scriptstyle \vdots & \scriptstyle \vdots & \scriptstyle \ddots & \scriptstyle \vdots & \scriptstyle \vdots & \scriptstyle \vdots \\ \scriptstyle p^{n-2} & \scriptstyle p^{n-2} & \scriptstyle p^{n-2} & \scriptstyle \ldots & \scriptstyle 0 & \scriptstyle p^{n-1} & \scriptstyle p^{n-1} \\ \scriptstyle p^{n-1} & \scriptstyle p^{n-1} & \scriptstyle p^{n-1} & \scriptstyle \ldots & \scriptstyle p^{n-1} & \scriptstyle 0 & \scriptstyle p^n \end{matrix} \right)$

which still has constant sums over the rows in $X$. Let $x_i = 1$ if $p^i \in X$ and let $x_i = 0$ otherwise. Writing numbers in base $p$, the row sums over $X$ are, for each column,

\begin{aligned} 1 \quad &\quad x_nx_{n-1} \ldots x_3x_2x_0 \\ p \quad &\quad x_n x_{n-1} \ldots x_3x_1x_0 \\ p^2\quad &\quad x_n x_{n-1} \ldots x_2x_1x_0 \\ \vdots \; \quad &\quad \qquad \vdots \\ p^{n-2} \quad &\quad x_nx_{n-2} \ldots x_2x_1x_0 \\ p^{n-1} \quad &\quad x_{n-1}x_{n-2} \ldots x_2x_1x_2. \end{aligned}

(Note in each case there are $n$ digits, the most significant corresponding to $p^{n-1}$.) The numbers in the right column are equal, hence all the $x_i$ are equal and $X = D$, as required. Taking $p=2$ this gives an alternative proof of Proposition 6.7(i).

#### Proofs of further claims on the Ramanujan matrices

Let $R(d)^{\circ}$ denote $R(d)$ rotated by a half turn. The following result was stated without proof in the paper; it implies that each $R(d)$ is invertible, with inverse $\frac{1}{d}R(d)^\circ$. (Update: I later found a much better proof, now outlined in the paper, using a lower-upper decomposition of $R(d)$.)

Proposition. We have $R(p^n)^{-1} = \frac{1}{p^n} R(p^n)^\circ$ and $\det R(p^n) = p^{n(n+1)}/2$.

Proof. Since $R(p^n)^\circ_{p^sp^t} = R(p^n)_{p^{n-s}p^{n-t}}$, we have $R(p^n)^\circ_{p^np^t} = 1$ for all $t \in \{0,\ldots, n\}$ and

\begin{aligned} R(p^n)^\circ_{p^sp^t} &= \begin{cases} (p-1)p^{n-s-1} & n-t \ge n-s \\ -p^{n-s-1} & n-t=n-s-1 \\ 0 & n-t \le n-s-2 \end{cases} \\ &= \begin{cases} (p-1)p^{n-s-1} & s \ge t \\ -p^{n-s-1} & s = t-1 \\ 0 & s \le t-2 \end{cases} \end{aligned}

for $s \in \{0,\ldots, n-1\}$ and $t \in \{0,\ldots,n\}$. We use this to show that

$\sum_{c=0}^n R(p^n)_{p^rp^c} R(p^n)^{\circ}_{p^c p^{r'}} = p^n [r=r']$

for $r,r' \in \{0,\ldots, n\}$. When $r=0$ the first term in each product is $1$, so the left-hand side is the column sum of the $n-r'$-th of $R(p^n)$; this is $p^n$ if $r'=0$ and $0$ otherwise, as required. Now suppose that $r \ge 1$. Since $R(p^n)_{p^rp^c}$ vanishes when $c \le r-2$, the left-hand side is

$-p^{r-1}R(p^n)^\circ_{p^{r-1}p^{r'}} + \sum_{c=r}^n p^{r-1}(p-1) R(p^n)_{p^cp^{r'}}.$

Take out a factor $p^{r-1}$ to define $L$. Substitute for $R(p^n)^\circ_{p^cp^{r'}}$, and split off the summand from $R(p^n)^\circ_{p^np^{r'}} = 1$ to get

\begin{aligned} L &= \begin{cases} (p-1)p^{n-r} & r' \le r-1 \\ -p^{n-r} & r' = r \\ 0 & r' \ge r+1 \end{cases} \\ & \qquad + (p-1) \sum_{c=r}^{n-1} (p-1) \begin{cases} (p-1)p^{n-c-1} & c \ge r' \\ -p^{n-c-1} & c = r'-1 \\ 0 & c \le r'-2 \end{cases} + p-1.\end{aligned}

We must now consider three cases. When $r=r'$ we get

\begin{aligned} L &= p^{n-r} + (p-1)\sum_{c=r}^{n-1} (p-1)p^{n-c-1} + p-1 \\ &= p^{n-r} + (p-1)(p^{n-r}-1) + p-1 \\ &= p^{n-r+1} \end{aligned}

as required. When $r \ge r'+1$ we have $c > r'$ in all summands so

\begin{aligned} L &= -(p-1)p^{n-r} + (p-1)\sum_{c=r}^{n-1} (p-1)p^{n-c-1} + p^{r-1}(p-1) \\ &= (p-1) \bigl( -p^{n-r} + (p^{n-r}-1) + 1 \bigr) \\ &= 0. \end{aligned}

When $r \le r'-1$ the first non-zero summand occurs for $c=r'-1$ so we have

\begin{aligned} L &= - (p-1)p^{n-r'} + (p-1) \sum_{c=r'}^{n-1} (p-1)p^{n-c-1} + p^{r-1}(p-1) \\ &= (p-1) \bigl( -p^{n-r'} + (p^{n-r'}-1) + 1 \bigr) \\ &=0. \end{aligned}

Now taking determinants, using that $\det R(p^n)^\circ = \det R(p^n)$ (the matrices are conjugate by the matrix with $1$s on its antidiagonal and $0$s elsewhere), we get $1 / \det R(p^n) = (p^n)^{n+1} \det R(p^n)$. Hence $\det R(p^n) = p^{n(n+1)/2}$, as required.

### Jordan block matrices.

Let $J$ be the $m \times m$ unipotent upper-triangular Jordan block matrix over $\mathbb{F}_p$. We have $(J-I)^m = 0$, hence $(J-I)^{p^r} = 0$ whenever $p^r \ge m$. On the other hand if $k \le m-1$ then $J^k$ has a $1$ in position $(1,k)$. Therefore $J^{p^r} = I$ if and only if $p^r \ge m$. We also need a result on the relative trace matrix $I + J + \cdots + J^{p^{r}-1}$. Note that $\binom{p}{k}$ is divisible by $p$ for $k \in \{1,\ldots, p-1\}$. (For instance, a $p$-cycle acts freely on the set of $k$-subsets of $\{1,\ldots, p\}$.) An easy inductive argument using $\binom{p-1}{k-1} + \binom{p-1}{k} = \binom{p}{k}$ shows that $\binom{p-1}{k} \equiv (-1)^k$ mod $p$ for all $k$. Now Lucas’ Theorem implies that $\binom{p^r-1}{k} \equiv (-1)^k$ mod $p$. Hence

$(J-I)^{p^r-1} = I + J + \cdots + J^{p^r-1}.$

It follows that the right-hand side is $0$ if and only if $p^r > m$.

## Model characters for wreath products with symmetric groups

April 24, 2017

Let $G$ be a finite group. The model character for $G$ is $\sum_{\chi \in \mathrm{Irr}(G)} \chi$. A nice short paper by Inglis, Richardson and Saxl gives a self-contained inductive proof that if $\pi_{2r}$ is the permutation character of $\mathrm{Sym}_{2r}$ acting by conjugacy on its set of fixed-point-free involutions then

$\sum_r (\pi_{2r} \times \mathrm{sgn}_{n-2r}) \bigl\uparrow^{\mathrm{Sym}_n}$

is the model character for $\mathrm{Sym}_n$. Assuming Pieri’s rule, that if $\alpha$ is a partition of $r$ then $\chi^\alpha \times \mathrm{sgn}_{\mathrm{Sym}_t} \uparrow^{\mathrm{Sym}_{r+t}} = \sum \chi^\lambda$, where the sum is over all partitions $\lambda$ obtained from $\alpha$ by adding $t$ boxes, no two in the same row, this follows from the well-known fact (proved inductively in the paper) that $\pi_{2r} = \sum_{\gamma \in \mathrm{Par}(r)} \chi^{2\gamma}$.

Note that $\pi_{2r} \times \mathrm{sgn}_{n-2r}\bigl\uparrow^{\mathrm{Sym}_n}$ is the induction of a linear character from the centralizer of the involution $(1,2)\ldots (2r-1,2r) \in \mathrm{Sym}_n$. (When $r=0$ we count the identity as an involution as an honorary involution.) Up to conjugacy, each involution is used exactly once to define the model character.

In an interesting paper, Baddeley generalizes the Inglis–Richardson–Saxl result to a larger class of groups. He makes the following definition.

Definition. An involution model for a finite group $G$ is a collection $\{ (\tau_1, \rho_1), \ldots, (\tau_c, \rho_c) \}$ such that $\{\tau_1,\ldots, \tau_c\}$ is a set of conjugacy-class representatives for the involutions of $G$ and $\rho_i : C_G(u_i) \rightarrow \mathbb{C}$ is a linear character for each $i \in \{1,\ldots, c\}$, chosen so that

$\sum_{\chi \in \mathrm{Irr}(G)} \chi = \sum_{i=1}^c \rho_i \bigl\uparrow_{\mathrm{Cent}_G(\tau_i)}^G$

For example, if an abelian group $A$ has an involution model then, since each centralizer is $A$ itself, comparing degrees shows that $c = |A|$, and so $A$ is an elementary abelian 2-group. Conversely, any such group clearly has an involution model. By the Frobenius–Schur count of involutions, a necessary condition for a group to have an involution model is that all its irreducible representations are defined over the reals.

Baddeley’s main theorem is as follows.

Theorem. [Baddeley] If a finite group $H$ has an involution model then so does $H \wr \mathrm{Sym}_n$.

The aim of this post is to sketch my version of Baddeley’s proof of his theorem in the special case when $H = C_2$. Some familiarity with the theory of conjugacy classes and representations of wreath products in Chapter 4 of James & Kerber, Representation theory of the symmetric group is assumed. The characters $\rho_i$ defined below differ from Baddeley’s by a factor of $\mathrm{Inf}_{\mathrm{Sym}_n}^{H \wr \mathrm{Sym}_n} \mathrm{sgn}_{\mathrm{Sym}_n}$; this is done to make $\phi_r$ (defined below) a permutation character, in analogy with the Inglis–Richardson–Saxl character $\pi_r$.

#### Aside: the Hyperoctahedral group

The group of all $n \times n$ matrices with entries $\pm 1$ that become permutation matrices when all $-1$ entries are changed to $1$ is isomorphic to $H \wr S_n$. Thus $C_2 \wr \mathrm{Sym}_n$ is the hyperoctahedral group of symmetries of the $n$-hypercube. It is a nice exercise to identify $\mathrm{Sym}_4$, the rotational symmetry group of the cube, as an explicit index $2$ subgroup of $H \wr \mathrm{Sym}_3$.

### Preliminaries

From now on let $H = \langle h \rangle \cong C_2$. The group $\mathrm{Sym}_n$ acts on $H^{\times n}$ by

$(b_1,\ldots,b_n)^\sigma = (b_{1\sigma^{-1}},\ldots,b_{n\sigma^{-1}}).$

This is a place permutation: the element $b_i$, in position $i$ on the left-hand side, occupies position $b_{i\sigma}$ on the right-hand side. Let $G = H^{\times n} \rtimes \mathrm{Sym}_n \cong H \wr \mathrm{Sym}_n.$

We write elements of $G$ as $(b_1,\ldots,b_n;t)$ where each $b_i \in \{1,h\}$.

#### Imprimitive action of $G$

For each $i \in \{1,\ldots, n\}$ introduce a formal symbol $\overline{i}$. (This could be thought of as $-i$, but I find that bar makes for a more convenient notation.) Let $\Omega = \{1, \ldots, n, \overline{1}, \ldots, \overline{n} \}$. Given $\sigma \in \mathrm{Sym}_{\{1,\ldots,n\}}$, we define $\overline{\sigma} \in \mathrm{Sym}_{\{\overline{1},\ldots,\overline{n}\}}$ by $i \overline{\sigma} = i$ for all $i \in \{1,\ldots, n\}$ and $\overline{i} \overline{\sigma} = \overline{i\sigma}$ for all $\overline{i} \in \{\overline{1},\ldots,\overline{n} \}$. Then $G$ is isomorphic to the subgroup $G_n \le \mathrm{Sym}_\Omega$ defined by $G_n = B_n \rtimes T_n$ where

$B_n = \langle (1, \overline{1}), \ldots, (n, \overline{n}) \rangle$

and

$T_n = \langle \sigma \overline{\sigma} : \sigma \in \mathrm{Sym}_n \rangle.$

So $G_n$ acts imprimitively on $\Omega$ with blocks $\{1, \overline{1}\}$, $\ldots$, $\{n,\overline{n}\}$.

#### Irreducible representations of $G$

Let $\epsilon : \{1,h\} \rightarrow \{\pm 1\}$ be the faithful character of $H \cong C_2$ and let $\widetilde{\epsilon}^{\times s}$ denote the linear character of $H \wr \mathrm{Sym}_s$ on which each of the $s$ factors of $H$ in the base group factor acts as $\epsilon$. Given a bipartition $(\lambda|\mu) \in \mathrm{BPar}(n)$ with $\lambda \in \mathrm{Par}(t)$ and $\mu \in \mathrm{Par}(n-t)$, we define

$\chi^{(\lambda|\mu)} = \bigl( \mathrm{Inf}_{\mathrm{Sym}_t}^{H \wr \mathrm{Sym}_t} \chi^\lambda \times \widetilde{\epsilon}^{\times (n-t)} \mathrm{Inf}_{\mathrm{Sym}_{n-t}}^{H \wr \mathrm{Sym}_{n-t}} \chi^\mu \bigr) \bigl\uparrow^{H \wr \mathrm{Sym}_n}.$

Basic Clifford theory shows that the characters $\chi^{(\lambda|\mu)}$ for $(\lambda|\mu) \in \mathrm{BPar}(n)$ form a complete irredundant set of irreducible characters of $G$. For example, the $n$-dimensional representation of $G$ as the symmetry group of the $n$-hypercube has character labelled by the bipartition $\bigl((n-1)|(1)\bigr)$.

#### Conjugacy classes of involutions in $G$

Since $G_n/B_n \cong T_n \cong \mathrm{Sym}_n$, any involution in $G$ is of the form $(b_1,\ldots,b_n ; \sigma)$ where $\sigma \in \mathrm{Sym}_n$ is an involution. Moreover, as the calculation $(h,1;(1,2))^2 = (h,h)$ suggests, the place permutation action of $\sigma$ on $(b_1,\ldots,b_n)$ permutes amongst themselves the indices $i$ such that $b_i = h$. By applying a suitable place permutation we may assume that $b_1,\ldots b_{n-s} = 1$ and $b_{n-s+1} = \ldots = b_n = h$, for some $s \in \{0,\ldots, n\}$. Now using that $(h,h ; (1,2))$ is conjugate, by $(h,1)$, to $(1,1;(1,2))$, we see that a set of conjugacy class representatives for the involutions in $G$ is

$\{ (\stackrel{n-s}{\overbrace{1, \ldots, 1}} , \stackrel{s}{\overbrace{h, \ldots, h}} ; (1,2) \ldots (2r-1,2r) \}$

for $r$ and $s$ such that $2r+s \le n$. The generalized cycle-type invariant defined in James–Kerber can be used to show no two of these representatives are conjugate.

#### Centralizers of involutions in $G$

As an element of $G_n$, the involution defined above is

$\tau_s^{(r)} = (n-s+1,\overline{n-s+1}) \ldots (n,\overline{n}) \theta_r\overline{\theta_r}$

where, by definition, $\theta_r = (1,2)\ldots (2r-1,2r) \in T_{2r} \le \mathrm{Sym}_r$. If $(b_1,\ldots,b_n;\sigma)$ commutes with $\tau_s^{(r)}$ then, passing to the quotient, $\sigma$ commutes with $\theta_r$. Therefore the non-singleton orbits

$\{1,2\}, \{\overline{1},\overline{2}\}, \ldots, \{2r-1,2r\}, \{\overline{2r-1},\overline{2r}\}$

of $\tau_s^{(r)}$ are permuted by $(b_1,\ldots,b_n;\sigma)$, as are the remaining non-singleton orbits $\{n-s+1, \overline{n-s+1}\}, \ldots, \{n,\overline{n} \}$.

Therefore

$\mathrm{Cent}_{G_n}(\tau_s^{(r)}) = \mathrm{Cent}_{G_{2r}}(\theta_r\overline{\theta_r}) \times G_{n-(2r+s)} \times G_s$

where $G_{n-(2r+s)}$ acts on $\{2r+1, \ldots, n-s, \overline{2r+1}, \ldots, \overline{n-s}\}$ and $G_s$ acts on $\{n-s+1,\ldots,n,\overline{n-s+1},\ldots,\overline{n}\}$. Clearly $(b_1,b_2,\ldots, b_{2r-1},b_{2r}) \in G_r$ commutes with

$(1,2)\ldots (2r-1,2r) (\overline{1},\overline{2})\ldots(\overline{2r-1},\overline{2r})$

if and only if $b_1 = b_2, \ldots, b_{2r-1} = b_{2r}$. Therefore the first factor is permutation isomorphic to

$D_r \rtimes \mathrm{Cent}_{T_{2r}}(\theta_r\overline{\theta_r})$

where

$D_r = \{(b_1,b_1,\ldots,b_r,b_r) : b_1, \ldots, b_r \in H \}$.

Set $E_r = \mathrm{Cent}_{T_{2r}}(\theta_r\overline{\theta_r})$. Note that $E_r$ is permutation isomorphic to $C_2 \wr \mathrm{Sym}_r$, acting with one orbit on $\{1,\ldots,r\}$ and another on $\{\overline{1},\ldots,\overline{r}\}$. (One has to get used to the two different ways in which the group $C_2 \wr \mathrm{Sym}_r$ arises; in this post I’ve used $H$ when the $C_2$ comes from the base group.)

For example, if $n=7$ then

$\tau_2^{(2)} = (1,2)(\overline{1},\overline{2})(3,4)(\overline{3},\overline{4})(6,\overline{6})(7,\overline{7}) \in G_7$

and the centralizer of $\tau_2^{(2)}$ is generated by $(1, \overline{1})(2, \overline{2})$, $(3,\overline{3})(4,\overline{4})$, $(5, \overline{5})$, $(6,\overline{6})$, $(7,\overline{7})$ in the base group $B_7$ and $(1,2)(\overline{1},\overline{2})$, $(1,3)(2,4)(\overline{1},\overline{3})(\overline{2},\overline{4})$ and $(6,7)(\overline{6},\overline{7})$ in the top group $T_7$. The first two top group generators generate $E_2$.

#### Definition of the linear representations and reduction

The second and third factors of the centralizer are both complete wreath product, so, by analogy with the Inglis–Richardson–Saxl paper, it is natural guess to define $\rho_s^{(r)}: \mathrm{Cent}_{G_n}(\tau_s^{(r)}) \rightarrow \mathbb{C}$ so that $\rho_s^{(r)}$ restricts to:

• The trivial character on $D_r \rtimes \mathrm{Cent}_{T_{2r}}(\theta_r\overline{\theta_r}) = D_r \rtimes E_r$;
• $\mathrm{Inf}_{\mathrm{Sym}_{n-(2r+s)}}^{H \wr \mathrm{Sym}_{n-(2r+s)}} \mathrm{sgn}_{\mathrm{Sym}_{n-(2r+s)}}$ on $G_{n-(2r+s)}$;
• $\widetilde{\epsilon}^{\times s} \mathrm{Inf}_{\mathrm{Sym}_s}^{H \wr \mathrm{Sym}_s} \mathrm{sgn}_{\mathrm{Sym}_s}$ on $G_s$.

That is (omitting the details of the inflations for brevity),

$\rho_s^{(r)} = 1_{D_r \rtimes E_r} \times \mathrm{Inf} \;\mathrm{sgn}_{\mathrm{Sym}_{n-(2r+s)}} \times \widetilde{\epsilon}^{\times s} \mathrm{Inf}\; \mathrm{sgn}_{\mathrm{Sym}_s}.$

Define

$\phi_r = 1_{D_r \rtimes E_r}\bigl\uparrow^{G_{2r}}.$

Since $\widetilde{\epsilon}^{\times 2r}$ restricts to the trivial character of $D_r \rtimes E_r$, we have $\phi_r = \widetilde{\epsilon}^{\times r}\phi_r$. The definition of $\rho_s^{(r)}$ above is therefore symmetric with respect to $1_H$ and $\epsilon$. Moreover, if

$\phi_r = \sum_{(\alpha|\beta) \in \mathrm{BPar}(2r)} m_{\alpha\beta} \chi^{(\alpha|\beta)}$

then, by Pieri’s rule for the hyperoctahedral group (this follows from Pieri’s rule for the symmetric group in the same way as the hyperoctahedral branching rule follows from the branching rule for the symmetric group — for the latter see Lemma 4.2 in this paper),

$\rho_s^{(r)} \bigl\uparrow^{G_n} = \sum_{(\alpha|\beta) \in \mathrm{BPar}(2r)} \sum_{(\lambda|\mu)} m_{\alpha\beta} \chi^{(\lambda|\mu)}$

where the second sum is over all bipartitions $(\lambda|\mu)$ of $n$ such that $\lambda$ is obtained by adding $n-(2r+s)$ boxes, no two in the same row, to $\alpha$ and $\mu$ is obtained by adding $s$ boxes, again no two in the same row, to $\beta$. Therefore Baddeley’s theorem holds if and only if $\phi_r$ is multiplicity-free, with precisely the right constituents for the Pieri inductions as $s$ varies to give us every character of $H \wr \mathrm{Sym}_n$ exactly once. This is the content of the following proposition.

Proposition. $\phi_r = \sum_{(\gamma|\delta) \in \mathrm{BPar}(r)} \chi^{(2\gamma|2\delta)}$

### Proof of the proposition

To avoid some messy notation I offer a ‘proof by example’. I believe it shows all the essential ideas of the general case.

Proof by example. Take $r=3$. We have

$D_3 = \langle (1,\overline{1})(2, \overline{2}), (3,\overline{3})(4,\overline{4}), (5, \overline{5})(6, \overline{6}) \rangle \le B$

and so $\phi_3$ is induced from the trivial character of

$\mathrm{Cent}_{G_3}(\tau_3^{(0)}) = D_3 \rtimes E_3$

(Recall that $\tau_3^{(0)} = \theta_3 \overline{\theta_3}$ where $\theta_3 = (1,2)(3,4)(5,6)$ and $E_3 = \mathrm{Cent}_{T_6}(\theta_3\overline{\theta_3})$.) To apply Clifford theory, it would be much more convenient if we induced from a subgroup of $G_6 = B_6 \rtimes T_6$ containing the full base group $B_6$. We arrange this by first inducing up to $B_6 \rtimes E_3$. (For the action of $E_3$, it is best to think of $B_6$ as $(H \times H)^{\times 3}$.) The calculation

$1_{D_1}\bigl\uparrow^{G_2} = \widetilde{1_H}^{\times 2} + \widetilde{\epsilon}^{\times 2}$

shows that, on restriction to $B_6$, the induced character

$1_{D_3 \rtimes E_3} \bigl\uparrow^{B_6 \rtimes E_3}$

is the sum of all products $\psi_1 \times \psi_2 \times \psi_3$ where each $\psi_i$ is one of the irreducible characters $1_{H \wr C_2} = \widetilde{1_H}^{\times 2}$ or $\eta_{H \wr C_2} = \widetilde{\epsilon}^{\times 2}$ on the right-hand side above. The centralizer $\mathrm{Cent}_{T_6}(\theta_3 \overline{\theta_r})$ acts transitively on the 3 factors: glueing together the products in the same orbits into induced characters we get that $1_{D_3 \rtimes E_3} \uparrow^{B_6 \rtimes E_3}$ has the following irreducible constituents:

• $\widetilde{1_{H \wr C_2}}^{\times 3} 1_{S_3}$
• $\bigl( \widetilde{1_{H \wr C_2}}^{\times 2} 1_{S_2} \times \eta_{H \wr C_2} \bigr) \bigl\uparrow_{(B_4 \rtimes E_2) \times (B_2 \rtimes E_1)}^{B_6 \rtimes E_3}$
• $1_{H \wr C_2} \times \widetilde{\eta_{H \wr C_2}}^{\times 2} 1_{S_2} \bigl\uparrow_{(B_2 \rtimes E_1) \times (B_4 \rtimes E_2)}^{B_6 \rtimes E_3}$
• $\widetilde{\eta_{H \wr C_2}}^{\times 3} 1_{S_3}$.

Note that the ’tilde-construction’ enters in two ways: once to combine characters of each two $H$-factors in the same orbit of $\tau_r$, and then again to combine the characters obtained in this way. As a small check, observe that the sum of degrees is $1 + 3 + 3 + 1 = 8$, which is the index of $D_3 \rtimes E_3$ in $B_6 \rtimes E_3$.

Reflecting the isomorphisms

$(H \wr C_2) \wr S_3 \cong H \wr (C_2 \wr S_3) \cong H \wr E_3 \cong B_6 \rtimes E_3$

we rewrite these characters as follows:

• $\widetilde{1_H}^{\times 6} 1_{E_3}$
• $\widetilde{1_H}^{\times 4} 1_{E_2} \times \widetilde{\epsilon}^{\times 2} 1_{E_1} \uparrow^{B_6 \rtimes E_3}$
• $\widetilde{1_H}^{\times 2} 1_{E_1} \times \widetilde{\epsilon}^{\times 4} 1_{E_2} \uparrow^{B_6 \rtimes E_3}$
• $\widetilde{\epsilon}^{\times 6} 1_{E_3}$.

It is now routine to induce ‘in the top group’ up to

$G_6 = C_2 \wr S_6 = B_6 \rtimes T_6$

using the decomposition of $\pi_r$ into characters labelled by even partitions. For the second summand we use transitivity of induction, starting at the subgroup $(B_4 \rtimes E_2) \times (B_2 \rtimes E_1)$ and going via $(B_4 \rtimes T_4) \times (B_2 \rtimes T_2)$. The third summand is dealt with similarly. Thus $\phi_3$ is the sum of the $\chi^{(\lambda|\mu)}$ for the following bipartitions $(\lambda|\mu)$:

• $\bigl((6)|\varnothing\bigr), \bigl((4,2)|\varnothing\bigr), \bigl((2,2,2)|\varnothing\bigr)$
• $\bigl((4)|(2)\bigr), \bigl((2,2)|(2)\bigr)$
• $\bigl((2)|(4)\bigr), \bigl((2)|(2,2)\bigr)$
• $\bigl(\varnothing|(6)\bigr), \bigl(\varnothing|(4,2)\bigr), \bigl(\varnothing|(2,2,2)\bigr)$.

as required. $\Box$

## Some Stirling Number identities by differentiation

March 20, 2017

Let $G(z)$ be the exponential generating function enumerating a family of combinatorial objects. For example, $-\log(1-z) = \sum_{n=1}^\infty z^n/n$ is the e.g.f. enumerating cycles (there are $(n-1)!$ cycles on $\{1,\ldots,n\}$) and $\mathrm{e}^z -1 = \sum_{n=1}^\infty z^n/n!$ is the e.g.f enumerating non-empty sets. Then $\exp G(z)$ is the e.g.f. enumerating set partitions where each part carries a $G$-structure. For example,

$\exp (\mathrm{e}^z - 1) = \sum_{n=0}^\infty B_n \frac{z^n}{n!}$

where the Bell Number $B_n$ is the number of set partitions of $\{1,\ldots, n\}$. We can keep track of the number of parts with a further variable. For example

$\exp \bigl( -x \log (1-z) \bigr) = \sum_{n=0}^\infty \sum_{m=0}^\infty \genfrac{[}{]}{0pt}{}{n}{m} x^m \frac{z^n}{n!}$

where the (unsigned) Stirling Number of the First Kind $\genfrac{[}{]}{0pt}{}{n}{m}$ is the number of permutations of $\{1, \ldots, n\}$ having exactly $m$ disjoint cycles. Similarly

$\exp \bigl( x(\mathrm{e}^z - 1) \bigr) = \sum_{n=0}^\infty \sum_{m=0}^\infty \genfrac{\{}{\}}{0pt}{}{n}{m} x^m \frac{z^n}{n!}$

where the Stirling Number of the Second Kind $\genfrac{\{}{\}}{0pt}{}{n}{m}$ is the number of set partitions of $\{1, \ldots, n\}$ into exactly $m$ parts.

All this is explained beautifully in Chapter 3 of Wilf’s book generating functionology, in a way that leads readily into the high-brow modern take on these ideas using combinatorial species. For my planned combinatorics textbook I expect to deal with products of exponential generating functions in an ad-hoc way, and probably not do much more, since it’s impossible to compete with Wilf’s exposition.

#### Two part structures where one part is boring

A special case of the multiplication rule is that $\exp(z) G(z)$ is the e.g.f enumerating two-part set compositions $(A,B)$ where $A$ carries no extra structure and $B$ carries a $G$-structure. For example, if $d_n$ is the number of derangements of $\{1,\ldots, n\}$ then, since any permutation is uniquely determined by its set of fixed points and the derangement it induces on the remaining points, we have $\exp(z) G(z) = \sum_{n=0}^\infty n! z^n/n! = 1/(1-z)$, giving

$G(z) = \exp(-z)/(1-z).$

For another example, observe that the e.g.f enumerating the $r^n$ functions $f : \{1,\ldots, n\} \rightarrow \{1,\ldots, r\}$ is

$\sum_{r=0}^\infty r^n \frac{w^r}{r!} \frac{z^n}{n!} = \sum_{r=0}^\infty \frac{w^r}{r!} \sum_{n=0}^\infty \frac{(rz)^n}{n!} = \sum_{r=0}^\infty \frac{w^r}{r!} \mathrm{e}^{rz} = \exp (w \mathrm{e}^z).$

Each such function is uniquely determined by a pair $(A,B)$ where $A = \{1,\ldots,r\} \backslash \mathrm{im} f$ carries no extra structure, and $B = \mathrm{im} f$ carries a set composition of $\{1,\ldots, n\}$ into $|B|$ parts. Therefore the e.g.f. for set compositions is

$\exp(-w) \exp (w\mathrm{e}^z) = \exp(w (\mathrm{e}^z-1)).$

(Note this series is doubly exponential: the number of set compositions of $\{1,\ldots,n\}$ into exactly $m$ parts is the coefficient of $w^m/m! z^n/n!$.) Since there are $m!$ set compositions associated to each set partition into $m$ parts, we get

$\exp(w (\mathrm{e}^z-1) = \sum_{s=0}^\infty \sum_{n=0}^\infty \genfrac{\{}{\}}{0pt}{}{n}{m} w^m \frac{z^n}{n!}.$

as claimed earlier.

#### Binomial inversion

Multiplication by an exponential series is closely related to binomial inversion. Let $G(z) = \sum_{n=0}^\infty a_n z^n/n!$. Then

$\exp(z) G(z) = \sum_{n=0}^\infty b_n z^n/n!$

if and only if $b_n = \sum_{m=0}^n \binom{n}{m} a_m$.

This gives a very quick and elementary proof of the derangements formula: enumerating permutations by their number of fixed points we have $n! = \sum_{m=0}^n \binom{n}{m} d_m$, so

$\sum_{n=0}^\infty d_n z^n/n! = \exp(-z) \sum_{n=0}^\infty n! z^n/n! = \exp(-z)/(1-z);$

now take the coefficient of $z^n$ to get

$d_n = n!\bigl( 1-\frac{1}{1!} + \frac{1}{2!} - \cdots + \frac{(-1)^n}{n!} \bigr).$

#### Differentiation

Let $G(z) = \sum_{n=1}^\infty a_n z^n/n!$ be the exponential generating function for $G$-structures. We have seen that the weighted exponential generating function for $G$-structured set partitions is

$\sum_{m=0}^\infty \sum_{n=0}^\infty a^{(m)}_n x^m \frac{z^n}{n!} = \exp x G(z)$.

Differentiating $k$ times with respect to $x$, dividing by $k!$, and then setting $x=1$ we get

$\sum_{n=0}^\infty \sum_{m=0}^n a^{(m)}_n \binom{m}{k} \frac{z^n}{n!} = G(z)^k/k! \exp G(z).$

Now $G(z)^k/k!$ is the exponential generating function for $G$-structured set partitions into exactly $k$ parts, so, taking coefficients of $z^n/n!$, we get

\begin{aligned} \sum_{m=0}^n a^{(m)}_n \binom{m}{k} &= \Bigl[ \frac{z^n}{n!}\Bigr] \sum_{r=0}^\infty a^{(k)}_r \frac{z^r}{r!} \exp G(z) \\ &= \sum_{r=0}^n a^{(k)}_r \frac{n!}{r!} [z^{n-r}] \exp G(z) \\ &= \sum_{r=0}^n \binom{n}{r} a^{(k)}_r \bigl[\frac{z^{n-r}}{(n-r)!}\Bigr] \exp G(z) \\ &= \sum_{r=0}^n \binom{n}{r} a^{(k)}_r b_{n-r} \end{aligned}

where $b_n$ is the number of $G$-structured set partitions of $\{1,\ldots,n\}.$

This identity gives uniform proofs of two Stirling Number identities.

Taking $G(z) = -\log (1-z)$ to enumerate Stirling Numbers of the First Kind we have $\exp G(z) = 1/(1-z)$ and $b_n = n!$ so

$\sum_{m=0}^n \genfrac{[}{]}{0pt}{}{n}{m} \binom{m}{k} = \sum_{r=0}^n \binom{n}{r} \genfrac{[}{]}{0pt}{}{r}{k} (n-r)! = \genfrac{[}{]}{0pt}{}{n+1}{k+1}$

where the final equality holds because the middle sum counts triples $(X, \sigma, \tau)$ where $X$ is an $r$-subset of $\{1,\ldots, n\}$, $\sigma$ is a permutation of $X$ having exactly $k$ disjoint cycles and $\tau$ is a cycle on $\{1,\ldots,n,n+1\}\backslash X$; such triples are in obvious bijection with permutations of $\{1,\ldots,n ,n+1\}$ having exactly $k+1$ disjoint cycles.

Taking $G(z) = \mathrm{e}^z-1$ to enumerate Stirling Numbers of the Second Kind we have $\exp G(z) = \sum_{n=0}^\infty B_n \frac{z^n}{n!}$ and $b_n = B_n$, so

$\sum_{m=0}^n \genfrac{\{}{\}}{0pt}{}{n}{m} \binom{m}{k} = \sum_{r=0}^n \binom{n}{r} \genfrac{\{}{\}}{0pt}{}{r}{k} B_{n-r}.$

This, and the analogous identity for the Stirling Numbers of the First Kind, may be compared with

$\sum_{m=0}^n \binom{n}{m} \genfrac{\{}{\}}{0pt}{}{m}{k} = \genfrac{\{}{\}}{0pt}{}{n+1}{k+1}$

which follows from counting set partitions according to the size of the part containing $n+1$.

#### Bijective proofs

The left-hand side of the general identity above counts $G$-structured set partitions of $\{1,\ldots,n\}$ having exactly $k$ distinguished parts, and maybe some further undistinguished parts. (Typically differentiation transforms generating functions in this way). The right-hand side counts the same objects, by enumerating triples consisting of an $r$-subset $X$ of $\{1,\ldots,n\}$, a $G$-structured set partition of $X$ into $k$ distinguished parts, and an (undistinguished) $G$-structured set partition of $\{1,\ldots,n\}\backslash X$. This gives fully bijective proofs of both identities. So maybe they are not so deep: that said, even the special case $k=1$ of the first, namely

$\sum_{m=0}^n \genfrac{[}{]}{0pt}{}{n}{m} m = \genfrac{[}{]}{0pt}{}{n+1}{2}$

seemed non-obvious to me.