Permutation modules and the simple group of order 168

June 26, 2018

In an earlier post I decomposed the permutation module for \mathrm{GL}_3(\mathbb{F}_2) acting on the 7 non-zero lines in \mathbb{F}_2^3, with coefficients in a field of characteristic 2. Some related ideas give the shortest proof I know of the exceptional isomorphism between \mathrm{PSL}_2(\mathbb{F}_7) and \mathrm{GL}_3(\mathbb{F}_2). The motivation section below explains why what we do ‘must work’, but is not logically necessary.


Let h \in \mathrm{GL}_3(\mathbb{F}_2) have order 7 and let

H  = \mathrm{N}_{\mathrm{GL}_3(\mathbb{F}_2)}(\langle h\rangle) \cong C_7 \rtimes C_3.

The simple modules for \mathrm{GL}_3(\mathbb{F}_2) in characteristic 2 are the trivial module, the natural module E, its dual E^\star, and the reduction modulo 2 of a module affording the 8-dimensional character induced from either of the two non-trivial linear characters of H. (A short calculation with Mackey’s Formula and Frobenius Reciprocity shows the module is irreducible in characteristic zero; then since 168/8 is odd, it reduces modulo 2 to a simple projective. Alternatively one can work over \mathbb{F}_4 and perform the analogue of the characteristic zero construction directly.) The permutation module for \mathrm{GL}_3(\mathbb{F}_2) acting on the cosets of H, defined over \mathbb{F}_2, is projective. Since it has the trivial module in its socle and top, the only possible Loewy structure is

\begin{matrix} \mathbb{F}_2 \\ E \oplus E^\star \\ \mathbb{F}_2 \end{matrix}.

In particular, it contains a 4-dimensional \mathbb{F}_2\mathrm{GL}_3(\mathbb{F}_2)-submodule U having E as a top composition factor. Below we construct U as a module for \mathrm{PSL}_2(\mathbb{F}_7) and hence obtain the exceptional isomorphism in the form \mathrm{PSL}_2(\mathbb{F}_7) \cong \mathrm{GL}(E).


Let G = \mathrm{PSL}_2(\mathbb{F}_7) regarded as a permutation group acting on the 8 non-zero lines in \mathbb{F}_7^2. For j \in \{0,1,\ldots, 6\}, let j denote the line through (j,1), and let \infty denote the line through (1,0). Let M = \langle e_j : j \in \{0,1,\ldots, 6\} \cup \{\infty\} \rangle_{\mathbb{F}_2} be the corresponding \mathbb{F}_2G-permutation module.

Let h = (0, 1, \dots, 6) \in G, corresponding to the Möbius map x \mapsto x + 1 and to the matrix \left( \begin{matrix} 1 & 1 \\ 0 & 1 \end{matrix} \right) in \mathrm{SL}_2(\mathbb{F}_7). The simple modules for \mathbb{F}_2\langle h \rangle correspond to the factors of

x^7 + 1 = (x+1)(x^3+x+1)(x^3+x^2+1).

Since \{1,2,4\} \le \mathbb{F}_7^\times is permuted by squaring, an idempotent killing the simple modules with minimal polynomial x^3 + x + 1 is h^4 + h^2 + h. Therefore the module U we require has codimension 1 in the 8-3 = 5-dimensional module

M(h + h^2 + h^4) = \langle e_\infty \rangle_{\mathbb{F}_2} \oplus \left\langle \begin{matrix} e_0 + e_1 + e_3 \\ e_1 + e_2 + e_4 \\ e_2 + e_3 + e_5 \\ e_3 + e_4 + e_6 \end{matrix} \right\rangle_{\mathbb{F}_2}

(For example, applying h to the final basis vector gives e_4 + e_5 + e_0 which is the sum of the first three basis vectors.) Since e_\infty must appear, it can only be that U is generated by u_0 where

u_0 = e_\infty + e_0 + e_1 + e_3

and so U has basis u_0, u_1, u_2, u_3 where u_i = u_0 h^i for each i.

Let t = (1,2,4)(3,5,6) \in \mathrm{N}_G(\langle h \rangle) and let

j = (0,\infty)(1,3)(2,5)(4,6) \in G,

chosen so that h^t = h^2 and t^j = t^2. The corresponding Möbius maps are x \mapsto 2x and x \mapsto 3/x and one choice of corresponding matrices in \mathrm{SL}_2(\mathbb{F}_7) is \left( \begin{matrix} 4 & 0 \\ 0 & 2 \end{matrix} \right) and \left(\begin{matrix} 0 & 2 \\ 3 & 0 \end{matrix}\right). Computation shows that in their (right) action on U,

h \mapsto \left( \begin{matrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 1 & 1 & 0 \end{matrix}\right), t \mapsto \left( \begin{matrix} 1 & 1 & 0 & 1 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 1 & 1 & 1 \end{matrix}\right), j \mapsto \left( \begin{matrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \end{matrix}\right).

The unique trivial submodule of U is spanned by

u_0 + u_2 + u_3 = e_0 + e_1 + \cdots + e_6 + e_\infty.

Quotienting out, we obtain the 3-dimensional representation of G where

h \mapsto \left( \begin{matrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 1 \end{matrix} \right), \; t \mapsto \left( \begin{matrix} 0 & 1 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{matrix}\right), \; j \mapsto \left( \begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{matrix} \right)

Since \mathrm{PSL}_2(\mathbb{F}_7) has cyclic Sylow 3-subgroups and cyclic Sylow 7-subgroups, and a unique conjugacy class of elements of order 2 (which contains j), one can see just from the matrices above that the representation is faithful. Comparing orders, we get \mathrm{PSL}_2(\mathbb{F}_7) \cong \mathrm{GL}_3(\mathbb{F}_2).


A weighted Chebyshev’s order inequality

June 20, 2018

Let w_i \in \mathbb{R}^{\ge 0} be non-negative weights and let a_1, \ldots, a_n \in \mathbb{R}^{\ge 0}, b_1, \ldots, b_n \in \mathbb{R}^{\ge 0} be non-negative numbers such that

\begin{aligned} &a_1 \ge a_2 \ge \ldots \ge a_n \\ & b_1 \le b_2 \le \ldots \le b_n \end{aligned}.


\begin{aligned} \bigl(  \sum_{i} w_i \bigr) \bigl(\sum_{j}w_ja_jb_j \bigr)  &= \frac{1}{2} \sum_{ij} w_iw_j (a_jb_j + a_ib_i) \\ &\le \frac{1}{2} \sum_{ij} w_iw_j (a_jb_i + a_ib_j) \\ &= \bigl( \sum_i w_ia_i\bigr) \bigl( \sum_j w_ja_j \bigr)\end{aligned}

where the middle step follows from the inequality (a_jb_i+a_ib_j) - (a_jb_j + a_ib_i) = (a_j-a_i)(b_i-b_j) \ge 0. In particular, if \sum_i w_i = 1 then we obtain

\sum_k w_i a_i b_i \le \bigl( \sum_i w_i a_i \bigr)\bigl(\sum_i w_i b_i\bigr).

This is a weighted version of Chebyshev’s order inequality. The weighted Chebychev’s order inequality appears (with a different proof) in an interesting article in MAA Mathematics Magazine, where it is used to deduce Jensen’s inequality.

Permutation bases and an unexpected duality

June 10, 2018

Let G be a finite group acting transitively on a set \Omega. An important object in representation theory, much used by Burnside and his successors to study the action of G, is the permutation module \mathbb{K}\Omega, defined to be the vector space of all formal linear combinations of the elements of \Omega, over a chosen field \mathbb{K}. Thus \mathbb{K} \Omega has \Omega as a canonical basis.

It’s easy to fall into the trap of thinking that the canonical basis of a permutation module is in some sense special, or even unique. The latter is certainly wrong in all but the smallest cases. For example, whenever \mathbb{K} has characteristic zero, or the characteristic of \mathbb{K} does not divide |\Omega|- 1, the linear map defined on the canonical basis by

\omega \mapsto \sum_{\omega' \in \Omega} \omega'

is an isomorphism of \mathbb{K} G-modules. Generally, if G has rank r in its action on \Omega then \mathrm{dim}\ \mathrm{Hom}_{\mathbb{K}S_n}(\mathbb{K}\Omega,\mathbb{F}\Omega) = r, and a generic linear combination of these r-linearly independent homomorphisms is an automorphism of \mathbb{K}\Omega sending the canonical basis to a basis of \mathbb{K}\Omega permuted by G.

The motivation for this post is an example where \mathbb{K}\Omega has a permutation basis not obtained in this way. Take \mathrm{GL}_3(\mathbb{F}_2) acting on E = \langle e_1,e_2,e_3 \rangle_{\mathbb{F}_2}. Let \Omega = \bigl\{ \langle e \rangle : e \in E \backslash \{0 \} \bigr\} be the set of lines in E. (Each line has a unique non-zero point; but we use the line notation to makes clear the distinction between e \in E and \langle e \rangle \in \mathbb{K}\Omega.) The canonical basis for \mathbb{K}\Omega is therefore

\bigl\{ \langle e_1 \rangle, \langle e_2 \rangle, \langle e_3 \rangle, \langle e_2 + e_3 \rangle, \langle e_1 + e_3 \rangle, \langle e_1 + e_2 \rangle, \langle e_1 + e_2 + e_3 \rangle \bigr\}.

For each plane \Pi contained in E, let v_\Pi = \sum_{e \in \Pi \backslash \{0\}} \langle e \rangle. Clearly v_\Pi g = v_{\Pi g}, for g \in \mathrm{GL}_3(\mathbb{F}_2), so the v_\Pi are permuted by \mathrm{GL}_3(\mathbb{F}_2). The duality \langle e \rangle \mapsto \langle e \rangle^\bot gives a bijection between the 7 lines and 7 planes. Ordering basis vectors as above, the v_{\langle e \rangle^\bot} for non-zero e \in E have the coefficients shown in the matrix P below.

\left(\begin{matrix} 0&1&1&1&0&0&0 \\ 1&0&1&0&1&0&0 \\  1&1&0&0&0&1&0 \\ 1&0&0&1&0&0&1 \\ 0&1&0&0&1&0&1 \\ 0&0&1&0&0&1&1 \\ 0&0&0&1&1&1&0 \end{matrix} \right)

The determinant of P is -24, so whenever \mathbb{K} has characteristic 0 or \mathbb{K} has prime characteristic \ge 5, the v_\Pi form a permutation basis for \mathbb{K} \Omega. Since \mathrm{GL}_3(\mathbb{F}_2) acts 2-transitively on \Omega, the endomorphism algebra of \mathbb{K}\Omega is 2-dimensional, spanned by the identity and the ‘complementing’ map defined above. Therefore this alternative basis is not obtained by an endomorphism of \mathbb{K}\Omega.

Now for the surprising duality. When \mathbb{K} has characteristic 2, the three linear relations whose first is

v_{\langle e_1 \rangle^\perp} + v_{\langle e_1 + e_2 \rangle^\perp} + v_{\langle e_1 + e_3 \rangle^\perp} + v_{\langle e_1 + e_2 + e_3 \rangle^\perp} = 0

show that \mathrm{rank}\ P \le 4. It is clear from the final four rows of P that v_{\langle e_2 + e_3\rangle^\perp}, v_{\langle e_1 + e_3\rangle^\perp}, v_{\langle e_1 + e_2 \rangle^\perp}, v_{\langle e_1 + e_2 + e_3 \rangle^\perp} are linearly independent and span a 4-dimensional submodule of \mathbb{K} \Omega. The sum of these vectors, namely \langle e_1 \rangle + \cdots + \langle e_1 + e_2+e_3\rangle, spans the unique trivial submodule of \mathbb{K}\Omega. Since 7 is odd, it splits off as a direct summand. Observe from rows 4 and 7 of the matrix P that

\begin{aligned} \langle e_1 \rangle + \langle e_1 + e_2 \rangle + \langle e_1 + e_3 \rangle & + \langle e_1 + e_2 + e_3 \rangle \\ &\quad = v_{\langle e_2+e_3\rangle^\perp} + v_{\langle e_1+e_2+e_3 \rangle^\perp}. \end{aligned}

A convenient basis for a complement to the trivial submodule is therefore

\begin{aligned} w_1 &= \langle e_1 \rangle + \langle e_1 + e_2 \rangle + \langle e_1 + e_3 \rangle + \langle e_1 + e_2 + e_3 \rangle \\ w_2 &= \langle e_2 \rangle + \langle e_1 + e_2 \rangle + \langle e_2 + e_3 \rangle + \langle e_1 + e_2 + e_3 \rangle \\ w_3 &= \langle e_3 \rangle + \langle e_1 + e_3 \rangle + \langle e_2 + e_3 \rangle + \langle e_1 + e_2 + e_3 \rangle. \end{aligned}

We have shown that W = \langle w_1, w_2, w_3 \rangle_\mathbb{K} is a 3-dimensional \mathbb{K}\mathrm{GL}_3(\mathbb{F}_2)-submodule of \mathbb{K}\Omega. In the special case K = \mathbb{F}_2, the representation homomorphism \rho : \mathrm{GL}_3(\mathbb{F}_2) \rightarrow \mathrm{GL(W)} is even an isomorphism. Looking at the definitions of w_1, w_2, w_3, it might seem almost obvious that W is isomorphic to the natural representation of \mathrm{GL}_3(\mathbb{F}_2).

This is not the case. There are two 3-dimensional irreducible \mathbb{F}_2\mathrm{GL}_3(\mathbb{F}_2)-modules, namely the natural module E and its dual E^\star. These are distinguished by the action of the matrices in \mathrm{GL}_3(\mathbb{F}_2) of order 7. Let

g = \left( \begin{matrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 1 & 0 \end{matrix} \right)

be the companion matrix for the irreducible cubic 1 + X + X^3. Computing the action of g on W we find that

\begin{aligned} w_1 g &= \langle e_1 g \rangle + \langle (e_1 + e_2)g \rangle + \langle (e_1 + e_3) g \rangle + \langle (e_1 + e_2 + e_3)g \rangle \\ &= \langle e_2 \rangle + \langle e_2 + e_3 \rangle + \langle e_1 \rangle + \langle e_1 + e_3 \rangle \\ &= w_1 + w_2. \end{aligned}

Similar calculations show that w_2 g = w_3 and w_3 g = w_1. Reordering basis vectors as w_2, w_3, w_1, we see that in its action on W, g is represented by the matrix

\left( \begin{matrix} 0 &1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 1 \end{matrix} \right).

This is the companion matrix of the irreducible cubic 1 + X^2 + X^3, which lies in the other conjugacy class of matrices of order 7 to g. Therefore W \cong E^\star.

Since \mathrm{dim}\ \mathrm{End}_{\mathbb{F}_2\mathrm{GL}_3(\mathbb{F}_2)}(\mathbb{F}_2\Omega) = 2, it follows that \mathbb{F}_2\Omega \cong  \mathbb{F}_2 \oplus U where U is an indecomposable 6-dimensional \mathbb{F}_2\mathrm{GL}_3(\mathbb{F}_2)-module with \mathrm{rad}\ U = \mathrm{soc}\ U \cong E^\star and U / \mathrm{rad}\ U \cong E. As expected, the summand U is self-dual, but I think it is a mild surprise that it has E^\star and not E in its socle.

Hill climbing on substitution ciphers

May 22, 2018

On Saturday the BSHM held a fascinating meeting The history of cryptography and codes. In the second talk, Solving Historical Ciphers with Modern Means, Klaus Schmeh mentioned an attack on the old-school Turning Grille cipher by the thoroughly new-school method of hill climbing. This got me thinking about whether hill climbing could be an effective attack on the monoalphabetic substitution cipher.

As a running example, we use the ciphertext below; it is the encryption, by a randomly chosen substitution cipher, of the first two sentences in Section 1.1 of Stinson’s highly recommended book Crpytography: Theory and Practice, punctuation and spaces deleted.


Frequency analysis shows that the most common letters are xzugk with 30, 23, 23, 21 and 19 occurrences, respectively. We therefore guess that x is the encryption of e, that z and u are the encryptions of t and a (in some order), and so on. Decrypting by the key zpxyjcofvuidsqkrlmnwabgeht implied by the frequency counts gives


From here I think it is feasible—but certainly non-trivial—to complete the decryption by hand. A good place to start is the final seven letters exafche, on the assumption that the (common) e and a are correct, and the (rare) x is also correct.

For a hill climb, we need

  • a place to start;
  • a way to step;
  • a scoring function.

The obvious starting point is the key guessed by frequency analysis.
Repeated steps must allow the climb to reach all 26! permutations of the Roman alphabet. A natural generating set for the symmetric group on the alphabet is the 26 \times 25 /2 = 325 transpositions. Thus in each step we interchange two letters of the decryption key. For instance, in the first step in the example below, zpxyj … becomes zyxpk… swapping the decrypts of b and d.

A good scoring function is far less obvious. In my first try, I scored a putative plaintext of length n by \sum_{i=1}^n s_i, where s_i is the length of the longest English word starting in position i. For example, exampletext gives scores (7,0,1,0,0,3,0,4,0,0,0), taking words from a small dictionary containing let but not ample. After 25 steps, the score increases from 115 to 187, and the hill climb then aborts, since no step improves the score. Unfortunately the putative plaintext is by this point


which, while a local maximum of the hill climb, seems barely more ‘Englishy’ than the first guess. Permitting two transpositions to be applied in each single step, the hill climb continued for 40 steps (the final 15 steps each taking a minute or so), and aborted with


of score 219. While closer to the truth, this seems a poor return on so much computational time. Increasing the dictionary size, and trying variant statistics, for example, the sum squared of the subword lengths, gave equally discouraging results.

At this point I searched for prior art, and found Cryptanalysis of the Simple Substitution Cipher, which recommends a score based on quadgram frequencies. For instance the most common quadgram is tion, with probability p_{\mathrm{tion}} = 0.00312, followed by nthe and ther. The suggested score is \sum_q \log p_q where the sum is over all quadgrams q in the putative plaintext. Using transpositions as steps, the hill climb took 31 steps (and about 10 seconds, most of which is spent building a look-up table of quadgram frequencies) to improve the score from -3188.91 to -2278.94. Half-way through the putative plaintext is


and the final putative plaintext is correct in every character:


For the record, the decryption key is zyxwkpomvutsrqjihdcbagfeln. Similarly impressive results were obtained using trigrams or quintgrams. (Some online code I wrote for the trigram attack is linked below.) Bigrams with transposition steps aborted after 23 steps at the local maximum


but allowing double transpositions the hill climb reaches (after 37 steps and a few minutes)


which is correct up to a single transposition. This transposition is a possible step, but it is rejected by the algorithm since making it worsens the bigram score, from -1308.83 to -1309.56. Thus for bigrams with double transposition steps, this plaintext is not even a local maximum of the hill climb. (Despite this theoretical obstruction, there is obviously no problem completing the decryption.)

Some final thoughts.

  • My first choice of scoring function suffers on the example text by getting stuck at local maxima that have some plausible chunks (for example, consider the extract … operaluldesireshallebuldish, while other parts look completely wrong. The less ‘brittle’ quadgram statistic appears to lead to a smoother, steady improvement.
  • The score by summing the logs of the probabilities of digrams is suggested by maximum likelihood estimates. Thus the more successful score has the more solid theoretical basis.
  • These experiments suggest that, at least for the substitution cipher, the choice of scoring function is far more important than the choice of steps. My attempt to salvage a poor scoring function by increasing the number of possible steps was somewhat successful, at the cost of vastly increased running time. Given the exponential growth in the search space, it seems clear that a small menu of steps, directed by a good scoring function, and possibly iterated many times, is the ideal to aim for.
  • A simplified version of the Haskell code I wrote is online here. Because of limitations of the online environment, it only works with bigrams and a reduced list of 2500 trigrams. Some trials suggest that the reduced trigrams suffice on most ciphertexts longer than a few sentences.

In my Cipher Systems course I had a computer demonstration in most lectures. I used Mathematica at first, in the hope (partially realised), that students would be able to re-use the notebooks for problem sheets, but switched to Haskell for the more computationally expensive attacks later in the course. My worry was that, since I never had time to explained the code beyond the bare mathematical details of the underlying algorithm, this would just deepen the students’ mystification. Feedback suggests, however, that these demonstrations were found surprisingly helpful.

The Gini coefficient

April 10, 2018

We define the Gini coefficient of a probability measure p on [0,\infty) with mean \mu by

\displaystyle G = \frac{\mathbb{E}[\hskip0.5pt|X-Y|\hskip0.5pt]}{2 \mu}

where X and Y are independently distributed according to p. Thus G = \frac{1}{2\mu} \int_0^\infty |X(\omega) - Y(\omega)| \mathrm{d}p_\omega. Dividing by \mu makes G a dimensionless quantity. The reason for normalizing by 2 will be seen shortly.

The Gini coefficient is a measure of inequality. For instance, if p_x is the probability that a citizen’s wealth is x \in [0,\infty), then we can sample G by picking two random citizens, and taking the (absolute value of) the difference of their wealths. In

  • Utopia, where everyone happily owns the same ample amount (right up to the point when they are beheaded for treason), the Gini coefficient is 0;
  • Dystopia, where the ruler has a modest fortune of N units, and the other N-1 `citizens’ have nothing at all, \mu = 1 and the Gini coefficient is \frac{1}{2} \frac{2(N-1)}{N^2} N = 1 - \frac{1}{N};
  • Subtopia, where each serf has 1 unit, and each land-owner has 2 units (plus two serfs to boss around), \mu = 4/3. Picking two people at random, the probability that come from different classes is 2 . \frac{2}{3} . \frac{1}{3} = \frac{4}{9}, hence the Gini coefficient is \frac{1}{2} \frac{3}{4} \frac{4}{9} = \frac{1}{6}.

A striking interpretation of G uses the Lorenz curve. Staying with the wealth interpretation, define L(\theta) to be the proportion of all wealth owned by the poorest \theta of the population. Thus L(0) = 0, L(1) = 1 and, since the poorest third (for example) cannot have more than a third of the wealth, L(\theta) \le \theta for all \theta. When N is large we can approximate L by the following functions \ell: in

  • Utopia \ell(\theta) = \theta for all \theta;

  • Dystopia \ell(\theta) = 0 if \theta < 1 and \ell(1) = 1;
  • Subtopia \ell(\theta) = \frac{3\theta}{4} if 0 \le  \theta < 2/3 and \ell(\theta) = -\frac{1}{2} + \frac{3\theta}{2} if 2/3 \le \theta \le 1.

Since the area below the Subtopian Lorenz curve is 1/6 + 1/6 + 1/12 = 5/12, the orange area is 1/12. This is half the Gini coefficient. Clearly the same result holds in Utopia and Dystopia.

Theorem. G is twice the area between the Lorenz curve for X and the diagonal.

Proof. Let P be the cumulative density function for X, so \mathbb{P}[X \le x] = P(x). If you are person \omega, and your wealth is x then X(\omega) = x. Thus P(X(\omega)) \le \theta if and only if person \omega is in the poorest \theta of the population. Therefore the poorest \theta of the population form the event \{ P(X) \le \theta \}. Their proportion of the wealth is the expectation of X, taken over this event, scaled by \mu. That is

\displaystyle L(\theta) = \frac{\mathbb{E}[X 1_{P(X) \le \theta}]}{\mu}.

The area, I say, under the Lorenz curve is therefore

\begin{aligned} I &= \frac{1}{\mu} \int_0^1 L(\theta) \mathrm{d}\theta \\  &= \frac{1}{\mu} \int_0^1 \mathbb{E}[X 1_{P(X) \le \theta}] \mathrm{d}\theta \\ &= \frac{1}{\mu} \int_0^\infty \mathbb{E}_X[X 1_{P(X) \le P(y)}] \mathrm{d}P(y)  \\ &= \frac{1}{\mu} \int_0^\infty \mathbb{E}_X[X 1_{X \le y}] \mathrm{d}P(y) \\ &= \frac{1}{\mu} \mathbb{E}_Y \mathbb{E}_X [X 1_{X \le Y}] \end{aligned}

where we made the substitution \theta = P(y) to go from line 2 to line 3. Now since \mathrm{min}(X,Y) = X 1_{X \le Y} + Y1_{Y \le X} - X 1_{X = Y}, where the event X=Y is assumed to be negligible, it follows from linearity of expectation that \mathbb{E}[\mathrm{min}(X,Y)] = 2 \mathbb{E}[X 1_{X \le Y}]. (Here \mathbb{E} denotes the expectation taken over both random variables.) Substituting we obtain

\displaystyle I = \frac{1}{\mu} \frac{\mathbb{E}[\mathrm{min}(X,Y)]}{2}.

From the identity |X-Y| = X+Y - 2\mathrm{min}(X,Y) then another application of linearity of expectation, and finally the definition of G we get

\begin{aligned}I &= \frac{1}{2\mu} \frac{\mathbb{E}[X+Y - |X-Y|]}{2} \\ &= \frac{1}{2\mu} \frac{2\mu - \mathbb{E}[X-Y]}{2} \\ &= \frac{1}{2} - \frac{G}{2}. \end{aligned}

Therefore G = 2(\frac{1}{2}-I), as claimed.

Some related integrals. My interest in this mainly comes from a way to compute G from the formula

\begin{aligned} \int_0^\infty (1-P(x))^2 \mathrm{d}x &= \int_0^\infty \mathbb{P}[X > x]^2 \mathrm{d}x \\&= \int_0^\infty \mathbb{P}[Y  > x] \mathbb{P}[Z > x] \mathrm{d}x \\ &= \int_0^\infty \mathbb{P}[\mathrm{min}(Y,Z) > x] \mathrm{d}x \\ &= \mathbb{E}[\mathrm{min}(Y,Z)] \end{aligned}

where we introduced independent random variables Y and Z distributed according to p, and then used the standard result \mathbb{E}[T] = \int_0^\infty \mathbb{P}[T \ge t] \mathrm{d}t which holds for any random variable T. Very similarly, since P(x)^2 = \mathbb{P}[Y \le x]\mathbb{P}[Z \le x] we have 1-P(x)^2 = \mathbb{P}[\mathrm{max}(Y,Z) > x] and

\int_0^\infty 1-P(x)^2 \mathrm{d}x = \mathbb{E}[\mathrm{max}(Y,Z)].

Both these formula generalize to give formulae for the minimum and maximum of k independent identically distributed random variables. For instance, a corollary of these formulae for k =2 is that

\begin{aligned} G &= \frac{1}{2\mu} \mathbb{E}[\mathrm{max}(X,Y) - \mathrm{min}(X,Y)]  \\ &= \frac{1}{2\mu} \int_0^\infty (1-F(x)^2) - (1-F(x))^2 \mathrm{d} x \\ &= \frac{1}{\mu} \int_0^\infty F(x)(1-F(x)) \mathrm{d}x \end{aligned}

and for k = 3, since (1-q^3)-(1-q)^3 = 3q(1-q),

\mathbb{E}[\mathrm{max}(X,Y,Z)-\mathrm{min}(X,Y,Z)] = \frac{3}{2}\mu G.

It’s perhaps surprising that sampling three people rather than two (and then taking max minus min) does not give an essentially different inequality statistic. But simply taking the middle does give something new:

\begin{aligned} \mathbb{E}&[\mathrm{middle}(X,Y,Z)] \\ &= 3\mu - \mathbb{E}[\mathrm{max}(X,Y,Z)] - \mathrm{E}[\mathrm{min}(X,Y,Z) \\ &= 3 \mu - \int_0^\infty (1-P(x)^3) \mathrm{d}x - \int_0^\infty (1-P(x))^3 \mathrm{d}x \\ &= 3 \mu - 3\int_0^\infty (1-P(x)) \mathrm{d}x + 3 \int_0^\infty (1-P(x))^2 \mathrm{d}x  \\ & \hskip2in - 2 \int_0^\infty (1-P(x))^3 \mathrm{d}x \\ &= 3 \mu - 3 \mathbb{E}[X] + 3 \mathbb{E}[\mathrm{min}(X,Y)] - 2 \mathbb{E}[\mathrm{min}(X,Y,Z)] \\ &= 3 \mathbb{E}[\mathrm{min}(X,Y)] - 2 \mathbb{E}[\mathrm{min}(X,Y,Z)]. \end{aligned}

A similar argument shows that

\mathbb{E}[\mathrm{middle}(X,Y,Z)] = 3\mathbb{E}[\mathrm{max}(X,Y)] - 2\mathbb{E}[\mathrm{max}(X,Y,Z)].

(Alternatively this can be proved from the previous remark.) Since the mean and median are not, in general, the same, it should not be a surprise that in a sample of three people, the middle wealth is not an unbiased estimator of the mean wealth. But still I find it slightly striking that it therefore serves as a measure of inequality.

Another interesting fact is that, generalizing the formulae for the cumulative density functions of the maximum and minimum above, the rank statistics for any number of independent samples have nice c.d.f.s. For example, since \mathrm{middle}(X,Y,Z) \le x if and only if either all three random variables are \le x or exactly two random variables are \le x, we have

P_\mathrm{middle}(x) = P(x)^3 + 3(1-P(x))P(x)^2

and in general, the j-th largest of k-independent random variables has c.d.f.

P_j(x) = \sum_{i=0}^{j-1} \binom{k}{i} P(x)^{k-i}(1-P(x))^i.

A motivated proof of the MacWilliams Identity

March 11, 2018

Fix n \in \mathbb{N} and let V = \mathbb{F}_2^n. Let \mathrm{wt}(u) denote the weight of u \in V, defined by \mathrm{wt}(u) = |\{i : u_i = 1\}|. Given a subset S of V, we define its weight enumerator W_S(x,y) by

W_S(x,y) = \sum_{u \in S} x^{\mathrm{wt}(u)} y^{n-\mathrm{wt}(u)}.

The MacWilliams Identity states that if C is a linear subspace of V, and C^\perp is its orthogonal complement with respect to the dot product u \cdot v = \sum_{i=1}^n u_i v_i then

\displaystyle W_{C^\perp}(x,y) = \frac{1}{|C|} W_C(-x+y,x+y).

The MacWilliams Identity is important in coding theory. It was first proved by the coding theorist Jessie MacWilliams, yet another alumnus of Bell Labs. To my mind, the best motivation for its proof comes from certain related identities in binomial coefficients and the generating function techniques used to prove them, which, in turn, are best understood in terms of the characters of cyclic groups.

Binomial coefficients

The Binomial Theorem states that (x+y)^n = \sum_{k=0}^n \binom{n}{k} x^k y^{n-k}. It has a one-line combinatorial proof: expand (x+y)(x+y) \cdots (x+y), by choosing x from k of the n brackets and y from the other n-k brackets; there are \binom{n}{k} ways to do this, so \binom{n}{k} is the coefficient of x^k y^{n-k}.

Now consider the sum s = \binom{n}{0} + \binom{n}{2} + \binom{n}{4} + \cdots of binomial coefficients with even `denominator’. This can be evaluated using the binomial theorem, using the two sums below to cancel out the unwanted binomial coefficients with odd denominator:

\displaystyle s = \frac{1}{2} \sum_{k=0}^n \binom{n}{k} + \frac{1}{2} \sum_{k=0}^n (-1)^k \binom{n}{k}.

For n \in \mathbb{N} the sums are (1+1)^n = 2^n and (-1+1)^n = 0, respectively. Therefore s = 2^{n-1}. More generally, this trick shows that

\displaystyle \sum_{j \in \mathbb{N}_0} \binom{n}{2j} x^{2j} y^{n-2j} = \frac{1}{2} (x+y)^n + \frac{1}{2}(-x+y)^n

which already has some of the appearance of the MacWilliams Identity.
How about s' = \binom{n}{0} + \binom{n}{4} + \binom{n}{8} + \cdots ? Since powers of -1 worked for the two-step identity, it is reasonable to try powers of \mathrm{i}. This gives

\begin{aligned} s' = \frac{1}{4} \sum_{k=0}^n \binom{n}{k} + \frac{1}{4} &\sum_{k=0}^n \binom{n}{k} \mathrm{i}^k \\& + \frac{1}{4} \sum_{k=0}^n \binom{n}{k} (-1)^k + \frac{1}{4} \sum_{k=0}^n \binom{n}{k} (\mathrm{-i})^k.\end{aligned}

By the Binomial Theorem, the sums are 2^n, (1+i)^n, 0 and (1-i)^n. For example, if n = 4m where m \in \mathbb{N} then (1+i)^{4m} = (1-i)^{4m} = 2^{2m} (-1)^m, so we have s' = 2^{4m-2} + 2^{2m-1} (-1)^m. Similar formulae can be obtained for the other cases for n modulo 4.

Character theory

Consider the cyclic group \mathbb{Z} / 4\mathbb{Z}. It has four irreducible complex characters, taking values (1,1,1,1), (1,\mathrm{i},-1,\mathrm{-i}), (1,-1,1,-1) and (1, \mathrm{-i},-1,\mathrm{i}). The previous displayed equation comes from expressing the indicator function (1,0,0,0) as a linear combination of irreducible characters

\begin{aligned} (1,0,0,0) = \frac{1}{4}  (1,1,1,1) + \frac{1}{4} &(1,i,-1,-i) \\ &+ \frac{1}{4} (1,-1,1,-1) + \frac{1}{4}(1,-i,-1,i).\end{aligned}

As an exercise, the reader might use the characters of \mathbb{Z} / 3\mathbb{Z} to decompose the indicator function (1,0,0) of 0 \in \mathbb{Z}/3\mathbb{Z} and so prove the related identity \binom{n}{0} + \binom{n}{3} + \binom{n}{6} + \cdots = \frac{2^{n}}{3} + \frac{2}{3} \cos \frac{n\pi}{3} for n \in \mathbb{N}.

MacWilliams’ Identity for the repetition code

Let C = \langle 1\ldots 1 \rangle. In this case the left-hand side of the MacWilliams’ Identity is sum of all x^{\mathrm{wt}(u)} y^{n - \mathrm{wt}(u)} over u \in V of even weight. By analogy with the binomial sum, we write

W_{C^\perp}(x,y) = \frac{1}{2} \sum_{v \in V} x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)} + \frac{1}{2} \sum_{v \in V} (-x)^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)}.

The first summand is the generating function enumerating all v \in V by their weight. Since there are \binom{n}{k} elements of V of weight k, it is \sum_{k =0}^n \binom{n}{k} x^k y^{n-k}, which is (x+y)^n by the Binomial Theorem. Replacing x with -x we see that the second summand is (-x+y)^n. Therefore

W_{C^\perp}(x,y) = \frac{1}{2} (x+y)^n + \frac{1}{2} (-x+y)^n.

Since W_C(x,y) = x^n + y^n, this agrees with the MacWilliams’ Identity. Note that the second sum could be written as

\displaystyle \sum_{v \in V} (-1)^{1\ldots 1 \cdot v} x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)}.

MacWilliams Identity for a larger code

Take n=4 and C = \langle 1100, 0011 \rangle. Then

W_C(x,y) = x^4 + 2x^2y^2 + y^4.

We can describe C^\perp as the intersection of the two codimension 1 hyperplanes U, U', with ‘normals’ 1100 and 0011. Thus

\displaystyle W_{C^\perp}(x,y) = \sum_{v} [v \in U \cap U'] x^{\mathrm{wt}(v)}y^{n-\mathrm{wt}(v)}.

For each b \in V, we define \chi_b(v) = (-1)^{b \cdot v}; these are the irreducible complex characters of V. Using them to decompose the indicator function [v \in U \cap U'] : V \rightarrow \mathbb{C} we get

[v \in U \cap U'] = \frac{1}{4} \chi_{0}(v) + \frac{1}{4} \chi_{1100}(v) + \frac{1}{4} \chi_{0011}(v) + \frac{1}{4} \chi_{1111}(v).

It now suffices to find \sum_{v \in V} \chi_{b}(v) x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)}. If \mathrm{wt}(b) = t then, by symmetry, we may assume that b = 1\ldots 10\ldots 0, where there are exactly t ones. Now (-1)^{b \cdot v} records the parity of the number of ones in the first t positions of v, so writing v \in V as the concatenated vector (u | w) where u \in \mathbb{F}_2^t and w \in \mathbb{F}_2^{n-t}, we have

\begin{aligned} \sum_{v \in V} &\chi_{b}(v) x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)}\\ &= \sum_{u \in U} (-1)^{\mathrm{wt}(u)} x^{\mathrm{wt}(u)} y^{t-\mathrm{wt}(v)} \sum_{w \in W} x^{\mathrm{wt}(w)} y^{n-t-\mathrm{wt}(w)} \\ &= (-x+y)^t (x+y)^{n-t} \end{aligned}.


\begin{aligned} W_{C^\perp}(x,y) &= \frac{1}{4} (x+y)^4 + \frac{2}{4} (-x+y)^2 (x+y)^2 + \frac{1}{4} (-x+y)^4 \\ &= W_C(-x+y,x+y) \end{aligned}

as required. (In this case, since C is self-dual, we even have W_{C^\perp}(x,y) = W_C(x,y).)

MacWilliams Identity in general

The general proof needs no more ideas than those seen in the previous example. The indicator function decomposition is

\displaystyle [v \in C^\perp] = \frac{1}{|C|} \sum_{b \in C} \chi_b(v)

and so

\begin{aligned} \sum_{v \in C^\perp} x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)} &= \sum_{v \in V} \frac{1}{|C|} \sum_{b \in C} \chi_b(v) x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)} \\ &= \frac{1}{|C|} \sum_{b \in C} \sum_{v \in V} \chi_b(v) x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)} \\ &= \frac{1}{|C|} \sum_{t=0}^n \sum_{b \in C : \mathrm{wt}(b) = t} (-x+y)^t (x+y)^{n-t} \\ &= \frac{1}{|C|} W_C(-x+y,x+y) \end{aligned}

as required.

Runups and cycles

March 3, 2018

Let xs be a stream of values that can be compared for equality. Let ++ denote concatenation of lists. Assuming that xs is of the form runUp ++ cyc ++ cyc ++ ..., what is a good way to find the decomposition (runUp, cyc)?

The problem is more subtle than it might seem. If xs is produced by iterating a function, so the entries are [x, f x, f f x, ...] then there are several cycle finding algorithms that can be used. In general, an algorithmic solution is impossible, because the stream may be of the form cyc ++ rest, where cyc has arbitrary length n+1 and agrees with rest for n positions. In practice, we might know that any cycle has some bounded length, or be prepared to accept a small chance of misidentification.

In this case, the following Haskell code finds the first decomposition (runUp, guessedCycle) such that the stream is runUp ++ rest and rest, which has length at least t, is of the form guessedCycle ++ guessedCycle ++ rest'.

runupAndCycle t xs = 
         runupAndCycleAux t bs (zip [1..] (tail xs))

runupAndCycleAux t xs [] = error $ show (t, xs)
runupAndCycleAux t xs qxs@((q, x) : _)
    = case lookup (map snd $ take t qxs) (take q tailsPositions) of
        Nothing -> runupAndCycleAux t xs (tail qxs)
        Just p  -> -- the t entries starting in positions q and p agree;
                   -- test if taking guessedCycle to be all entries
                   -- between positions q and p-1 gives a decomposition 
                   -- of the required form
                   let xsFirst  = take (q-p) $ drop p xs
                       xsSecond = take (q-p) $ map snd qxs
                   in  if and (zipWith (==) xsFirst xsSecond) 
                          then (take p xs, xsSecond) 
                          else runupAndCycleAux t xs (tail qxs)
    where tailsPositions 
              = [(take t ys, p) | (ys, p) <- zip (tails xs) [0..]]

For example

  • runupAndCycle 4 [1,2,3,1,2,3,4,5,1,2,3,4,5] evaluates to ([1,2,3],[1,2,3,4,5]);
  • runupAndCycle 4 [1,2,3,1,2,3,4,6,1,2,3,4,5] raises an exception, because while there is agreement for four positions, this is shown by the if ... then ... else ... test not to correspond to a cycle (and the bistream is then exhausted without finding anything better);
  • runUpAndCycle 4 [1,2,3,1,2,3,4,5,1,2,3,4,6] raises an exception for the same reason;
  • runUpAndCycle 4 [1,2,3,1,2,3,4,5,1,2,3,4,5,6] evaluates to ([1,2,3],[1,2,3,4,5]), as the first example, wrongly suggesting a 5-cycle. Replacing the first argument 4 with 6 we get an exception instead.

The algorithm may use O(tn^2) work to exhaust n bits without finding a suitable decomposition. For example with t = 3, the stream [0,0,0,1,0,0,0,2,0,0,0,3,0,0,0,4, ...] finds possible cycle starts in every fourth position; by position 4m, these require work 4m to investigate.

I’d like to know if there is a faster or more elegant algorithm—ideally both of course—but some searching only found hits for the deterministic version of the problem.