Meet-in-the-middle attacks on block ciphers

November 25, 2018

A mathematical model for a block cipher is a collection \mathcal{E} of invertible functions e_k : \mathbb{F}_2^n \rightarrow \mathbb{F}_2^n indexed by keys k \in \mathbb{F}_2^\ell. We say that the block cipher \mathcal{E} has block size n and key length \ell. For example, DES has n=64 and \ell = 56, while its successor AES has n=128 and \ell = 128 (or 192 or 256 for the super-paranoid).

Let \mathcal{E} and \mathcal{E}' be block ciphers of block size n and key lengths \ell and \ell', respectively. An easy way to boost the key length to \ell + \ell', while keeping essentially the same cryptologics, is to compose the encryption functions from the two ciphers. Thus for (k,k') \in \mathbb{F}_2^\ell \times \mathbb{F}_2^{\ell'} we define

e_{(k,k')}(x) = e'_{k'}\bigl(e_k(x)\bigr).

Perhaps surprisingly, when \ell, \ell' \le n, this cryptoscheme offers only \mathrm{max}(\ell,\ell')+1 bits of security, rather than the expected \ell + \ell'. The purpose of this post is to give a fairly precise costing of the meet-in-the-middle attack behind this.

As a notational convention, \star denotes quantities unknown to the cryptanalyst and \dagger quantities chosen or observed by the cryptanalyst. We cost by the number of encryption/decryption operations, or cryptops, justifying this assumption in the course of the argument.

Let (k_\star,k'_\star) be the (unknown) key. Let x_\dagger be a chosen plaintext and let z_\dagger = e_{(k',k)}(x_\dagger) be its observed encryption. (We write e_{(k'_\star,k_\star)}(x_\dagger) rather than e'_{k'_\star}\bigl(e_{k_\star}(x_\dagger)\bigr) to emphasise that in the chosen plaintext model, we can perform two-stage encryption using the unknown key pair (k_\star,k'_\star), but not each stage separately.) We denote by d'_{k'} the inverse of the encryption function e'_{k'}. Let

\begin{aligned} E &= \{ \bigl(k, e_k(x_\dagger)\bigr) : k \in \mathbb{F}_2^\ell \} \\ D & = \{ \bigl(k', d'_{k'}(z_\dagger) \bigr) : k' \in \mathbb{F}_2^\ell \} \end{aligned}

Thus |E| = 2^\ell, |D| = 2^{\ell'} and 2^\ell + 2^{\ell'} cryptops are required to get this far.

Using any reasonable algorithm, sorting E by the second entry in each pair takes at most 2^\ell \ell s primitive operations for some smallish s. Similarly D is sorted using 2^{\ell'} \ell' s primitive operations. Working through the sorted lists we pair up the (k,y) \in E with the (k',y) \in D, for each y \in \mathbb{F}_2^n. In one possible implementation this makes a list of pointers to the relevant members of D and E for each y \in \mathbb{F}_2^n, so the cost in primitive operations is

(|E|+|D|)b = (2^\ell + 2^{\ell'}) b

for some small b. Since each cryptop for \mathcal{E} must produce \ell bits, and similarly for \mathcal{E}', it seems reasonable to assume that the cost of 2^\ell + 2^{\ell'} cryptops is significantly more than the cost of this sorting and pairing up.

We now have the sets

\begin{aligned} \mathcal{K}_y &= \{ k \in \mathbb{F}_2^\ell : e_{k}(x_\dagger) = y \} \\ \mathcal{K}'_y &= \{ k' \in \mathbb{F}_2^\ell : d'_{k'}(z_\dagger) = y \} \end{aligned}

for each y \in \mathbb{F}_2^n. Observe that \mathcal{K}_y \times \mathcal{K}'_y is those key pairs (for the composed cryptosystem) that meet-in-the-middle at y. Fix p distinct plaintexts

X_\dagger^{(1)}, \ldots, X_\dagger^{(p)} \in \mathbb{F}_2^n \backslash \{x_\dagger\}.

For each y \in \mathbb{F}_2^n we encrypt X_\dagger^{(1)}, \ldots, X_\dagger^{(p)} using the guessed key pair (k,k') \in \mathcal{K}_y \times \mathcal{K}'_{y}, and check if

e'_{k'}\bigl(e_{k}(X_\dagger^{(i)})\bigr) = (e_{k'_\star} \circ e_{k_\star})(X^{(i)}_\dagger)

for each i. If all p encryptions agree, we assume that (k_\star,k'_\star) =  (k,k'). (But for simplicity, let’s say we carry on checking all the other keys.) The number of pairs to be checked in the final step is

M = \sum_{y \in \mathbb{F}_2^n} |\mathcal{K}_y| |\mathcal{K}'_y|

and each pair requires two encryptions, making the cost of the final step 2Mp cryptops. The total cost in cryptops is therefore 2^{\ell+\ell'} + 2Mp.

We now estimate 2^{\ell+\ell'} + 2Mp for the random cipher, and use this as an approximation to DES and to AES.

Random cipher

Suppose that each e_k : \mathbb{F}_2^n \rightarrow \mathbb{F}_2^n and e'_{k'} : \mathbb{F}_2^n \rightarrow \mathbb{F}_2^n is chosen uniformly and independently at random from the (2^n)! permutations of \mathbb{F}_2^n. We know that e_{k_\star}(x_\dagger) = y_\dagger and for each other key k, the probability is \frac{1}{2^n} that e_k(x_\dagger) = y_\dagger. Dually, we know that d'_{k'_\star}(z_\dagger) = y_\dagger, and for each other key k', the probability is \frac{1}{2^n} that d'_{k'}(z_\dagger) = y_\dagger. Therefore |\mathcal{K}_{y_\dagger}| and |\mathcal{K'}_{y_\dagger}| are independently distributed as the shifted binomial distribution

1 + \mathrm{Bin}(2^\ell-1, \frac{1}{2^n}).

Similarly, if y \not= y_\dagger then |\mathcal{K}|_y and |\mathcal{K}|_y are independently distributed as \mathrm{Bin}(2^\ell-1, \frac{1}{2^n}). (It is worth noting that \mathcal{K}_{y} and \mathcal{K}_{y'} are not independent: for example we have \cup_{y \in \mathbb{F}_2^n} \mathcal{K}_y = \mathbb{F}_2^\ell where the union is disjoint, therefore \sum_{y \in \mathbb{F}_2^n} |\mathcal{K}_y|  = 2^\ell.) It follows by linearity of expectation that

\begin{aligned} \mathbb{E}[M] &= \mathbb{E}\bigl[\sum_{y \in \mathbb{F}_2^\ell} |\mathcal{K}_y| |\mathcal{K}'_y|\bigr] \\ &= \bigl( 1 + \frac{2^\ell-1}{2^n} \bigr)\bigl( 1 + \frac{2^{\ell'}-1}{2^n} \bigr) + (2^n-1) \bigl( \frac{2^\ell-1}{2^n} \bigr)\bigl( \frac{2^{\ell'}-1}{2^n} \bigr)  \\ &= \frac{2^{\ell+\ell'}}{2^n} + 1 -  \frac{1}{2^{n}} \end{aligned}

This is very well approximated by \frac{2^{\ell + \ell'}}{2^n} + 1. Hence the total cost is estimated as

2^{\ell} + 2^{\ell'} + 2p \bigl( \frac{2^{\ell + \ell'}}{2^n} + 1\bigr).

It remains to choose a suitable value for p. By our randomness assumption, e'_{k'_\star} \circ e_{k_\star} and e'_{k'} \circ e_k are each permutations of \mathbb{F}_2^n chosen uniformly at random subject to the constraint f(x_\dagger) = z_\dagger. Thus for each chosen plaintext X_\dagger \not= x_\dagger, the ciphertexts (e'_{k'_\star} \circ e_{k_\star})(X_\dagger) and (e'_{k'} \circ e_k)(X_\dagger) are uniformly distributed on \mathbb{F}_2^n \backslash \{z_\dagger\}. The probability they agree is therefore 1/(2^n-1). More generally, the probability that e'_{k'}\bigl(e_{k}(X_\dagger^{(i)})\bigr) = (e'_{k'_\star} \circ e_{k_\star})(X^{(i)}_\dagger) for all of the p chosen X_\dagger^{(i)} is

\displaystyle \frac{1}{2^n-1} \times \ldots \times \frac{1}{2^n-p} \approx \frac{1}{2^{np}}.

We have to check M key pairs, so the expected number of ‘false’ keys is very nearly \mathbb{E}\bigl[\frac{M}{2^{np}}\bigr]; by the good approximation for M above, this is very nearly

\displaystyle \frac{2^{\ell+\ell'}}{2^{(1+p)n}} + \frac{1}{2^{np}}.

Suppose for simplicity that \ell = \ell'. Setting n = \nu \ell, our estimate for the total work is then

2^{\ell}+2^{\ell'} + 2p\bigl(  2^{(2-\nu)\ell} + 1 \bigr)

cryptops, where p should be chosen so that \nu(1+p) is large compared to 2. In particular, if \nu is very large then, as one would expect, then one can take p=0: no checking cryptops are needed.

DES as a random cipher

For DES, \ell = 56, n = 64, so \nu = \frac{8}{7}, (2-\nu)\ell = \frac{6}{7} \times 56 = 48 and we need \frac{8}{7}(1+p) large compared to 2. Taking p = 2, the total work is 2^{57} + 2^{50}. Clearly for any reasonable choice of p, the cost of the attack is dominated by the first phase, and is about 2^{57} cryptops. Even in 2009, using a special purpose device, this took at most 2 days.

AES as a random cipher

For AES with the shortest key length, \ell = n = 128, so \nu = 1 and we need 2p large compared to 2. Again taking p=2, the total work is 2^{129} + 2^{130}, roughly balanced between the first and second phases. For AES with the longest key length, \ell = 256 and n=128, so \nu = \frac{1}{2} and we need (1+p)/2 large compared to 2. Taking p=5, the total work is 2^{257} + 10 \times 2^{378}. The second phase dominates. Of course neither attack is practical.


An algebraic view of secret sharing

October 17, 2018

As last year I’m lecturing our 3rd year / 4th year / M.Sc. course on cipher systems. As part of the M.Sc. course I cover the Shamir Secret Sharing Scheme. Here is Shamir’s idea generalized to arbitrary commutative rings, with an application to secret sharing in a hierarchical organization. The idea is very obvious, so no originality is expected or claimed.

For a practical example, imagine that I have a critically important 20GB file. Using a secret-sharing scheme with threshold 3, I issue shares of the file (each most probably still of size 20GB) to Amazon, Degoo (who promise a ‘top secret cloud drive’) Dropbox and Microsoft. If any single provider goes out of business, or maliciously corrupts the share, the file can still be reconstructed from the remaining three shares. Depending on how much I trust them (and the computational resources available on the various platforms), the reconstruction could be done in the cloud, or on my own machine. In the latter case, a provider can learn the secret only if three of them collude by pooling their shares: this would presumably be a serious breach of all their terms of service.

Mathematical formulation

Fix n \in \mathbb{N} and let 1 \le t \le n. Fix a commutative ring R and distinct ideals I, J_1, \ldots, J_n of R. The secret is an element of R/I. We suppose that R has a subset S such that

  1. if A \subseteq \{1, \ldots, n\} is a t-set then the composition S \hookrightarrow R \rightarrow \prod_{i \in A} R/J_i is injective;
  2. if A \subseteq \{1, \ldots, n\} is a (t-1)-set then the composition S \hookrightarrow R \rightarrow R/I \times \prod_{i \in A} R/J_i is surjective.

In particular (2) implies that S \hookrightarrow R \twoheadrightarrow R/I is surjective.

To share a secret s + I \in R/I, pick r \in S such that r \equiv s mod I; the share for algebraist i is then r + J_i \in R/J_i. By hypothesis, for each A \subseteq \{1,\ldots, n\} such that |A| = t, the unique element of R congruent to r modulo J_i for each i \in A is r itself. Therefore any t (or more) algebraists can pool their shares and determine r, and hence s = r+I.

Shamir’s Scheme

Fix a prime p and distinct c_1, \ldots, c_n \in \mathbb{F}_p. Take R = \mathbb{F}_p[X], I = \langle X \rangle, and J_i = \langle X - c_i \rangle and S = \{ r \in R : \mathrm{deg} f < t \}. (Thus the secret is r(0), and the share for algebraist i is r(c_i).) Property (1) holds because a polynomial of degree \le t-1 is uniquely determined by its values at the t distinct points c_i for i \in A. Moreover, since the map

\displaystyle S \rightarrow R/I \times \prod_{i \in A} R/J_i \cong \mathbb{F}_p \times \mathbb{F}_p^{t-1}

is a linear isomorphism whenever |A| = t-1, we have (2) in the strongest possible form.

Integer version

For another special case, fix a prime p, a parameter c \ge 2 (of which more later) and consecutive primes q_1 \le \ldots \le q_n with cp \le q_1. Let Q = q_1 \ldots q_t. Take R = \mathbb{Z}, I = p\mathbb{Z}, J_i = q_i \mathbb{Z} and

S = \{0,1,\ldots, Q-1\}.

Since each r \in S is uniquely determined by its residues modulo any t of the q_i, we have (1). Suppose the t-1 algebraists numbered a_1, \ldots, a_{t-1} pool their shares. They can learn r modulo Q' = q_{a_1} \ldots q_{a_{t-1}}. Now S contains \lfloor Q/Q' \rfloor ‘candidate secrets’ of the form r+ b Q' where b \in \mathbb{Z}. Provided c is sufficiently large,

\displaystyle \frac{Q}{Q'} = \frac{q_1 \ldots q_t}{q_{a_1} \ldots q_{a_{t-1}}} \ge cp\frac{q_2 \ldots q_t}{q_{n-t+2} \ldots q_n} \ge p.

Therefore the ‘candidate secrets’ cover all residue classes modulo p, and so (2) holds. Moreover, the sizes of the fibres over each (r^\star + I, r + J_{a_1}, \ldots, r+J_{a_{t-1}}) vary by at most 1 with r^\star, and have mean size Q / p q_{a_1} \ldots q_{a_{t-1}}. Provided c is large, all the primes q_i have about the same size, and the mean fibre size is approximately c.


Provided c is large, this variation in the fibre leaks very little information. For small c is it a problem. For example, take n=4, t=3, c = 2, p = 11 and q_1 = 23, q_2 = 29, q_3 = 31, q_4 = 37. Thus Q = q_1q_2q_3 = 20677. For the secret s we choose 7, lifting it to

r = 11161 \in \{0,\ldots, Q-1\}.

The shares are then 6, 25, 1, 24. If algebraists 3 and 4 cooperate they can learn r modulo 31 \times 27. This leaves

\lfloor \frac{Q}{31 \times 37} \rfloor  = 18

possibilities for r. The mean fibre size is 18 / 11 and since 18 = 2 \times 7 + 1 \times 4, there are 7 fibres of size 2 and 4 of size 1 (namely those for 1 + 11\mathbb{Z}, 4 + 11\mathbb{Z}, 7 + 11\mathbb{Z} and 10 + 11\mathbb{Z}). Choosing one of these at random gives them a 1/4 chance of guessing the secret.

Choosing a small fibre is a good heuristic, but not infallible: it is possible that the secret corresponds to a large fibre—for example if algebraists 1 and 2 pool their shares then the fibres for 2 + 11 \mathbb{Z} and 9 + 11\mathbb{Z} have size 2, and the rest have size 3.


What other rings are amenable to the generalized Shamir scheme?

Secret sharing in a hierarchy

One possible answer to this question is ‘multivariable polynomial rings’. Here is an idea along these lines.

Let R = \mathbb{F}_p[X,Y]. Fix m, n \in \mathbb{N} and thresholds t, u with t \le m and u \le n. Let S be the set of polynomials f \in R with \mathrm{deg}_X f < t and \mathrm{deg}_Y f < u. Fix distinct non-zero c_1, \ldots, c_m \in \mathbb{F}_p and distinct d_1, \ldots, d_n \in \mathbb{F}_p. The aim is to share a secret s \in \mathbb{F}_p between teams T_1, \ldots, T_n where team T_j consists of people P_{1j}, \ldots, p_{mj}. For this, choose f \in S such that f(0,0) = s. For each j \in \{1,\ldots, n\}, let f_j(X) = f(X, d_j). For each j \in \{1,\ldots, n\} and each i \in \{1,\ldots, m\}, let s_{ij} = f_j(c_i) = f(c_i,d_j). Issue s_{ij} as the share to person P_{ij}.

The shares s_{ij} are the shares in a normal Shamir Secret Sharing Scheme for f_j(0). By the linear isomorphism in the first example above, when t people in the team cooperate to learn f_j(0), they in fact learn f_j itself. Suppose that u teams learn f_{b_1}, \ldots, f_{b_u}. Then since \mathrm{deg}_Y f < u, they can cooperate to learn f, and hence f(0,0). Given a (u-1)-subset B of \{1,\ldots, n\}, when the teams numbered in B pool their shares they learn nothing useful, since for each h^\star \in \mathbb{F}[X], there is a unique polynomial f^\star with \mathrm{deg}_Y < u such that f(X, 0) = h^\star(X) and f(X, d_j) = f_j(X) for each j \in B. Explicitly,

\displaystyle f^\star = h^\star(X) \prod_{j \in B} \frac{Y-d_k}{-d_k}+ \sum_{j \in B}Y \prod_{k \in B \atop k \not=j}\frac{Y-d_k}{d_j-d_k} f_j(X).

This polynomial is consistent with the secret being h^\star (0); since h^\star(0) takes each value in \mathbb{F}_p equally often as h^\star(X) varies over the polynomials in \mathbb{F}_p[X] of degree at most t-1 no useful information is learned.

It should be admitted that one can arrive at essentially the same situation, more simply, by sharing the secret between n teams, and then sharing each 'team-share' between the m people in each team, each time using the normal Shamir Scheme.

Permutation modules and the simple group of order 168

June 26, 2018

In an earlier post I decomposed the permutation module for \mathrm{GL}_3(\mathbb{F}_2) acting on the 7 non-zero lines in \mathbb{F}_2^3, with coefficients in a field of characteristic 2. Some related ideas give the shortest proof I know of the exceptional isomorphism between \mathrm{PSL}_2(\mathbb{F}_7) and \mathrm{GL}_3(\mathbb{F}_2). The motivation section below explains why what we do ‘must work’, but is not logically necessary.


Let h \in \mathrm{GL}_3(\mathbb{F}_2) have order 7 and let

H  = \mathrm{N}_{\mathrm{GL}_3(\mathbb{F}_2)}(\langle h\rangle) \cong C_7 \rtimes C_3.

The simple modules for \mathrm{GL}_3(\mathbb{F}_2) in characteristic 2 are the trivial module, the natural module E, its dual E^\star, and the reduction modulo 2 of a module affording the 8-dimensional character induced from either of the two non-trivial linear characters of H. (A short calculation with Mackey’s Formula and Frobenius Reciprocity shows the module is irreducible in characteristic zero; then since 168/8 is odd, it reduces modulo 2 to a simple projective. Alternatively one can work over \mathbb{F}_4 and perform the analogue of the characteristic zero construction directly.) The permutation module for \mathrm{GL}_3(\mathbb{F}_2) acting on the cosets of H, defined over \mathbb{F}_2, is projective. Since it has the trivial module in its socle and top, the only possible Loewy structures are

\begin{matrix} \mathbb{F}_2 \\ E \oplus E^\star \\ \mathbb{F}_2 \end{matrix}, \quad \begin{matrix} \mathbb{F}_2 \\ E \\ E^\star \\ \mathbb{F}_2 \end{matrix}, \quad \begin{matrix} \mathbb{F}_2 \\ E^\star \\ E \\ \mathbb{F}_2. \end{matrix}.

In every case, the permutation module contains a 4-dimensional \mathbb{F}_2\mathrm{GL}_3(\mathbb{F}_2)-submodule U having either E or E^\star as a top composition factor. Below we construct U as a module for \mathrm{PSL}_2(\mathbb{F}_7) and hence obtain the exceptional isomorphism in the form \mathrm{PSL}_2(\mathbb{F}_7) \cong \mathrm{GL}(E).

In fact the final two Loewy structures can be ruled out because they are asymmetric with respect to E and E^\star, and so paired under the outer automorphism of \mathrm{GL}_3(\mathbb{F}_2). Thus if one exists then so does the other, and there would be two different projective covers of the trivial module (impossible).


Let G = \mathrm{PSL}_2(\mathbb{F}_7) regarded as a permutation group acting on the 8 non-zero lines in \mathbb{F}_7^2. For j \in \{0,1,\ldots, 6\}, let j denote the line through (j,1), and let \infty denote the line through (1,0). Let M = \langle e_j : j \in \{0,1,\ldots, 6\} \cup \{\infty\} \rangle_{\mathbb{F}_2} be the corresponding \mathbb{F}_2G-permutation module.

Let h = (0, 1, \dots, 6) \in G, corresponding to the Möbius map x \mapsto x + 1 and to the matrix \left( \begin{matrix} 1 & 1 \\ 0 & 1 \end{matrix} \right) in \mathrm{SL}_2(\mathbb{F}_7). The simple modules for \mathbb{F}_2\langle h \rangle correspond to the factors of

x^7 + 1 = (x+1)(x^3+x+1)(x^3+x^2+1).

Since \{1,2,4\} \le \mathbb{F}_7^\times is permuted by squaring, an idempotent killing the simple modules with minimal polynomial x^3 + x + 1 is h^4 + h^2 + h = (h^3+h^2+1)h. Therefore the module U we require has codimension 1 in the 8-3 = 5-dimensional module

M(h + h^2 + h^4) = \langle e_\infty \rangle_{\mathbb{F}_2} \oplus \left\langle \begin{matrix} e_0 + e_1 + e_3 \\ e_1 + e_2 + e_4 \\ e_2 + e_3 + e_5 \\ e_3 + e_4 + e_6 \end{matrix} \right\rangle_{\mathbb{F}_2}

Here (e_3+e_4+e_6)h = e_4+e_5 + e_0 is the sum of the first three basis vectors, so in its right action, h is represented by

\left( \begin{matrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 1 & 1 & 0 \end{matrix}\right).

By symmetry e_\infty must appear in some vector in the \mathbb{F}_2G-module U we require. It therefore seems a plausible guess that U we require is generated by u_0 where

u_0 = e_\infty + e_0 + e_1 + e_3

and so U has basis u_0, u_1, u_2, u_3 where u_i = u_0 h^i for each i.

Let t = (1,2,4)(3,5,6) \in \mathrm{N}_G(\langle h \rangle) and let

s = (0,\infty)(1,3)(2,5)(4,6) \in G,

chosen so that h^t = h^2 and t^s = t^2. The corresponding Möbius maps are x \mapsto 2x and x \mapsto 3/x and one choice of corresponding matrices in \mathrm{SL}_2(\mathbb{F}_7) is \left( \begin{matrix} 4 & 0 \\ 0 & 2 \end{matrix} \right) and \left(\begin{matrix} 0 & 2 \\ 3 & 0 \end{matrix}\right). Extending the calculation for h above shows that in their (right) action on U,

h \mapsto \left( \begin{matrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 1 & 1 & 0 \end{matrix}\right), t \mapsto \left( \begin{matrix} 1 & 1 & 0 & 1 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 1 & 1 & 1 \end{matrix}\right), s \mapsto \left( \begin{matrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \end{matrix}\right).

The unique trivial submodule of U is spanned by

u_0 + u_2 + u_3 = e_0 + e_1 + \cdots + e_6 + e_\infty.

Quotienting out, we obtain the 3-dimensional representation of G where

h \mapsto \left( \begin{matrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 1 \end{matrix} \right), \; t \mapsto \left( \begin{matrix} 0 & 1 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{matrix}\right), \; s \mapsto \left( \begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{matrix} \right)

Since \mathrm{PSL}_2(\mathbb{F}_7) has cyclic Sylow 3-subgroups and cyclic Sylow 7-subgroups, and a unique conjugacy class of elements of order 2 (which contains s), one can see just from the matrices above that the representation is faithful. Comparing orders, we get \mathrm{PSL}_2(\mathbb{F}_7) \cong \mathrm{GL}_3(\mathbb{F}_2).

A weighted Chebyshev’s order inequality

June 20, 2018

Let w_i \in \mathbb{R}^{\ge 0} be non-negative weights and let a_1, \ldots, a_n \in \mathbb{R}^{\ge 0}, b_1, \ldots, b_n \in \mathbb{R}^{\ge 0} be non-negative numbers such that

\begin{aligned} &a_1 \ge a_2 \ge \ldots \ge a_n \\ & b_1 \le b_2 \le \ldots \le b_n \end{aligned}.


\begin{aligned} \bigl(  \sum_{i} w_i \bigr) \bigl(\sum_{j}w_ja_jb_j \bigr)  &= \frac{1}{2} \sum_{ij} w_iw_j (a_jb_j + a_ib_i) \\ &\le \frac{1}{2} \sum_{ij} w_iw_j (a_jb_i + a_ib_j) \\ &= \bigl( \sum_i w_ia_i\bigr) \bigl( \sum_j w_jb_j \bigr)\end{aligned}

where the middle step follows from the inequality (a_jb_i+a_ib_j) - (a_jb_j + a_ib_i) = (a_j-a_i)(b_i-b_j) \ge 0. In particular, if \sum_i w_i = 1 then we obtain

\sum_k w_i a_i b_i \le \bigl( \sum_i w_i a_i \bigr)\bigl(\sum_i w_i b_i\bigr).

This is a weighted version of Chebyshev’s order inequality. The weighted Chebychev’s order inequality appears (with a different proof) in an interesting article in MAA Mathematics Magazine, where it is used to deduce Jensen’s inequality.

Permutation bases and an unexpected duality

June 10, 2018

Let G be a finite group acting transitively on a set \Omega. An important object in representation theory, much used by Burnside and his successors to study the action of G, is the permutation module \mathbb{K}\Omega, defined to be the vector space of all formal linear combinations of the elements of \Omega, over a chosen field \mathbb{K}. Thus \mathbb{K} \Omega has \Omega as a canonical basis.

It’s easy to fall into the trap of thinking that the canonical basis of a permutation module is in some sense special, or even unique. The latter is certainly wrong in all but the smallest cases. For example, whenever \mathbb{K} has characteristic zero, or the characteristic of \mathbb{K} does not divide |\Omega|- 1, the linear map defined on the canonical basis by

\omega \mapsto \sum_{\omega' \in \Omega} \omega'

is an isomorphism of \mathbb{K} G-modules. Generally, if G has rank r in its action on \Omega then \mathrm{dim}\ \mathrm{Hom}_{\mathbb{K}S_n}(\mathbb{K}\Omega,\mathbb{F}\Omega) = r, and a generic linear combination of these r-linearly independent homomorphisms is an automorphism of \mathbb{K}\Omega sending the canonical basis to a basis of \mathbb{K}\Omega permuted by G.

The motivation for this post is an example where \mathbb{K}\Omega has a permutation basis not obtained in this way. Take \mathrm{GL}_3(\mathbb{F}_2) acting on E = \langle e_1,e_2,e_3 \rangle_{\mathbb{F}_2}. Let \Omega = \bigl\{ \langle e \rangle : e \in E \backslash \{0 \} \bigr\} be the set of lines in E. (Each line has a unique non-zero point; but we use the line notation to makes clear the distinction between e \in E and \langle e \rangle \in \mathbb{K}\Omega.) The canonical basis for \mathbb{K}\Omega is therefore

\bigl\{ \langle e_1 \rangle, \langle e_2 \rangle, \langle e_3 \rangle, \langle e_2 + e_3 \rangle, \langle e_1 + e_3 \rangle, \langle e_1 + e_2 \rangle, \langle e_1 + e_2 + e_3 \rangle \bigr\}.

For each plane \Pi contained in E, let v_\Pi = \sum_{e \in \Pi \backslash \{0\}} \langle e \rangle. Clearly v_\Pi g = v_{\Pi g}, for g \in \mathrm{GL}_3(\mathbb{F}_2), so the v_\Pi are permuted by \mathrm{GL}_3(\mathbb{F}_2). The duality \langle e \rangle \mapsto \langle e \rangle^\bot gives a bijection between the 7 lines and 7 planes. Ordering basis vectors as above, the v_{\langle e \rangle^\bot} for non-zero e \in E have the coefficients shown in the matrix P below.

\left(\begin{matrix} 0&1&1&1&0&0&0 \\ 1&0&1&0&1&0&0 \\  1&1&0&0&0&1&0 \\ 1&0&0&1&0&0&1 \\ 0&1&0&0&1&0&1 \\ 0&0&1&0&0&1&1 \\ 0&0&0&1&1&1&0 \end{matrix} \right)

The determinant of P is -24, so whenever \mathbb{K} has characteristic 0 or \mathbb{K} has prime characteristic \ge 5, the v_\Pi form a permutation basis for \mathbb{K} \Omega. Since \mathrm{GL}_3(\mathbb{F}_2) acts 2-transitively on \Omega, the endomorphism algebra of \mathbb{K}\Omega is 2-dimensional, spanned by the identity and the ‘complementing’ map defined above. Therefore this alternative basis is not obtained by an endomorphism of \mathbb{K}\Omega.

Now for the surprising duality. When \mathbb{K} has characteristic 2, the three linear relations whose first is

v_{\langle e_1 \rangle^\perp} + v_{\langle e_1 + e_2 \rangle^\perp} + v_{\langle e_1 + e_3 \rangle^\perp} + v_{\langle e_1 + e_2 + e_3 \rangle^\perp} = 0

show that \mathrm{rank}\ P \le 4. It is clear from the final four rows of P that v_{\langle e_2 + e_3\rangle^\perp}, v_{\langle e_1 + e_3\rangle^\perp}, v_{\langle e_1 + e_2 \rangle^\perp}, v_{\langle e_1 + e_2 + e_3 \rangle^\perp} are linearly independent and span a 4-dimensional submodule of \mathbb{K} \Omega. The sum of these vectors, namely \langle e_1 \rangle + \cdots + \langle e_1 + e_2+e_3\rangle, spans the unique trivial submodule of \mathbb{K}\Omega. Since 7 is odd, it splits off as a direct summand. Observe from rows 4 and 7 of the matrix P that

\begin{aligned} \langle e_1 \rangle + \langle e_1 + e_2 \rangle + \langle e_1 + e_3 \rangle & + \langle e_1 + e_2 + e_3 \rangle \\ &\quad = v_{\langle e_2+e_3\rangle^\perp} + v_{\langle e_1+e_2+e_3 \rangle^\perp}. \end{aligned}

A convenient basis for a complement to the trivial submodule is therefore

\begin{aligned} w_1 &= \langle e_1 \rangle + \langle e_1 + e_2 \rangle + \langle e_1 + e_3 \rangle + \langle e_1 + e_2 + e_3 \rangle \\ w_2 &= \langle e_2 \rangle + \langle e_1 + e_2 \rangle + \langle e_2 + e_3 \rangle + \langle e_1 + e_2 + e_3 \rangle \\ w_3 &= \langle e_3 \rangle + \langle e_1 + e_3 \rangle + \langle e_2 + e_3 \rangle + \langle e_1 + e_2 + e_3 \rangle. \end{aligned}

We have shown that W = \langle w_1, w_2, w_3 \rangle_\mathbb{K} is a 3-dimensional \mathbb{K}\mathrm{GL}_3(\mathbb{F}_2)-submodule of \mathbb{K}\Omega. In the special case K = \mathbb{F}_2, the representation homomorphism \rho : \mathrm{GL}_3(\mathbb{F}_2) \rightarrow \mathrm{GL(W)} is even an isomorphism. Looking at the definitions of w_1, w_2, w_3, it might seem almost obvious that W is isomorphic to the natural representation of \mathrm{GL}_3(\mathbb{F}_2).

This is not the case. There are two 3-dimensional irreducible \mathbb{F}_2\mathrm{GL}_3(\mathbb{F}_2)-modules, namely the natural module E and its dual E^\star. These are distinguished by the action of the matrices in \mathrm{GL}_3(\mathbb{F}_2) of order 7. Let

g = \left( \begin{matrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 1 & 0 \end{matrix} \right)

be the companion matrix for the irreducible cubic 1 + X + X^3. Computing the action of g on W we find that

\begin{aligned} w_1 g &= \langle e_1 g \rangle + \langle (e_1 + e_2)g \rangle + \langle (e_1 + e_3) g \rangle + \langle (e_1 + e_2 + e_3)g \rangle \\ &= \langle e_2 \rangle + \langle e_2 + e_3 \rangle + \langle e_1 \rangle + \langle e_1 + e_3 \rangle \\ &= w_1 + w_2. \end{aligned}

Similar calculations show that w_2 g = w_3 and w_3 g = w_1. Reordering basis vectors as w_2, w_3, w_1, we see that in its action on W, g is represented by the matrix

\left( \begin{matrix} 0 &1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 1 \end{matrix} \right).

This is the companion matrix of the irreducible cubic 1 + X^2 + X^3, which lies in the other conjugacy class of matrices of order 7 to g. Therefore W \cong E^\star.

Since \mathrm{dim}\ \mathrm{End}_{\mathbb{F}_2\mathrm{GL}_3(\mathbb{F}_2)}(\mathbb{F}_2\Omega) = 2, it follows that \mathbb{F}_2\Omega \cong  \mathbb{F}_2 \oplus U where U is an indecomposable 6-dimensional \mathbb{F}_2\mathrm{GL}_3(\mathbb{F}_2)-module with \mathrm{rad}\ U = \mathrm{soc}\ U \cong E^\star and U / \mathrm{rad}\ U \cong E. As expected, the summand U is self-dual, but I think it is a mild surprise that it has E^\star and not E in its socle.

Hill climbing on substitution ciphers

May 22, 2018

On Saturday the BSHM held a fascinating meeting The history of cryptography and codes. In the second talk, Solving Historical Ciphers with Modern Means, Klaus Schmeh mentioned an attack on the old-school Turning Grille cipher by the thoroughly new-school method of hill climbing. This got me thinking about whether hill climbing could be an effective attack on the monoalphabetic substitution cipher.

As a running example, we use the ciphertext below; it is the encryption, by a randomly chosen substitution cipher, of the first two sentences in Section 1.1 of Stinson’s highly recommended book Crpytography: Theory and Practice, punctuation and spaces deleted.


Frequency analysis shows that the most common letters are xzugk with 30, 23, 23, 21 and 19 occurrences, respectively. We therefore guess that x is the encryption of e, that z and u are the encryptions of t and a (in some order), and so on. Decrypting by the key zpxyjcofvuidsqkrlmnwabgeht implied by the frequency counts gives


From here I think it is feasible—but certainly non-trivial—to complete the decryption by hand. A good place to start is the final seven letters exafche, on the assumption that the (common) e and a are correct, and the (rare) x is also correct.

For a hill climb, we need

  • a place to start;
  • a way to step;
  • a scoring function.

The obvious starting point is the key guessed by frequency analysis.
Repeated steps must allow the climb to reach all 26! permutations of the Roman alphabet. A natural generating set for the symmetric group on the alphabet is the 26 \times 25 /2 = 325 transpositions. Thus in each step we interchange two letters of the decryption key. For instance, in the first step in the example below, zpxyj … becomes zyxpk… swapping the decrypts of b and d.

A good scoring function is far less obvious. In my first try, I scored a putative plaintext of length n by \sum_{i=1}^n s_i, where s_i is the length of the longest English word starting in position i. For example, exampletext gives scores (7,0,1,0,0,3,0,4,0,0,0), taking words from a small dictionary containing let but not ample. After 25 steps, the score increases from 115 to 187, and the hill climb then aborts, since no step improves the score. Unfortunately the putative plaintext is by this point


which, while a local maximum of the hill climb, seems barely more ‘Englishy’ than the first guess. Permitting two transpositions to be applied in each single step, the hill climb continued for 40 steps (the final 15 steps each taking a minute or so), and aborted with


of score 219. While closer to the truth, this seems a poor return on so much computational time. Increasing the dictionary size, and trying variant statistics, for example, the sum squared of the subword lengths, gave equally discouraging results.

At this point I searched for prior art, and found Cryptanalysis of the Simple Substitution Cipher, which recommends a score based on quadgram frequencies. For instance the most common quadgram is tion, with probability p_{\mathrm{tion}} = 0.00312, followed by nthe and ther. The suggested score is \sum_q \log p_q where the sum is over all quadgrams q in the putative plaintext. Using transpositions as steps, the hill climb took 31 steps (and about 10 seconds, most of which is spent building a look-up table of quadgram frequencies) to improve the score from -3188.91 to -2278.94. Half-way through the putative plaintext is


and the final putative plaintext is correct in every character:


For the record, the decryption key is zyxwkpomvutsrqjihdcbagfeln. Similarly impressive results were obtained using trigrams or quintgrams. (Some online code I wrote for the trigram attack is linked below.) Bigrams with transposition steps aborted after 23 steps at the local maximum


but allowing double transpositions the hill climb reaches (after 37 steps and a few minutes)


which is correct up to a single transposition. This transposition is a possible step, but it is rejected by the algorithm since making it worsens the bigram score, from -1308.83 to -1309.56. Thus for bigrams with double transposition steps, this plaintext is not even a local maximum of the hill climb. (Despite this theoretical obstruction, there is obviously no problem completing the decryption.)

Some final thoughts.

  • My first choice of scoring function suffers on the example text by getting stuck at local maxima that have some plausible chunks (for example, consider the extract … operaluldesireshallebuldish, while other parts look completely wrong. The less ‘brittle’ quadgram statistic appears to lead to a smoother, steady improvement.
  • The score by summing the logs of the probabilities of digrams is suggested by maximum likelihood estimates. Thus the more successful score has the more solid theoretical basis.
  • These experiments suggest that, at least for the substitution cipher, the choice of scoring function is far more important than the choice of steps. My attempt to salvage a poor scoring function by increasing the number of possible steps was somewhat successful, at the cost of vastly increased running time. Given the exponential growth in the search space, it seems clear that a small menu of steps, directed by a good scoring function, and possibly iterated many times, is the ideal to aim for.
  • A simplified version of the Haskell code I wrote is online here. Because of limitations of the online environment, it only works with bigrams and a reduced list of 2500 trigrams. Some trials suggest that the reduced trigrams suffice on most ciphertexts longer than a few sentences.

In my Cipher Systems course I had a computer demonstration in most lectures. I used Mathematica at first, in the hope (partially realised), that students would be able to re-use the notebooks for problem sheets, but switched to Haskell for the more computationally expensive attacks later in the course. My worry was that, since I never had time to explained the code beyond the bare mathematical details of the underlying algorithm, this would just deepen the students’ mystification. Feedback suggests, however, that these demonstrations were found surprisingly helpful.

The Gini coefficient

April 10, 2018

We define the Gini coefficient of a probability measure p on [0,\infty) with mean \mu by

\displaystyle G = \frac{\mathbb{E}[\hskip0.5pt|X-Y|\hskip0.5pt]}{2 \mu}

where X and Y are independently distributed according to p. Thus G = \frac{1}{2\mu} \int_0^\infty |X(\omega) - Y(\omega)| \mathrm{d}p_\omega. Dividing by \mu makes G a dimensionless quantity. The reason for normalizing by 2 will be seen shortly.

The Gini coefficient is a measure of inequality. For instance, if p_x is the probability that a citizen’s wealth is x \in [0,\infty), then we can sample G by picking two random citizens, and taking the (absolute value of) the difference of their wealths. In

  • Utopia, where everyone happily owns the same ample amount (right up to the point when they are beheaded for treason), the Gini coefficient is 0;
  • Dystopia, where the ruler has a modest fortune of N units, and the other N-1 `citizens’ have nothing at all, \mu = 1 and the Gini coefficient is \frac{1}{2} \frac{2(N-1)}{N^2} N = 1 - \frac{1}{N};
  • Subtopia, where each serf has 1 unit, and each land-owner has 2 units (plus two serfs to boss around), \mu = 4/3. Picking two people at random, the probability that come from different classes is 2 . \frac{2}{3} . \frac{1}{3} = \frac{4}{9}, hence the Gini coefficient is \frac{1}{2} \frac{3}{4} \frac{4}{9} = \frac{1}{6}.

A striking interpretation of G uses the Lorenz curve. Staying with the wealth interpretation, define L(\theta) to be the proportion of all wealth owned by the poorest \theta of the population. Thus L(0) = 0, L(1) = 1 and, since the poorest third (for example) cannot have more than a third of the wealth, L(\theta) \le \theta for all \theta. When N is large we can approximate L by the following functions \ell: in

  • Utopia \ell(\theta) = \theta for all \theta;

  • Dystopia \ell(\theta) = 0 if \theta < 1 and \ell(1) = 1;
  • Subtopia \ell(\theta) = \frac{3\theta}{4} if 0 \le  \theta < 2/3 and \ell(\theta) = -\frac{1}{2} + \frac{3\theta}{2} if 2/3 \le \theta \le 1.

Since the area below the Subtopian Lorenz curve is 1/6 + 1/6 + 1/12 = 5/12, the orange area is 1/12. This is half the Gini coefficient. Clearly the same result holds in Utopia and Dystopia.

Theorem. G is twice the area between the Lorenz curve for X and the diagonal.

Proof. Let P be the cumulative density function for X, so \mathbb{P}[X \le x] = P(x). If you are person \omega, and your wealth is x then X(\omega) = x. Thus P(X(\omega)) \le \theta if and only if person \omega is in the poorest \theta of the population. Therefore the poorest \theta of the population form the event \{ P(X) \le \theta \}. Their proportion of the wealth is the expectation of X, taken over this event, scaled by \mu. That is

\displaystyle L(\theta) = \frac{\mathbb{E}[X 1_{P(X) \le \theta}]}{\mu}.

The area, I say, under the Lorenz curve is therefore

\begin{aligned} I &= \frac{1}{\mu} \int_0^1 L(\theta) \mathrm{d}\theta \\  &= \frac{1}{\mu} \int_0^1 \mathbb{E}[X 1_{P(X) \le \theta}] \mathrm{d}\theta \\ &= \frac{1}{\mu} \int_0^\infty \mathbb{E}_X[X 1_{P(X) \le P(y)}] \mathrm{d}P(y)  \\ &= \frac{1}{\mu} \int_0^\infty \mathbb{E}_X[X 1_{X \le y}] \mathrm{d}P(y) \\ &= \frac{1}{\mu} \mathbb{E}_Y \mathbb{E}_X [X 1_{X \le Y}] \end{aligned}

where we made the substitution \theta = P(y) to go from line 2 to line 3. Now since \mathrm{min}(X,Y) = X 1_{X \le Y} + Y1_{Y \le X} - X 1_{X = Y}, where the event X=Y is assumed to be negligible, it follows from linearity of expectation that \mathbb{E}[\mathrm{min}(X,Y)] = 2 \mathbb{E}[X 1_{X \le Y}]. (Here \mathbb{E} denotes the expectation taken over both random variables.) Substituting we obtain

\displaystyle I = \frac{1}{\mu} \frac{\mathbb{E}[\mathrm{min}(X,Y)]}{2}.

From the identity |X-Y| = X+Y - 2\mathrm{min}(X,Y) then another application of linearity of expectation, and finally the definition of G we get

\begin{aligned} I &= \frac{1}{2\mu} \frac{\mathbb{E}[X+Y - |X-Y|]}{2} \\ &= \frac{1}{2\mu} \frac{2\mu - \mathbb{E}[X-Y]}{2} \\ &= \frac{1}{2} - \frac{G} {2}. \end{aligned}

Therefore G = 2(\frac{1}{2}-I), as claimed.

Some related integrals. My interest in this mainly comes from a way to compute G from the formula

\begin{aligned} \int_0^\infty (1-P(x))^2 \mathrm{d}x &= \int_0^\infty \mathbb{P}[X > x]^2 \mathrm{d}x \\&= \int_0^\infty \mathbb{P}[Y  > x] \mathbb{P}[Z > x] \mathrm{d}x \\ &= \int_0^\infty \mathbb{P}[\mathrm{min}(Y,Z) > x] \mathrm{d}x \\ &= \mathbb{E}[\mathrm{min}(Y,Z)] \end{aligned}

where we introduced independent random variables Y and Z distributed according to p, and then used the standard result \mathbb{E}[T] = \int_0^\infty \mathbb{P}[T \ge t] \mathrm{d}t which holds for any random variable T. Very similarly, since P(x)^2 = \mathbb{P}[Y \le x]\mathbb{P}[Z \le x] we have 1-P(x)^2 = \mathbb{P}[\mathrm{max}(Y,Z) > x] and

\int_0^\infty 1-P(x)^2 \mathrm{d}x = \mathbb{E}[\mathrm{max}(Y,Z)].

Both these formula generalize to give formulae for the minimum and maximum of k independent identically distributed random variables. For instance, a corollary of these formulae for k =2 is that

\begin{aligned} G &= \frac{1}{2\mu} \mathbb{E}[\mathrm{max}(X,Y) - \mathrm{min}(X,Y)]  \\ &= \frac{1}{2\mu} \int_0^\infty (1-F(x)^2) - (1-F(x))^2 \mathrm{d} x \\ &= \frac{1}{\mu} \int_0^\infty F(x)(1-F(x)) \mathrm{d}x \end{aligned}

and for k = 3, since (1-q^3)-(1-q)^3 = 3q(1-q),

\mathbb{E}[\mathrm{max}(X,Y,Z)-\mathrm{min}(X,Y,Z)] = \frac{3}{2}\mu G.

It’s perhaps surprising that sampling three people rather than two (and then taking max minus min) does not give an essentially different inequality statistic. But simply taking the middle does give something new:

\begin{aligned} \mathbb{E}&[\mathrm{middle}(X,Y,Z)] \\ &= 3\mu - \mathbb{E}[\mathrm{max}(X,Y,Z)] - \mathrm{E}[\mathrm{min}(X,Y,Z) \\ &= 3 \mu - \int_0^\infty (1-P(x)^3) \mathrm{d}x - \int_0^\infty (1-P(x))^3 \mathrm{d}x \\ &= 3 \mu - 3\int_0^\infty (1-P(x)) \mathrm{d}x + 3 \int_0^\infty (1-P(x))^2 \mathrm{d}x  \\ & \hskip2in - 2 \int_0^\infty (1-P(x))^3 \mathrm{d}x \\ &= 3 \mu - 3 \mathbb{E}[X] + 3 \mathbb{E}[\mathrm{min}(X,Y)] - 2 \mathbb{E} [\mathrm{min}(X,Y,Z)] \\ &= 3 \mathbb{E}[\mathrm{min}(X,Y)] - 2 \mathbb{E} [\mathrm{min}(X,Y,Z)]. \end{aligned}

A similar argument shows that

\mathbb{E}[\mathrm{middle}(X,Y,Z)] = 3\mathbb{E}[\mathrm{max}(X,Y)] - 2\mathbb{E}[\mathrm{max}(X,Y,Z)].

(Alternatively this can be proved from the previous remark.) Since the mean and median are not, in general, the same, it should not be a surprise that in a sample of three people, the middle wealth is not an unbiased estimator of the mean wealth. But still I find it slightly striking that it therefore serves as a measure of inequality.

Another interesting fact is that, generalizing the formulae for the cumulative density functions of the maximum and minimum above, the rank statistics for any number of independent samples have nice c.d.f.s. For example, since \mathrm{middle}(X,Y,Z) \le x if and only if either all three random variables are \le x or exactly two random variables are \le x, we have

P_\mathrm{middle}(x) = P(x)^3 + 3(1-P(x))P(x)^2

and in general, the j-th largest of k-independent random variables has c.d.f.

P_j(x) = \sum_{i=0}^{j-1} \binom{k}{i} P(x)^{k-i}(1-P(x))^i.