Hill climbing on substitution ciphers

May 22, 2018

On Saturday the BSHM held a fascinating meeting The history of cryptography and codes. In the second talk, Solving Historical Ciphers with Modern Means, Klaus Schmeh mentioned an attack on the old-school Turning Grille cipher by the thoroughly new-school method of hill climbing. This got me thinking about whether hill climbing could be an effective attack on the monoalphabetic substitution cipher.

As a running example, we use the ciphertext below; it is the encryption, by a randomly chosen substitution cipher, of the first two sentences in Section 1.1 of Stinson’s highly recommended book Crpytography: Theory and Practice, punctuation and spaces deleted.


Frequency analysis shows that the most common letters are xzugk with 30, 23, 23, 21 and 19 occurrences, respectively. We therefore guess that x is the encryption of e, that z and u are the encryptions of t and a (in some order), and so on. Decrypting by the key zpxyjcofvuidsqkrlmnwabgeht implied by the frequency counts gives


From here I think it is feasible—but certainly non-trivial—to complete the decryption by hand. A good place to start is the final seven letters exafche, on the assumption that the (common) e and a are correct, and the (rare) x is also correct.

For a hill climb, we need

  • a place to start;
  • a way to step;
  • a scoring function.

The obvious starting point is the key guessed by frequency analysis.
Repeated steps must allow the climb to reach all 26! permutations of the Roman alphabet. A natural generating set for the symmetric group on the alphabet is the 26 \times 25 /2 = 325 transpositions. Thus in each step we interchange two letters of the decryption key. For instance, in the first step in the example below, zpxyj … becomes zyxpk… swapping the decrypts of b and d.

A good scoring function is far less obvious. In my first try, I scored a putative plaintext of length n by \sum_{i=1}^n s_i, where s_i is the length of the longest English word starting in position i. For example, exampletext gives scores (7,0,1,0,0,3,0,4,0,0,0), taking words from a small dictionary containing let but not ample. After 25 steps, the score increases from 115 to 187, and the hill climb then aborts, since no step improves the score. Unfortunately the putative plaintext is by this point


which seems barely more ‘Englishy’ than the first guess. Allowing two transpositions to be applied in each step dramatically increases the number of possible steps, and allowed the hill climb to escape the local maximum above. After 40 steps (the final 15 steps each taking a minute or so), the hill climb aborted with


of score 219. While closer to the truth, this seems a poor return on so much computational time. Increasing the dictionary size, and trying some variant statistics, for example, the sum squared of the subword lengths, gave equally discouraging results.

At this point I searched for prior art, and found Cryptanalysis of the Simple Substitution Cipher, which recommends a score based on quadgram frequencies. For instance the most common quadgram is tion, with probability p_{\mathrm{tion}} = 0.00312, followed by nthe and ther. The suggested score is \sum_q \log p_q where the sum is over all quadgrams q in the putative plaintext. Using transpositions as steps, the hill climb took 31 steps (and about 10 seconds, most of which is spent building a look-up table of quadgram frequencies) to improve the score from -3188.91 to -2278.94. Half-way through the putative plaintext is


and the final putative plaintext is correct in every character:


For the record, the decryption key is zyxwkpomvutsrqjihdcbagfeln. Similarly impressive results were obtained using trigrams or quintgrams. (There is a link at the end to some online code I wrote that can do the trigram attack.) Bigrams with transposition steps aborted after 23 steps at the local maximum


but allowing double transpositions the hill climb reaches (after 37 steps and a few minutes)


which is correct up to a single transposition. This transposition is a possible step, but it is rejected by the algorithm since making it worsens the bigram score, from -1308.83 to -1309.56. Thus for bigrams with double transposition steps, this plaintext is not even a local maximum of the hill climb. (Despite this theoretical obstruction, there is in practice no problem completing the decryption.)

Some final thoughts.

  • My first choice of scoring function suffers on the example text by getting stuck at local maxima that have some plausible chunks (for example, consider the extract … operaluldesireshallebuldish, while other parts look completely wrong. The less ‘brittle’ quadgram statistic appears to lead to a smoother, steady improvement.
  • The score by summing the logs of the probabilities of digrams is suggested by maximum likelihood estimates. Thus the more successful score has the more solid theoretical basis.
  • These experiments suggest that, at least for the substitution cipher, the choice of scoring function is far more important than the choice of steps. My attempt to salvage a poor scoring function by increasing the number of possible steps was somewhat successful, at the cost of vastly increased running time. Given the exponential growth in the search space, it seems clear that a small menu of steps, directed by a good scoring function, and possibly iterated many times, is the ideal to aim for.
  • A simplified version of the Haskell code I wrote is online here. Because of limitations of the online environment, it only works with bigrams and a reduced list of 2500 trigrams. Some trials suggest that the reduced trigrams suffice on most ciphertexts longer than a few sentences.

In my Cipher Systems course I had a computer demonstration in most lectures. I used Mathematica at first, in the hope (partially realised), that students would be able to re-use the notebooks for problem sheets, but switched to Haskell for the more computationally expensive attacks later in the course. My worry was that, since I never had time to explained the code beyond the bare mathematical details of the underlying algorithm, this would just deepen the students’ mystification. Feedback suggests, however, that these demonstrations were found surprisingly helpful.


The Gini coefficient

April 10, 2018

We define the Gini coefficient of a probability measure p on [0,\infty) with mean \mu by

\displaystyle G = \frac{\mathbb{E}[\hskip0.5pt|X-Y|\hskip0.5pt]}{2 \mu}

where X and Y are independently distributed according to p. Thus G = \frac{1}{2\mu} \int_0^\infty |X(\omega) - Y(\omega)| \mathrm{d}p_\omega. Dividing by \mu makes G a dimensionless quantity. The reason for normalizing by 2 will be seen shortly.

The Gini coefficient is a measure of inequality. For instance, if p_x is the probability that a citizen’s wealth is x \in [0,\infty), then we can sample G by picking two random citizens, and taking the (absolute value of) the difference of their wealths. In

  • Utopia, where everyone happily owns the same ample amount (right up to the point when they are beheaded for treason), the Gini coefficient is 0;
  • Dystopia, where the ruler has a modest fortune of N units, and the other N-1 `citizens’ have nothing at all, \mu = 1 and the Gini coefficient is \frac{1}{2} \frac{2(N-1)}{N^2} N = 1 - \frac{1}{N};
  • Subtopia, where each serf has 1 unit, and each land-owner has 2 units (plus two serfs to boss around), \mu = 4/3. Picking two people at random, the probability that come from different classes is 2 . \frac{2}{3} . \frac{1}{3} = \frac{4}{9}, hence the Gini coefficient is \frac{1}{2} \frac{3}{4} \frac{4}{9} = \frac{1}{6}.

A striking interpretation of G uses the Lorenz curve. Staying with the wealth interpretation, define L(\theta) to be the proportion of all wealth owned by the poorest \theta of the population. Thus L(0) = 0, L(1) = 1 and, since the poorest third (for example) cannot have more than a third of the wealth, L(\theta) \le \theta for all \theta. When N is large we can approximate L by the following functions \ell: in

  • Utopia \ell(\theta) = \theta for all \theta;

  • Dystopia \ell(\theta) = 0 if \theta < 1 and \ell(1) = 1;
  • Subtopia \ell(\theta) = \frac{3\theta}{4} if 0 \le  \theta < 2/3 and \ell(\theta) = -\frac{1}{2} + \frac{3\theta}{2} if 2/3 \le \theta \le 1.

Since the area below the Subtopian Lorenz curve is 1/6 + 1/6 + 1/12 = 5/12, the orange area is 1/12. This is half the Gini coefficient. Clearly the same result holds in Utopia and Dystopia.

Theorem. G is twice the area between the Lorenz curve for X and the diagonal.

Proof. Let P be the cumulative density function for X, so \mathbb{P}[X \le x] = P(x). If you are person \omega, and your wealth is x, so X(\omega) = x, then P(X(\omega)) \le \theta if and only if you are in the poorest \theta of the population. Therefore the poorest \theta of the population form the event \{ P(X) \le \theta \}. Their proportion of the wealth is the expectation of X, taken over this event, scaled by \mu. That is

\displaystyle L(\theta) = \frac{\mathbb{E}[X 1_{P(X) \le \theta}]}{\mu}.

The area, I say, under the Lorenz curve is therefore

\begin{aligned} I &= \frac{1}{\mu} \int_0^1 L(\theta) \mathrm{d}\theta \\  &= \frac{1}{\mu} \int_0^1 \mathbb{E}[X 1_{P(X) \le \theta}] \mathrm{d}\theta \\ &= \frac{1}{\mu} \int_0^\infty \mathbb{E}_X[X 1_{P(X) \le P(y)}] \mathrm{d}P(y)  \\ &= \frac{1}{\mu} \int_0^\infty \mathbb{E}_X[X 1_{X \le y}] \mathrm{d}P(y) \\ &= \frac{1}{\mu} \mathbb{E}_Y \mathbb{E}_X [X 1_{X \le Y}] \end{aligned}

where we made the substitution \theta = P(y) to go from line 2 to line 3. Now since \mathrm{min}(X,Y) = X 1_{X \le Y} + Y1_{Y \le X} - X 1_{X = Y}, where the event X=Y is assumed to be negligible, it follows from linearity of expectation that \mathbb{E}[\mathrm{min}(X,Y)] = 2 \mathbb{E}[X 1_{X \le Y}]. (Here \mathbb{E} denotes the expectation taken over both random variables.) Substituting we obtain

\displaystyle I = \frac{1}{\mu} \frac{\mathbb{E}[\mathrm{min}(X,Y)]}{2}.

From the identity |X-Y| = X+Y - 2\mathrm{min}(X,Y) then another application of linearity of expectation, and finally the definition of G we get

\begin{aligned}I &= \frac{1}{2\mu} \frac{\mathbb{E}[X+Y - |X-Y|]}{2} \\ &= \frac{1}{2\mu} \frac{2\mu - \mathbb{E}[X-Y]}{2} \\ &= \frac{1}{2} - \frac{G}{2}. \end{aligned}

Therefore G = 2(\frac{1}{2}-I), as claimed.

Some related integrals. My interest in this mainly comes from a way to compute G from the formula

\begin{aligned} \int_0^\infty (1-P(x))^2 \mathrm{d}x &= \int_0^\infty \mathbb{P}[X > x]^2 \mathrm{d}x \\&= \int_0^\infty \mathbb{P}[Y  > x] \mathbb{P}[Z > x] \mathrm{d}x \\ &= \int_0^\infty \mathbb{P}[\mathrm{min}(Y,Z) > x] \mathrm{d}x \\ &= \mathbb{E}[\mathrm{min}(Y,Z)] \end{aligned}

where we introduced independent random variables Y and Z distributed according to p, and then used the standard result \mathbb{E}[T] = \int_0^\infty \mathbb{P}[T \ge t] \mathrm{d}t which holds for any random variable T. Very similarly, since P(x)^2 = \mathbb{P}[Y \le x]\mathbb{P}[Z \le x] we have 1-P(x)^2 = \mathbb{P}[\mathrm{max}(Y,Z) > x] and

\int_0^\infty 1-P(x)^2 \mathrm{d}x = \mathbb{E}[\mathrm{max}(Y,Z)].

Both these formula generalize to give formulae for the minimum and maximum of k independent identically distributed random variables. For instance, a corollary of these formulae for k =2 is that

\begin{aligned} G &= \frac{1}{2\mu} \mathbb{E}[\mathrm{max}(X,Y) - \mathrm{min}(X,Y)]  \\ &= \frac{1}{2\mu} \int_0^\infty (1-F(x)^2) - (1-F(x))^2 \mathrm{d} x \\ &= \frac{1}{\mu} \int_0^\infty F(x)(1-F(x)) \mathrm{d}x \end{aligned}

and for k = 3, since (1-q^3)-(1-q)^3 = 3q(1-q),

\mathbb{E}[\mathrm{max}(X,Y,Z)-\mathrm{min}(X,Y,Z)] = \frac{3}{2}\mu G.

It’s perhaps surprising that sampling three people rather than two (and then taking max minus min) does not give an essentially different inequality statistic. But simply taking the middle does give something new:

\begin{aligned} \mathbb{E}&[\mathrm{middle}(X,Y,Z)] \\ &= 3\mu - \mathbb{E}[\mathrm{max}(X,Y,Z)] - \mathrm{E}[\mathrm{min}(X,Y,Z) \\ &= 3 \mu - \int_0^\infty (1-P(x)^3) \mathrm{d}x - \int_0^\infty (1-P(x))^3 \mathrm{d}x \\ &= 3 \mu - 3\int_0^\infty (1-P(x)) \mathrm{d}x + 3 \int_0^\infty (1-P(x))^2 \mathrm{d}x  \\ & \hskip2in - 2 \int_0^\infty (1-P(x))^3 \mathrm{d}x \\ &= 3 \mu - 3 \mathbb{E}[X] + 3 \mathbb{E}[\mathrm{min}(X,Y)] - 2 \mathbb{E}[\mathrm{min}(X,Y,Z)] \\ &= 3 \mathbb{E}[\mathrm{min}(X,Y)] - 2 \mathbb{E}[\mathrm{min}(X,Y,Z)]. \end{aligned}

A similar argument shows that

\mathbb{E}[\mathrm{middle}(X,Y,Z)] = 3\mathbb{E}[\mathrm{max}(X,Y)] - 2\mathbb{E}[\mathrm{max}(X,Y,Z)].

(Alternatively this can be proved from the previous remark.) Since the mean and median are not, in general, the same, it should not be a surprise that in a sample of three people, the middle wealth is not an unbiased estimator of the mean wealth. But still I find it slightly striking that it therefore serves as a measure of inequality.

Another interesting fact is that, generalizing the formulae for the cumulative density functions of the maximum and minimum above, the rank statistics for any number of independent samples have nice c.d.f.s. For example, since \mathrm{middle}(X,Y,Z) \le x if and only if either all three random variables are \le x or exactly two random variables are \le x, we have

P_\mathrm{middle}(x) = P(x)^3 + 3(1-P(x))P(x)^2

and in general, the j-th largest of k-independent random variables has c.d.f.

P_j(x) = \sum_{i=0}^{j-1} \binom{k}{i} P(x)^{k-i}(1-P(x))^i.

A motivated proof of the MacWilliams Identity

March 11, 2018

Fix n \in \mathbb{N} and let V = \mathbb{F}_2^n. Let \mathrm{wt}(u) denote the weight of u \in V, defined by \mathrm{wt}(u) = |\{i : u_i = 1\}|. Given a subset S of V, we define its weight enumerator W_S(x,y) by

W_S(x,y) = \sum_{u \in S} x^{\mathrm{wt}(u)} y^{n-\mathrm{wt}(u)}.

The MacWilliams Identity states that if C is a linear subspace of V, and C^\perp is its orthogonal complement with respect to the dot product u \cdot v = \sum_{i=1}^n u_i v_i then

\displaystyle W_{C^\perp}(x,y) = \frac{1}{|C|} W_C(-x+y,x+y).

The MacWilliams Identity is important in coding theory. It was first proved by the coding theorist Jessie MacWilliams, yet another alumnus of Bell Labs. To my mind, the best motivation for its proof comes from certain related identities in binomial coefficients and the generating function techniques used to prove them, which, in turn, are best understood in terms of the characters of cyclic groups.

Binomial coefficients

The Binomial Theorem states that (x+y)^n = \sum_{k=0}^n \binom{n}{k} x^k y^{n-k}. It has a one-line combinatorial proof: expand (x+y)(x+y) \cdots (x+y), by choosing x from k of the n brackets and y from the other n-k brackets; there are \binom{n}{k} ways to do this, so \binom{n}{k} is the coefficient of x^k y^{n-k}.

Now consider the sum s = \binom{n}{0} + \binom{n}{2} + \binom{n}{4} + \cdots of binomial coefficients with even `denominator’. This can be evaluated using the binomial theorem, using the two sums below to cancel out the unwanted binomial coefficients with odd denominator:

\displaystyle s = \frac{1}{2} \sum_{k=0}^n \binom{n}{k} + \frac{1}{2} \sum_{k=0}^n (-1)^k \binom{n}{k}.

For n \in \mathbb{N} the sums are (1+1)^n = 2^n and (-1+1)^n = 0, respectively. Therefore s = 2^{n-1}. More generally, this trick shows that

\displaystyle \sum_{j \in \mathbb{N}_0} \binom{n}{2j} x^{2j} y^{n-2j} = \frac{1}{2} (x+y)^n + \frac{1}{2}(-x+y)^n

which already has some of the appearance of the MacWilliams Identity.
How about s' = \binom{n}{0} + \binom{n}{4} + \binom{n}{8} + \cdots ? Since powers of -1 worked for the two-step identity, it is reasonable to try powers of \mathrm{i}. This gives

\begin{aligned} s' = \frac{1}{4} \sum_{k=0}^n \binom{n}{k} + \frac{1}{4} &\sum_{k=0}^n \binom{n}{k} \mathrm{i}^k \\& + \frac{1}{4} \sum_{k=0}^n \binom{n}{k} (-1)^k + \frac{1}{4} \sum_{k=0}^n \binom{n}{k} (\mathrm{-i})^k.\end{aligned}

By the Binomial Theorem, the sums are 2^n, (1+i)^n, 0 and (1-i)^n. For example, if n = 4m where m \in \mathbb{N} then (1+i)^{4m} = (1-i)^{4m} = 2^{2m} (-1)^m, so we have s' = 2^{4m-2} + 2^{2m-1} (-1)^m. Similar formulae can be obtained for the other cases for n modulo 4.

Character theory

Consider the cyclic group \mathbb{Z} / 4\mathbb{Z}. It has four irreducible complex characters, taking values (1,1,1,1), (1,\mathrm{i},-1,\mathrm{-i}), (1,-1,1,-1) and (1, \mathrm{-i},-1,\mathrm{i}). The previous displayed equation comes from expressing the indicator function (1,0,0,0) as a linear combination of irreducible characters

\begin{aligned} (1,0,0,0) = \frac{1}{4}  (1,1,1,1) + \frac{1}{4} &(1,i,-1,-i) \\ &+ \frac{1}{4} (1,-1,1,-1) + \frac{1}{4}(1,-i,-1,i).\end{aligned}

As an exercise, the reader might use the characters of \mathbb{Z} / 3\mathbb{Z} to decompose the indicator function (1,0,0) of 0 \in \mathbb{Z}/3\mathbb{Z} and so prove the related identity \binom{n}{0} + \binom{n}{3} + \binom{n}{6} + \cdots = \frac{2^{n}}{3} + \frac{2}{3} \cos \frac{n\pi}{3} for n \in \mathbb{N}.

MacWilliams’ Identity for the repetition code

Let C = \langle 1\ldots 1 \rangle. In this case the left-hand side of the MacWilliams’ Identity is sum of all x^{\mathrm{wt}(u)} y^{n - \mathrm{wt}(u)} over u \in V of even weight. By analogy with the binomial sum, we write

W_{C^\perp}(x,y) = \frac{1}{2} \sum_{v \in V} x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)} + \frac{1}{2} \sum_{v \in V} (-x)^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)}.

The first summand is the generating function enumerating all v \in V by their weight. Since there are \binom{n}{k} elements of V of weight k, it is \sum_{k =0}^n \binom{n}{k} x^k y^{n-k}, which is (x+y)^n by the Binomial Theorem. Replacing x with -x we see that the second summand is (-x+y)^n. Therefore

W_{C^\perp}(x,y) = \frac{1}{2} (x+y)^n + \frac{1}{2} (-x+y)^n.

Since W_C(x,y) = x^n + y^n, this agrees with the MacWilliams’ Identity. Note that the second sum could be written as

\displaystyle \sum_{v \in V} (-1)^{1\ldots 1 \cdot v} x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)}.

MacWilliams Identity for a larger code

Take n=4 and C = \langle 1100, 0011 \rangle. Then

W_C(x,y) = x^4 + 2x^2y^2 + y^4.

We can describe C^\perp as the intersection of the two codimension 1 hyperplanes U, U', with ‘normals’ 1100 and 0011. Thus

\displaystyle W_{C^\perp}(x,y) = \sum_{v} [v \in U \cap U'] x^{\mathrm{wt}(v)}y^{n-\mathrm{wt}(v)}.

For each b \in V, we define \chi_b(v) = (-1)^{b \cdot v}; these are the irreducible complex characters of V. Using them to decompose the indicator function [v \in U \cap U'] : V \rightarrow \mathbb{C} we get

[v \in U \cap U'] = \frac{1}{4} \chi_{0}(v) + \frac{1}{4} \chi_{1100}(v) + \frac{1}{4} \chi_{0011}(v) + \frac{1}{4} \chi_{1111}(v).

It now suffices to find \sum_{v \in V} \chi_{b}(v) x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)}. If \mathrm{wt}(b) = t then, by symmetry, we may assume that b = 1\ldots 10\ldots 0, where there are exactly t ones. Now (-1)^{b \cdot v} records the parity of the number of ones in the first t positions of v, so writing v \in V as the concatenated vector (u | w) where u \in \mathbb{F}_2^t and w \in \mathbb{F}_2^{n-t}, we have

\begin{aligned} \sum_{v \in V} &\chi_{b}(v) x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)}\\ &= \sum_{u \in U} (-1)^{\mathrm{wt}(u)} x^{\mathrm{wt}(u)} y^{t-\mathrm{wt}(v)} \sum_{w \in W} x^{\mathrm{wt}(w)} y^{n-t-\mathrm{wt}(w)} \\ &= (-x+y)^t (x+y)^{n-t} \end{aligned}.


\begin{aligned} W_{C^\perp}(x,y) &= \frac{1}{4} (x+y)^4 + \frac{2}{4} (-x+y)^2 (x+y)^2 + \frac{1}{4} (-x+y)^4 \\ &= W_C(-x+y,x+y) \end{aligned}

as required. (In this case, since C is self-dual, we even have W_{C^\perp}(x,y) = W_C(x,y).)

MacWilliams Identity in general

The general proof needs no more ideas than those seen in the previous example. The indicator function decomposition is

\displaystyle [v \in C^\perp] = \frac{1}{|C|} \sum_{b \in C} \chi_b(v)

and so

\begin{aligned} \sum_{v \in C^\perp} x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)} &= \sum_{v \in V} \frac{1}{|C|} \sum_{b \in C} \chi_b(v) x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)} \\ &= \frac{1}{|C|} \sum_{b \in C} \sum_{v \in V} \chi_b(v) x^{\mathrm{wt}(v)} y^{n-\mathrm{wt}(v)} \\ &= \frac{1}{|C|} \sum_{t=0}^n \sum_{b \in C : \mathrm{wt}(b) = t} (-x+y)^t (x+y)^{n-t} \\ &= \frac{1}{|C|} W_C(-x+y,x+y) \end{aligned}

as required.

Runups and cycles

March 3, 2018

Let xs be a stream of values that can be compared for equality. Let ++ denote concatenation of lists. Assuming that xs is of the form runUp ++ cyc ++ cyc ++ ..., what is a good way to find the decomposition (runUp, cyc)?

The problem is more subtle than it might seem. If xs is produced by iterating a function, so the entries are [x, f x, f f x, ...] then there are several cycle finding algorithms that can be used. In general, an algorithmic solution is impossible, because the stream may be of the form cyc ++ rest, where cyc has arbitrary length n+1 and agrees with rest for n positions. In practice, we might know that any cycle has some bounded length, or be prepared to accept a small chance of misidentification.

In this case, the following Haskell code finds the first decomposition (runUp, guessedCycle) such that the stream is runUp ++ rest and rest, which has length at least t, is of the form guessedCycle ++ guessedCycle ++ rest'.

runupAndCycle t xs = 
         runupAndCycleAux t bs (zip [1..] (tail xs))

runupAndCycleAux t xs [] = error $ show (t, xs)
runupAndCycleAux t xs qxs@((q, x) : _)
    = case lookup (map snd $ take t qxs) (take q tailsPositions) of
        Nothing -> runupAndCycleAux t xs (tail qxs)
        Just p  -> -- the t entries starting in positions q and p agree;
                   -- test if taking guessedCycle to be all entries
                   -- between positions q and p-1 gives a decomposition 
                   -- of the required form
                   let xsFirst  = take (q-p) $ drop p xs
                       xsSecond = take (q-p) $ map snd qxs
                   in  if and (zipWith (==) xsFirst xsSecond) 
                          then (take p xs, xsSecond) 
                          else runupAndCycleAux t xs (tail qxs)
    where tailsPositions 
              = [(take t ys, p) | (ys, p) <- zip (tails xs) [0..]]

For example

  • runupAndCycle 4 [1,2,3,1,2,3,4,5,1,2,3,4,5] evaluates to ([1,2,3],[1,2,3,4,5]);
  • runupAndCycle 4 [1,2,3,1,2,3,4,6,1,2,3,4,5] raises an exception, because while there is agreement for four positions, this is shown by the if ... then ... else ... test not to correspond to a cycle (and the bistream is then exhausted without finding anything better);
  • runUpAndCycle 4 [1,2,3,1,2,3,4,5,1,2,3,4,6] raises an exception for the same reason;
  • runUpAndCycle 4 [1,2,3,1,2,3,4,5,1,2,3,4,5,6] evaluates to ([1,2,3],[1,2,3,4,5]), as the first example, wrongly suggesting a 5-cycle. Replacing the first argument 4 with 6 we get an exception instead.

The algorithm may use O(tn^2) work to exhaust n bits without finding a suitable decomposition. For example with t = 3, the stream [0,0,0,1,0,0,0,2,0,0,0,3,0,0,0,4, ...] finds possible cycle starts in every fourth position; by position 4m, these require work 4m to investigate.

I’d like to know if there is a faster or more elegant algorithm—ideally both of course—but some searching only found hits for the deterministic version of the problem.

Lagrange Inversion

November 18, 2017

In this post I’ll give two short proofs of the Lagrange Inversion formula, both based on the ideas in these notes. The first uses nothing more than Cauchy’s Integral Formula, and Cauchy’s Theorem to justify a change of variable. Rather than use Rouché’s Theorem, or the Open Mapping Theorem, as in the linked notes, I’ve added the necessary assumption on bijectivity into the hypotheses. In the standard combinatorial applications this is often obvious. The inverse is assumed to be analytic; this is easily proved, by trivial changes to the usual real variable argument, once it is known to exist. The second is even more elementary, but a little harder to motivate: some version of it will probably be the one I use for my planned textbook (expected, at the current rate, some time in 2022).

Theorem. Let f : \mathbb{C} \rightarrow \mathbb{C} be an analytic function such that f(0) = 0 and f restricts to a bijective function on a domain containing 0. Then

\displaystyle [w^n]  f^{-1}(w) = \frac{1}{n} [z^{n-1}] \bigl( \frac{z}{f(z)}\bigr)^n.

Proof 1. We use Cauchy’s Integral Formula to find the coefficient of w^n in the Taylor Series of f^{-1}(w), change variables by s = f(z) (so \mathrm{d}s = f'(z)\mathrm{d}z), and then integrate by parts. The change of variables is justified by Cauchy’s Theorem since the approximation f(z) \approx z for small z shows that the contour t \mapsto f(r\mathrm{e}^{it}) for t \in [0,2\pi] winds once around the origin provided r is sufficiently small. Thus

\displaystyle \begin{aligned} \ [w^n] f^{-1}(w) &= \frac{1}{2\pi \mathrm{i}} \oint \frac{f^{-1}(s)}{s^{n+1}} \mathrm{d} s \\ &= \frac{1}{2\pi \mathrm{i}} \oint \frac{z}{f(z)^{n+1}} f'(z) \mathrm{d} z \\ &= \frac{1}{2\pi \mathrm{i}} \oint - \frac{z}{n}  \frac{\mathrm{d}}{\mathrm{d}z} \frac{1}{(f(z))^{n}}  \\ &= \frac{1}{n} \frac{1}{2\pi \mathrm{i}} \oint  \frac{1}{f(z)^n} \mathrm{d} z \\ &= \frac{1}{n} [z^{-1}] \frac{1}{f(z)^n} \\ &= \frac{1}{n} [z^{n-1}] \bigl( \frac{z}{f(z)}\Bigr)^n  \end{aligned}

as required. \Box

Since [z^n]f(z) = \frac{1}{n!}\frac{\mathrm{d}^n}{\mathrm{d}z^n} f(z), evaluated at 0, an equivalent statement of the result is

\displaystyle f^{-1}(w) = \sum_{n=0}^\infty  \frac{w^n}{n!} \frac{\mathrm{d}^{n-1}}{\mathrm{d}z^{n-1}} \bigl( \frac{z}{f(z)} \bigr)^n \bigr|_{z=0}.

Proof 2. Let c_n = [w^n] f^{-1}(w). Since z = f^{-1}\bigl( f(z) \bigr), the Taylor series for f^{-1} gives

\displaystyle z = \sum_{m=1}^\infty c_m f(z)^m.

Differentiate and then divide through by f(z)^n to get

\displaystyle \frac{1}{f(z)^n} = n c_n \frac{f'(z)}{f(z)} + \sum_{m \not= n} m c_m f(z)^{m-1-n} f'(z).

The formal residue of f'(z)/f(z) is 1 (this is the count of zeros and poles). The summand for m where m\not= n is, up to scalars, the derivative of f(z)^{m-n}, so has zero residue. Therefore, taking residues, we obtain [z^{-1}] \frac{1}{f(z)^n} = n c_n, which is equivalent to c_n = [z^{n-1}] (z/f(z))^n, as required. \Box

Not a proof. It seems impossible to replace the use of residues in Proof 2 with the more elementary derivative. From

\displaystyle \frac{z}{f(z)^n} = c_n + \sum_{m\not= n} c_m f(z)^{m-n}

we can multiply through by z^{n-1} to get

\displaystyle c_n z^{n-1} + \sum_{m=1}^{n-1} c_m \bigl(\frac{z}{f(z)} \bigr)^n \frac{f(z)^m}{z} + \sum_{r=1}^\infty c_{n+r} z^{n-1} f(z)^r.

It is now tempting to differentiate n-1 times to extract c_n: the third summand is no obstacle, since it is \mathrm{O}(z^n). The second summand is now holomorphic (this is clear from the rewriting above, since the singularity at 0 is removable). But of course it contributes erroneous terms to the derivative. Indeed, this procedure gives (n-1)!c_n, whereas the theorem predicts that n! c_n = \frac{\mathrm{d}^{n-1}}{\mathrm{d}z^{n-1}} \bigl( \frac{z}{f(z)} \bigr)^n \bigr|_{z=0}, so something is surely wrong.

Shannon’s 1949 paper

September 20, 2017

In 1949 Claude Shannon published Communication theory of secrecy systems, Bell Systems Technical Journal 28 656–715. The members of Bell Labs around this time included Richard Hamming, the three inventors of the transistor and Harry Nyquist, of the Nyquist bound. Its technical journal published many seminal papers, including Shannon’s 1948 paper A mathematical theory of communication defining entropy and Hamming’s 1950 paper Error detecting and error correcting codes, defining Hamming distance and essentially inventing the modern `adversarial’ setting for coding theory.

Incidentally, the story goes that Shannon asked Von Neumann what he should call his new measure of information content and Von Neumann replied

You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.

Like Hamming’s paper, Shannon’s two papers still amply repay reading today. One idea introduced in the 1949 paper is ‘perfect secrecy’.

Perfect secrecy

Consider a cryptosystem with plaintexts \mathcal{P} and ciphertexts \mathcal{C}, with encryption functions e_k : \mathcal{P} \rightarrow \mathcal{C} parametrised by a set \mathcal{K} of keys.

Shannon supposes that there is a probability distribution on the plaintexts, assigning an a priori probability p_x to each x \in \mathcal{P}. There is also an independent probability distribution on the keys. These interact in a rich way. For instance, \mathbb{P}[Y=y|X=x] = \mathbb{P}[e_K(x) = y] is a sum of key probabilities, while \mathbb{P}[Y=y|K=k] = \mathbb{P}[e_k(X) = y] is either zero, or the probability attached to the unique plaintext p_{e_k^{-1}(y)}. Other probabilities, such as \mathbb{P}[X=x|Y=y] can be computed from these using Bayes’ Theorem type arguments.

Suppose we observe a ciphertext y \in \mathcal{C}: what, if anything do we learn about the corresponding plaintext x \in \mathcal{P}? Shannon defines the a posteriori probability of the plaintext x \in \mathcal{P} given we observed the ciphertext y \in \mathcal{C} to be the conditional probability \mathbb{P}[X = x | Y = y]; the system then has perfect secrecy if, for any a priori probability distribution p_x, we have \mathbb{P}[X = x | Y = y] = p_x for all x \in \mathcal{P}, and all y \in \mathcal{C}. (This assumes implicitly that \mathbb{P}[Y = y] > 0 for every y \in \mathcal{C}.)

Shannon proves in his Theorem 6 that a necessary and sufficient condition for perfect secrecy is that \mathbb{P}[Y = y|X = x] = \mathbb{P}[Y = y] for all x \in \mathcal{P} and y \in \mathcal{C}.

The proof is a short application of Bayes’ Theorem: since \mathbb{P}[X = x | Y = y] = \mathbb{P}[Y = y|X = x]p_x/\mathbb{P}[Y = y], and since we may choose p_x \not=0 (which is necessary anyway for the conditional probability to be well-defined), we have p_x = \mathbb{P}[X = x | Y = y] if and only if \mathbb{P}[Y = y|X = x] = \mathbb{P}[Y = y].

Corollary. In a system with perfect secrecy, |\mathcal{K}| \ge |\mathcal{C}|. Moreover, if equality holds then every key must be used with equal probability 1/|\mathcal{K}| and for each x \in \mathcal{P} and y \in \mathcal{C} there exist a unique k \in \mathcal{K} such that e_k(x) = y.

Proof Fix a plaintext x \in \mathcal{P}. We claim that for each y \in \mathcal{C} there exists a key k such that e_k(x) = y. Indeed, if y_{\mathrm{bad}} \in \mathcal{C} is never an encryption of x then, for any choice of a priori probabilities that gives some probability to x, we have \mathbb{P}[Y = y_{\mathrm{bad}}|X = x] = 0, so by Shannon’s Theorem 6, \mathbb{P}[Y = y_\mathrm{bad}] = 0. But, by the implicit assumption, there is a non-zero chance of observing y_\mathrm{bad}, a contradiction.

Since x has at most |\mathcal{K}| different encryptions, the claim implies that |\mathcal{K}| \ge |\mathcal{C}|. Moreover if equality holds then for every y \in \mathcal{C} there exists a unique k \in \mathcal{K} such that e_k(x) = y. Since \mathbb{P}[Y=y | X =x ] = \mathbb{P}[K = k], where k is this unique key, the conclusion of Theorem 6, that \mathbb{P}[Y = y|X = x] is constant for y \in \mathcal{C}, then implies that each key is equiprobable. \Box

The ‘only if’ direction of Theorem 2.4 in Cryptography: theory and practice (3rd edition) by Douglas Stinson, is the corollary above, but, according to my reading of pages 48 and 50, interpreted with a different definition of perfect secrecy, in which the a priori distribution is fixed, as part of the cryptosystem. (This makes some sense, for example, in a cryptosystem encrypting English plaintexts nonsensical strings should always have zero probability.) Unfortunately, Stinson’s change makes the result false. The diagram below shows a toy cryptosystem with two keys and two plaintexts.

Take p_w = 0 and p_x = 1. Then

\mathbb{P}[X = x | Y = y] = \mathbb{P}[X = x | Y = y'] = p_x = 1


\mathbb{P}[X = w | Y = y] = \mathbb{P}[X = w | Y = y'] = p_w = 0,

so the system has perfect secrecy (according to my interpretation of Stinson’s definition), no matter what probabilities we assign to the keys. In particular, we can assign different probabilities to the keys, or give one key zero probability, in which case only one ciphertext is ever observed.

The error in the proof of Theorem 2.4 comes in the application of Bayes’ Law, where it is implicitly assumed that p_x \not= 0. This shows that the extra layer of quantification in Shannon’s paper is not a mere technicality. Given the difficulties students have with nested quantifiers, I’m inclined to keep Stinson’s definition, and fix the problem by assuming p_x\not =0 for each x \in \mathcal{P}. One can get a more general, but somewhat technical result, by taking an arbitrary cryptosystem and applying Shannon’s Theorem to the subsystem where \mathcal{P} is the set of plaintexts with positive probability, and \mathcal{C} and \mathcal{K} are unchanged. (To be fair to Stinson, he observes before the theorem that plaintexts x such that p_x = 0 are never an obstacle to perfect secrecy, so clearly he was aware of the issue. He also assumes, as we have done, that \mathbb{P}[Y = y] \not=0 for each y \in \mathcal{C}, but this is something else.)

Incidentally, there is a subtle English trap in Shannon’s paper: he says, quite correctly, that when |\mathcal{K}| = |\mathcal{C}|, ‘it is possible to obtain perfect secrecy with only this number of keys’. Here `with only’ does not mean the same as ‘only with’.

Example from permutation groups

Given a finite group G acting transitively on a set \Omega we obtain a cryptosystem with plaintexts and ciphertexts \Omega and encryption maps \alpha \mapsto \alpha g indexed by the elements of G. For which probability distributions on G does this cryptoscheme have perfect secrecy?

Let H_\beta denote the point stabiliser of \beta \in \Omega. Since

\mathbb{P}[X = \alpha | Y = \beta] = \mathbb{P}[k \in g_{\alpha \beta} H_\beta],

where g_{\alpha \beta} is any element such that \alpha g_{\alpha\beta}= \beta, the cryptosystem has perfect secrecy if and only if, for each \beta \in \Omega,

\mathbb{P}[K \in g_{\alpha\beta} H_\beta]

is constant as \alpha varies over \Omega. Call this condition (\star).

Lemma 2. Suppose that G has a regular normal subgroup N. Any probability distribution such that \mathbb{P}[k] depends only on the coset of N containing k satisfies (\star).

Proof. Since N acts regularly, we may suppose that \{g_{\alpha\beta} : \alpha \in \Omega\} = N. Since G = N \rtimes H_\beta, the subgroup H_\beta meets each coset of N in a unique element. The same holds for each coset t H_\beta for t \in N. Therefore \sum_{k \in t H_\beta} \mathbb{P}[k] is the sum of the |H_\beta| probabilities assigned to each of the |H_\beta| cosets of N. Using H_\beta as a set of coset representatives we get \sum_{k \in t H_\beta} \mathbb{P}[k] = \sum_{k \in H_\beta} \mathbb{P}[k]; clearly this is independent of t. \Box

In the case of the affine cipher, the converse also holds.

Theorem 3. Let p be a prime and let G be the affine group on \mathbb{F}_p with normal translation subgroup N. A probability distribution on G satisfies (\star) if and only if it is constant on each of the cosets of N.

Proof. Let r generate the multiplicative group of integers modulo p. Then G is a semidirect product N \rtimes H where N = \langle t \rangle is the regular subgroup generated by translation by +1 and the point stabiliser H = \langle h \rangle is generated by multiplication by r.

A probability distribution on G can be regarded as an element of the group algebra \mathbb{C}G. The probability distributions satisfying (\star) correspond to those \sum_{k \in G} c_k k \in \mathbb{C}G such that, for each \beta \in \mathbb{F}_p,

\sum_{k \in t^{-\alpha + \beta} H^{t^\beta}} c_k

is constant as \alpha varies over \mathbb{F}_p. Thus if U is the corresponding subspace of \mathbb{C}G, then U is invariant under left-multiplication by N and right-conjugation by N. But since kg = gk^t for k, g \in G, taken together, these conditions are equivalent to invariance under left- and right- multiplication by N. It is therefore natural to ask for the decomposition of \mathbb{C}G as a representation of N \times N, where N \times N acts by

k \cdot (g,g') = g^{-1}k g'.

Since G/N is abelian, (Nh^\beta)t = (Nh^\beta)^t = Nh^\beta for all \beta. Hence \mathbb{C}G = \bigoplus_{\beta = 0}^{p-1} Nh^\beta. The calculation behind this observation is

h^\beta t = h^\beta t h^{-\beta} h^\beta = t^{s^\beta} h^\beta

where s = r^{-1} in \mathbb{F}_p. This shows that h^\beta \in Nh^\beta is stabilised by (t^{-s^\beta}, t) \in N \times N and so Nh^\beta, regarded as a representation of N \times N, is induced from the subgroup of N \times N generated by (t^{-s^\beta}, t). By Frobenius Reciprocity, Nh^\beta decomposes as a direct sum of a trivial representation of N \times N, spanned by

h^\beta + th^\beta + \cdots + t^{p-1}h^\beta,

and a further p-1 non-trivial representations, each with kernel \langle (t^{-s^\beta}, t) \rangle. Critically all (p-1)^2 of these representations are non-isomorphic.

By Lemma 2, U contains all p-1 trivial representations. Writing T for their direct sum, we have \mathbb{C}G = T \oplus C for a unique complement C that decomposes uniquely as a sum of the non-trivial representations in the previous paragraph. Therefore if U properly contains T then U contains a non-trivial summand of some Nh^\beta, spanned by some u \in \mathbb{C}G of the form

b_0 h^\beta + b_1 t h^\beta + \cdots + b_{p-1} t^{p-1} h^\beta .

The support of u is h^\beta N, which meets each point stabiliser in a unique element. Hence all the b_i must be equal, a contradiction. \Box

This should generalize to an arbitrary Frobenius group with abelian kernel, and probably to other permutation groups having a regular normal subgroup.

The key equivocation given a ciphertext

In practice, we might care more about what an observed ciphertext tells us about the key. This question is considered in Part B of Shannon’s paper. Let X, Y and K be random variables representing the plaintext, ciphertext and key, respectively. We assume that X and K are independent. (This is easily motivated: for instance, it holds if the key is chosen before the plaintext, and the plaintext is not a message about the key.) The conditional entropy H(K | Y), defined by

\begin{aligned}H(K{}&{} | Y) \\ &= \sum_{y \in \mathcal{C}} \mathbb{P}[Y=y] H(K | Y = y) \\ &= \sum_{y \in \mathcal{C}}  \mathbb{P}[Y=y] \sum_{k \in \mathcal{K}} \mathbb{P}[K = k | Y = y] \log \frac{1}{\mathbb{P}[K = k | Y = y]} \end{aligned}

represents our expected uncertainty, or `equivocation’ to use Shannon’s term, in the key, after we have observed a ciphertext. (Throughout \log denotes logarithm in base 2.)

It is intuitive that H(K | Y) \ge H(X | Y): since a given plaintext x can be encrypted to a given ciphertext y by several different keys, given a ciphertext, we are at least as uncertain about the key as we are about the plaintext. The proof is yet another exercise in Bayes’ Theorem, which I’ll include because it sheds some light on the probability model.

Lemma 4. H(K | Y) \ge H(X | Y).

Proof. Fix y \in \mathcal{C}. Let \mathcal{P}_y be the set of plaintext that encrypt, under some key, to y. (Thus \mathcal{P}_y is the union of the inverse images of the singleton set \{y\} under all encryption maps.) For x \in \mathcal{P}, let \mathcal{K}_{xy} be the set (possibly empty) of keys k such that e_k(x) = y. By Bayes’ Theorem,

\mathbb{P}[X=x|Y=y] = \displaystyle \frac{\mathbb{P}[K \in \mathcal{K}_{xy}] p_x}{\mathbb{P}[Y=y]}.

Similarly if y = e_k(x) for some (necessarily unique) x \in \mathcal{P} then

\mathbb{P}[K=k|Y=y] = \displaystyle \frac{\mathbb{P}[K=k] p_x}{\mathbb{P}[Y=y]}

and otherwise this probability is 0. Thus

\mathbb{P}[X=x|Y=y] = \sum_{k \in \mathcal{K}_{xy}} \mathbb{P}[K=k|Y=y]

and so the probability distribution on X (conditioned on Y=y) is a coarsening of the probability distribution on K (conditioned on Y=y). Therefore H(K | Y=y) \ge H(X | Y = y) and the lemma now follows by summing over y, using the definition of conditional entropy above. \Box

A related formula is now a textbook staple. If we know the key K then we know X if and only if we know Y. Therefore the joint entropies H(K, X) and H(K, Y) agree and

H(K | Y) + H(Y) = H(K,Y) = H(K,X) = H(K) + H(X)

by the independence assumption. Hence

H(K | Y) = H(K) + H(X) - H(Y),

Shannon considers a toy cryptosystem with keys K = \{k,k'\} and
\mathcal{P} = \mathcal{C} = \{0,1\}^n, in which a plaintext is encrypted bit-by-bit, as itself if K = k and flipping each bit if K = k'. Suppose that the keys are equiprobable, and that each plaintext bit is 0 independently with probability p, so the probability of a plaintext having exactly s zeros is \binom{n}{s} p^s (1-p)^{n-s}). Denoting this quantity by p_s we have H(X) = -\sum_{s=0}^n p_s \log p_s. The probability that the ciphertext Y has exactly s zeros is (p_s + p_{n-s})/2. Therefore

\begin{aligned} H(Y) &= -\sum_{s=0}^n \frac{p_s + p_{n-s}}{2} \log \frac{p_s + p_{n-s}}{2} \\ &= -\sum_{s=0}^n p_s \log \frac{p_s + p_{n-s}}{2}. \end{aligned}

and so

\begin{aligned} H(K{}&{} |Y) \\ &= 1 - \sum_{s=0}^n p_s \log \frac{p_s}{(p_s+p_{n-s})/2} \\ &= -\sum_{s=0}^n p_s \log \frac{p_s}{p_s + p_{n-s}} \\ &= -\sum_{s=0}^n \binom{n}{s} p^s (1-p)^{n-s} \log \frac{p^s (1-p)^{n-s}}{p^s (1-p)^{n-s} + p^{n-s}(1-p)^s} \\ &= \sum_{s=0}^n \binom{n}{s} p^s (1-p)^{n-s} \log \bigl(1 + p^{n-2s}(1-p)^{2s-n} \bigr) \\ &= \sum_{s=0}^n \binom{n}{s} p^s (1-p)^{n-s} \log \bigl(1 + q^{2s-n} \bigr). \end{aligned}

where q = (1-p)/p.

The graph on page 690 of Shannon’s paper shows H(K | Y) against n for two values of p. He does not attempt to analyse the formula any further, remarking `yet already the formulas are so involved as to be nearly useless’. I’m not convinced this is the case, but the analysis is certainly fiddly.

We may reduce to the case when p \ge 1/2 and so q \le 1/2. (If p = 1 then Shannon’s formula gives H(K | Y) = 1, as expected.) Take n = 2m. The summand for m+u is

\binom{2m}{m+u} p^{m+u}(1-p)^{m-u} \log (1 + q^{2u}).

Since \binom{2m}{m+u}(m-u) = \binom{2m}{m+u+1}(m+u+1), the ratio of the summands for m+u and m+u+1 is

\displaystyle \frac{m+u+1}{m-u} \frac{1-p}{p} \frac{\log (1 + q^{2u})}{\log (1 + q^{2u+2})}.

When u \ge 0 we can use the inequalities

x \ge \log (1+x) \ge x - x^2/2

for x \in [0,1] to bound the ratio below by

\displaystyle \frac{m+u+1}{m-u} \Bigl( \frac{1}{q} -  q^{2u-1}/2 \Bigr).

Therefore the summands in Shannon’s formula get exponentially small, at a rate approximately 1/q > 1 as s increases from n/2. Their contribution can be bounded by summing a geometric series, and has the same order as the middle term.

Going the other way, the summands for m-t and m-t-1 are

\displaystyle \frac{m+t+1}{m-t} \frac{p}{1-p} \frac{\log (1 + q^{-2t})}{\log (1 + q^{-2t+2})}

which can be rewritten as

\displaystyle \frac{m+t+1}{m-t} \frac{1}{q} \frac{2t \log \frac{1}{q} + \log (1+q^{2t})}{(2t+2)\log \frac{1}{q} + \log (1+q^{2t+2})}.

It is useful to set p = 1/2 + \rho. If the final fraction were t/(t+1) then the maximum would occur when t/(t+1) = q, i.e. when t = q/(1-q) = (1-p)/(2p-1) = (1/2 -\rho)/\rho. Numerical tests suggest this gives about the right location of the maximum when m large (and so the first fraction can be neglected) and p near to 1/2.
For t of the same order of m the first fraction is dominant and gives exponential decay. It is therefore not so surprising that the middle term gives a reasonable lower bound for the entire series, namely

\displaystyle \begin{aligned} H(K | Y) &\ge \binom{2m}{m} p^m (1-p)^m \log (1+q^0) \\ &\ge \frac{2^{2m}}{\sqrt{4m}} (\frac{1}{2} + \rho)^m (\frac{1}{2}-\rho)^m \\ &=   \frac{1}{\sqrt{2n}} (1-4\rho^2)^m \\ &\approx \frac{1}{\sqrt{2n}} \exp(-2 \rho^2 n). \end{aligned}

The graph below shows H(K|Y) and the middle summand for p = 1/2 + 1/20, 3/5, 2/3, 4/5 with colours red, green, blue, black.

Upper bound

A slightly weaker upper bound follows from tail estimates for binomial probabilities. (Update. There might be a stronger result using the Central Limit Theorem.) Let S be distributed as \mathrm{Bin}(p,2m), so \mathbb{P}[S= s] = \binom{2m}{s}p^s (1-p)^{2m-s}. By Hoeffding’s inequality,

\mathbb{P}[S - 2pm \le -2\epsilon m] \le \mathrm{e}^{-4 \epsilon^2 m}.

The argument by exponential decay shows that the contribution to H(K | Y) from the summands for s > m is of the same order as the middle term. Using \log(1+ q^{2s-n}) \le 1 + (n-2s) \log (1/q) we get

\mathbb{P}[S \le n/2] + 2\log (1/q) \sum_{s=0}^m (m-s) \mathbb{P}[S = s]

as an upper bound for the remaining terms. By a standard trick, related to the formula \mathbb{E}[X] = \mathbb{P}[X \ge 0] + \mathbb{P}[X \ge 1] + \cdots for the expectation of a random variable taking values in \mathbb{N}_0, we have

\sum_{s=0}^m (m-s)\mathbb{P}[S=s] = \sum_{s=0}^{m-1} \mathbb{P}[S\le s].

Take \epsilon = p - 1/2 + \alpha in the version of Hoeffding’s inequality to get

\begin{aligned} \mathbb{P}[S \le m(1-\alpha)] &\le \mathrm{e}^{-4 (p-1/2+\alpha)^2 m} \\ &= \mathrm{e}^{-4 (p-1/2)^2 m} \mathrm{e}^{-8 (p-1/2) \alpha m - 4 \alpha^2 m} \\ &\le \mathrm{e}^{-4 \rho^2 m} \mathrm{e}^{-8 \rho \alpha m}. \end{aligned}

Thus the upper bounds for \mathbb{P}[S \le s] become exponentially smaller as we decrease s from m by as much as \alpha m, for any \alpha > 0. By summing a geometric series, as before, we get

\sum_{s=0}^{m-1} P[S\le s] \le Bm \mathrm{e}^{-4 \rho^2 m}

for some constant B. The neglected contributions are of the order of the middle term, so bounded above by \mathrm{e}^{-4 \rho^2 m}. Therefore

H(K | Y) \le Cn \mathrm{e}^{-2 \rho^2 n}

for some further constant C. We conclude that

\log H(K|Y) \sim -2 \rho^2 n

as n \rightarrow \infty.

Random ciphers and unicity distance

A selection effect

Imagine a hypothetical world where families have either no children, an only child, or two twins, with probabilities 1/4, 1/2, 1/4. The mean number of children per family is therefore 1. In an attempt to confirm this empirically, a researcher goes to a primary school and asks each child to state his or her number of siblings. Twin-families are half as frequent as only-families, but send two representatives to the school rather than one: these effects cancel, so the researcher observes equal numbers of children reporting 0 and 1 siblings. (Families with no children are, of course, never sampled.) The estimate for the mean number of children is therefore the inflated 3/2.

The random cipher

Shannon’s proof of the Noisy Coding Theorem for the memoryless binary channel is a mathematical gem; his chief insight was that a random binary code of suitable size can (with high probability) be used as part of a coding scheme that achieves the channel capacity. In his 1949 paper he considers the analogous random cipher.

Let \mathcal{A} = \{\mathrm{a}, \ldots, \mathrm{z}\} be the Roman alphabet. Let \mathcal{P}_n = \mathcal{C}_n = \mathcal{A}^n and let X_n and Y_n be the random variables recording the plaintext and ciphertext, respectively. We suppose that the message is chosen uniformly at random from those plaintexts that make good sense in English (once spaces are inserted). In yet another fascinating paper Shannon estimated that the per-character redundancy of English, R say, is between 3.2 and 3.7 bits. Thus H(X_n) = (\log_2 26 - R)n and the number of plausible plaintexts is S_n = 2^{(\log_2 26 - R)n}. Let T_n = |\mathcal{P}_n| = 26^n.

Fix t \in \mathbb{N}. We aim to construct a random cipher with exactly t keys. Each key will be chosen with equal probability. A quick reading of (3) on page 691 of Shannon’s paper may suggest a too naive construction: for each ciphertext, we chose t plaintexts uniformly at random from \mathcal{P}_n to be its decryptions. (Some plaintexts may be chosen multiple times, meaning two or more keys decrypt the ciphertext to the same plaintext.) However if we pick the same x \in \mathcal{P}_n first for different ciphertexts y and y', then it is ambiguous whether to encrypt x as y or as y' under the first key. Moreover some plaintexts may never be chosen, giving them no encryption under the first key. Instead, it seems to work best to choose, for each of t keys, a random bijection \mathcal{P}_n \rightarrow \mathcal{C}_n to be the encryption function for that key. This also arrives at the situation required by Shannon in (3).

Let g(y) be the number of keys giving plausible decryptions of the ciphertext y \in \mathcal{C}_n. (The number of keys may be more than the number of plausible decryptions if there are two keys decrypting y to the same plaintext.) Thus g : \mathrm{C}_n \rightarrow \mathbb{N}_0 is a random quantity, where the randomness comes from the choices made in the construction of the random cipher. If Z_n is chosen uniformly at random from \mathcal{C}_n then, for each key, there is a probability |S_n|/|T_n| that the decrypt of Z_n under this key is plausible. Therefore

\displaystyle \mathbb{P}[g(Z_n) = m] = \binom{t}{m} \bigl( \frac{S_n}{T_n} \bigr)^m \bigl( 1 - \frac{S_n}{T_n} \bigr)^{t-m}

and g(Z_n) is distributed binomially as \mathrm{Bin}(S_n/T_n, t).

However this is not the distribution of g(Y_n): since Y_n is the encryption of a plausible plaintext, ciphertexts y with a high g(y) are more frequent, while, as in the family example, ciphertexts y such that g(y) = 0 are never seen at all.

Lemma 5. \mathbb{P}[g(Y_n) = m] = \displaystyle \frac{T_n m}{S_n t} \mathbb{P}[g(Z_n) = m].

Proof. Let \mathcal{P}_n(y) be the multiset of the g(y) plausible plaintexts that are decryptions of y \in \mathcal{C}_n. (Alternatively, and maybe more clearly, we could define \mathcal{P }_n(y) to be the set of pairs (k,x) where k is a key decrypting y to a plausible plaintext x.) Conditioning on the event that x \in \mathcal{P}_n(y) we have

\begin{aligned} \mathbb{P}[g(Y_n) = m] &= \sum_{y \in \mathcal{C}_n \atop g(y) = m} \mathbb{P}[Y=y] \\ &= \sum_{y \in \mathcal{C}_n \atop g(y) = m} \sum_{x \in \mathcal{P}_n(y)} \mathbb{P}[Y=y|X=x]\mathbb{P}[X=x] \\  &= \sum_{y \in \mathcal{C}_n \atop g(y) = m} \sum_{x \in \mathcal{P}_n(y)} \frac{1}{t} \frac{1}{S_n} \\ &= \sum_{y \in \mathcal{C}_n \atop g(y) = m} \frac{m}{kS_n} \\ &= T_n \mathbb{P}[Z_n = m] \frac{m}{S_n t} \\ &= \frac{T_n m}{S_n t} \mathbb{P}[Z_n = m] \end{aligned}

as required. \Box

Corollary 6. The random variable g(Y_n) is distributed as 1+ \mathrm{Bin}(S_n/T_n, t-1).

Proof. By Lemma 5 and the identity m \binom{t}{m} = t \binom{t-1}{m} we have

\begin{aligned} \mathbb{P}[g(Y_n) = m] &= \frac{T_n m}{S_n t}\binom{t}{m} \bigl( \frac{S_n}{T_n} \bigr)^m \bigl( 1 - \frac{S_n}{T_n} \bigr)^{t-m}\\ &= \binom{t-1}{m-1}\bigl( \frac{S_n}{T_n} \bigr)^{m-1} \bigl( 1 - \frac{S_n}{T_n} \bigr)^{t-m}. \end{aligned}

Hence g(Y_n)-1 is distributed as \mathrm{Bin}(S_n/T_n, t-1). \Box

It feels like there should be a quick direct proof of the corollary, along the lines `we know Y has one plausible decrypt; each of the remaining t-1 is plausible with probability S_n/T_n, hence … ‘. But this seems dangerously close to `we know two fair coin flips gave at least one head; the other flip is a head with probability 1/2, hence …’, which gives the wrong answer. The difference is that the plausible decrypt of Y comes with a known key, whereas the `at least one head’ could be either flip. Given the subtle nature of selection effects and conditional probability, I prefer the calculation in the lemma.

Shannon’s paper replaces the lemma with the comment [with one notation change] `The probability of such a cryptogram [our event g(Y) = m] is mT/St, since it can be produced by m keys from high probability messages [our plausible plaintexts] each with probability T/S.’ I cannot follow this: in particular T/S cannot be a probability, since it is far greater than 1.

Given that Y = y \in \mathcal{C}_n, the entropy in the key is H(K|Y = y) = \log g(y). Therefore, going back to the lemma, we have

\begin{aligned} H(K|Y) &= \sum_{m=1}^t \mathbb{P}[g(Y) = m] \log m  \\ &= \sum_{m=1}^k \frac{T_n}{S_n t} \mathbb{P}[Z_n = m] m \log m.\end{aligned}

Shannon argues that if t is large compared to n then \log m is almost constant for m near the mean t S_n/T_n of Z, and so the expected value can be approximated by

\frac{T_n}{S_n t} \frac{t S_n}{T_n} \log \frac{t S_n}{T_n} = \log t + \log S_n - \log T_n.

Observe that \log S_n - \log T_n = nR, where R is the per-character redundancy of English. Therefore Shannon’s approximation becomes

H(K|Y_n) = \log t - nR.

When n is large compared to t, Shannon uses a Poisson approximation. As an alternative, we argue from Corollary 5. Let Z^- be distributed as \mathrm{Bin}(S_n/T_n, t-1). We have

\begin{aligned} H(K|Y_n) &= \mathbb{E}[\log (1 + Z^-)]  \\ &\approx \log \bigl( 1+\mathbb{E} [Z^-] \bigr) \\ &= \log \bigl( 1 + (t-1)S_n/T_n \bigr) \\ &\approx (t-1)S_n/T_n  \\ &\approx t2^{-nR}.\end{aligned}

The graph of H(K|Y_n) is therefore as sketched below.

The quantity H(K) / R = \log t / R is known as the unicity distance of the cipher. Roughly one expects that, after observing n characters of the ciphertext, the key will be substantially known.

A more general random cipher?

It would be interesting to generalize the random cipher to the case when |\mathcal{P}_n| < |\mathcal{C}_n| and we pick random injections.

Cipher Systems: possible extras

September 12, 2017

This year I’m lecturing our course Cipher Systems for the first time. The purpose of this post is to record some of the ideas that I hope to at least touch on in the course.

Cryptocurrency (1) RSA Scheme

Of course the Bitcoin blockchain is a splendid advertisement for hash functions. But here I want to give a much simpler example of a toy cryptocurrency in the RSA setup.

TTTT (Totally Trusted Transmission Technology) is going into the cryptocurrency game. In readiness, its RSA modulus N and encryption exponent e are prominently posted on the corporate website. TTTT issues currency units by signing the string ‘TTTT promises faithfully to redeem BitFlop x for £100′, where x is the serial number of the relevant BitFlop. Let f(x) \in \{0,1,\ldots, N-1\} be the number representing this string for BitFlop x. (The function f is public knowledge and is injective.) We say f(x) is an unsigned BitFlop.

In the proposed protocol, a customer wishing to buy a BitFlop sends TTTT a self-addressed envelope stuffed with 20 used fivers. TTTT removes the fivers and inserts a signed Bitflop f(x)^d mod N, neatly typed out on letterhead paper. The serial number x is inscribed in red ink in the company’s ledgers. Alice, as well as anyone she gives the signed BitFlop to, can calculate (f(x)^d)^e = f(x), read the string containing the serial number, and know they have a legitimate BitFlop signed by TTTT. To redeem a BitFlop the postal process is reversed, and x is crossed off from the ledger.

One of the many problems with this scheme is that there is nothing to stop a nefarious Alice (with access to a photocopier) from passing the same signed BitFlop onto both Bob and Charlie. To get round this, TTTT decides to publish its ledger of issued BitFlop serial numbers on the web, reasoning that Bob can then check that he has received an unredeemed BitFlop.

Another drawback is that if Alice buys a BitFlop from TTTT and then gives it to Bob, who then redeems it, TTTT can connect Alice and Bob. (Admittedly not necessarily as parties in the same transaction because of the possibility of a chain Alice to Charlie to Bob.)

Blind signing

To get around the second drawback, TTTT decides to use blind signing.

In the new protocol, if Alice wishes to transfer money to Bob, Bob (not Alice) gets TTTT to send him f(x), representing a `candidate-BitFlop’ with serial number x. TTTT enters x in a new ledger. Bob passes f(x) to Alice, who calculates the product a^e f(x) mod N for some privately chosen a. Alice then sends TTTT this number, along with the usual envelope of used fivers, and receives by return

(a^e f(x))^d = a^{ed}f(x)^d = a f(x)^d \text{ mod } N.

To transfer the BitFlop, Alice divides by a and gives the signed BitFlop f(x)^d to Bob. Bob can check that x appears in the public ledger and later redeem the BitFlop. Over the course of the protocol:

  • Alice learns: a, f(x), f(x)^d
  • Bob learns: f(x), f(x)^d
  • TTTT learns f(x), that Alice has paid for a signature of a^e f(x), that Bob has redeemed the signed BitFlop f(x)^d.

Assumingly (generously enough) that TTTT has many customers, there is no way for TTTT to associate f(x) with a^e f(x)^e, or Alice’s transaction with Bob’s.


  1. What protection does BitFlop have against double spending?
  2. TTTT decides it would be a nice touch to allow customers to choose their preferred serial number and encryption exponent. (TTTT knows the factorization N = pq, so can easily compute the decryption exponent d' for any invertible e'. What could possibly go wrong?
  3. While still depending on the postal system to receive used fivers, TTTT decides that email could be used for all other customer communications. Emails have to be encrypted, but no problem, TTTT already has N and e published on its website. What could possibly go wrong, under (a) the blind-signing scheme, (b) the original scheme?

Answers and discussion

  1. TTTT is protected by its ledger system. Using a private ledger there is nothing to stop Alice passing the same BitFlop to both Bob and Charlie; if Bob redeems the BitFlop first, he gets its full value, while Charlie gets a nasty surprise later. Using a public ledger Bob and Charlie can verify the BitFlop is unspent at the time they accept it, but are then in a race to redeem it. (Switching from post to email doesn’t really help: it just means the race is run faster.)

    This might be acceptable: for instance if Alice is in debt to Bob, and pays Bob by BitFlop, then Bob can simply wait for the envelope of used fivers to arrive before agreeing the debt is settled. Alice cannot settle two debts in this way, while Bob cannot plausibly claim to have had the transaction refused by TTTT when x is removed from the public ledger.

    In other cases, for instance if Alice wishes to purchase a physical object from Bob, some escrow system or degree of trust is needed even for less dubious schemes than BitFlop.

  2. When Alice’s request for an encryption exponent e such that (e, \phi(N)) \not= 1 is politely refused by TTTT, she learns a prime factor of p-1 or q-1. This could be useful in a Pollard \rho attack.
  3. (a) Eve who is snooping on communications between Bob and TTTT, intercepts an encrypted message M^e mod N from Bob to TTTT. Eve sends M^e mod N to TTTT, along with the usual envelope of used fivers. TTTT, believing that M^e mod N is an obfuscated unsigned BitFlop a^e f(x) mod N, happily sends back (M^e)^d = M to Eve. So for the price of one BitFlop, Eve has obtained the supposedly confidential message M.

    (b) One day Alice asks TTTT to start signing its messages. Clearly authentication is good practice, so TTTT agrees. A typical message from TTTT to Alice, signed using the hash function h, is a pair (M^{e_A} \text{ mod } N_A,h(M)^d \text{ mod } N). Alice notes that any email to TTTT is immediately bounced back with response `Dear Ms. Alice, your custom is important to us. TTTT will reply within four months …’. She carefully crafts a message from a Ms. TYH!ubN(CZ…, such that TTTT’s automatic response M has hash h(M) = f(x) for some serial number x. (Again this is rather easy if f is bijective.) The signed message is then (?, f(x)^d \text{ mod } N), giving Alice (and anyone else snooping) the signed BitFlop with serial number x. Of course this x does not appear in the company’s ledger, so this is most obviously a problem for the first scheme. But even with the public ledger, a malicious Alice can bombard TTTT with emails and so acquire a large number of ‘protoBitFlops’. Publishing each f(x)^d destroys an unledgered BitFlop x, and creates a race to redeem if x is already in the ledger.

The Boomerang Attack

For the moment this is just a link to Wagner’s original paper.

Cryptocurrency (2) Block chain

Linked lists of type Int may be defined in Haskell by Data List = Empty | Cons Int List. For example Cons 2 (Cons 1 Empty) represents the list [2,1]. We define an analogous data type HashList in which each a cons-cell with data x is argumented by a hash value h. At the level of types h could be any Int, but in a well-formed hash list, we will suppose that h is the sum of the hash of x and the hash of the tail of the list. Assume that hashInt :: Int -> Int has already been defined.

   type Hash = Int
   data HashList = Empty | Cons Int Hash HashList

   hash :: HashList -> Hash
   hash Empty = 0
   hash (Cons _ hashValue _) = hashValue

   cons :: Int -> HashList -> HashList
   cons x tail = Cons x h tail 
      where h = hashInt x + hash tail

For example, if (to make a simple illustration) hashInt x = x*x then the hash list with data 3, 2, 1 is

   Cons 3 14 (Cons 2 5 (Cons 1 1 Empty))

the hash values being 1^2 = 1, 2^2 + 1 = 5 and 3^2 + 5 = 14. Note that data values contribute to the hash (albeit at the head only) as soon as they enter a hash list.

As defined above, hash lists provide warning of accidental corruption. But they offer no protection against malicious corruption. Indeed, the function cons allows anyone to create well-formed hash lists having arbitrary values.

In the Bitcoin blockchain, malicious corruption is prevented—or at least, made as hard as inverting a one-way function—by digital signatures. Again I find the Haskell code the clearest way to present the idea. We assume there is a basic public key infrastructure in place, including functions with the types below such that (unsign m) . (sign m) is the identity for each person m.

   type Person = Int

   sign   :: Person -> Int -> Int
   unsign :: Person -> Int -> Int

   type SignedHash = (Person, Int)
   data BlockChain = Empty 
                   | Cons Int SignedHash BlockChain

   signedHash :: BlockChain -> SignedHash
   signedHash Empty = 0
   signedHash (Cons _ signedHash _) = signedHash

   cons :: Person -> Int -> BlockChain -> BlockChain
   cons m x tail = Cons x (m, (sign m h)) tail 
      where h = let (_, h') = signedHash tail
                in  hashInt m + hashInt x + h'

It is a simple exercise to write a function verify :: BlockChain -> Bool that checks a block chain is valid.

    verify Empty = True
    verify (Cons x (m, h) tail) =    
      let (_, h') = signedHash tail
      in  unsign m h ==    hashInt m + hashInt x + h' 
                        && verify tail 

Note that verify only calls the publically known function unsign m. For example, if (thanks to an appalling programming error swapping encryption and decryption functions in Diffie–Hellman), Person 0's signature function turns out to be x \mapsto x^3 (modulo some large RSA modulus) then the block chain with data 3, 2, 1 is

   Cons 3 (0, 134*134*134) (Cons 2 (0, 5*5*5) 
                          (Cons 1 (1, 1*1*1) Empty))

which passes verification. (Person 0 is particularly unfortunate in that his identity enters the hash by adding 0^2.) A well-formed length one block chain Cons x (m, sign m (hash x)) Empty is equivalent to a signed x from Person m.

Using BlockChain one could implement a simple cryptocurrency, along the lines of the 'ScroogeCoin' considered in the recent book Bitcoin and cryptocurrency technologies (draft available online). Here Scrooge is a trusted party who must sign every transaction, and has sole responsibility for the integrity of the currency. The brilliance of Bitcoin lies in how it achieves decentralization by rewarding (with Bitcoins) anyone willing to play the role of an honest Scrooge, while still thwarting double spending.

A minimal working implementation of BlockChain is available here.