The first purpose of this post is to record two approaches to Shannon’s Source Coding Theorem for memoryless sources, first by explicit codes and secondly using statistical properties of the source. After that I try, with mixed success, to give a motivated proof of a more general result on lossy coding.
Setup and statement.
Suppose that a memoryless source produces the symbols with probabilities , respectively. Let
be the entropy of this probability dis tribution. Since the source is memoryless, the entropy of an -tuple of symbols is then . Shannon’s Source Coding Theorem (or First Main Theorem, or Noiseless Coding Theorem) states that, given , provided is sufficiently large, there is a variable length binary code of average length less than that can be used to encode all -ruples of symbols.
Proof by Kraft–McMillan inequality
Choose sufficiently large so that . Let be the probabilities of the -tuples and let be least such that . Equivalently, is least such that
This equivalent characterisation shows that
so by the Kraft–McMillan inequality, there exists a prefix free binary code with these codeword lengths. Moreover, since
The required prefix-free code is the Shannon code for the probabilities . Another algorithmic construction is given by the proof of the sufficiency of the Kraft–McMillan inequality on page 17 of Welsh Codes and cryptography. (The same proof is indicated, rather less clearly in my view, on the Wikipedia page.)
Proof using Asymptotic Equipartition Property (AEP).
Despite its forbidding name, for memoryless sources the AEP is little more than the Weak Law of Large Numbers, albeit applied in a slightly subtle way to random variables whose values are certain (functions of) probabilities. Let be the probability space of symbols produced by the source. Let be a random symbol produced by the source. Then is a random variable. Summing over the outcomes in the probability space, we find that
where is the entropy already seen. While straightforward, and giving some motivation for considering , the calculation hides some subtleties. For example, is it true that
In general the answer is ‘no’: in fact equality holds if and only if is the unique symbol with probability . (One further thing, that I certainly will not even dare to mention in my course, is that, by the formal definition of a random variable, is the identity function on , and so is a composition of two functions; in Example 2.3.1 of Goldie and Pinch Communication Theory the authors remark that forming such compositions is a ‘characteristic feature of information theory, not found elsewhere in probability theory’.)
Anyway, returning to the AEP, let be a random -tuple. For brevity we shall now call these -tuples messages. Let be defined by
Since the source is memoryless, is the probability that the first symbols produced by the source are . Consider the random variable . By the displayed equation above,
This is a sum of independent identically distributed random variables. We saw above that each has expected value . Hence, by the weak law, given any ,
as . By taking sufficiently large so that the probability is less than , we obtain a subset of such that (a)
and (b) if then
The properties (a) and (b) say that memoryless sources satisfy the AEP. It is usual to call the elements of typical messages.
For example, if then the most likely message is and the least likely is ; unless the distribution is uniform, both are atypical when is small. A typical message will instead have each symbol with frequency about . Indeed, the proof of the AEP for memoryless sources on page 80 of Welsh Codes and cryptography takes this as the definition as typical, using the Central Limit Theorem to replace the weak law so that ‘about’ can be defined precisely.
We can now prove the Source Coding Theorem in another way: if is typical then
Hence . For the atypical messages we make no attempt at efficiency, and instead encode them using binary words of length for some . Thus the code has (at most) two distinct lengths of codewords, and the expected length of a codeword is at most
By taking sufficiently large and , we may assume that . Therefore the expected length is at most
when is sufficiently small this is less than .
In Part C of my course I plan to give this proof before defining the AEP, and then take (a) and (b) as its definition for general sources. As a preliminary, I plan to prove the Weak Law of Large Numbers from Chebyshev’s Inequality.
AEP for general sources
Going beyond this, things quickly get difficult. Recall that a source producing symbols is stationary if and have the same distribution for all and , and ergodic if with probability the observed frequencies of every -tuple of symbols converges to its expected value. There is a nice proof using Fekete’s Lemma in Welsh Codes and cryptography (see page 78) that any stationary source has an entropy, defined by
However the proof that a stationary ergodic source satisfies the AEP is too hard for my course. (For instance the sketch on Wikipedia blithely uses Lévy’s Martingale Convergence Theory.) And while fairly intuitive, it is hard to prove that a memoryless source is ergodic — in fact this is essentially the Strong Law of Large Numbers — so there are difficulties even in connecting the memoryless motivation to the general result. All in all this seems to be a place to give examples and state results without proof.
Lossy source coding
In §10 of Cover and Thomas, Elements of information theory, Wiley (2006) there is a very general but highly technical result on source encoding sources satisfying the AEP when a non-zero error probability is permitted. After much staring at the proof in §10.5 and disentangling of the definitions from rate-distortion theory, I was able to present a very simple version for the unbiased memoryless binary source in my course. As a rare question, someone asked about the biased memoryless binary source. Here is (one version of) the relevant result. Throughout let .
Theorem. Let . Provided is sufficiently large, there exists a code such that
and an encoder such that
Here denotes Hamming distance. So the conclusion is that with probability , a random word emitted by the source can be encoded by a codeword in obtained by flipping at most bits. Therefore each bit is correct with probability at least . By sending the number of a codeword we may therefore compress the source by a factor of , for every . Note that we assume , since if a bit error probability exceeding is acceptable then there is no need for any encoding: the receiver simply guesses that every bit is .
In the lecture, I could not see any good way to motivate the appearance of , so instead I claimed that one could compress by a factor of . The argument I had in mind was that by the AEP (or Shannon's Noiseless Coding Theorem) one can represent the typical words emitted by the source using all binary words of length . Then compress to bits by thinking of these binary words as emitted by an unbiased source, using the special case of the theorem above when . The problem is that decoding gives a bitwise error probability of on the binary words of length used to encode the source, and this does not give good control of the bitwise error probability on the original binary words of length emitted by the source.
In an attempt to motivate the outline proof below, we take one key idea in the proof of the more general result in Cover and Thomas: the code should be obtained by taking a random word emitted by the source and flipping bits to obtain a codeword such that is minimized, subject to the constraint that the expectation of is .
In the special case I presented in the lecture all words are equally likely to be emitted by the source, and so flipping bits at random gives codewords distributed uniformly at random on . As a warm up, we use such codewords to prove the theorem in this special case.
Proof when . Fix and let be the probability that a codeword , chosen uniformly at random from , is within distance of . Since is uniformly distributed, is the size of the Hamming ball of radius about divided by . Standard bounds show that
By the lower bound, if we pick slightly more than codewords uniformly at random, then the expected number of codewords within distance of is at least . More precisely, let be as in the statement of the theorem above. Let be a random code constructed by chosing codewords independently and uniformly at random from . The probability that no codeword in is within distance of is at most
which tends to as . Since was arbitrary, the above is also the probability that no codeword is within distance of a word in chosen uniformly at random. Therefore this average probability, say, can also be made less than . Hence there exists a particular code with codewords such that, for this code, . Thus at least of all words in are within distance of some codeword in .
It seemed highly intuitive to me that, since by the AEP, we need only consider those of weight about , the right construction in general would be to take such words and construct random codewords by randomly flipping of their bits. But some thought shows this cannot work: for example, take and . Then flipping bits at random gives codewords of average weight
Let be of weight . Then is at distance
words of weight , and to pick a random code from words of this weight so that the expectated number of codewords close to is at least , we must choose codewords where
If the denominator was , we could take as required. But because we are stuck in the bigger space of words of weight , we require a larger .
Some of my confusion disappeared after doing the minimization of more carefully. Since and is constant, a good — although maybe not very intuitive — way to do this is to choose the probabilities to maximize , consistent with the requirements that if then
Replacing the inequality with equality (it is a safe assumption that the expected distortion is maximized when the mutual information is minimized), and solving the linear equations, one finds that the flip probabilities are
Thus one may pick any such that . Since , substituting from the equations above gives a formula for in terms of , and . Differentiating with respect to (this is fiddly enough to warrant computer algebra), one finds that
and so the extrema occur when
For no reason that I can see, on multiplying out, all powers of and cancel, and one is left with the factoring quadratic
Hence is extremized only when and both flip probabilities are and when , when and . In the first solution the second derivative satisfies
so it is a maximum. Consistent with the failed attempt above, we have , so the space of codewords is smaller than the space of source words. The extreme value is . Again this is somewhat intuitive: the minimum mutual information consistent with a distortion of should be the uncertainty in the probability distribution. But note that the distortion is applied to codewords, rather than words emitted by the source. (A related remark is given at the end of this post.)
To emphasise the difference, the left diagrams below show the probabilities going from to , relevant to . The right diagram shows the probabilities going from to , relevant to and need in the proof below.
Note that the flip probabilities going from to are not, in general equal. In the example above with and we have and the flip probabilities from to are and . The extreme value is .
The second solution appears to be spurious, i.e. not corresponding to a zero of the derivative. But since two maxima are separated by another zero of the derivative, it cannot be a maximum. This solution exists only when , a condition equivalent to . In the example above and (by complete chance) and . The extreme value is .
Outline Proof. Let . Let as in the statement of the theorem. Choose codewords independently and uniformly at random from the binary words of length of weight . Let be a fixed word of weight . When we flip s and s in we obtain a binary word of weight
as expected. Moreover, the number of binary words of weight we obtain in this way is . By the bounds above, the probability that a codeword of weight , chosen uniformly at random, is within distance of is about
Considering all codewords, the probability that no codeword is within distance of is, by the same exponential approximation used earlier, about
Now, as one would hope, and as can be shown using Mathematica, with
FullSimplify and the assumptions that and , we have . The end of the proof is therefore as before.
Remark. It is very tempting to start on the other side, and argue that a codeword chosen uniformly at random from the set of binary words of weight is within distance of source words of weight by flipping exactly s and s. This avoids the horrendous algebraic simplification required in the proof above. But now it is the source word that is varying, not the codeword. One can argue this way by varying both, which is permitted by the AEP since typically source words have about s. This combines both sources of randomness: the random code and the random source word in a slightly subtle way, typical of the way the AEP is used in a practice.