The purpose of this post is to record two approaches to Shannon’s Source Coding Theorem for memoryless sources, first by explicit codes and secondly using statistical properties of the source.
Setup and statement.
Suppose that a memoryless source produces the symbols with probabilities , respectively. Let
be the entropy of this probability distribution. Since the source is memoryless, the entropy of an -tuple of symbols is then . Shannon’s Source Coding Theorem (or First Main Theorem, or Noiseless Coding Theorem) states that, given , provided is sufficiently large, there is a variable length binary code of average length less than that can be used to encode all -ruples of symbols.
Proof by Kraft–McMillan inequality
Choose sufficiently large so that . Let be the probabilities of the -tuples and let be least such that . Equivalently, is least such that
This equivalent characterisation shows that
so by the Kraft–McMillan inequality, there exists a prefix free binary code with these codeword lengths. Moreover, since
The required prefix-free code is the Shannon code for the probabilities . Another algorithmic construction is given by the proof of the sufficiency of the Kraft–McMillan inequality on page 17 of Welsh Codes and cryptography. (The same proof is indicated, rather less clearly in my view, on the Wikipedia page.)
Proof using Asymptotic Equipartition Property (AEP).
Despite its forbidding name, for memoryless sources the AEP is little more than the Weak Law of Large Numbers, albeit applied in a slightly subtle way to random variables whose values are certain (functions of) probabilities. Let be the probability space of symbols produced by the source. Let be a random symbol produced by the source. Then is a random variable. Summing over the outcomes in the probability space, we find that
where is the entropy already seen. While straightforward, and giving some motivation for considering , the calculation hides some subtleties. For example, is it true that
In general the answer is ‘no’: in fact equality holds if and only if is the unique symbol with probability . (One further thing, that I certainly will not even dare to mention in my course, is that, by the formal definition of a random variable, is the identity function on , and so is a composition of two functions; in Example 2.3.1 of Goldie and Pinch Communication Theory the authors remark that forming such compositions is a ‘characteristic feature of information theory, not found elsewhere in probability theory’.)
Anyway, returning to the AEP, let be a random -tuple. For brevity we shall now call these -tuples messages. Let be defined by
Since the source is memoryless, is the probability that the first symbols produced by the source are . Consider the random variable . By the displayed equation above,
This is a sum of independent identically distributed random variables. We saw above that each has expected value . Hence, by the weak law, given any ,
as . By taking sufficiently large so that the probability is less than , we obtain a subset of such that (a)
and (b) if then
The properties (a) and (b) say that memoryless sources satisfy the AEP. It is usual to call the elements of typical messages.
For example, if then the most likely message is and the least likely is ; unless the distribution is uniform, both are atypical when is small. A typical message will instead have each symbol with frequency about . Indeed, the proof of the AEP for memoryless sources on page 80 of Welsh Codes and cryptography takes this as the definition as typical, using the Central Limit Theorem to replace the weak law so that ‘about’ can be defined precisely.
We can now prove the Source Coding Theorem in another way: if is typical then
Hence . For the atypical messages we make no attempt at efficiency, and instead encode them using binary words of length for some . Thus the code has (at most) two distinct lengths of codewords, and the expected length of a codeword is at most
By taking sufficiently large and , we may assume that . Therefore the expected length is at most
when is sufficiently small this is less than .
In Part C of my course I plan to give this proof before defining the AEP, and then take (a) and (b) as its definition for general sources. As a preliminary, I plan to prove the Weak Law of Large Numbers from Chebyshev’s Inequality.
AEP for general sources
Going beyond this, things quickly get difficult. Recall that a source producing symbols is stationary if and have the same distribution for all and , and ergodic if with probability the observed frequencies of every -tuple of symbols converges to its expected value. There is a nice proof using Fekete’s Lemma in Welsh Codes and cryptography (see page 78) that any stationary source has an entropy, defined by
However the proof that a stationary ergodic source satisfies the AEP is too hard for my course. (For instance the sketch on Wikipedia blithely uses Lévy’s Martingale Convergence Theory.) And while fairly intuitive, it is hard to prove that a memoryless source is ergodic — in fact this is essentially the Strong Law of Large Numbers — so there are difficulties even in connecting the memoryless motivation to the general result. All in all this seems to be a place to give examples and state results without proof.