6 What Happens When the Noise is Stochastic: Shannon’s Theorem
A (memoryless) stochastic channel consists of a finite input alphabet \(\mathcal{X}\), a finite output alphabet \(\mathcal{Y}\), and a transition matrix \(\mathbf{M}\) indexed by \(\mathcal{X}\times \mathcal{Y}\), where for every \((x,y)\in \mathcal{X}\times \mathcal{Y}\), \(\mathbf{M}(x,y)=\Pr (y\mid x)\) denotes the probability that \(y\) is output by the channel when \(x\) is input to the channel; in particular \(\sum _{y\in \mathcal{Y}}\mathbf{M}(x,y)=1\) for every \(x\in \mathcal{X}\). The channel is memoryless in the sense that the noise acts independently on each symbol of a transmitted string.
Let \(0\le p\le 1\). The Binary Symmetric Channel with crossover probability \(p\), denoted \(\mathrm{BSC}_p\), is the stochastic channel with \(\mathcal{X}=\mathcal{Y}=\{ 0,1\} \) and transition matrix
That is, every transmitted bit is flipped independently with probability \(p\). (One may restrict attention to \(0\le p\le \tfrac 12\), since reliable communication over \(\mathrm{BSC}_p\) for \(p\le \tfrac 12\) implies reliable communication over \(\mathrm{BSC}_{p'}\) for \(p'{\gt}\tfrac 12\).) For the rest of the chapter, \(\mathbf{e}\sim \mathrm{BSC}_p\) denotes an error pattern \(\mathbf{e}\in \{ 0,1\} ^n\) drawn according to the error distribution induced by \(\mathrm{BSC}_p\).
Let \(q\ge 2\) and \(0\le p\le 1-\tfrac 1q\). The \(q\)-ary Symmetric Channel with crossover probability \(p\), denoted \(q\mathrm{SC}_p\), is the stochastic channel with \(\mathcal{X}=\mathcal{Y}=[q]\) and transition matrix
That is, every symbol is retained with probability \(1-p\) and distorted to each of the \(q-1\) other symbols with equal probability \(\tfrac {p}{q-1}\).
Let \(0\le \alpha \le 1\). The Binary Erasure Channel with erasure probability \(\alpha \), denoted \(\mathrm{BEC}_\alpha \), is the stochastic channel with \(\mathcal{X}=\{ 0,1\} \), \(\mathcal{Y}=\{ 0,1,{?}\} \) (where \({?}\) denotes an erasure), and transition matrix given by
and \(\mathbf{M}(x,y)=0\) for every other pair \((x,y)\). That is, every transmitted bit is erased with probability \(\alpha \) and left unchanged with probability \(1-\alpha \).
Let \(\sigma \ge 0\). The Binary Input Additive Gaussian White Noise Channel with standard deviation \(\sigma \), denoted \(\mathrm{BIAGWN}_\sigma \), is the (continuous) channel with input alphabet \(\mathcal{X}=\{ -1,1\} \) and output alphabet \(\mathcal{Y}=\mathbb {R}\), in which for \((x,y)\in \{ -1,1\} \times \mathbb {R}\) the noise \(y-x\) is distributed according to the Gaussian distribution with mean \(0\) and standard deviation \(\sigma \), i.e.
Fix a stochastic channel and a pair of encoding and decoding functions \((E,D)\) for a code of block length \(n\). The decoding error probability of \((E,D)\) is a function \(f(n)\) such that for every message \(\mathbf{m}\), the decoding algorithm recovers \(\mathbf{m}\) with probability at least \(1-f(n)\) (the probability taken over the randomness of the channel noise), where \(\lim _{n\to \infty }f(n)=0\), i.e. \(f(n)=o(1)\). Ideally one would like \(f(n)=2^{-\Omega (n)}\).
Given a stochastic channel, its capacity is a real number \(C\ge 0\) such that reliable communication is possible over the channel if and only if the rate is less than \(C\). That is: for every rate \(R{\lt}C\), there is a family of codes (with associated encoding and decoding functions) of rate \(R\) whose decoding error probability is negligible (indeed \(2^{-\Omega (n)}\)); and for every rate \(R{\gt}C\), for every choice of encoding and decoding functions of rate \(R\), there is some message for which the decoding error probability is bounded below by a constant.
For real numbers \(p,\varepsilon \) such that \(0\le p{\lt}\tfrac 12\) and \(0\le \varepsilon \le \tfrac 12-p\), the following statements are true for large enough \(n\):
There exists a real \(\delta {\gt}0\), an encoding function \(E:\{ 0,1\} ^k\to \{ 0,1\} ^n\) and a decoding function \(D:\{ 0,1\} ^n\to \{ 0,1\} ^k\) where \(k\le \lfloor (1-H(p+\varepsilon ))n\rfloor \), such that the following holds for every \(\mathbf{m}\in \{ 0,1\} ^k\):
\[ \Pr _{\mathbf{e}\sim \mathrm{BSC}_p}[D(E(\mathbf{m})+\mathbf{e})\ne \mathbf{m}]\le 2^{-\delta n}. \]If \(k\ge \lceil (1-H(p)+\varepsilon )n\rceil \) then for every pair of encoding and decoding functions \(E:\{ 0,1\} ^k\to \{ 0,1\} ^n\) and \(D:\{ 0,1\} ^n\to \{ 0,1\} ^k\), there exists \(\mathbf{m}\in \{ 0,1\} ^k\) such that
\[ \Pr _{\mathbf{e}\sim \mathrm{BSC}_p}[D(E(\mathbf{m})+\mathbf{e})\ne \mathbf{m}]{\gt}\frac12. \]
For \(0\le p{\lt}\tfrac 12\), the capacity of \(\mathrm{BSC}_p\) is \(1-H(p)\).
Fix a rate \(R{\lt}1-H(p)\) and choose \(\varepsilon {\gt}0\) small enough that \(R\le 1-H(p+\varepsilon )\). By part (1) of Theorem 144, for every large enough \(n\) there is a code of dimension \(k\le \lfloor (1-H(p+\varepsilon ))n\rfloor \), hence of rate at least (a value tending to) \(R\), and decoding error probability \(2^{-\Omega (n)}\); this shows every rate \(R{\lt}1-H(p)\) is achievable. Conversely, fix \(R{\gt}1-H(p)\) and let \(\varepsilon =R-(1-H(p)){\gt}0\); then \(k=Rn\ge \lceil (1-H(p)+\varepsilon )n\rceil \) for large enough \(n\), so by part (2) of Theorem 144, for every choice of encoding and decoding functions of rate \(R\) there is a message with decoding error probability greater than \(1/2\), so the decoding error probability is bounded below by a constant. Hence \(1-H(p)\) satisfies the requirements of Definition 143 for the capacity of \(\mathrm{BSC}_p\).
Let \(0\le p{\lt}\tfrac 12\) and \(0\le \varepsilon \le \tfrac 12-p\). For large enough \(n\), if \(k\ge \lceil (1-H(p)+\varepsilon )n\rceil \) then for every pair of encoding and decoding functions \(E:\{ 0,1\} ^k\to \{ 0,1\} ^n\) and \(D:\{ 0,1\} ^n\to \{ 0,1\} ^k\), there exists \(\mathbf{m}\in \{ 0,1\} ^k\) such that
We may assume \(p{\gt}0\): if \(p=0\) then \(1-H(p)+\varepsilon =1+\varepsilon {\gt}1\), so the hypothesis \(k\ge \lceil (1-H(p)+\varepsilon )n\rceil \) forces \(k{\gt}n\), in which case \(E\) cannot be injective and no decoding function can be correct on every message even with zero noise; the claim is then immediate.
For the sake of contradiction, assume that for every \(\mathbf{m}\in \{ 0,1\} ^k\),
Define \(D_{\mathbf{m}}=\{ \mathbf{y}\mid D(\mathbf{y})=\mathbf{m}\} \). Since \(D\) is a function, the sets \(D_{\mathbf{m}}\), \(\mathbf{m}\in \{ 0,1\} ^k\), partition \(\{ 0,1\} ^n\).
Fix \(\gamma {\gt}0\) (to be chosen at the end of the proof) and let \(S_{\mathbf{m}}\) be the shell of inner radius \((1-\gamma )pn\) and outer radius \((1+\gamma )pn\) around \(E(\mathbf{m})\), i.e.
Fix \(\mathbf{m}\in \{ 0,1\} ^k\). By assumption,
By the multiplicative Chernoff bound applied to \(wt(\mathbf{e})\), a sum of \(n\) i.i.d. Bernoulli\((p)\) random variables of mean \(pn\),
By the union bound,
so, for large enough \(n\),
On the other hand, since all error patterns of the same Hamming weight occur with the same probability under \(\mathrm{BSC}_p\),
Since \(p\le \tfrac 12\), the function \(p^d(1-p)^{n-d}\) is decreasing in \(d\), so the maximum is attained at \(d=(1-\gamma )pn\), giving
Combining the last three displays,
Summing over all \(2^k\) messages \(\mathbf{m}\), using disjointness of the \(D_{\mathbf{m}}\),
Choosing \(\gamma =\dfrac {\varepsilon }{2p\log (1/p-1)}\) (so \(\gamma =\Theta (\varepsilon )\), using \(0{\lt}p{\lt}\tfrac 12\)), for large enough \(n\) we get
This implies \(k{\lt}(1-H(p)+\varepsilon )n\), contradicting \(k\ge \lceil (1-H(p)+\varepsilon )n\rceil \). This proves the lemma.
The same proof, with a suitable choice of \(\gamma \), shows that the conclusion of Lemma 146 still holds if the hypothesis is weakened to decoding error probability at most \(1-2^{-\beta n}\) (instead of \(1/2\)) for small enough \(\beta =\beta (\varepsilon ){\gt}0\).
Let \(0\le p{\lt}\tfrac 12\), \(0{\lt}\varepsilon \le \tfrac 12-p\), and let \(\varepsilon '=\varepsilon /2\). Let \(k\le \lfloor (1-H(p+\varepsilon ))n\rfloor \). Suppose \(E:\{ 0,1\} ^k\to \{ 0,1\} ^n\) is chosen at random by picking \(E(\mathbf{m}')\) independently and uniformly at random from \(\{ 0,1\} ^n\) for every \(\mathbf{m}'\in \{ 0,1\} ^k\), and let \(D\) be the maximum likelihood decoding (MLD) function for \(E\). Then there is \(\delta '=\delta '(\varepsilon ,p){\gt}0\) such that for every fixed \(\mathbf{m}\in \{ 0,1\} ^k\) and large enough \(n\),
Write \(\mathbf{y}=E(\mathbf{m})+\mathbf{e}\) for the received word. Splitting the sum defining the decoding error probability according to whether \(\mathbf{y}\) lies in the Hamming ball \(B(E(\mathbf{m}),(p+\varepsilon ')n)\),
The second sum is at most
by the additive Chernoff bound applied to \(wt(\mathbf{e})\), a sum of \(n\) i.i.d. Bernoulli\((p)\) random variables. Hence
Taking the expectation of both sides over the random choice of \(E\) and using linearity of expectation,
Fix a received word \(\mathbf{y}\) with \(\Delta (\mathbf{y},E(\mathbf{m}))\le (p+\varepsilon ')n\). Since \(D\) is MLD, \(D(\mathbf{y})\ne \mathbf{m}\) can happen only if there is \(\mathbf{m}'\ne \mathbf{m}\) with \(\Delta (E(\mathbf{m}'),\mathbf{y})\le \Delta (E(\mathbf{m}),\mathbf{y})\le (p+\varepsilon ')n\). By the union bound, and using that each \(E(\mathbf{m}')\), \(\mathbf{m}'\ne \mathbf{m}\), is independent of \(E(\mathbf{m})\) and uniform on \(\{ 0,1\} ^n\),
By the upper bound on the volume of a Hamming ball, \(|B(\mathbf{y},(p+\varepsilon ')n)|\le 2^{H(p+\varepsilon ')n}\), so, using \(k\le (1-H(p+\varepsilon ))n\),
Substituting back, and using \(\sum _{\mathbf{y}\in B(E(\mathbf{m}),(p+\varepsilon ')n)}\Pr [\mathbf{y}\mid E(\mathbf{m})]\le 1\),
Taking \(\varepsilon '=\varepsilon /2\), since \(\varepsilon {\gt}0\), both \(e^{-(\varepsilon ')^2n/2}\) and \(2^{-n(H(p+\varepsilon )-H(p+\varepsilon '))}\) are \(2^{-\Omega (n)}\), so for a suitable \(\delta '=\delta '(\varepsilon ,p){\gt}0\) and large enough \(n\) the right-hand side is at most \(2^{-\delta 'n}\).
Let \(E^*:\{ 0,1\} ^k\to \{ 0,1\} ^n\) and \(D^*:\{ 0,1\} ^n\to \{ 0,1\} ^k\) satisfy
where \(\mathbf{m}\) is uniform on \(\{ 0,1\} ^k\). List the messages as \(\mathbf{m}_1,\ldots ,\mathbf{m}_{2^k}\) and let \(P_i=\Pr _{\mathbf{e}\sim \mathrm{BSC}_p}[D^*(E^*(\mathbf{m}_i)+\mathbf{e})\ne \mathbf{m}_i]\); assume \(P_1\le P_2\le \cdots \le P_{2^k}\). Then \(P_{2^{k-1}}\le 2\cdot 2^{-\delta 'n}\).
By definition of \(P_i\) and the hypothesis on \(E^*,D^*\),
Suppose, for the sake of contradiction, that \(P_{2^{k-1}}{\gt}2\cdot 2^{-\delta 'n}\). Since \(P_1\le \cdots \le P_{2^k}\), we have \(P_i\ge P_{2^{k-1}}\) for every \(i{\gt}2^{k-1}\), so, dropping half the summands from the sum,
This contradicts \(\frac{1}{2^k}\sum _{i=1}^{2^k}P_i\le 2^{-\delta 'n}\). Hence \(P_{2^{k-1}}\le 2\cdot 2^{-\delta 'n}\).
Let \(0\le p{\lt}\tfrac 12\) and \(0{\lt}\varepsilon \le \tfrac 12-p\). For large enough \(n\), there exists a real \(\delta {\gt}0\), an encoding function \(E:\{ 0,1\} ^k\to \{ 0,1\} ^n\) and a decoding function \(D:\{ 0,1\} ^n\to \{ 0,1\} ^k\) where \(k\le \lfloor (1-H(p+\varepsilon ))n\rfloor \), such that the following holds for every \(\mathbf{m}\in \{ 0,1\} ^k\):
We use the probabilistic method. Let \(k_0=\lfloor (1-H(p+\varepsilon ))n\rfloor +1\). Choose \(E:\{ 0,1\} ^{k_0}\to \{ 0,1\} ^n\) at random by picking \(E(\mathbf{m})\) independently and uniformly at random from \(\{ 0,1\} ^n\) for every \(\mathbf{m}\in \{ 0,1\} ^{k_0}\), and let \(D\) be the MLD function for \(E\).
By Lemma 147 (Step 1), there is \(\delta '=\delta '(\varepsilon ,p){\gt}0\) such that for every fixed \(\mathbf{m}\in \{ 0,1\} ^{k_0}\),
Taking the expectation of both sides over \(\mathbf{m}\) uniform on \(\{ 0,1\} ^{k_0}\) and switching the (finite, independent) order of the expectations over \(E\) and \(\mathbf{m}\),
By the probabilistic method, there is a choice \(E^*\) of \(E\) (and corresponding \(D^*\)) such that
Apply Lemma 148 to \((*)\): order the messages \(\mathbf{m}_1,\ldots ,\mathbf{m}_{2^{k_0}}\) so that \(P_1\le \cdots \le P_{2^{k_0}}\), where \(P_i=\Pr _{\mathbf{e}\sim \mathrm{BSC}_p}[D^*(E^*(\mathbf{m}_i)+\mathbf{e})\ne \mathbf{m}_i]\). Then \(P_{2^{k_0-1}}\le 2\cdot 2^{-\delta 'n}\), and hence \(P_i\le 2\cdot 2^{-\delta 'n}\) for every \(i\le 2^{k_0-1}\).
Define the new code with message set \(\{ \mathbf{m}_1,\ldots ,\mathbf{m}_{2^{k_0-1}}\} \): fixing an arbitrary bijection of this set with \(\{ 0,1\} ^{k_0-1}\), let \(E\) be the restriction of \(E^*\) to this message set (composed with the bijection), and let \(D\) be \(D^*\) composed with the same bijection on its output (and mapping any output outside the restricted message set to an arbitrary fixed message, say). This new code has dimension \(k=k_0-1=\lfloor (1-H(p+\varepsilon ))n\rfloor \), and every message has decoding error probability at most \(2\cdot 2^{-\delta 'n}=2^{-(\delta '-1/n)n}\). Setting \(\delta =\delta '-1/n\), which is positive for large enough \(n\), completes the proof.
A code \(C\) of block length \(n\) is called explicit if there exists a \(\mathrm{poly}(n)\)-time algorithm that computes a succinct description of \(C\) given \(n\). For linear codes, such a succinct description could be a generator matrix or a parity check matrix.
A linear \([n,k]\) code \(C\) is called strongly explicit if, given any index pair \((i,j)\in [k]\times [n]\), there is a \(\mathrm{poly}(\log n)\)-time algorithm that outputs \(G_{i,j}\), where \(G\) is a generator matrix of \(C\).
Let \(0\le p{\lt}\tfrac 12\) and \(0{\lt}\varepsilon \le \tfrac 12-p\). If an algorithm \(A\) can handle \(p+\varepsilon \) fraction of worst case errors (i.e. correctly recovers the transmitted message from up to \((p+\varepsilon )n\) adversarially chosen errors, for codewords of block length \(n\)), then \(A\) can be used for reliable communication over \(\mathrm{BSC}_p\).
By the additive Chernoff bound, with probability at least \(1-e^{-\varepsilon ^2n/2}\), the fraction of errors introduced by \(\mathrm{BSC}_p\) on a transmitted codeword is at most \(p+\varepsilon \). By the assumption on \(A\), whenever this event occurs \(A\) correctly recovers the transmitted message. Hence \(A\), used as the decoding algorithm, achieves decoding error probability at most \(e^{-\varepsilon ^2n/2}=2^{-\Omega (n)}\) over \(\mathrm{BSC}_p\), i.e. it can be used for reliable communication over \(\mathrm{BSC}_p\).