2 Main Result: codes for square-integrable repeat channels
Fix a square-integrable repeat channel \(\mathrm{RC}_\mathcal{D}\). For every \(\varepsilon {\gt}0\) there exists a sound code \(\mathcal{C}\) with rate \(R\) for the \(\mathrm{RC}_\mathcal{D}\) such that \(R \ge \mathrm{Cap}(\mathrm{RC}_\mathcal{D})-\varepsilon \), with linear-time encoding and quasi-linear-time decoding. Moreover the decoder has probability of failure \(e^{-\Omega (n)}\).
Use the construction \((\mathsf{Enc},\mathsf{Dec})\) of Definition 24, whose rate is \(R = km/(\mathrm{Cap}(\mathrm{RC}_\mathcal{D})-\psi (\varepsilon ,\delta ,\eta ,k,m))\) with \(\psi \to 0\) as \(\varepsilon ,\delta ,\eta \to 0\) and \(k,m\to \infty \); by Lemma 25 the information rate of \(\mathrm{TRC}_\mathcal{D}\) equals \(\mathrm{Cap}(\mathrm{RC}_\mathcal{D})\), and by Proposition 28 an approximately balanced inner code of rate \(\ge \mathrm{Cap}(\mathrm{RC}_\mathcal{D})-\varepsilon \) exists, so choosing \(\varepsilon ,\delta ,\eta \) small and \(m\) large makes \(R \ge \mathrm{Cap}(\mathrm{RC}_\mathcal{D})-\varepsilon \).
It remains to show the decoder \(\mathsf{Dec}\) of Definition 24 succeeds with high probability for a suitable inner blocklength \(m\). There are four sources of error; the first three concern identifying the all-zero buffers \(0^b\) (\(b=\eta m\)), the fourth is inner-code failure.
For a sent buffer \(0^b\), fewer than \(\tfrac {b}{2}=\tfrac {\eta m}{2}\) zeros survive, so the buffer is not identified as such.
All ones in an inner codeword are deleted, merging two adjacent buffers.
A received substring longer than \(\tfrac {n}{2}\eta m\) arrives all-zero, creating a spurious buffer.
For a correctly identified received inner word, the inner code decoding fails.
Translated to the outer alphabet \(\Sigma \), error (1) (merging two codewords into one) costs edit distance \(3\) (a deletion plus an insertion of a third letter), error (2) costs \(1\) (a deletion), error (3) costs \(3\) (one deletion, two insertions), and error (4) costs \(2\) (a substitution). If each error type occurs at most \(k\delta /9\) times, the total edit distance is at most \(\tfrac {k\delta }{9}(3+1+3+2)=k\delta \), which the outer code of Theorem 22 corrects. So it suffices to show each error type occurs more than \(k\delta /9\) times only with vanishing probability.
Error (1). By Chebyshev’s inequality (Theorem 5), \(\mathbb {P}(|\mathrm{RC}_\mathcal{D}\, 0^{m\eta }| {\lt} \tfrac {\eta }{2}m) = O(m^{-1})\), so for \(m\) a large constant this is \({\lt}\delta /10\); since these events are independent across the \(k-1\) buffers, the number of failing buffers is \(\mathrm{Binomial}(k-1,p)\) with \(p\le \delta /10\) (Definition 7), and by concentration (Theorem 8) the probability of more than \(k\delta /9\) errors vanishes as \(k\to \infty \).
Error (2). By Proposition 28 each inner codeword has \(\ge \gamma m\) ones for some \(\gamma {\gt}0\) independent of \(m\), so error (2) occurs with probability \(d^{\gamma m}\) (where \(d=\mathbb {P}(\mathcal{D}=0)\)); taking \(m\) large so that \(d^{\gamma m}{\lt}\delta /10\) and concentrating (Theorem 8) gives \({\gt}k\delta /9\) errors with vanishing probability.
Error (3). Suppose a received all-zero string \(s\) with \(|s|\ge \tfrac {n}{2}\eta m\) arises from a codeword \(x\in \mathcal{C}_{in}\). Then either (a) some substring \(\widetilde s\) of length \({\gt}\tfrac 14\eta m\) of the input had all its ones deleted, or (b) some substring \(\widetilde s\) of length \(\le \tfrac 14\eta m\) at the input gave rise to an output of length \(\ge \tfrac 12\mu \eta m\). For (a): by Proposition 28 with \(\zeta =\tfrac 12\mu \eta \) we have \(w(\widetilde s)\ge \gamma \zeta m\), so the probability such an \(s\) exists is \({\lt}d^{\gamma \zeta m}\), and by a union bound (Theorem 6) over the \(O(1)\) choices of initial substring \(s\) this is \({\lt}\delta /20\) for \(m\) large. For (b): if \(Z=X_1+\cdots +X_t\) with \(t=\tfrac 14\eta m\) and \(X_j\sim \mathcal{D}\), then by Chebyshev’s inequality (Theorem 5)
and a union bound (Theorem 6) over \(O(1)\) initial substrings makes this \(\le \delta /20\) for \(m\) large. Hence error (3) occurs with probability \(\le \delta /10\) per word, and by concentration (Theorem 8) more than \(k\delta /9\) such errors occur with vanishing probability.
Error (4). The probability of an inner decoding failure goes to \(0\) as \(m\to \infty \) by soundness of the inner code for the \(\mathrm{TRC}_\mathcal{D}\) (Proposition 28); for \(m\) large this is \({\lt}\delta /10\), and as above \({\gt}k\delta /9\) failures occur with vanishing probability.
Finally, the failure probability is \(e^{-\Omega (n)}\): the frequency of each error type is a \(\mathrm{Binomial}(t,p)\) variable with \(t=k-1\) or \(t=k\) and \(p\le \delta /10\), so by the Chernoff bound (Theorem 8) and a union bound (Theorem 6) over the four types, the probability of edit distance \({\gt}k\delta /9\) is \(e^{-\Omega (n)}\). The encoding of Definition 24 runs in linear time (the outer code runs in linear time and each inner block in \(O(1)\) time) and the decoding in quasi-linear time (buffer identification is linear; the outer decoder of Theorem 22 is quasi-linear).
Fix the inner blocklength \(m=O(1)\), set \(|\Sigma |=2^m\), and let \(\mathcal{C}_{in}\subseteq \{ 0,1\} ^m\) be an approximately balanced inner code for the \(\mathrm{TRC}_\mathcal{D}\) of rate \(R-\varepsilon \) with (not necessarily efficient) encoder \(\mathsf{Enc}_{\mathrm{in}}\) and decoder \(\mathsf{Dec}_{\mathrm{in}}\) (existence by Proposition 28; the \(O(1)\)-size code is found by exhaustive search in \(O(1)\) time). The encoder \(\mathsf{Enc}:\{ 0,1\} ^{km}\to \{ 0,1\} ^n\) acts on \(x\in \{ 0,1\} ^{km}\) as follows.
Split \(x\) into \(x_1,\dots ,x_k\) with \(|x_j|=m\), viewing each \(x_j\) as a letter of \(\Sigma \), so \(x\in \Sigma ^k\). Apply the outer encoder \(\mathsf{Enc}_{\mathrm{out}}\) of Theorem 22 to get \(\widetilde x=\mathsf{Enc}_{\mathrm{out}}(x)\in \Sigma ^{k/(1-\delta -\varepsilon )}\).
Split \(\widetilde x\) into \(\widetilde x_1,\dots ,\widetilde x_{k'}\) with \(k'=k/(1-\delta -\varepsilon )\), view each \(\widetilde x_j\in \{ 0,1\} ^m\), and encode it with the inner code to get \(\widehat x_j=\mathsf{Enc}_{\mathrm{in}}(\widetilde x_j)\in \{ 0,1\} ^{m/(R-\varepsilon )}\) (the inner rate is \(R-\varepsilon \) for \(m\) large).
Concatenate the \(\widehat x_j\) with all-zero buffers \(0^b\), \(b=b(m)=\eta m\):
\[ \mathsf{Enc}(x) = \widehat x_1\, 0^b\, \widehat x_2\, 0^b\cdots 0^b\, \widehat x_{k'}, \]where \(b\) is a constant independent of \(n=k'\cdot (\tfrac {m}{R-\varepsilon }+b) = km/(\mathrm{Cap}(\mathrm{RC}_\mathcal{D})-\psi (\varepsilon ,\delta ,\eta ,k,m))\) with \(\psi \to 0\).
The decoder \(\mathsf{Dec}\) on a received string \(y\in \{ 0,1\} ^*\) reverses the steps:
Identify buffers: interpret any maximal contiguous block of \(\ge \tfrac {n}{2}\eta m\) zeros as a buffer; remove buffers to get the received inner strings \(y_1,\dots ,y_\ell \).
Decode each \(y_j\) with the inner code: \(\widetilde y_j=\mathsf{Dec}_{\mathrm{in}}(y_j)\in \{ 0,1\} ^m\) for \(j\le \ell \).
Interpret each \(\widetilde y_j\) as a letter of \(\Sigma \) and decode the concatenation \(\widetilde y=\widetilde y_1\cdots \widetilde y_\ell \in \Sigma ^\ell \) with the outer decoder: \(\mathsf{Dec}(y)=\mathsf{Dec}_{\mathrm{out}}(\widetilde y)\in \{ 0,1\} ^{km}\).
Let \(\mathrm{RC}_\mathcal{D}\) be a square-integrable repeat channel. Then the information rate of the \(\mathrm{TRC}_\mathcal{D}\) is \(\mathrm{Cap}(\mathrm{RC}_\mathcal{D})\).
Let \(X\) be supported on \(\{ 0,1\} ^n\) and \(Y=\mathrm{RC}_\mathcal{D}\, X\). Let \(L=\ell (Y)\), \(R=r(Y)\) be the (random) indices that mark the all-zero substrings TRIM would delete from \(Y\), and let \(\widetilde L,\widetilde R\) be the indices of the bits in \(X\) that pass through to indices \(L,R\) in \(Y\). By the Generalized Shannon’s Theorem (Theorem 21) it suffices to prove Claim 26, which gives both \(|I(X;Y)-I(X;Y\mid \widetilde L,\widetilde R)|=o(n)\) and \(\lim _n\tfrac 1n\sup _X I(X;Y\mid \widetilde L,\widetilde R) = \lim _n\tfrac 1n\sup _X I(X;Y')\) where \(Y'=\mathrm{TRC}_\mathcal{D}\, X\).
We may assume \(\mathcal{D}\) has support bounded linearly in \(n\): there is a deterministic \(B{\gt}0\) with \(R\le Bn\) a.s. for \(R\sim \mathcal{D}\). If \(\mathcal{D}\) has unbounded support, replace it by its truncation \(\mathcal{D}_n\) at \(Bn\) (renormalize the mass on \(\{ 0,\dots ,Bn\} \)). For each \(x\in \{ 0,1\} ^n\), \(\mathrm{TRC}_{\mathcal{D}_n}\, x \overset {\mathcal{D}}{=}\mathrm{TRC}_\mathcal{D}\, x\) and \(\mathrm{RC}_{\mathcal{D}_n}\, x\overset {\mathcal{D}}{=} \mathrm{RC}_\mathcal{D}\, x\) outside the event \(\bigcup _{i=1}^n\{ R_i{\gt}Bn\} \), which by Chebyshev’s inequality (Theorem 5) and a union bound (Theorem 6) has probability \(O(n^{-1})\to 0\). By Lemma 27 the information rates of \(\mathrm{TRC}_{\mathcal{D}_n}\) and \(\mathrm{TRC}_\mathcal{D}\) (and of \(\mathrm{RC}_{\mathcal{D}_n}\) and \(\mathrm{RC}_\mathcal{D}\)) coincide. This reduces the lemma to the bounded case, which is Claim 26; combining with Theorem 21 finishes the proof.
Assume \(\mathcal{D}\) is bounded by \(Bn\). Let \(Y=\mathrm{RC}_\mathcal{D}\, X\) and \(Y'=\mathrm{TRC}_\mathcal{D}\, X\), and let \(\widetilde L,\widetilde R\) be as in Lemma 25. Then
First equation. Write \(I(X;Y\mid \widetilde L,\widetilde R)=H(X\mid \widetilde L,\widetilde R) -H(X\mid Y,\widetilde L,\widetilde R)\). Since \(H(X\mid \widetilde L,\widetilde R) \le H(X)\) and, by the chain rule (Lemma 12), \(H(X\mid \widetilde L,\widetilde R)=H(X,\widetilde L,\widetilde R) -H(\widetilde L,\widetilde R)\ge H(X)-H(\widetilde L,\widetilde R)\), we get \(|H(X)-H(X\mid \widetilde L,\widetilde R)|\le H(\widetilde L,\widetilde R)\), and similarly \(|H(X\mid Y)-H(X\mid Y,\widetilde L,\widetilde R)|\le H(\widetilde L,\widetilde R)\). By the triangle inequality \(|I(X;Y)-I(X;Y\mid \widetilde L,\widetilde R)|\le 2H(\widetilde L,\widetilde R) \le 4\log n=o(n)\), since \((\widetilde L,\widetilde R)\) is supported on \([n]^2\).
Second equation. By the chain rule (Lemma 12), splitting \(Y\) into its prefix \(Y_1^L\), middle \(Y_{L+1}^{R-1}\) and suffix \(Y_R^n\),
where the penultimate inequality holds because \(Y_1^L\) and \(Y_R^n\) are all-zero strings, uniquely determined once \(|Y|,L,R\) are given, and the last because \((L,R,|Y|)\) is supported on \([Bn]^3\) so its entropy is \(O(\log n)\). By the same argument as for the first equation, \(|I(X;Y_{L+1}^{R-1}\mid \widetilde L,\widetilde R) -I(X;Y_{L+1}^{R-1})|=o(n)\), and since \(Y_{L+1}^{R-1}=\mathrm{TRIM}\, Y\) we have \(Y_{L+1}^{R-1}=Y'\), giving \(|I(X;Y\mid \widetilde L,\widetilde R)-I(X;Y')|=o(n)\), hence the claimed equality of the normalized suprema.
Let \(\mathsf{Ch}_1,\mathsf{Ch}_2:\Omega \times \{ 0,1\} ^*\to \{ 0,1\} ^*\) be two channels. Suppose there is a sequence \(p_n\ge 0\) with \(p_n\to 1\) such that for every \(x\in \{ 0,1\} ^n\) there is a set \(A_x\subseteq \Omega \) with: conditioned on \(A_x\), the distributions of \(\mathsf{Ch}_1\, x\) and \(\mathsf{Ch}_2\, x\) are the same, and \(\mathbb {P}(A_x)\ge p_n\). Then the information rates of \(\mathsf{Ch}_1\) and \(\mathsf{Ch}_2\) are equal.
Let \(Y_1=\mathsf{Ch}_1\, X\), \(Y_2=\mathsf{Ch}_2\, X\) and let \(\mathbf{1}_{A_X}\) be the indicator of \(A_X\). By the chain rule (Lemma 12),
and likewise \(I(X,\mathbf{1}_{A_X};Y_i)=I(X;Y_i)+I(\mathbf{1}_{A_X};Y_i\mid X)\) gives \(I(X;Y_i)=I(X;Y_i\mid \mathbf{1}_{A_X})+I(\mathbf{1}_{A_X};Y_i) -I(\mathbf{1}_{A_X};Y_i\mid X)\ge I(X;Y_i\mid \mathbf{1}_{A_X})-2\log 2\). So \(|I(X;Y_i)-I(X;Y_i\mid \mathbf{1}_{A_X})|\le 2\log 2=o(n)\), and it suffices to show \(\lim _n\tfrac 1n I(X;Y_1\mid \mathbf{1}_{A_X})=\lim _n\tfrac 1n I(X;Y_2\mid \mathbf{1}_{A_X})\). But
where the dropped terms vanish because \(0\le I(X;Y_i\mid A_X^c)\le n\) and \(\mathbb {P}(A_X^c)\to 0\), and the middle equality is the hypothesis (conditioned on \(A_X\), \(Y_1\) and \(Y_2\) have the same distribution). This proves equality of the information rates.
Fix a square-integrable \(\mathcal{D}\)-repeat channel \(\mathsf{Ch}=\mathrm{RC}_\mathcal{D}\), or its trimming version \(\mathsf{Ch}=\mathrm{TRC}_\mathcal{D}\), with information rate \(\mathcal{I}\). For every \(\zeta ,\varepsilon \in (0,1)\) there exists \(\gamma \in (0,\tfrac 12)\) and a family of codes \(\mathcal{C}_n\subseteq \{ 0,1\} ^n\) for \(\mathsf{Ch}\) with rate \(R\ge \mathcal{I}-\varepsilon \) such that for every codeword \(c\in \mathcal{C}_n\) and every \(i\in [n-\zeta n]\),
i.e. every length-\(\zeta n\) substring of every codeword is approximately balanced.
By the Generalized Shannon’s Theorem (Theorem 21) the information rate \(\mathcal{I}\) is achieved by a stationary ergodic input process \(\{ X_j\} _{j\ge 0}\) (for the trimming channel, Lemma 25 shows the same statement holds), i.e. \(\mathcal{I}=\lim _n\tfrac 1n I(X_1^n;Y^n)\) with \(Y^n=\mathsf{Ch}\, X_1^n\). Let \(P:=\mathbb {P}(X_1=1)\). By stationarity \(P\in (0,1)\) (else \(\{ X_j\} \) is trivial and does not achieve the rate). By Birkhoff’s pointwise ergodic theorem (Theorem 13, via Definition 20), almost surely \(P=\lim _{t}\tfrac 1t\sum _{j=1}^t\mathbf{1}\{ X_j=1\} =\lim _t\tfrac 1tw(X_1^t)\).
Setting \(t=\zeta n\): for any \(\delta {\gt}0\), with probability \(p_n\to 1\) we have \((P-\delta )\zeta n\le w(X_1^{\zeta n})\le (P+\delta )\zeta n\). Picking \(\delta ,\gamma \) small enough ensures \(\gamma \zeta n\le w(X_1^{\zeta n}) \le (1-\gamma )\zeta n\) with probability \(p_n\). To extend to all substrings: by stationarity \(X_{i\zeta n+1}^{(i+1)\zeta n}\overset {\mathcal{D}}{=}X_1^{\zeta n}\) for all \(i\in [1/\zeta ]\), so by a union bound (Theorem 6) over the constant \(1/\zeta \) disjoint blocks, with probability \(\widetilde p_n\to 1\) all blocks are balanced simultaneously. Each substring \(x_i^{i+3\zeta n}\) contains at least one block substring of the form \(x_{j\zeta n+1}^{(j+1)\zeta n}\), so \(\tfrac {\gamma }{3}\cdot 3\zeta n\le w(x_i^{i+3\zeta n})\le (1-\tfrac {\gamma }{3}) \cdot 3\zeta n\) for all \(i\in [n-\zeta n]\) with probability \(\widetilde p_n\); re-setting \(\overline\zeta =3\zeta \) and \(\overline\gamma =\gamma /3\) yields the property of the lemma with probability \(\widetilde p_n\).
Finally, we may extract a family of codes \(\widehat\mathcal{C}_n\) of rate \(R\ge \lim _n\tfrac 1n I(X_1^n;Y^n)-\varepsilon =\mathcal{I}-\varepsilon \) from \(\{ X_j\} \) by sampling, as in the standard proof of Shannon’s theorem (Dobrushin; also Cover–Thomas Thm. 7.7.1). Since with high probability the sampled process satisfies the balance condition, we discard from \(\widehat\mathcal{C}_n\) any codewords that fail it, obtaining \(\mathcal{C}_n\) of the same rate.