28 Appendix E. Some Information Theory Essentials

Definition 663 Entropy, Def E.1.2

Let

\(\)X\( be a discrete random variable. The entropy of \)X\(, denoted \)H(X)\(, is the average surprise for an outcome of \)X\(: \[ H(X) = \mathbb {E}_X\left(\log _2 \frac{1}{\Pr (X=x)}\right) = \sum _{x\in \mathrm{supp}(X)} p(x)\log _2\frac{1}{p(x)}, \] where \)p(x) := (X=x)\(, we define \)0_2=0\(, and \)supp(X)\( is the support of \)X\(, i.e. the set of values \)x\( such that \)(X=x)≠0\(. \)

Definition

Proposition 664 Basic properties of entropy

Let

\(\)X\( be a discrete random variable. \begin{enumerate} \item \end{enumerate}\)H(X)≥0\(, and \)H(X)=0\( if and only if \)X\( is deterministic. \item If \)X{0,1}\( with \)(X=1)=p\( and \)(X=0)=1-p\(, then \[ H(X) = p\log _2\frac{1}{p} + (1-p)\log _2\frac{1}{1-p} = H_2(p), \] the binary entropy function. \item If \)X\( is uniform on a set \){a_1,…,a_n}\( of size \)n\(, then \[ H(X) = \sum _{i=1}^n \frac1n \log _2\left(\frac{1}{1/n}\right) = \log _2 n. \] \)

Proposition

Proof ▶

Each term

\(\)p(x)_2(1/p(x))\( in the defining sum is non-negative, since \)p(x)(0,1]\( implies \)_2(1/p(x))≥0\(; a sum of non-negative terms is non-negative. This sum is \)0\( exactly when every term is \)0\(, i.e. when \)p(x){0,1}\( for every \)x\( in the support, i.e. when a single value has probability \)1\(, i.e. when \)X\( is deterministic. \item Substitute the two-point distribution \)p(1)=p\(, \)p(0)=1-p\( directly into the definition of \)H(X)\(. \item Substitute the uniform distribution \)p(a_i)=1/n\( for \)i=1,…,n\( directly into the definition of \)H(X)\(; every one of the \)n\( terms equals \)_2 n\(. \)

Proof ▶

Lemma 665 Jensen’s inequality

✓

\(\)f:RR\( is concave and \)Z\( is a real-valued random variable, then \)E[f(Z)] ≤f(E[Z])\(. \)

Lemma

Lemma 666 Lemma E.1.3

\(\)|supp(X)| = n\( then \)H(X) ≤_2 n\(. \)

Lemma

Proof ▶

Let

\(\)Z\( be the random variable that takes value \)1/p(x)\( with probability \)p(x)\(, for \)x\( ranging over \)supp(X)\(. Then \[ H(X) = \mathbb {E}_{x\leftarrow X}\left[\log _2\frac{1}{p(x)}\right] = \mathbb {E}_Z[\log _2 Z]. \] Applying Jensen's inequality (Lemma~ \ref{lem:appE-jensen}) to the concave function \)f(x)=_2 x\( gives \[ H(X) = \mathbb {E}_Z[\log _2 Z] \le \log _2\left(\mathbb {E}_Z[Z]\right) = \log _2\left(\sum _{x\in \mathrm{supp}(X)} p(x)\cdot \frac{1}{p(x)}\right) = \log _2 |\mathrm{supp}(X)| = \log _2 n. \qedhere \] \)

Proof ▶

Definition 667 Joint entropy, Def E.2.1

Let

\(\)X\( and \)Y\( be two (possibly correlated) discrete random variables. The joint entropy of \)X\( and \)Y\(, denoted \)H(X,Y)\(, is \[ H(X,Y) = \sum _{x,y} p(x,y)\log _2\frac{1}{p(x,y)}, \] where \)p(x,y) := (X=x ∧Y=y)\(. \)

Definition

Proposition 668 Joint entropy of independent variables

\(\)X\( and \)Y\( are independent, then \)H(X,Y) = H(X)+H(Y)\(. \)

Proposition

Proof ▶

\(\)X\( and \)Y\( are independent then \)p(x,y)=p(x)p(y)\( for all \)x,y\(, so \[ H(X,Y) = \sum _{x,y} p(x)p(y)\left(\log _2\frac{1}{p(x)} + \log _2\frac{1}{p(y)}\right) = \sum _x p(x)\log _2\frac1{p(x)} \sum _y p(y) + \sum _y p(y)\log _2\frac1{p(y)} \sum _x p(x) = H(X)+H(Y), \] using that \)∑_x p(x) = ∑_y p(y) = 1\(. \)

Proof ▶

Definition 669 Conditional entropy, Def E.2.2

For random variables

\(\)X\( and \)Y\(, write \)p(y|x) := (Y=y|X=x)\(, and let \)H(Y|X=x)\( denote the entropy of \)Y\( under the conditional distribution \)p(·|x)\(. The conditional entropy of \)Y\( given \)X\( is \[ H(Y|X) = \mathbb {E}_x\big[H(Y|X=x)\big]. \] \)

Definition

Lemma 670 Lemma E.2.3

\(\)H(X,Y) = H(X) + H(Y|X)\(. Symmetrically, \)H(X,Y) = H(Y) + H(X|Y)\(. \)

Lemma

Proof ▶

Using

\(\)p(x,y) = p(x)p(y|x)\(, \begin{align*} H(X,Y) & = \sum _{x,y} p(x,y)\log _2\frac{1}{p(x,y)} = \sum _{x,y} p(x,y)\log _2\frac{1}{p(x)p(y|x)}\\ & = \sum _{x,y} p(x,y)\log _2\frac{1}{p(x)} + \sum _{x,y} p(x,y)\log _2\frac{1}{p(y|x)}\\ & = \sum _x p(x)\log _2\frac{1}{p(x)} + \sum _x p(x) \sum _y p(y|x)\log _2\frac1{p(y|x)}\\ & = H(X) + \sum _x p(x)\, H(Y|X=x) = H(X) + \mathbb {E}_x[H(Y|X=x)] = H(X)+H(Y|X). \end{align*} The symmetric statement follows by exchanging the roles of \)X\( and \)Y\( throughout, since \)H(X,Y)\( is symmetric in \)X,Y\( by definition. \)

Proof ▶

Lemma 671 Self-conditional entropy

\(\)H(X|X) = 0\(. \)

Lemma

Proof ▶

For every

\(\)x\( with \)p(x)>0\(, conditioned on \)X=x\( the random variable \)X\( is deterministic (it equals \)x\( with probability \)1\(), so \)H(X|X=x) = 0\( for every \)x\(. Hence \)H(X|X) = E_x[H(X|X=x)] = 0\(. \)

Proof ▶

Lemma 672 Chain rule for probability

✓

For random variables

\(\)X_1,…,X_n\(, \[ p(x_1,\dots ,x_n) = p(x_1)\, p(x_2|x_1)\cdots p(x_n|x_1,\dots ,x_{n-1}). \] \)

Lemma

Theorem 673 Chain rule, Thm E.2.4

For random variables

\(\)X,Y,Z\(, \[ H(X,Y,Z) = H(X) + H(Y|X) + H(Z|X,Y). \] More generally, for \)n\( random variables \)X_1,…,X_n\(, \[ H(X_1,X_2,\dots ,X_n) = H(X_1) + H(X_2|X_1) + \cdots + H(X_n|X_1,\dots ,X_{n-1}). \] \)

Theorem

Proof ▶

Viewing the pair

\(\)(Y,Z)\( as a single random variable, Lemma \ref{lem:appE-chain-rule-2} gives \)H(X,Y,Z) = H(X,(Y,Z)) = H(X) + H(Y,Z|X)\(. Applying the same two-variable chain rule to the joint entropy \)H(Y,Z|X)\(, working throughout inside the conditional distribution given \)X\(, gives \)H(Y,Z|X) = H(Y|X) + H(Z|X,Y)\(. Combining the two displays proves the three-variable case. The general \)n\(-variable case follows by induction on \)n\(: the base case \)n=1\( is trivial, and the case \)n=2\( is Lemma \ref{lem:appE-chain-rule-2}. For the inductive step, view \)(X_1,…,X_n-1)\( as a single random variable \)W\(; Lemma \ref{lem:appE-chain-rule-2} gives \)H(X_1,…,X_n) = H(W,X_n) = H(W) + H(X_n|W) = H(X_1,…,X_n-1) + H(X_n|X_1,…,X_n-1)\(, and the inductive hypothesis expands \)H(X_1,…,X_n-1)\( into the desired sum of the first \)n-1\( conditional entropy terms. This mirrors, term by term, the chain rule for probability (Lemma~ \ref{lem:appE-prob-chain-rule}), since the log turns the product of conditional probabilities into a sum of conditional entropies. \)

Proof ▶

Example 674 Example E.2.5

Let

\(\)X\( be a random variable uniform on \){0,1,2,3}\(, and let \)Y = X 2\(. Then: \begin{itemize} \item \end{itemize}\)H(X) = 2\(, since \)X\( is uniform on \)4\( values. \item \)H(Y) = 1\(, since \)Y\( is uniform on \){0,1}\(. \item \)H(X|Y) = 1\(: knowing \)Y\( tells us whether \)X\( is odd or even, and conditioned on this, \)X\( is uniform over the \)2\( remaining possible values. \item \)H(Y|X) = 0\(: knowing \)X\( tells us the exact value of \)Y\(. \item \)H(X,Y) = 2\(: the pair \)(X,Y)\( carries exactly the information \)X\( carries, since \)Y\( is a deterministic function of \)X\(. \)

Example

Lemma 675 Conditioning cannot increase entropy, Lemma E.2.6

\(\)H(Y|X) ≤H(Y)\(. \)

Lemma

Proof ▶

By Lemma 670,

\(\)H(X,Y) = H(X) + H(Y|X)\(, so \[ H(Y|X) - H(Y) = H(X,Y) - H(X) - H(Y) = \sum _{x,y} p(x,y)\log _2\frac1{p(x,y)} - \sum _x p(x)\log _2\frac1{p(x)} - \sum _y p(y)\log _2\frac1{p(y)}. \] Using \)p(x) = ∑_y p(x,y)\( and \)p(y) = ∑_x p(x,y)\( to rewrite the last two sums as sums over \)x,y\(, this becomes \[ H(Y|X) - H(Y) = \sum _{x,y} p(x,y)\log _2\frac{p(x)p(y)}{p(x,y)}. \] Let \)Z\( be the random variable taking value \)p(x)p(y)/p(x,y)\( with probability \)p(x,y)\(. Then \)H(Y|X)-H(Y) = E_x,y[_2 Z]\(, and by Jensen's inequality (Lemma~ \ref{lem:appE-jensen}) applied to the concave function \)_2\(, \[ \mathbb {E}_{x,y}[\log _2 Z] \le \log _2\big(\mathbb {E}[Z]\big) = \log _2\left(\sum _{x,y} p(x,y)\cdot \frac{p(x)p(y)}{p(x,y)}\right) = \log _2\left(\Big(\sum _x p(x)\Big)\Big(\sum _y p(y)\Big)\right) = \log _2 1 = 0. \] Hence \)H(Y|X) - H(Y) ≤0\(. \)

Proof ▶

Corollary 676 Corollary E.2.7

For random variables

\(\)X\( and \)Y\(, \)H(X,Y) ≤H(X)+H(Y)\(. More generally, for random variables \)X_1,…,X_n\(, \[ H(X_1,\dots ,X_n) \le \sum _{i=1}^n H(X_i). \] \)

Corollary

Proof ▶

By Lemma 670,

\(\)H(X,Y) = H(X)+H(Y|X)\(, and by Lemma \ref{lem:appE-cond-no-increase}, \)H(Y|X)≤H(Y)\(; combining gives \)H(X,Y)≤H(X)+H(Y)\(. For the general case, the chain rule (Theorem \ref{thm:appE-chain-rule}) gives \)H(X_1,…,X_n) = ∑_i=1^n H(X_i|X_1,…,X_i-1)\(. The same Jensen's-inequality argument used to prove Lemma \ref{lem:appE-cond-no-increase}, with \)X\( replaced by the tuple \)(X_1,…,X_i-1)\( and \)Y\( replaced by \)X_i\(, shows \)H(X_i|X_1,…,X_i-1) ≤H(X_i)\( for every \)i\(. Summing over \)i\( gives \)H(X_1,…,X_n) ≤∑_i=1^n H(X_i)\(. \)

Proof ▶

Definition 677 Mutual information, Def E.3.1

The mutual information between random variables

\(\)X\( and \)Y\(, denoted \)I(X;Y)\(, is \[ I(X;Y) = H(X) - H(X|Y). \] \)

Definition

Proposition 678 Equivalent forms of mutual information

\[ I(X;Y) = H(X) - H(X|Y) = H(X) + H(Y) - H(X,Y) = H(Y) - H(Y|X) = I(Y;X). \]

Proof ▶

By Lemma 670 applied to

\(\)(X,Y)\(, \)H(X,Y) = H(Y) + H(X|Y)\(, so \)H(X|Y) = H(X,Y) - H(Y)\(, and hence \[ I(X;Y) = H(X) - H(X|Y) = H(X) - \big(H(X,Y)-H(Y)\big) = H(X)+H(Y)-H(X,Y). \] This last expression is manifestly symmetric in \)X\( and \)Y\(. Also, by Lemma \ref{lem:appE-chain-rule-2} applied the other way, \)H(X,Y) = H(X) + H(Y|X)\(, so \)H(X)+H(Y)-H(X,Y) = H(Y) - H(Y|X)\(. Hence \)I(X;Y) = H(Y)-H(Y|X) = I(Y;X)\(. \)

Proof ▶

Lemma 679 Lemma E.3.2

\(\)I(X;Y) ≥0\(. Moreover, if \)X\( and \)Y\( are independent, then \)I(X;Y) = 0\(. \)

Lemma

Proof ▶

Since

\(\)I(X;Y) = H(X) - H(X|Y)\(, and by Lemma \ref{lem:appE-cond-no-increase} (with the roles of \)X,Y\( exchanged), \)H(X|Y) ≤H(X)\(, we get \)I(X;Y) ≥0\(. If \)X\( and \)Y\( are independent, then by Proposition \ref{prop:appE-indep-joint-entropy}, \)H(X,Y) = H(X)+H(Y)\(, so by Proposition \ref{prop:appE-mutual-info-equiv}, \)I(X;Y) = H(X)+H(Y)-H(X,Y) = 0\(. \)

Proof ▶

Example 680 Example E.3.3

For

\(\)X,Y\( as in Example \ref{ex:appE-XY-mod2}, \[ I(X;Y) = H(X) - H(X|Y) = 2 - 1 = 1. \] \)

Example

Example 681 Example E.3.4

Let

\(\)Z_1,…,Z_7\( be i.i.d.\ random variables uniform over \){0,1}\(. Let \)X = Z_1Z_2Z_3Z_4Z_5\( and \)Y = Z_4Z_5Z_6Z_7\(. Then \)I(X;Y) = 2\(, since \)X\( and \)Y\( share exactly the \)2\( bits \)Z_4, Z_5\( in common. \)

Example

Definition 682 Conditional mutual information, Def E.3.5

The conditional mutual information between

\(\)X\( and \)Y\( given \)Z\( is \[ I(X;Y|Z) = H(X|Z) - H(X|Y,Z) = H(Y|Z) - H(Y|X,Z). \] \)

Definition