# CSE5313 Coding and information theory for data science (Lecture 15) ## Information theory Transmission, processing, extraction, and utilization of information. - Information: "Resolution" of uncertainty. - Question in the 1940's: How to quantify the complexity of information? - Questions: - How many bits are required to **describe** an information source? - How many bits are required to **transmit** an information source? - How much information does one source **reveal** about another? - Applications: - Data Compression. - Channel Coding. - Privacy. - Claude Shannon 1948: Information Entropy. ### Entropy The "information value" of a message depends on how surprising it is. - The more unlikely the event, the more informative the message. The **Shannon information** of an event $E$: $$ I(E)=-\log_2 \frac{1}{Pr(E)} $$ Entropy the the expected amount of information in a random trial. - Rolling a die has more entropy than rolling a coin (6 states vs 2 states). #### Information entropy Let $X$ be a random variable with values in some finite set $\mathcal{X}$. The entropy $H(X)$ of $X$ is defined as: $$ H(X)=\sum_{x\in \mathcal{X}} \log_2 \frac{1}{Pr(X=x)}=-\mathbb{E}_{x\sim X}[Pr(X=x)\log_2 Pr(X=x)] $$ Idea: How many bits are required, on average, to describe $X$? - The more unlikely the event, the more informative the message. - Use "few" bits for common $x$ (i.e., $Pr(x)$ large). - Use "many" bits for rare $x$ (i.e., $Pr(x)$ small). Notes: - $H(X)=\mathbb{E}_{x\sim X}[I(X=x)]$ - $H(X)\geq 0$ - $H(X)=0$ if and only if $Pr(X=x)=1$ for some $x\in \mathcal{X}$. - Does not depend on $\mathcal{X}$ but on the probability distribution $Pr(X=x)$. - Maximum entropy is achieved when the distribution is uniform. $H(Uniform(n))=\log_2 n$. Where $n$ is the number of events. ($a$ bits required to describe each event of size $2^a$).
Example For uniform distribution $X\sim Uniform\{0,1\}$, we have $H(X)=\log_2 2=1$. --- For Bernoulli distribution $X\sim Bernoulli(p)$, we have $H(X)=-p\log_2 p-(1-p)\log_2 (1-p)=H(p)$.
### Motivation Optimal compression: Consider the $X$ with distribution given belows: |Value|Probability| Encoding| |-----|-----------|---------| |1 | 1/8 |000 | |2 | 1/4 |001 | |3 | 1/4 |01 | |4 | 1/2 |1 | The average length of the encoding is $1/8\times 3+1/4\times 3+1/4\times 2+1/2\times 1=7/4$. And the entropy of $X$ is $H(X)=1/8\times \log_2 8+1/4\times \log_2 4+1/4\times \log_2 2+1/2\times \log_2 1=7/4$. So the average length of the encoding is equal to the entropy of $X$. This is the optimal compression. This is the Huffman coding. #### Few extra theorems that will not be proved in CS course - **Theorem**: Avg. # of bits in any prefix-free compression of $X$ is ≥ $H(X)$. - **Theorem**: Huffman coding is optimal. - I.e., Avg. # of bits equals $H(X)$. - **Disadvantage of Huffman coding**: the distribution must be known. - Generally not the case. - **[Lempel-Ziv 1978]**: Universal lossless data compression. ### Conditional and joint entropy How does the entropy of different random variables interact? - As a function of their dependence. - Needed in order to relate: - A variable and its compression. - A variable and its transmission over a noisy channel. - A variable and its encryption #### Definition for joint entropy Let $X$ and $Y$ be discrete random variables. The joint entropy $H(X,Y)$ of $X$ and $Y$ is defined as: $$ H(X,Y)=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(X=x, Y=y) \log_2 Pr(X=x, Y=y) $$ Notes: - $H(X,Y)\geq 0$ - $H(X,Y)=H(Y,X)$ - $H(X,Y)\geq \max\{H(X),H(Y)\}$ - $H(X,Y)\leq H(X)+H(Y)$ with equality if and only if $X$ and $Y$ are independent. #### Conditional entropy > [!NOTE] > > Recall that the conditional probability $P(Y|X)$ is defined as: $P(Y|X)=\frac{P(Y,X)}{P(X)}$. What is the average amount of information revealed by $Y$ given the distribution of $X$? For each given $x\in \mathcal{X}$, the conditional entropy $H(Y|X=x)$ is defined as: $$ \begin{aligned} H(Y|X=x)&=-\sum_{y\in \mathcal{Y}} \log_2 \frac{1}{Pr(Y=y|X=x)} \\ &=-\sum_{y\in \mathcal{Y}} Pr(Y=y|X=x) \log_2 Pr(Y=y|X=x) \\ \end{aligned} $$ The conditional entropy $H(Y|X)$ is defined as: $$ \begin{aligned} H(Y|X)&=\mathbb{E}_{x\sim X}[H(Y|X=x)] \\ &=-\sum_{x\in \mathcal{X}} Pr(X=x)H(Y|X=x) \\ &=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(X=x, Y=y) \log_2 Pr(Y=y|X=x) \\ &=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(x)\sum_{y\in \mathcal{Y}} Pr(Y=y|X=x) \log_2 Pr(Y=y|X=x) \\ \end{aligned} $$ Notes: - $H(X|X)=0$ - $H(Y|X)=0$ when $Y$ is a function of $X$. - Conditional entropy not necessarily symmetric. $H(X|Y)\neq H(Y|X)$. - Conditioning will not increase entropy. $H(X|Y)\leq H(X)$. #### Chain rules of entropy Joint entropy: $H(X,Y)=\mathbb{E}_{x\sim X, y\sim Y}\log \frac{1}{Pr(X=x, Y=y)}$. Conditional entropy: $H(Y|X)=\mathbb{E}_{x\sim X}\log \frac{1}{Pr(Y=y|X=x)}$. Chain rule: $H(X,Y)=H(X)+H(Y|X)$.
Proof For **statistically independent** $X$ and $Y$, we have $Pr(x,y)=Pr(x)Pr(y)$. $Pr(y|x)=\frac{Pr(x,y)}{Pr(x)}=\frac{Pr(x)Pr(y)}{Pr(x)}=Pr(y)$. _Apply the symmetry of the joint distribution_.
Example of computing the conditional entropy For the given distribution: |Y\X|1|2|3|4|Y marginal| |-----|-----|-----|-----|-----|-----| |1 |1/8 |1/16 |1/32 |1/32 |1/4 | |2 |1/16 |1/8 |1/16 |1/8 |1/4 | |3 |1/16 |1/16 |1/16 |1/16 |1/4 | |4 |1/4 |0|0 |0 |1/4 | |X marginal|1/2 |1/4 |1/8 |1/8 |1.0 | Here $H(X)=H((1/2,1/4,1/8,1/8))=-(1/2\log_2 1/2+1/4\log_2 1/4+1/8\log_2 1/8+1/8\log_2 1/8)=7/4$. $H(Y)=H((1/4,1/4,1/4,1/4))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/4\log_2 1/4+1/4\log_2 1/4)=2$. $H(Y|X=1)=H((1/4,1/8,1/8,1/2))=-(1/4\log_2 1/4+1/8\log_2 1/8+1/8\log_2 1/8+1/2\log_2 1/2)=7/4$. $H(Y|X=2)=H((1/4,1/2,1/4))=-(1/4\log_2 1/4+1/2\log_2 1/2+1/4\log_2 1/4)=1.5$. $H(Y|X=3)=H((1/4,1/4,1/2,1/4))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/2\log_2 1/2+1/4\log_2 1/4)=1.5$. $H(Y|X=4)=H((1/4,1/4,1/2))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/2\log_2 1/2)=1.5$. So $H(Y|X)=\sum_{x\in \mathcal{X}} Pr(x)H(Y|X=x)=1/2\times 7/4+1/4\times 1.5+1/8\times 1.5+1/8\times 1.5=13/8$
### Mutual information #### Definition for mutual information The mutual information $I(X;Y)$ of $X$ and $Y$ is defined as: $$ I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X) $$ #### Properties of mutual information - $I(X;Y)\geq 0$ - Prove via Jensen's inequality. - Conditioning will not increase entropy. - $I(X;Y)=I(Y;X)$ symmetry. - $I(X;X)=H(X)-H(X|X)=H(X)$ - $I(X;Y)=H(X)+H(Y)-H(X,Y)$
Example of computing the mutual information For the given distribution: |Y\X|1|2|3|4|Y marginal| |-----|-----|-----|-----|-----|-----| |1 |1/8 |1/16 |1/32 |1/32 |1/4 | |2 |1/16 |1/8 |1/16 |1/8 |1/4 | |3 |1/16 |1/16 |1/16 |1/16 |1/4 | |4 |1/4 |0|0 |0 |1/4 | |X marginal|1/2 |1/4 |1/8 |1/8 |1.0 | Recall from the previous example: $H(Y|X)=\frac{13}{8}$ $H(Y)=2$ then the mutual information is: $$ I(X;Y)=H(Y)-H(Y|X)=2-\frac{13}{8}=\frac{3}{8} $$ So the entropy of $Y$ is reduced by $\frac{3}{8}$ bits on average when $X$ is known.
## Applications ### Channel capacity Recall a channel is tuple of $(F,\Phi,\operatorname{Pr})$. #### Definition of discrete channel A discrete channel is a system consisting of a discrete input alphabet $\mathcal{X}$, a discrete output alphabet $\mathcal{Y}$, and a probability transition $\operatorname{Pr}(y|x)$. The channel is _memoryless_ if the output at any time depends only on the input at that time and not on the inputs and outputs at other times. #### Definition of channel capacity The channel capacity $C$ of a channel is defined as: $$ C=\max_{x\in \mathcal{X}} I(X;Y) $$ where $I(X;Y)$ is the mutual information between $X$ and $Y$. #### Shannon's noisy coding theorem Recall the rate of the code is $R=\frac{\log_{|F|}|\mathcal{C}|}{n}$. For a discrete memoryless channel with capacity $C$, every rate $R Proof $$ \begin{aligned} H(Y|X=x)&=-\operatorname{Pr}(y=0|x)\log_2 \operatorname{Pr}(y=0|x)-\operatorname{Pr}(y=1|x)\log_2 \operatorname{Pr}(y=1|x) \\ &=-p\log p-(1-p)\log (1-p)\\ &=H(p) \end{aligned} $$ So, $$ \begin{aligned} H(Y|X)&=\sum_{x\in \mathcal{X}} \operatorname{Pr}(x)H(Y|X=x)\\ &=H(p)\sum_{x\in \mathcal{X}} \operatorname{Pr}(x) \\ &=H(p) $$ So, $$ \begin{aligned} I(X;Y)&=H(Y)-H(Y|X)\\ &\leq 1-H(p) $$ When $p=0$ the capacity is $1$. When $p=\frac{1}{2}$ the capacity is $0$. (completely noisy channel)