9.0 KiB
CSE5313 Coding and information theory for data science (Lecture 15)
Information theory
Transmission, processing, extraction, and utilization of information.
- Information: "Resolution" of uncertainty.
- Question in the 1940's: How to quantify the complexity of information?
- Questions:
- How many bits are required to describe an information source?
- How many bits are required to transmit an information source?
- How much information does one source reveal about another?
- Applications:
- Data Compression.
- Channel Coding.
- Privacy.
- Questions:
- Claude Shannon 1948: Information Entropy.
Entropy
The "information value" of a message depends on how surprising it is.
- The more unlikely the event, the more informative the message.
The Shannon information of an event E:
I(E)=-\log_2 \frac{1}{Pr(E)}
Entropy the the expected amount of information in a random trial.
- Rolling a die has more entropy than rolling a coin (6 states vs 2 states).
Information entropy
Let X be a random variable with values in some finite set \mathcal{X}.
The entropy H(X) of X is defined as:
H(X)=\sum_{x\in \mathcal{X}} \log_2 \frac{1}{Pr(X=x)}=-\mathbb{E}_{x\sim X}[Pr(X=x)\log_2 Pr(X=x)]
Idea: How many bits are required, on average, to describe X?
- The more unlikely the event, the more informative the message.
- Use "few" bits for common
x(i.e.,Pr(x)large). - Use "many" bits for rare
x(i.e.,Pr(x)small).
Notes:
H(X)=\mathbb{E}_{x\sim X}[I(X=x)]H(X)\geq 0H(X)=0if and only ifPr(X=x)=1for somex\in \mathcal{X}.- Does not depend on
\mathcal{X}but on the probability distributionPr(X=x). - Maximum entropy is achieved when the distribution is uniform.
H(Uniform(n))=\log_2 n. Wherenis the number of events. (abits required to describe each event of size2^a).
Example
For uniform distribution X\sim Uniform\{0,1\}, we have H(X)=\log_2 2=1.
For Bernoulli distribution X\sim Bernoulli(p), we have H(X)=-p\log_2 p-(1-p)\log_2 (1-p)=H(p).
Motivation
Optimal compression:
Consider the X with distribution given belows:
| Value | Probability | Encoding |
|---|---|---|
| 1 | 1/8 | 000 |
| 2 | 1/4 | 001 |
| 3 | 1/4 | 01 |
| 4 | 1/2 | 1 |
The average length of the encoding is 1/8\times 3+1/4\times 3+1/4\times 2+1/2\times 1=7/4.
And the entropy of X is H(X)=1/8\times \log_2 8+1/4\times \log_2 4+1/4\times \log_2 2+1/2\times \log_2 1=7/4.
So the average length of the encoding is equal to the entropy of X.
This is the optimal compression.
This is the Huffman coding.
Few extra theorems that will not be proved in CS course
- Theorem: Avg. # of bits in any prefix-free compression of
Xis ≥H(X). - Theorem: Huffman coding is optimal.
- I.e., Avg. # of bits equals
H(X).
- I.e., Avg. # of bits equals
- Disadvantage of Huffman coding: the distribution must be known.
- Generally not the case.
- [Lempel-Ziv 1978]: Universal lossless data compression.
Conditional and joint entropy
How does the entropy of different random variables interact?
- As a function of their dependence.
- Needed in order to relate:
- A variable and its compression.
- A variable and its transmission over a noisy channel.
- A variable and its encryption
Definition for joint entropy
Let X and Y be discrete random variables.
The joint entropy H(X,Y) of X and Y is defined as:
H(X,Y)=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(X=x, Y=y) \log_2 Pr(X=x, Y=y)
Notes:
H(X,Y)\geq 0H(X,Y)=H(Y,X)H(X,Y)\geq \max\{H(X),H(Y)\}H(X,Y)\leq H(X)+H(Y)with equality if and only ifXandYare independent.
Conditional entropy
Note
Recall that the conditional probability
P(Y|X)is defined as:P(Y|X)=\frac{P(Y,X)}{P(X)}.
What is the average amount of information revealed by Y given the distribution of X?
For each given x\in \mathcal{X}, the conditional entropy H(Y|X=x) is defined as:
\begin{aligned}
H(Y|X=x)&=-\sum_{y\in \mathcal{Y}} \log_2 \frac{1}{Pr(Y=y|X=x)} \\
&=-\sum_{y\in \mathcal{Y}} Pr(Y=y|X=x) \log_2 Pr(Y=y|X=x) \\
\end{aligned}
The conditional entropy H(Y|X) is defined as:
\begin{aligned}
H(Y|X)&=\mathbb{E}_{x\sim X}[H(Y|X=x)] \\
&=-\sum_{x\in \mathcal{X}} Pr(X=x)H(Y|X=x) \\
&=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(X=x, Y=y) \log_2 Pr(Y=y|X=x) \\
&=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(x)\sum_{y\in \mathcal{Y}} Pr(Y=y|X=x) \log_2 Pr(Y=y|X=x) \\
\end{aligned}
Notes:
H(X|X)=0H(Y|X)=0whenYis a function ofX.- Conditional entropy not necessarily symmetric.
H(X|Y)\neq H(Y|X). - Conditioning will not increase entropy.
H(X|Y)\leq H(X).
Chain rules of entropy
Joint entropy: H(X,Y)=\mathbb{E}_{x\sim X, y\sim Y}\log \frac{1}{Pr(X=x, Y=y)}.
Conditional entropy: H(Y|X)=\mathbb{E}_{x\sim X}\log \frac{1}{Pr(Y=y|X=x)}.
Chain rule: H(X,Y)=H(X)+H(Y|X).
Proof
For statistically independent X and Y, we have Pr(x,y)=Pr(x)Pr(y).
Pr(y|x)=\frac{Pr(x,y)}{Pr(x)}=\frac{Pr(x)Pr(y)}{Pr(x)}=Pr(y).
Apply the symmetry of the joint distribution.
Example of computing the conditional entropy
For the given distribution:
| Y\X | 1 | 2 | 3 | 4 | Y marginal |
|---|---|---|---|---|---|
| 1 | 1/8 | 1/16 | 1/32 | 1/32 | 1/4 |
| 2 | 1/16 | 1/8 | 1/16 | 1/8 | 1/4 |
| 3 | 1/16 | 1/16 | 1/16 | 1/16 | 1/4 |
| 4 | 1/4 | 0 | 0 | 0 | 1/4 |
| X marginal | 1/2 | 1/4 | 1/8 | 1/8 | 1.0 |
Here H(X)=H((1/2,1/4,1/8,1/8))=-(1/2\log_2 1/2+1/4\log_2 1/4+1/8\log_2 1/8+1/8\log_2 1/8)=7/4.
H(Y)=H((1/4,1/4,1/4,1/4))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/4\log_2 1/4+1/4\log_2 1/4)=2.
H(Y|X=1)=H((1/4,1/8,1/8,1/2))=-(1/4\log_2 1/4+1/8\log_2 1/8+1/8\log_2 1/8+1/2\log_2 1/2)=7/4.
H(Y|X=2)=H((1/4,1/2,1/4))=-(1/4\log_2 1/4+1/2\log_2 1/2+1/4\log_2 1/4)=1.5.
H(Y|X=3)=H((1/4,1/4,1/2,1/4))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/2\log_2 1/2+1/4\log_2 1/4)=1.5.
H(Y|X=4)=H((1/4,1/4,1/2))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/2\log_2 1/2)=1.5.
So H(Y|X)=\sum_{x\in \mathcal{X}} Pr(x)H(Y|X=x)=1/2\times 7/4+1/4\times 1.5+1/8\times 1.5+1/8\times 1.5=13/8
Mutual information
Definition for mutual information
The mutual information I(X;Y) of X and Y is defined as:
I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)
Properties of mutual information
I(X;Y)\geq 0- Prove via Jensen's inequality.
- Conditioning will not increase entropy.
I(X;Y)=I(Y;X)symmetry.I(X;X)=H(X)-H(X|X)=H(X)I(X;Y)=H(X)+H(Y)-H(X,Y)
Example of computing the mutual information
For the given distribution:
| Y\X | 1 | 2 | 3 | 4 | Y marginal |
|---|---|---|---|---|---|
| 1 | 1/8 | 1/16 | 1/32 | 1/32 | 1/4 |
| 2 | 1/16 | 1/8 | 1/16 | 1/8 | 1/4 |
| 3 | 1/16 | 1/16 | 1/16 | 1/16 | 1/4 |
| 4 | 1/4 | 0 | 0 | 0 | 1/4 |
| X marginal | 1/2 | 1/4 | 1/8 | 1/8 | 1.0 |
Recall from the previous example:
H(Y|X)=\frac{13}{8}
H(Y)=2
then the mutual information is:
I(X;Y)=H(Y)-H(Y|X)=2-\frac{13}{8}=\frac{3}{8}
So the entropy of Y is reduced by \frac{3}{8} bits on average when X is known.
Applications
Channel capacity
Recall a channel is tuple of (F,\Phi,\operatorname{Pr}).
Definition of discrete channel
A discrete channel is a system consisting of a discrete input alphabet \mathcal{X}, a discrete output alphabet \mathcal{Y}, and a probability transition \operatorname{Pr}(y|x).
The channel is memoryless if the output at any time depends only on the input at that time and not on the inputs and outputs at other times.
Definition of channel capacity
The channel capacity C of a channel is defined as:
C=\max_{x\in \mathcal{X}} I(X;Y)
where I(X;Y) is the mutual information between X and Y.
Shannon's noisy coding theorem
Recall the rate of the code is R=\frac{\log_{|F|}|\mathcal{C}|}{n}.
For a discrete memoryless channel with capacity C, every rate R<C is achievable.
- There exists a sequence of codes of length
nand size|F|^{nR}with vanishing probability of error asn\to \infty. - In 1 channel use, it is impossible beat the noise.
- In
nchannel uses asn\to \infty,it is possible to beat the all the noise.
Corollary for the binary symmetric channel
The capacity of the binary symmetric channel with crossover probability p is 1-H(p).
Proof
\begin{aligned}
H(Y|X=x)&=-\operatorname{Pr}(y=0|x)\log_2 \operatorname{Pr}(y=0|x)-\operatorname{Pr}(y=1|x)\log_2 \operatorname{Pr}(y=1|x) \\
&=-p\log p-(1-p)\log (1-p)\\
&=H(p)
\end{aligned}
So,
\begin{aligned}
H(Y|X)&=\sum_{x\in \mathcal{X}} \operatorname{Pr}(x)H(Y|X=x)\\
&=H(p)\sum_{x\in \mathcal{X}} \operatorname{Pr}(x) \\
&=H(p)
So,
\begin{aligned}
I(X;Y)&=H(Y)-H(Y|X)\\
&\leq 1-H(p)
When p=0 the capacity is 1.
When p=\frac{1}{2} the capacity is 0. (completely noisy channel)