355 lines
12 KiB
Markdown
355 lines
12 KiB
Markdown
# CSE5313 Coding and information theory for data science (Lecture 17)
|
||
|
||
## Shannon's coding Theorem
|
||
|
||
**Shannon’s coding theorem**: For a discrete memoryless channel with capacity $C$,
|
||
every rate $R < C = \max_{x\in \mathcal{X}} I(X; Y)$ is achievable.
|
||
|
||
### Computing Channel Capacity
|
||
|
||
$X$: channel input (per 1 channel use), $Y$: channel output (per 1 channel use).
|
||
|
||
Let the rate of the code be $\frac{\log_F |C|}{n}$ (or $\frac{k}{n}$ if it is linear).
|
||
|
||
The Binary Erasure Channel (BEC): analog of BSC, but the bits are lost (not corrupted).
|
||
|
||
Let $\alpha$ be the fraction of erased bits.
|
||
|
||
### Corollary: The capacity of the BEC is $C = 1 - \alpha$.
|
||
|
||
<details>
|
||
|
||
<summary>Proof</summary>
|
||
|
||
$$
|
||
\begin{aligned}
|
||
C&=\max_{x\in \mathcal{X}} I(X;Y)\\
|
||
&=\max_{x\in \mathcal{X}} (H(Y)-H(Y|X))\\
|
||
&=H(Y)-H(\alpha)
|
||
\end{aligned}
|
||
$$
|
||
|
||
Suppose we denote $Pr(X=1)\coloneqq p$.
|
||
|
||
$Pr(Y=0)=Pr(X=0)Pr(no erasure)=(1-p)(1-\alpha)$
|
||
|
||
$Pr(Y=1)=Pr(X=1)Pr(no erasure)=p(1-\alpha)$
|
||
|
||
$Pr(Y=*)=\alpha$
|
||
|
||
So,
|
||
|
||
$$
|
||
\begin{aligned}
|
||
H(Y)&=H((1-p)(1-\alpha),p(1-\alpha),\alpha)\\
|
||
&=(1-p)(1-\alpha)\log_2 ((1-p)(1-\alpha))+p(1-\alpha)\log_2 (p(1-\alpha))+\alpha\log_2 (\alpha)\\
|
||
&=H(\alpha)+(1-\alpha)H(p)
|
||
\end{aligned}
|
||
$$
|
||
|
||
So $I(X;Y)=H(Y)-H(Y|X)=H(\alpha)+(1-\alpha)H(p)-H(\alpha)=(1-\alpha)H(p)$
|
||
|
||
So $C=\max_{x\in \mathcal{X}} I(X;Y)=\max_{p\in [0,1]} (1-\alpha)H(p)=(1-\alpha)$
|
||
|
||
So the capacity of the BEC is $C = 1 - \alpha$.
|
||
|
||
</details>
|
||
|
||
### General interpretation of capacity
|
||
|
||
Recall $I(X;Y)=H(Y)-H(Y|X)$.
|
||
|
||
Edge case:
|
||
|
||
- If $H(X|Y)=0$, then output $Y$ reveals all information about input $X$.
|
||
- rate of $R=I(X;Y)=H(Y)$ is possible. (same as information compression)
|
||
- If $H(Y|X)=H(X)$, then $Y$ reveals no information about $X$.
|
||
- rate of $R=I(X;Y)=0$ no information is transferred.
|
||
|
||
> [!NOTE]
|
||
>
|
||
> Compression is transmission without noise.
|
||
|
||
## Side notes for Cryptography
|
||
|
||
Goal: Quantify the amount of information that is leaked to the eavesdropper.
|
||
|
||
- Let:
|
||
- $M$ be the message distribution.
|
||
- Let $Z$ be the cyphertext distribution.
|
||
- How much information is leaked about $m$ to the eavesdropper (who sees $operatorname{Enc}(m)$)?
|
||
- Idea: One-time pad.
|
||
|
||
### One-time pad
|
||
|
||
$M=\mathcal{M}=\{0,1\}^n$
|
||
|
||
Suppose the Sender and Receiver agree on $k\sim U=\operatorname{Uniform}\{0,1\}^n$
|
||
|
||
Let $operatorname{Enc}(m)=m\oplus k$
|
||
|
||
Measure the information leaked to the eavesdropper (who sees $operatorname{Enc}(m)$).
|
||
|
||
That is to compute $I(M;Z)=H(Z)-H(Z|M)$.
|
||
|
||
<details>
|
||
<summary>Proof</summary>
|
||
|
||
Recall that $Z=M\oplus U$.
|
||
|
||
So
|
||
|
||
$$
|
||
\begin{aligned}
|
||
Pr(Z=z)&=\operatorname{Pr}(M\oplus U=z)\\
|
||
&=\sum_{m\in \mathcal{M}} \operatorname{Pr}(M\oplus U=z|M=m) \text{by the law of total probability}\\
|
||
&=\sum_{m\in \mathcal{M}} \operatorname{Pr}(M=m) \operatorname{Pr}(U=m\oplus z)\\
|
||
&=\frac{1}{2^n} \sum_{m\in \mathcal{M}} \operatorname{Pr}(M=m)\text{message is uniformly distributed}\\
|
||
&=\frac{1}{2^n}
|
||
\end{aligned}
|
||
$$
|
||
|
||
$Z$ is uniformly distributed over $\{0,1\}^n$.
|
||
|
||
So $H(Z)=\log_2 2^n=n$.
|
||
|
||
$H(Z|M)=H(U)=n$ because $U$ is uniformly distributed over $\{0,1\}^n$.
|
||
|
||
- Notice $m\oplus\{0,1\}^n=\{0,1\}^n$ for every $m\in \mathcal{M}$. (since $(\{0,1\}^n,\oplus)$ is a group)
|
||
- For every $z\in \{0,1\}^n$, $Pr(m\oplus U=z)=Pr(U=z\oplus m)=2^{-n}$
|
||
|
||
So $I(M;Z)=H(Z)-H(Z|M)=n-n=0$.
|
||
|
||
</details>
|
||
|
||
### Discussion of information theoretical privacy
|
||
|
||
What does $I(Z;M)=0$ mean?
|
||
|
||
- No information is leaked to the eavesdropper.
|
||
- Regardless of the value of $M$ and $U$.
|
||
- Regardless of any computational power.
|
||
|
||
Information Theoretic privacy:
|
||
|
||
- Guarantees given in terms of mutual information.
|
||
- Remains private in the face of any computational power.
|
||
- Remains private forever.
|
||
|
||
Very strong form of privacy
|
||
|
||
> [!NOTE]
|
||
>
|
||
> The mutual information is an average metric for privacy, no guarantee for any individual message.
|
||
>
|
||
> The one-time pad is so-far, the only known perfect privacy scheme.
|
||
|
||
## The asymptotic equipartition property (AEP) and data compression
|
||
|
||
> [!NOTE]
|
||
>
|
||
> This section will help us understand the limits of data compression.
|
||
|
||
### The asymptotic equipartition property
|
||
|
||
Idea: consider the space of all possible sequences produced by a random source, and focus on the "typical" ones.
|
||
|
||
- Asymptotic in the sense that many of the results focus on the regime of large source sequences.
|
||
- Fundamental to the concept of typical sets used in data compression.
|
||
- The analog of the law of large numbers (LLN) in information theory.
|
||
- Direct consequence of the weak law.
|
||
|
||
#### The law of large numbers
|
||
|
||
The average of outcomes obtained from a large number of trials is close to the expected value of the random variable.
|
||
|
||
For independent and identically distributed (i.i.d.) random variables $X_1,X_2,\cdots,X_n$, with expected value $\mathbb{E}[X_i]=\mu$, the sample average
|
||
|
||
$$
|
||
\overline{X}_n=\frac{1}{n}\sum_{i=1}^n X_i
|
||
$$
|
||
|
||
converges to the expected value $\mu$ as $n$ goes to infinity.
|
||
|
||
#### The weak law of large numbers
|
||
|
||
converges in probability to the expected value $\mu$
|
||
|
||
$$
|
||
\overline{X}_n^p\to \mu\text{ as }n\to\infty
|
||
$$
|
||
|
||
That is,
|
||
|
||
$$
|
||
\lim_{n\to\infty} P\left(|\overline{X}_n<\epsilon)=1
|
||
$$
|
||
|
||
for any positive $\epsilon$.
|
||
|
||
> Intuitively, for any nonzero margin $\epsilon$, no matter how small, with a sufficiently large sample there will be a very high probability that the average of the observations will be close to the expected value (within the margin)
|
||
|
||
#### The AEP
|
||
|
||
Let $X$ be an i.i.d source that takes values in alphabet $\mathcal{X}$ and produces a sequence of i.i.d. random variables $X_1, \cdots, X_n$.
|
||
|
||
Let $p(X_1, \cdots, X_n)$ be the probability of observing the sequence $X_1, \cdots, X_n$.
|
||
|
||
Theorem (AEP):
|
||
|
||
$$
|
||
-\frac{1}{n} \log p(X_1, \cdots, X_n) \xrightarrow{p} H(X)\text{ as }n\to \infty
|
||
$$
|
||
|
||
<details>
|
||
<summary>Proof</summary>
|
||
|
||
$\frac{1}{n} \log p(X_1, \cdots, X_n) = \frac{1}{n} \sum_{i=1}^n \log p(X_i)$
|
||
|
||
The $\log p(X_i)$ terms are also i.i.d. random variables.
|
||
|
||
So by the weak law of large numbers,
|
||
|
||
$$
|
||
\frac{1}{n} \sum_{i=1}^n \log p(X_i) \xrightarrow{p} \mathbb{E}[\log p(X_i)]=H(X)
|
||
$$
|
||
|
||
$$
|
||
-\mathbb{E}[\log p(X_i)]=\mathbb{E}[\log \frac{1}{p(X_i)}]=H(X)
|
||
$$
|
||
|
||
</details>
|
||
|
||
### Typical sets
|
||
|
||
#### Definition of typical set
|
||
|
||
The typical set (denoted by $A_\epsilon^{(n)}$) with respect to $p(x)$ is the set of sequence $(x_1, \cdots, x_n)\in \mathcal{X}^n$ that satisfies
|
||
|
||
$$
|
||
2^{-n(H(X)-\epsilon)}\leq p(x_1, \cdots, x_n)\leq 2^{-n(H(X)+\epsilon)}
|
||
$$
|
||
|
||
In other words, the typical set contains all $n$-length sequences with probability close to $2^{-nH(X)}$.
|
||
|
||
- This notion of typicality only concerns the probability of a sequence and not the actual sequence itself.
|
||
- It has great use in compression theory.
|
||
|
||
#### Properties of the typical set
|
||
|
||
- If $(x_1, \cdots, x_n)\in A_\epsilon^{(n)}$, then $H(X)-\epsilon\leq -\frac{1}{n} \log p(x_1, \cdots, x_n)\leq H(X)+\epsilon$.
|
||
- $\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\epsilon$. for sufficiently large $n$.
|
||
|
||
<details>
|
||
<summary>Sketch of proof</summary>
|
||
|
||
Use the AEP to show that $\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\epsilon$. for sufficiently large $n$.
|
||
|
||
For any $\delta>0$, there exists $n_0$ such that for all $n\geq n_0$,
|
||
$$
|
||
\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\delta
|
||
$$
|
||
</details>
|
||
|
||
- $|A_\epsilon^{(n)}|\leq 2^{n(H(X)+\epsilon)}$.
|
||
- $|A_\epsilon^{(n)}|\geq (1-\epsilon)2^{n(H(X)-\epsilon)}$. for sufficiently large $n$.
|
||
|
||
#### Smallest probable set
|
||
|
||
The typical set $A_\epsilon^{(n)}$ is a fairly small set with most of the probability.
|
||
|
||
Q: Is it the smallest such set?
|
||
|
||
A: Not quite, but pretty close.
|
||
|
||
Notation: Let $X_1, \cdots, X_n$ be i.i.d. random variables drawn from $p(x)$. For some $\delta < \frac{1}{2}$,
|
||
we denote $B_\delta^{(n)}\subset \mathcal{X}^n$ as the smallest set such that $\operatorname{Pr}(B_\delta^{(n)})\geq 1-\delta$.
|
||
|
||
Notation: We write $a_n \doteq b_n$ if
|
||
|
||
$$
|
||
\lim_{n\to\infty} \frac{a_n}{b_n}=1
|
||
$$
|
||
|
||
Theorem: $|B_\delta^{(n)}|\doteq |A_\epsilon^{(n)}|\doteq 2^{nH(X)}$.
|
||
|
||
Check book for detailed proof.
|
||
|
||
What is the difference between $B_\delta^{(n)}$ and $A_\epsilon^{(n)}$?
|
||
|
||
Consider a Bernoulli sequence $X_1, \cdots, X_n$ with $p = 0.9$.
|
||
|
||
The typical sequences are those in which the proportion of 1's is close to 0.9.
|
||
|
||
- However, they do not include the most likely sequence, i.e., the sequence of all 1's!
|
||
- $H(X) = 0.469$.
|
||
- $-\frac{1}{n} \log p(1, \cdots, 1) = -\frac{1}{n} \log 0.9^n = 0.152$. Its average logarithmic probability cannot come close to the entropy of $X$ no matter how large $n$ can be.
|
||
- The set $B_\delta^{(n)}$ contains all the most probable sequences… and includes the sequence of all 1's.
|
||
|
||
### Consequences of the AEP: data compression schemes
|
||
|
||
|
||
Let $X_1, \cdots, X_n$ be i.i.d. random variables drawn from $p(x)$.
|
||
|
||
We want to find the shortest description of the sequence $X_1, \cdots, X_n$.
|
||
|
||
First, divide all sequences in $\mathcal{X}^n$
|
||
into two sets, namely the typical set $A_\epsilon^{(n)}$ and its complement $A_\epsilon^{(n)c}$.
|
||
|
||
- $A_\epsilon^{(n)}$ contains most of the probability while $A_\epsilon^{(n)c}$ contains most elements.
|
||
- The typical set has probability close to 1 and contains approximately $2^{nH(X)}$ elements.
|
||
|
||
Order the sequences in each set in lexicographic order.
|
||
|
||
- This means we can represent each sequence of $A_\epsilon^{(n)}$ by giving its index in the set.
|
||
- There are at most $2^{n(H(X)+\epsilon)}$ sequences in $A_\epsilon^{(n)}$.
|
||
- Indexing requires no more than $nH(X)+\epsilon+1$ bits.
|
||
- Prefix all of these by a "0" bit ⇒ at most $nH(X)+\epsilon+2$ bits to represent each.
|
||
- Similarly, index each sequence in $A_\epsilon^{(n)c}$ with no more than $n\log |\mathcal{X}|+1$ bits.
|
||
- Prefix it by "1" ⇒ at most $n\log |\mathcal{X}|+2$ bits to represent each.
|
||
- Voilà! We have a code for all sequences in $\mathcal{X}^n$.
|
||
- The typical sequences have short descriptions of length $\approx nH(X)$ bits.
|
||
|
||
#### Algorithm for data compression
|
||
|
||
1. Divide all $n$-length sequences in $\mathcal{X}^n$ into the typical set $A_\epsilon^{(n)}$ and its complement $A_\epsilon^{(n)c}$.
|
||
2. Order the sequences in each set in lexicographic order.
|
||
3. Index each sequence in $A_\epsilon^{(n)}$ using $\leq nH(X)+\epsilon+1$ bits and each sequence in $A_\epsilon^{(n)c}$ using $\leq n\log |\mathcal{X}|+1$ bits.
|
||
4. Prefix the sequence by "0" if it is in $A_\epsilon^{(n)}$ and "1" if it is in $A_\epsilon^{(n)c}$.
|
||
|
||
Notes:
|
||
|
||
- The code is one-to-one and can be decoded easily. The initial bit acts as a "flag".
|
||
- The number of elements in $A_\epsilon^{(n)c}$ is less than the number of elements in $\mathcal{X}^n$, but it turns out this does not matter.
|
||
|
||
#### Expected length of codewords
|
||
|
||
Use $x_n$ to denote a sequence $x_1, \cdots, x_n$ and $\ell_{x_n}$ to denote the length of the corresponding codeword.
|
||
|
||
Suppose $n$ is sufficiently large.
|
||
|
||
This means $\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\epsilon$.
|
||
|
||
Lemma: The expected length of the codeword $\mathbb{E}[\ell_{x_n}]$ is upper bounded by
|
||
$nH(X)+\epsilon'$, where $\epsilon'=\epsilon+\epsilon\log |\mathcal{X}|+\frac{2}{n}$.
|
||
|
||
$\epsilon'$ can be made arbitrarily small by choosing $\epsilon$ and $n$ appropriately.
|
||
|
||
#### Efficient data compression guarantee
|
||
|
||
The expected length of the codeword $\mathbb{E}[\ell_{x_n}]$ is upper bounded by $nH(X)+\epsilon'$, where $\epsilon'=\epsilon+\epsilon\log |\mathcal{X}|+\frac{2}{n}$.
|
||
|
||
Theorem: Let $X_n\triangleq X_1, \cdots, X_n$ be i.i.d. with $p(x)$ and $\epsilon > 0$. Then there exists a code that maps sequences $x_n$ to binary strings (codewords) such that the mapping is one-to-one and
|
||
$\mathbb{E}[\ell_{x_n}] \leq n(H(X)+\epsilon)$ for $n$ sufficiently large.
|
||
|
||
- Thus, we can represent sequences $X_n$ using $\approx nH(X)$ bits on average
|
||
|
||
## Shannon's source coding theorem
|
||
|
||
There exists an algorithm that can compress $n$ i.i.d. random variables $X_1, \cdots, X_n$, each with entropy $H(X)$, into slightly more than $nH(X)$ bits with negligible risk of information loss. Conversely, if they are compressed into fewer than $nH(X)$ bits, the risk of information loss is very high.
|
||
|
||
- We have essentially proved the first half!
|
||
|
||
|
||
Proof of converse: Show that any set of size smaller than that of $A_\epsilon^{(n)}$ covers a set of probability bounded away from 1
|