Files
NoteNextra-origin/content/CSE5313/CSE5313_L17.md
Zheyuan Wu 4b77a1e8e4 updates
2025-10-28 12:44:26 -05:00

355 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CSE5313 Coding and information theory for data science (Lecture 17)
## Shannon's coding Theorem
**Shannons coding theorem**: For a discrete memoryless channel with capacity $C$,
every rate $R < C = \max_{x\in \mathcal{X}} I(X; Y)$ is achievable.
### Computing Channel Capacity
$X$: channel input (per 1 channel use), $Y$: channel output (per 1 channel use).
Let the rate of the code be $\frac{\log_F |C|}{n}$ (or $\frac{k}{n}$ if it is linear).
The Binary Erasure Channel (BEC): analog of BSC, but the bits are lost (not corrupted).
Let $\alpha$ be the fraction of erased bits.
### Corollary: The capacity of the BEC is $C = 1 - \alpha$.
<details>
<summary>Proof</summary>
$$
\begin{aligned}
C&=\max_{x\in \mathcal{X}} I(X;Y)\\
&=\max_{x\in \mathcal{X}} (H(Y)-H(Y|X))\\
&=H(Y)-H(\alpha)
\end{aligned}
$$
Suppose we denote $Pr(X=1)\coloneqq p$.
$Pr(Y=0)=Pr(X=0)Pr(no erasure)=(1-p)(1-\alpha)$
$Pr(Y=1)=Pr(X=1)Pr(no erasure)=p(1-\alpha)$
$Pr(Y=*)=\alpha$
So,
$$
\begin{aligned}
H(Y)&=H((1-p)(1-\alpha),p(1-\alpha),\alpha)\\
&=(1-p)(1-\alpha)\log_2 ((1-p)(1-\alpha))+p(1-\alpha)\log_2 (p(1-\alpha))+\alpha\log_2 (\alpha)\\
&=H(\alpha)+(1-\alpha)H(p)
\end{aligned}
$$
So $I(X;Y)=H(Y)-H(Y|X)=H(\alpha)+(1-\alpha)H(p)-H(\alpha)=(1-\alpha)H(p)$
So $C=\max_{x\in \mathcal{X}} I(X;Y)=\max_{p\in [0,1]} (1-\alpha)H(p)=(1-\alpha)$
So the capacity of the BEC is $C = 1 - \alpha$.
</details>
### General interpretation of capacity
Recall $I(X;Y)=H(Y)-H(Y|X)$.
Edge case:
- If $H(X|Y)=0$, then output $Y$ reveals all information about input $X$.
- rate of $R=I(X;Y)=H(Y)$ is possible. (same as information compression)
- If $H(Y|X)=H(X)$, then $Y$ reveals no information about $X$.
- rate of $R=I(X;Y)=0$ no information is transferred.
> [!NOTE]
>
> Compression is transmission without noise.
## Side notes for Cryptography
Goal: Quantify the amount of information that is leaked to the eavesdropper.
- Let:
- $M$ be the message distribution.
- Let $Z$ be the cyphertext distribution.
- How much information is leaked about $m$ to the eavesdropper (who sees $operatorname{Enc}(m)$)?
- Idea: One-time pad.
### One-time pad
$M=\mathcal{M}=\{0,1\}^n$
Suppose the Sender and Receiver agree on $k\sim U=\operatorname{Uniform}\{0,1\}^n$
Let $operatorname{Enc}(m)=m\oplus k$
Measure the information leaked to the eavesdropper (who sees $operatorname{Enc}(m)$).
That is to compute $I(M;Z)=H(Z)-H(Z|M)$.
<details>
<summary>Proof</summary>
Recall that $Z=M\oplus U$.
So
$$
\begin{aligned}
Pr(Z=z)&=\operatorname{Pr}(M\oplus U=z)\\
&=\sum_{m\in \mathcal{M}} \operatorname{Pr}(M\oplus U=z|M=m) \text{by the law of total probability}\\
&=\sum_{m\in \mathcal{M}} \operatorname{Pr}(M=m) \operatorname{Pr}(U=m\oplus z)\\
&=\frac{1}{2^n} \sum_{m\in \mathcal{M}} \operatorname{Pr}(M=m)\text{message is uniformly distributed}\\
&=\frac{1}{2^n}
\end{aligned}
$$
$Z$ is uniformly distributed over $\{0,1\}^n$.
So $H(Z)=\log_2 2^n=n$.
$H(Z|M)=H(U)=n$ because $U$ is uniformly distributed over $\{0,1\}^n$.
- Notice $m\oplus\{0,1\}^n=\{0,1\}^n$ for every $m\in \mathcal{M}$. (since $(\{0,1\}^n,\oplus)$ is a group)
- For every $z\in \{0,1\}^n$, $Pr(m\oplus U=z)=Pr(U=z\oplus m)=2^{-n}$
So $I(M;Z)=H(Z)-H(Z|M)=n-n=0$.
</details>
### Discussion of information theoretical privacy
What does $I(Z;M)=0$ mean?
- No information is leaked to the eavesdropper.
- Regardless of the value of $M$ and $U$.
- Regardless of any computational power.
Information Theoretic privacy:
- Guarantees given in terms of mutual information.
- Remains private in the face of any computational power.
- Remains private forever.
Very strong form of privacy
> [!NOTE]
>
> The mutual information is an average metric for privacy, no guarantee for any individual message.
>
> The one-time pad is so-far, the only known perfect privacy scheme.
## The asymptotic equipartition property (AEP) and data compression
> [!NOTE]
>
> This section will help us understand the limits of data compression.
### The asymptotic equipartition property
Idea: consider the space of all possible sequences produced by a random source, and focus on the "typical" ones.
- Asymptotic in the sense that many of the results focus on the regime of large source sequences.
- Fundamental to the concept of typical sets used in data compression.
- The analog of the law of large numbers (LLN) in information theory.
- Direct consequence of the weak law.
#### The law of large numbers
The average of outcomes obtained from a large number of trials is close to the expected value of the random variable.
For independent and identically distributed (i.i.d.) random variables $X_1,X_2,\cdots,X_n$, with expected value $\mathbb{E}[X_i]=\mu$, the sample average
$$
\overline{X}_n=\frac{1}{n}\sum_{i=1}^n X_i
$$
converges to the expected value $\mu$ as $n$ goes to infinity.
#### The weak law of large numbers
converges in probability to the expected value $\mu$
$$
\overline{X}_n^p\to \mu\text{ as }n\to\infty
$$
That is,
$$
\lim_{n\to\infty} P\left(|\overline{X}_n<\epsilon)=1
$$
for any positive $\epsilon$.
> Intuitively, for any nonzero margin $\epsilon$, no matter how small, with a sufficiently large sample there will be a very high probability that the average of the observations will be close to the expected value (within the margin)
#### The AEP
Let $X$ be an i.i.d source that takes values in alphabet $\mathcal{X}$ and produces a sequence of i.i.d. random variables $X_1, \cdots, X_n$.
Let $p(X_1, \cdots, X_n)$ be the probability of observing the sequence $X_1, \cdots, X_n$.
Theorem (AEP):
$$
-\frac{1}{n} \log p(X_1, \cdots, X_n) \xrightarrow{p} H(X)\text{ as }n\to \infty
$$
<details>
<summary>Proof</summary>
$\frac{1}{n} \log p(X_1, \cdots, X_n) = \frac{1}{n} \sum_{i=1}^n \log p(X_i)$
The $\log p(X_i)$ terms are also i.i.d. random variables.
So by the weak law of large numbers,
$$
\frac{1}{n} \sum_{i=1}^n \log p(X_i) \xrightarrow{p} \mathbb{E}[\log p(X_i)]=H(X)
$$
$$
-\mathbb{E}[\log p(X_i)]=\mathbb{E}[\log \frac{1}{p(X_i)}]=H(X)
$$
</details>
### Typical sets
#### Definition of typical set
The typical set (denoted by $A_\epsilon^{(n)}$) with respect to $p(x)$ is the set of sequence $(x_1, \cdots, x_n)\in \mathcal{X}^n$ that satisfies
$$
2^{-n(H(X)-\epsilon)}\leq p(x_1, \cdots, x_n)\leq 2^{-n(H(X)+\epsilon)}
$$
In other words, the typical set contains all $n$-length sequences with probability close to $2^{-nH(X)}$.
- This notion of typicality only concerns the probability of a sequence and not the actual sequence itself.
- It has great use in compression theory.
#### Properties of the typical set
- If $(x_1, \cdots, x_n)\in A_\epsilon^{(n)}$, then $H(X)-\epsilon\leq -\frac{1}{n} \log p(x_1, \cdots, x_n)\leq H(X)+\epsilon$.
- $\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\epsilon$. for sufficiently large $n$.
<details>
<summary>Sketch of proof</summary>
Use the AEP to show that $\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\epsilon$. for sufficiently large $n$.
For any $\delta>0$, there exists $n_0$ such that for all $n\geq n_0$,
$$
\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\delta
$$
</details>
- $|A_\epsilon^{(n)}|\leq 2^{n(H(X)+\epsilon)}$.
- $|A_\epsilon^{(n)}|\geq (1-\epsilon)2^{n(H(X)-\epsilon)}$. for sufficiently large $n$.
#### Smallest probable set
The typical set $A_\epsilon^{(n)}$ is a fairly small set with most of the probability.
Q: Is it the smallest such set?
A: Not quite, but pretty close.
Notation: Let $X_1, \cdots, X_n$ be i.i.d. random variables drawn from $p(x)$. For some $\delta < \frac{1}{2}$,
we denote $B_\delta^{(n)}\subset \mathcal{X}^n$ as the smallest set such that $\operatorname{Pr}(B_\delta^{(n)})\geq 1-\delta$.
Notation: We write $a_n \doteq b_n$ if
$$
\lim_{n\to\infty} \frac{a_n}{b_n}=1
$$
Theorem: $|B_\delta^{(n)}|\doteq |A_\epsilon^{(n)}|\doteq 2^{nH(X)}$.
Check book for detailed proof.
What is the difference between $B_\delta^{(n)}$ and $A_\epsilon^{(n)}$?
Consider a Bernoulli sequence $X_1, \cdots, X_n$ with $p = 0.9$.
The typical sequences are those in which the proportion of 1's is close to 0.9.
- However, they do not include the most likely sequence, i.e., the sequence of all 1's!
- $H(X) = 0.469$.
- $-\frac{1}{n} \log p(1, \cdots, 1) = -\frac{1}{n} \log 0.9^n = 0.152$. Its average logarithmic probability cannot come close to the entropy of $X$ no matter how large $n$ can be.
- The set $B_\delta^{(n)}$ contains all the most probable sequences and includes the sequence of all 1's.
### Consequences of the AEP: data compression schemes
Let $X_1, \cdots, X_n$ be i.i.d. random variables drawn from $p(x)$.
We want to find the shortest description of the sequence $X_1, \cdots, X_n$.
First, divide all sequences in $\mathcal{X}^n$
into two sets, namely the typical set $A_\epsilon^{(n)}$ and its complement $A_\epsilon^{(n)c}$.
- $A_\epsilon^{(n)}$ contains most of the probability while $A_\epsilon^{(n)c}$ contains most elements.
- The typical set has probability close to 1 and contains approximately $2^{nH(X)}$ elements.
Order the sequences in each set in lexicographic order.
- This means we can represent each sequence of $A_\epsilon^{(n)}$ by giving its index in the set.
- There are at most $2^{n(H(X)+\epsilon)}$ sequences in $A_\epsilon^{(n)}$.
- Indexing requires no more than $nH(X)+\epsilon+1$ bits.
- Prefix all of these by a "0" bit at most $nH(X)+\epsilon+2$ bits to represent each.
- Similarly, index each sequence in $A_\epsilon^{(n)c}$ with no more than $n\log |\mathcal{X}|+1$ bits.
- Prefix it by "1" at most $n\log |\mathcal{X}|+2$ bits to represent each.
- Voilà! We have a code for all sequences in $\mathcal{X}^n$.
- The typical sequences have short descriptions of length $\approx nH(X)$ bits.
#### Algorithm for data compression
1. Divide all $n$-length sequences in $\mathcal{X}^n$ into the typical set $A_\epsilon^{(n)}$ and its complement $A_\epsilon^{(n)c}$.
2. Order the sequences in each set in lexicographic order.
3. Index each sequence in $A_\epsilon^{(n)}$ using $\leq nH(X)+\epsilon+1$ bits and each sequence in $A_\epsilon^{(n)c}$ using $\leq n\log |\mathcal{X}|+1$ bits.
4. Prefix the sequence by "0" if it is in $A_\epsilon^{(n)}$ and "1" if it is in $A_\epsilon^{(n)c}$.
Notes:
- The code is one-to-one and can be decoded easily. The initial bit acts as a "flag".
- The number of elements in $A_\epsilon^{(n)c}$ is less than the number of elements in $\mathcal{X}^n$, but it turns out this does not matter.
#### Expected length of codewords
Use $x_n$ to denote a sequence $x_1, \cdots, x_n$ and $\ell_{x_n}$ to denote the length of the corresponding codeword.
Suppose $n$ is sufficiently large.
This means $\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\epsilon$.
Lemma: The expected length of the codeword $\mathbb{E}[\ell_{x_n}]$ is upper bounded by
$nH(X)+\epsilon'$, where $\epsilon'=\epsilon+\epsilon\log |\mathcal{X}|+\frac{2}{n}$.
$\epsilon'$ can be made arbitrarily small by choosing $\epsilon$ and $n$ appropriately.
#### Efficient data compression guarantee
The expected length of the codeword $\mathbb{E}[\ell_{x_n}]$ is upper bounded by $nH(X)+\epsilon'$, where $\epsilon'=\epsilon+\epsilon\log |\mathcal{X}|+\frac{2}{n}$.
Theorem: Let $X_n\triangleq X_1, \cdots, X_n$ be i.i.d. with $p(x)$ and $\epsilon > 0$. Then there exists a code that maps sequences $x_n$ to binary strings (codewords) such that the mapping is one-to-one and
$\mathbb{E}[\ell_{x_n}] \leq n(H(X)+\epsilon)$ for $n$ sufficiently large.
- Thus, we can represent sequences $X_n$ using $\approx nH(X)$ bits on average
## Shannon's source coding theorem
There exists an algorithm that can compress $n$ i.i.d. random variables $X_1, \cdots, X_n$, each with entropy $H(X)$, into slightly more than $nH(X)$ bits with negligible risk of information loss. Conversely, if they are compressed into fewer than $nH(X)$ bits, the risk of information loss is very high.
- We have essentially proved the first half!
Proof of converse: Show that any set of size smaller than that of $A_\epsilon^{(n)}$ covers a set of probability bounded away from 1