From 4b77a1e8e4b910127d1a4a64973b9f4ce1f31cad Mon Sep 17 00:00:00 2001 From: Zheyuan Wu <60459821+Trance-0@users.noreply.github.com> Date: Tue, 28 Oct 2025 12:44:26 -0500 Subject: [PATCH] updates --- content/CSE5313/CSE5313_L17.md | 270 +++++++++++++++++++++++++++++++++ content/index.md | 2 +- 2 files changed, 271 insertions(+), 1 deletion(-) diff --git a/content/CSE5313/CSE5313_L17.md b/content/CSE5313/CSE5313_L17.md index e98aeed..5a2af17 100644 --- a/content/CSE5313/CSE5313_L17.md +++ b/content/CSE5313/CSE5313_L17.md @@ -82,3 +82,273 @@ Goal: Quantify the amount of information that is leaked to the eavesdropper. ### One-time pad +$M=\mathcal{M}=\{0,1\}^n$ + +Suppose the Sender and Receiver agree on $k\sim U=\operatorname{Uniform}\{0,1\}^n$ + +Let $operatorname{Enc}(m)=m\oplus k$ + +Measure the information leaked to the eavesdropper (who sees $operatorname{Enc}(m)$). + +That is to compute $I(M;Z)=H(Z)-H(Z|M)$. + +
+Proof + +Recall that $Z=M\oplus U$. + +So + +$$ +\begin{aligned} +Pr(Z=z)&=\operatorname{Pr}(M\oplus U=z)\\ +&=\sum_{m\in \mathcal{M}} \operatorname{Pr}(M\oplus U=z|M=m) \text{by the law of total probability}\\ +&=\sum_{m\in \mathcal{M}} \operatorname{Pr}(M=m) \operatorname{Pr}(U=m\oplus z)\\ +&=\frac{1}{2^n} \sum_{m\in \mathcal{M}} \operatorname{Pr}(M=m)\text{message is uniformly distributed}\\ +&=\frac{1}{2^n} +\end{aligned} +$$ + +$Z$ is uniformly distributed over $\{0,1\}^n$. + +So $H(Z)=\log_2 2^n=n$. + +$H(Z|M)=H(U)=n$ because $U$ is uniformly distributed over $\{0,1\}^n$. + +- Notice $m\oplus\{0,1\}^n=\{0,1\}^n$ for every $m\in \mathcal{M}$. (since $(\{0,1\}^n,\oplus)$ is a group) +- For every $z\in \{0,1\}^n$, $Pr(m\oplus U=z)=Pr(U=z\oplus m)=2^{-n}$ + +So $I(M;Z)=H(Z)-H(Z|M)=n-n=0$. + +
+ +### Discussion of information theoretical privacy + +What does $I(Z;M)=0$ mean? + +- No information is leaked to the eavesdropper. +- Regardless of the value of $M$ and $U$. +- Regardless of any computational power. + +Information Theoretic privacy: + +- Guarantees given in terms of mutual information. +- Remains private in the face of any computational power. +- Remains private forever. + +Very strong form of privacy + +> [!NOTE] +> +> The mutual information is an average metric for privacy, no guarantee for any individual message. +> +> The one-time pad is so-far, the only known perfect privacy scheme. + +## The asymptotic equipartition property (AEP) and data compression + +> [!NOTE] +> +> This section will help us understand the limits of data compression. + +### The asymptotic equipartition property + +Idea: consider the space of all possible sequences produced by a random source, and focus on the "typical" ones. + +- Asymptotic in the sense that many of the results focus on the regime of large source sequences. +- Fundamental to the concept of typical sets used in data compression. +- The analog of the law of large numbers (LLN) in information theory. + - Direct consequence of the weak law. + +#### The law of large numbers + +The average of outcomes obtained from a large number of trials is close to the expected value of the random variable. + +For independent and identically distributed (i.i.d.) random variables $X_1,X_2,\cdots,X_n$, with expected value $\mathbb{E}[X_i]=\mu$, the sample average + +$$ +\overline{X}_n=\frac{1}{n}\sum_{i=1}^n X_i +$$ + +converges to the expected value $\mu$ as $n$ goes to infinity. + +#### The weak law of large numbers + +converges in probability to the expected value $\mu$ + +$$ +\overline{X}_n^p\to \mu\text{ as }n\to\infty +$$ + +That is, + +$$ +\lim_{n\to\infty} P\left(|\overline{X}_n<\epsilon)=1 +$$ + +for any positive $\epsilon$. + +> Intuitively, for any nonzero margin $\epsilon$, no matter how small, with a sufficiently large sample there will be a very high probability that the average of the observations will be close to the expected value (within the margin) + +#### The AEP + +Let $X$ be an i.i.d source that takes values in alphabet $\mathcal{X}$ and produces a sequence of i.i.d. random variables $X_1, \cdots, X_n$. + +Let $p(X_1, \cdots, X_n)$ be the probability of observing the sequence $X_1, \cdots, X_n$. + +Theorem (AEP): + +$$ +-\frac{1}{n} \log p(X_1, \cdots, X_n) \xrightarrow{p} H(X)\text{ as }n\to \infty +$$ + +
+Proof + +$\frac{1}{n} \log p(X_1, \cdots, X_n) = \frac{1}{n} \sum_{i=1}^n \log p(X_i)$ + +The $\log p(X_i)$ terms are also i.i.d. random variables. + +So by the weak law of large numbers, + +$$ +\frac{1}{n} \sum_{i=1}^n \log p(X_i) \xrightarrow{p} \mathbb{E}[\log p(X_i)]=H(X) +$$ + +$$ +-\mathbb{E}[\log p(X_i)]=\mathbb{E}[\log \frac{1}{p(X_i)}]=H(X) +$$ + +
+ +### Typical sets + +#### Definition of typical set + +The typical set (denoted by $A_\epsilon^{(n)}$) with respect to $p(x)$ is the set of sequence $(x_1, \cdots, x_n)\in \mathcal{X}^n$ that satisfies + +$$ +2^{-n(H(X)-\epsilon)}\leq p(x_1, \cdots, x_n)\leq 2^{-n(H(X)+\epsilon)} +$$ + +In other words, the typical set contains all $n$-length sequences with probability close to $2^{-nH(X)}$. + +- This notion of typicality only concerns the probability of a sequence and not the actual sequence itself. +- It has great use in compression theory. + +#### Properties of the typical set + +- If $(x_1, \cdots, x_n)\in A_\epsilon^{(n)}$, then $H(X)-\epsilon\leq -\frac{1}{n} \log p(x_1, \cdots, x_n)\leq H(X)+\epsilon$. +- $\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\epsilon$. for sufficiently large $n$. + +
+Sketch of proof + +Use the AEP to show that $\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\epsilon$. for sufficiently large $n$. + +For any $\delta>0$, there exists $n_0$ such that for all $n\geq n_0$, +$$ +\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\delta +$$ +
+ +- $|A_\epsilon^{(n)}|\leq 2^{n(H(X)+\epsilon)}$. +- $|A_\epsilon^{(n)}|\geq (1-\epsilon)2^{n(H(X)-\epsilon)}$. for sufficiently large $n$. + +#### Smallest probable set + +The typical set $A_\epsilon^{(n)}$ is a fairly small set with most of the probability. + +Q: Is it the smallest such set? + +A: Not quite, but pretty close. + +Notation: Let $X_1, \cdots, X_n$ be i.i.d. random variables drawn from $p(x)$. For some $\delta < \frac{1}{2}$, +we denote $B_\delta^{(n)}\subset \mathcal{X}^n$ as the smallest set such that $\operatorname{Pr}(B_\delta^{(n)})\geq 1-\delta$. + +Notation: We write $a_n \doteq b_n$ if + +$$ +\lim_{n\to\infty} \frac{a_n}{b_n}=1 +$$ + +Theorem: $|B_\delta^{(n)}|\doteq |A_\epsilon^{(n)}|\doteq 2^{nH(X)}$. + +Check book for detailed proof. + +What is the difference between $B_\delta^{(n)}$ and $A_\epsilon^{(n)}$? + +Consider a Bernoulli sequence $X_1, \cdots, X_n$ with $p = 0.9$. + +The typical sequences are those in which the proportion of 1's is close to 0.9. + +- However, they do not include the most likely sequence, i.e., the sequence of all 1's! +- $H(X) = 0.469$. +- $-\frac{1}{n} \log p(1, \cdots, 1) = -\frac{1}{n} \log 0.9^n = 0.152$. Its average logarithmic probability cannot come close to the entropy of $X$ no matter how large $n$ can be. +- The set $B_\delta^{(n)}$ contains all the most probable sequences… and includes the sequence of all 1's. + +### Consequences of the AEP: data compression schemes + + +Let $X_1, \cdots, X_n$ be i.i.d. random variables drawn from $p(x)$. + +We want to find the shortest description of the sequence $X_1, \cdots, X_n$. + +First, divide all sequences in $\mathcal{X}^n$ +into two sets, namely the typical set $A_\epsilon^{(n)}$ and its complement $A_\epsilon^{(n)c}$. + +- $A_\epsilon^{(n)}$ contains most of the probability while $A_\epsilon^{(n)c}$ contains most elements. +- The typical set has probability close to 1 and contains approximately $2^{nH(X)}$ elements. + +Order the sequences in each set in lexicographic order. + +- This means we can represent each sequence of $A_\epsilon^{(n)}$ by giving its index in the set. +- There are at most $2^{n(H(X)+\epsilon)}$ sequences in $A_\epsilon^{(n)}$. +- Indexing requires no more than $nH(X)+\epsilon+1$ bits. +- Prefix all of these by a "0" bit ⇒ at most $nH(X)+\epsilon+2$ bits to represent each. +- Similarly, index each sequence in $A_\epsilon^{(n)c}$ with no more than $n\log |\mathcal{X}|+1$ bits. +- Prefix it by "1" ⇒ at most $n\log |\mathcal{X}|+2$ bits to represent each. +- Voilà! We have a code for all sequences in $\mathcal{X}^n$. +- The typical sequences have short descriptions of length $\approx nH(X)$ bits. + +#### Algorithm for data compression + +1. Divide all $n$-length sequences in $\mathcal{X}^n$ into the typical set $A_\epsilon^{(n)}$ and its complement $A_\epsilon^{(n)c}$. +2. Order the sequences in each set in lexicographic order. +3. Index each sequence in $A_\epsilon^{(n)}$ using $\leq nH(X)+\epsilon+1$ bits and each sequence in $A_\epsilon^{(n)c}$ using $\leq n\log |\mathcal{X}|+1$ bits. +4. Prefix the sequence by "0" if it is in $A_\epsilon^{(n)}$ and "1" if it is in $A_\epsilon^{(n)c}$. + +Notes: + +- The code is one-to-one and can be decoded easily. The initial bit acts as a "flag". +- The number of elements in $A_\epsilon^{(n)c}$ is less than the number of elements in $\mathcal{X}^n$, but it turns out this does not matter. + +#### Expected length of codewords + +Use $x_n$ to denote a sequence $x_1, \cdots, x_n$ and $\ell_{x_n}$ to denote the length of the corresponding codeword. + +Suppose $n$ is sufficiently large. + +This means $\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\epsilon$. + +Lemma: The expected length of the codeword $\mathbb{E}[\ell_{x_n}]$ is upper bounded by +$nH(X)+\epsilon'$, where $\epsilon'=\epsilon+\epsilon\log |\mathcal{X}|+\frac{2}{n}$. + +$\epsilon'$ can be made arbitrarily small by choosing $\epsilon$ and $n$ appropriately. + +#### Efficient data compression guarantee + +The expected length of the codeword $\mathbb{E}[\ell_{x_n}]$ is upper bounded by $nH(X)+\epsilon'$, where $\epsilon'=\epsilon+\epsilon\log |\mathcal{X}|+\frac{2}{n}$. + +Theorem: Let $X_n\triangleq X_1, \cdots, X_n$ be i.i.d. with $p(x)$ and $\epsilon > 0$. Then there exists a code that maps sequences $x_n$ to binary strings (codewords) such that the mapping is one-to-one and +$\mathbb{E}[\ell_{x_n}] \leq n(H(X)+\epsilon)$ for $n$ sufficiently large. + +- Thus, we can represent sequences $X_n$ using $\approx nH(X)$ bits on average + +## Shannon's source coding theorem + +There exists an algorithm that can compress $n$ i.i.d. random variables $X_1, \cdots, X_n$, each with entropy $H(X)$, into slightly more than $nH(X)$ bits with negligible risk of information loss. Conversely, if they are compressed into fewer than $nH(X)$ bits, the risk of information loss is very high. + +- We have essentially proved the first half! + + +Proof of converse: Show that any set of size smaller than that of $A_\epsilon^{(n)}$ covers a set of probability bounded away from 1 diff --git a/content/index.md b/content/index.md index 45ec37c..c5dbcd3 100644 --- a/content/index.md +++ b/content/index.md @@ -3,7 +3,7 @@ > [!WARNING] > > -> This site updates in a daily basis. But the sidebar is not. **If you find some notes are not shown on sidebar but the class already ends more than 24 hours**, please try to access the page directly via the URL. (for example, change the URL to `.../Math4201/Math4201_L{number}` to access the note of the lecture `Math4201_L{number}`) +> This site use SSG to generate the static pages. And cache is stored to your browser, this may not reveal the latest updates. **If you find some notes are not shown on sidebar but the class already ends more than 24 hours**, please try to access the page directly via the URL, or force reload the cache (for example, change the URL to `.../Math4201/Math4201_L{number}` to access the note of the lecture `Math4201_L{number}` and then refresh the page). This was originated from another project [NoteChondria](https://github.com/Trance-0/Notechondria) that I've been working on for a long time but don't have a stable release yet.