diff --git a/content/CSE510/CSE510_L16.md b/content/CSE510/CSE510_L16.md index b920462..2584b61 100644 --- a/content/CSE510/CSE510_L16.md +++ b/content/CSE510/CSE510_L16.md @@ -148,3 +148,30 @@ $$ y_1 = r + \gamma \min_{i=1,2} Q^{\theta_i'}(s', \pi_{\phi_i}(s')) $$ +High-variance estimates provide a noisy gradient. + +Techniques in TD3 to reduce the variance: + +- Update the policy at a lower frequency than the value network. +- Smoothing the value estimate: + $$ + y=r+\gamma \mathbb{E}_{\epsilon}[Q^{\theta'}(s', \pi_{\phi'}(s')+\epsilon)] + $$ + +Update target: + +$$ +y=r+\gamma \mathbb{E}_{\epsilon}[Q^{\theta'}(s', \pi_{\phi'}(s')+\epsilon)] +$$ + +where $\epsilon\sim clip(\mathcal{N}(0, \sigma), -c, c)$ + +#### Other methods + +- Generalizable Episode Memory for Deep Reinforcement Learning +- Distributed Distributional Deep Deterministic Policy Gradient + - Distributional critic + - N-step returns are used to update the critic + - Multiple distributed parallel actors + - Prioritized experience replay +- \ No newline at end of file diff --git a/content/CSE5313/CSE5313_L15.md b/content/CSE5313/CSE5313_L15.md index 20d67af..cf050e2 100644 --- a/content/CSE5313/CSE5313_L15.md +++ b/content/CSE5313/CSE5313_L15.md @@ -1 +1,327 @@ -# CSE5313 Coding and information theory for data science (Lecture 15) \ No newline at end of file +# CSE5313 Coding and information theory for data science (Lecture 15) + +## Information theory + +Transmission, processing, extraction, and utilization of information. + +- Information: "Resolution" of uncertainty. +- Question in the 1940's: How to quantify the complexity of information? + - Questions: + - How many bits are required to **describe** an information source? + - How many bits are required to **transmit** an information source? + - How much information does one source **reveal** about another? + - Applications: + - Data Compression. + - Channel Coding. + - Privacy. +- Claude Shannon 1948: Information Entropy. + +### Entropy + +The "information value" of a message depends on how surprising it is. + +- The more unlikely the event, the more informative the message. + +The **Shannon information** of an event $E$: + +$$ +I(E)=-\log_2 \frac{1}{Pr(E)} +$$ + +Entropy the the expected amount of information in a random trial. + +- Rolling a die has more entropy than rolling a coin (6 states vs 2 states). + +#### Information entropy + +Let $X$ be a random variable with values in some finite set $\mathcal{X}$. + +The entropy $H(X)$ of $X$ is defined as: + +$$ +H(X)=\sum_{x\in \mathcal{X}} \log_2 \frac{1}{Pr(X=x)}=-\mathbb{E}_{x\sim X}[Pr(X=x)\log_2 Pr(X=x)] +$$ + +Idea: How many bits are required, on average, to describe $X$? + +- The more unlikely the event, the more informative the message. +- Use "few" bits for common $x$ (i.e., $Pr(x)$ large). +- Use "many" bits for rare $x$ (i.e., $Pr(x)$ small). + +Notes: + +- $H(X)=\mathbb{E}_{x\sim X}[I(X=x)]$ +- $H(X)\geq 0$ +- $H(X)=0$ if and only if $Pr(X=x)=1$ for some $x\in \mathcal{X}$. +- Does not depend on $\mathcal{X}$ but on the probability distribution $Pr(X=x)$. +- Maximum entropy is achieved when the distribution is uniform. $H(Uniform(n))=\log_2 n$. Where $n$ is the number of events. ($a$ bits required to describe each event of size $2^a$). + +
+Example + +For uniform distribution $X\sim Uniform\{0,1\}$, we have $H(X)=\log_2 2=1$. + +--- + +For Bernoulli distribution $X\sim Bernoulli(p)$, we have $H(X)=-p\log_2 p-(1-p)\log_2 (1-p)=H(p)$. + +
+ +### Motivation + +Optimal compression: + +Consider the $X$ with distribution given belows: + +|Value|Probability| Encoding| +|-----|-----------|---------| +|1 | 1/8 |000 | +|2 | 1/4 |001 | +|3 | 1/4 |01 | +|4 | 1/2 |1 | + +The average length of the encoding is $1/8\times 3+1/4\times 3+1/4\times 2+1/2\times 1=7/4$. + +And the entropy of $X$ is $H(X)=1/8\times \log_2 8+1/4\times \log_2 4+1/4\times \log_2 2+1/2\times \log_2 1=7/4$. + +So the average length of the encoding is equal to the entropy of $X$. + +This is the optimal compression. + +This is the Huffman coding. + +#### Few extra theorems that will not be proved in CS course + +- **Theorem**: Avg. # of bits in any prefix-free compression of $X$ is ≥ $H(X)$. +- **Theorem**: Huffman coding is optimal. + - I.e., Avg. # of bits equals $H(X)$. +- **Disadvantage of Huffman coding**: the distribution must be known. + - Generally not the case. +- **[Lempel-Ziv 1978]**: Universal lossless data compression. + +### Conditional and joint entropy + +How does the entropy of different random variables interact? + +- As a function of their dependence. +- Needed in order to relate: + - A variable and its compression. + - A variable and its transmission over a noisy channel. + - A variable and its encryption + +#### Definition for joint entropy + +Let $X$ and $Y$ be discrete random variables. + +The joint entropy $H(X,Y)$ of $X$ and $Y$ is defined as: + +$$ +H(X,Y)=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(X=x, Y=y) \log_2 Pr(X=x, Y=y) +$$ + +Notes: + +- $H(X,Y)\geq 0$ +- $H(X,Y)=H(Y,X)$ +- $H(X,Y)\geq \max\{H(X),H(Y)\}$ +- $H(X,Y)\leq H(X)+H(Y)$ with equality if and only if $X$ and $Y$ are independent. + +#### Conditional entropy + +> [!NOTE] +> +> Recall that the conditional probability $P(Y|X)$ is defined as: $P(Y|X)=\frac{P(Y,X)}{P(X)}$. + +What is the average amount of information revealed by $Y$ given the distribution of $X$? + +For each given $x\in \mathcal{X}$, the conditional entropy $H(Y|X=x)$ is defined as: + +$$ +\begin{aligned} +H(Y|X=x)&=-\sum_{y\in \mathcal{Y}} \log_2 \frac{1}{Pr(Y=y|X=x)} \\ +&=-\sum_{y\in \mathcal{Y}} Pr(Y=y|X=x) \log_2 Pr(Y=y|X=x) \\ +$$ + +The conditional entropy $H(Y|X)$ is defined as: + +$$ +\begin{aligned} +H(Y|X)&=\mathbb{E}_{x\sim X}[H(Y|X=x)] \\ +&=-\sum_{x\in \mathcal{X}} Pr(X=x)H(Y|X=x) \\ +&=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(X=x, Y=y) \log_2 Pr(Y=y|X=x) \\ +&=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(x)\sum_{y\in \mathcal{Y}} Pr(Y=y|X=x) \log_2 Pr(Y=y|X=x) \\ +$$ + +Notes: + +- $H(X|X)=0$ +- $H(Y|X)=0$ when $Y$ is a function of $X$. +- Conditional entropy not necessarily symmetric. $H(X|Y)\neq H(Y|X)$. +- Conditioning will not increase entropy. $H(X|Y)\leq H(X)$. + +#### Chain rules of entropy + +Joint entropy: $H(X,Y)=\mathbb{E}_{x\sim X, y\sim Y}\log \frac{1}{Pr(X=x, Y=y)}$. + +Conditional entropy: $H(Y|X)=\mathbb{E}_{x\sim X}\log \frac{1}{Pr(Y=y|X=x)}$. + +Chain rule: $H(X,Y)=H(X)+H(Y|X)$. + +
+Proof + +For **statistically independent** $X$ and $Y$, we have $Pr(x,y)=Pr(x)Pr(y)$. + +$Pr(y|x)=\frac{Pr(x,y)}{Pr(x)}=\frac{Pr(x)Pr(y)}{Pr(x)}=Pr(y)$. + +_Apply the symmetry of the joint distribution_. + +
+ +
+Example of computing the conditional entropy + +For the given distribution: + +|Y\X|1|2|3|4|Y marginal| +|-----|-----|-----|-----|-----|-----| +|1 |1/8 |1/16 |1/32 |1/32 |1/4 | +|2 |1/16 |1/8 |1/16 |1/8 |1/4 | +|3 |1/16 |1/16 |1/16 |1/16 |1/4 | +|4 |1/4 |0|0 |0 |1/4 | +|X marginal|1/2 |1/4 |1/8 |1/8 |1.0 | + +Here $H(X)=H((1/2,1/4,1/8,1/8))=-(1/2\log_2 1/2+1/4\log_2 1/4+1/8\log_2 1/8+1/8\log_2 1/8)=7/4$. + +$H(Y)=H((1/4,1/4,1/4,1/4))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/4\log_2 1/4+1/4\log_2 1/4)=2$. + +$H(Y|X=1)=H((1/4,1/8,1/8,1/2))=-(1/4\log_2 1/4+1/8\log_2 1/8+1/8\log_2 1/8+1/2\log_2 1/2)=7/4$. + +$H(Y|X=2)=H((1/4,1/2,1/4))=-(1/4\log_2 1/4+1/2\log_2 1/2+1/4\log_2 1/4)=1.5$. + +$H(Y|X=3)=H((1/4,1/4,1/2,1/4))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/2\log_2 1/2+1/4\log_2 1/4)=1.5$. + +$H(Y|X=4)=H((1/4,1/4,1/2))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/2\log_2 1/2)=1.5$. + +So $H(Y|X)=\sum_{x\in \mathcal{X}} Pr(x)H(Y|X=x)=1/2\times 7/4+1/4\times 1.5+1/8\times 1.5+1/8\times 1.5=13/8$ + + + +
+ +### Mutual information + +#### Definition for mutual information + +The mutual information $I(X;Y)$ of $X$ and $Y$ is defined as: + +$$ +I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X) +$$ + +#### Properties of mutual information + +- $I(X;Y)\geq 0$ + - Prove via Jensen's inequality. + - Conditioning will not increase entropy. +- $I(X;Y)=I(Y;X)$ symmetry. +- $I(X;X)=H(X)-H(X|X)=H(X)$ +- $I(X;Y)=H(X)+H(Y)-H(X,Y)$ + +
+Example of computing the mutual information + +For the given distribution: + +|Y\X|1|2|3|4|Y marginal| +|-----|-----|-----|-----|-----|-----| +|1 |1/8 |1/16 |1/32 |1/32 |1/4 | +|2 |1/16 |1/8 |1/16 |1/8 |1/4 | +|3 |1/16 |1/16 |1/16 |1/16 |1/4 | +|4 |1/4 |0|0 |0 |1/4 | +|X marginal|1/2 |1/4 |1/8 |1/8 |1.0 | + +Recall from the previous example: + +$H(Y|X)=\frac{13}{8}$ +$H(Y)=2$ + +then the mutual information is: + +$$ +I(X;Y)=H(Y)-H(Y|X)=2-\frac{13}{8}=\frac{3}{8} +$$ + +So the entropy of $Y$ is reduced by $\frac{3}{8}$ bits on average when $X$ is known. + +
+ +## Applications + +### Channel capacity + +Recall a channel is tuple of $(F,\Phi,\operatorname{Pr})$. + +#### Definition of discrete channel + +A discrete channel is a system consisting of a discrete input alphabet $\mathcal{X}$, a discrete output alphabet $\mathcal{Y}$, and a probability transition $\operatorname{Pr}(y|x)$. + +The channel is _memoryless_ if the output at any time depends only on the input at that time and not on the inputs and outputs at other times. + +#### Definition of channel capacity + +The channel capacity $C$ of a channel is defined as: + +$$ +C=\max_{x\in \mathcal{X}} I(X;Y) +$$ + +where $I(X;Y)$ is the mutual information between $X$ and $Y$. + +#### Shannon's noisy coding theorem + +Recall the rate of the code is $R=\frac{\log_{|F|}|\mathcal{C}|}{n}$. + +For a discrete memoryless channel with capacity $C$, every rate $R +Proof + +$$ +\begin{aligned} +H(Y|X=x)&=-\operatorname{Pr}(y=0|x)\log_2 \operatorname{Pr}(y=0|x)-\operatorname{Pr}(y=1|x)\log_2 \operatorname{Pr}(y=1|x) \\ +&=-p\log p-(1-p)\log (1-p)\\ +&=H(p) +\end{aligned} +$$ + +So, + +$$ +\begin{aligned} +H(Y|X)&=\sum_{x\in \mathcal{X}} \operatorname{Pr}(x)H(Y|X=x)\\ +&=H(p)\sum_{x\in \mathcal{X}} \operatorname{Pr}(x) \\ +&=H(p) +$$ + +So, + +$$ +\begin{aligned} +I(X;Y)&=H(Y)-H(Y|X)\\ +&\leq 1-H(p) +$$ + +When $p=0$ the capacity is $1$. + +When $p=\frac{1}{2}$ the capacity is $0$. (completely noisy channel) + +