diff --git a/content/CSE510/CSE510_L16.md b/content/CSE510/CSE510_L16.md
index b920462..2584b61 100644
--- a/content/CSE510/CSE510_L16.md
+++ b/content/CSE510/CSE510_L16.md
@@ -148,3 +148,30 @@ $$
 y_1 = r + \gamma \min_{i=1,2} Q^{\theta_i'}(s', \pi_{\phi_i}(s'))
 $$
 
+High-variance estimates provide a noisy gradient.
+
+Techniques in TD3 to reduce the variance:
+
+- Update the policy at a lower frequency than the value network.
+- Smoothing the value estimate:
+  $$
+  y=r+\gamma \mathbb{E}_{\epsilon}[Q^{\theta'}(s', \pi_{\phi'}(s')+\epsilon)]
+  $$
+
+Update target:
+
+$$
+y=r+\gamma \mathbb{E}_{\epsilon}[Q^{\theta'}(s', \pi_{\phi'}(s')+\epsilon)]
+$$
+
+where $\epsilon\sim clip(\mathcal{N}(0, \sigma), -c, c)$
+
+#### Other methods
+
+- Generalizable Episode Memory for Deep Reinforcement Learning
+- Distributed Distributional Deep Deterministic Policy Gradient
+  - Distributional critic
+  - N-step returns are used to update the critic
+  - Multiple distributed parallel actors
+  - Prioritized experience replay
+- 
\ No newline at end of file
diff --git a/content/CSE5313/CSE5313_L15.md b/content/CSE5313/CSE5313_L15.md
index 20d67af..cf050e2 100644
--- a/content/CSE5313/CSE5313_L15.md
+++ b/content/CSE5313/CSE5313_L15.md
@@ -1 +1,327 @@
-# CSE5313 Coding and information theory for data science (Lecture 15)
\ No newline at end of file
+# CSE5313 Coding and information theory for data science (Lecture 15)
+
+## Information theory
+
+Transmission, processing, extraction, and utilization of information.
+
+- Information: "Resolution" of uncertainty.
+- Question in the 1940's: How to quantify the complexity of information?
+  - Questions:
+    - How many bits are required to **describe** an information source?
+    - How many bits are required to **transmit** an information source?
+    - How much information does one source **reveal** about another?
+  - Applications:
+    - Data Compression.
+    - Channel Coding.
+    - Privacy.
+- Claude Shannon 1948: Information Entropy.
+
+### Entropy
+
+The "information value" of a message depends on how surprising it is.
+
+- The more unlikely the event, the more informative the message.
+
+The **Shannon information** of an event $E$:
+
+$$
+I(E)=-\log_2 \frac{1}{Pr(E)}
+$$
+
+Entropy the the expected amount of information in a random trial.
+
+- Rolling a die has more entropy than rolling a coin (6 states vs 2 states).
+
+#### Information entropy
+
+Let $X$ be a random variable with values in some finite set $\mathcal{X}$.
+
+The entropy $H(X)$ of $X$ is defined as:
+
+$$
+H(X)=\sum_{x\in \mathcal{X}} \log_2 \frac{1}{Pr(X=x)}=-\mathbb{E}_{x\sim X}[Pr(X=x)\log_2 Pr(X=x)]
+$$
+
+Idea: How many bits are required, on average, to describe $X$?
+
+- The more unlikely the event, the more informative the message.
+- Use "few" bits for common $x$ (i.e., $Pr(x)$ large).
+- Use "many" bits for rare $x$ (i.e., $Pr(x)$ small).
+
+Notes:
+
+- $H(X)=\mathbb{E}_{x\sim X}[I(X=x)]$
+- $H(X)\geq 0$
+- $H(X)=0$ if and only if $Pr(X=x)=1$ for some $x\in \mathcal{X}$.
+- Does not depend on $\mathcal{X}$ but on the probability distribution $Pr(X=x)$.
+- Maximum entropy is achieved when the distribution is uniform. $H(Uniform(n))=\log_2 n$. Where $n$ is the number of events. ($a$ bits required to describe each event of size $2^a$).
+
+<details>
+<summary>Example</summary>
+
+For uniform distribution $X\sim Uniform\{0,1\}$, we have $H(X)=\log_2 2=1$.
+
+---
+
+For Bernoulli distribution $X\sim Bernoulli(p)$, we have $H(X)=-p\log_2 p-(1-p)\log_2 (1-p)=H(p)$.
+
+</details>
+
+### Motivation
+
+Optimal compression:
+
+Consider the $X$ with distribution given belows:
+
+|Value|Probability| Encoding|
+|-----|-----------|---------|
+|1    | 1/8       |000      |
+|2    | 1/4       |001      |
+|3    | 1/4       |01       |
+|4    | 1/2       |1        |
+
+The average length of the encoding is $1/8\times 3+1/4\times 3+1/4\times 2+1/2\times 1=7/4$.
+
+And the entropy of $X$ is $H(X)=1/8\times \log_2 8+1/4\times \log_2 4+1/4\times \log_2 2+1/2\times \log_2 1=7/4$.
+
+So the average length of the encoding is equal to the entropy of $X$.
+
+This is the optimal compression.
+
+This is the Huffman coding.
+
+#### Few extra theorems that will not be proved in CS course
+
+- **Theorem**: Avg. # of bits in any prefix-free compression of $X$ is ≥ $H(X)$.
+- **Theorem**: Huffman coding is optimal.
+  - I.e., Avg. # of bits equals $H(X)$.
+- **Disadvantage of Huffman coding**: the distribution must be known.
+  - Generally not the case.
+- **[Lempel-Ziv 1978]**: Universal lossless data compression.
+
+### Conditional and joint entropy
+
+How does the entropy of different random variables interact?
+
+- As a function of their dependence.
+- Needed in order to relate:
+  - A variable and its compression.
+  - A variable and its transmission over a noisy channel.
+  - A variable and its encryption
+
+#### Definition for joint entropy
+
+Let $X$ and $Y$ be discrete random variables.
+
+The joint entropy $H(X,Y)$ of $X$ and $Y$ is defined as:
+
+$$
+H(X,Y)=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(X=x, Y=y) \log_2 Pr(X=x, Y=y)
+$$
+
+Notes:
+
+- $H(X,Y)\geq 0$
+- $H(X,Y)=H(Y,X)$
+- $H(X,Y)\geq \max\{H(X),H(Y)\}$
+- $H(X,Y)\leq H(X)+H(Y)$ with equality if and only if $X$ and $Y$ are independent.
+
+#### Conditional entropy
+
+> [!NOTE]
+>
+> Recall that the conditional probability $P(Y|X)$ is defined as: $P(Y|X)=\frac{P(Y,X)}{P(X)}$.
+
+What is the average amount of information revealed by $Y$ given the distribution of $X$?
+
+For each given $x\in \mathcal{X}$, the conditional entropy $H(Y|X=x)$ is defined as:
+
+$$
+\begin{aligned} 
+H(Y|X=x)&=-\sum_{y\in \mathcal{Y}} \log_2 \frac{1}{Pr(Y=y|X=x)} \\
+&=-\sum_{y\in \mathcal{Y}} Pr(Y=y|X=x) \log_2 Pr(Y=y|X=x) \\
+$$
+
+The conditional entropy $H(Y|X)$ is defined as:
+
+$$
+\begin{aligned}
+H(Y|X)&=\mathbb{E}_{x\sim X}[H(Y|X=x)] \\
+&=-\sum_{x\in \mathcal{X}} Pr(X=x)H(Y|X=x) \\
+&=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(X=x, Y=y) \log_2 Pr(Y=y|X=x) \\
+&=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(x)\sum_{y\in \mathcal{Y}} Pr(Y=y|X=x) \log_2 Pr(Y=y|X=x) \\
+$$
+
+Notes:
+
+- $H(X|X)=0$
+- $H(Y|X)=0$ when $Y$ is a function of $X$.
+- Conditional entropy not necessarily symmetric. $H(X|Y)\neq H(Y|X)$.
+- Conditioning will not increase entropy. $H(X|Y)\leq H(X)$.
+
+#### Chain rules of entropy
+
+Joint entropy: $H(X,Y)=\mathbb{E}_{x\sim X, y\sim Y}\log \frac{1}{Pr(X=x, Y=y)}$.
+
+Conditional entropy: $H(Y|X)=\mathbb{E}_{x\sim X}\log \frac{1}{Pr(Y=y|X=x)}$.
+
+Chain rule: $H(X,Y)=H(X)+H(Y|X)$.
+
+<details>
+<summary>Proof</summary>
+
+For **statistically independent** $X$ and $Y$, we have $Pr(x,y)=Pr(x)Pr(y)$.
+
+$Pr(y|x)=\frac{Pr(x,y)}{Pr(x)}=\frac{Pr(x)Pr(y)}{Pr(x)}=Pr(y)$.
+
+_Apply the symmetry of the joint distribution_.
+
+</details>
+
+<details>
+<summary>Example of computing the conditional entropy</summary>
+
+For the given distribution:
+
+|Y\X|1|2|3|4|Y marginal|
+|-----|-----|-----|-----|-----|-----|
+|1    |1/8  |1/16  |1/32  |1/32  |1/4  |
+|2    |1/16 |1/8   |1/16  |1/8   |1/4  |
+|3    |1/16 |1/16  |1/16   |1/16  |1/4  |
+|4    |1/4 |0|0  |0   |1/4  |
+|X marginal|1/2  |1/4   |1/8   |1/8   |1.0  |
+
+Here $H(X)=H((1/2,1/4,1/8,1/8))=-(1/2\log_2 1/2+1/4\log_2 1/4+1/8\log_2 1/8+1/8\log_2 1/8)=7/4$.
+
+$H(Y)=H((1/4,1/4,1/4,1/4))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/4\log_2 1/4+1/4\log_2 1/4)=2$.
+
+$H(Y|X=1)=H((1/4,1/8,1/8,1/2))=-(1/4\log_2 1/4+1/8\log_2 1/8+1/8\log_2 1/8+1/2\log_2 1/2)=7/4$.
+
+$H(Y|X=2)=H((1/4,1/2,1/4))=-(1/4\log_2 1/4+1/2\log_2 1/2+1/4\log_2 1/4)=1.5$.
+
+$H(Y|X=3)=H((1/4,1/4,1/2,1/4))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/2\log_2 1/2+1/4\log_2 1/4)=1.5$.
+
+$H(Y|X=4)=H((1/4,1/4,1/2))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/2\log_2 1/2)=1.5$.
+
+So $H(Y|X)=\sum_{x\in \mathcal{X}} Pr(x)H(Y|X=x)=1/2\times 7/4+1/4\times 1.5+1/8\times 1.5+1/8\times 1.5=13/8$
+
+
+
+</details>
+
+### Mutual information
+
+#### Definition for mutual information
+
+The mutual information $I(X;Y)$ of $X$ and $Y$ is defined as:
+
+$$
+I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)
+$$
+
+#### Properties of mutual information
+
+- $I(X;Y)\geq 0$
+  - Prove via Jensen's inequality.
+  - Conditioning will not increase entropy.
+- $I(X;Y)=I(Y;X)$ symmetry.
+- $I(X;X)=H(X)-H(X|X)=H(X)$
+- $I(X;Y)=H(X)+H(Y)-H(X,Y)$
+
+<details>
+<summary>Example of computing the mutual information</summary>
+
+For the given distribution:
+
+|Y\X|1|2|3|4|Y marginal|
+|-----|-----|-----|-----|-----|-----|
+|1    |1/8  |1/16  |1/32  |1/32  |1/4  |
+|2    |1/16 |1/8   |1/16  |1/8   |1/4  |
+|3    |1/16 |1/16  |1/16   |1/16  |1/4  |
+|4    |1/4 |0|0  |0   |1/4  |
+|X marginal|1/2  |1/4   |1/8   |1/8   |1.0  |
+
+Recall from the previous example:
+
+$H(Y|X)=\frac{13}{8}$
+$H(Y)=2$
+
+then the mutual information is:
+
+$$
+I(X;Y)=H(Y)-H(Y|X)=2-\frac{13}{8}=\frac{3}{8}
+$$
+
+So the entropy of $Y$ is reduced by $\frac{3}{8}$ bits on average when $X$ is known.
+
+</details>
+
+## Applications
+
+### Channel capacity
+
+Recall a channel is tuple of $(F,\Phi,\operatorname{Pr})$.
+
+#### Definition of discrete channel
+
+A discrete channel is a system consisting of a discrete input alphabet $\mathcal{X}$, a discrete output alphabet $\mathcal{Y}$, and a probability transition $\operatorname{Pr}(y|x)$.
+
+The channel is _memoryless_ if the output at any time depends only on the input at that time and not on the inputs and outputs at other times.
+
+#### Definition of channel capacity
+
+The channel capacity $C$ of a channel is defined as:
+
+$$
+C=\max_{x\in \mathcal{X}} I(X;Y)
+$$
+
+where $I(X;Y)$ is the mutual information between $X$ and $Y$.
+
+#### Shannon's noisy coding theorem
+
+Recall the rate of the code is $R=\frac{\log_{|F|}|\mathcal{C}|}{n}$.
+
+For a discrete memoryless channel with capacity $C$, every rate $R<C$ is achievable.
+
+- There exists a sequence of codes of length $n$ and size $|F|^{nR}$ with vanishing probability of error as $n\to \infty$.
+- In 1 channel use, it is impossible beat the noise.
+- In $n$ channel uses as $n\to \infty$,it is possible to beat the **all** the noise.
+
+#### Corollary for the binary symmetric channel
+
+The capacity of the binary symmetric channel with crossover probability $p$ is $1-H(p)$.
+
+<details>
+<summary>Proof</summary>
+
+$$
+\begin{aligned}
+H(Y|X=x)&=-\operatorname{Pr}(y=0|x)\log_2 \operatorname{Pr}(y=0|x)-\operatorname{Pr}(y=1|x)\log_2 \operatorname{Pr}(y=1|x) \\
+&=-p\log p-(1-p)\log (1-p)\\
+&=H(p)
+\end{aligned}
+$$
+
+So,
+
+$$
+\begin{aligned}
+H(Y|X)&=\sum_{x\in \mathcal{X}} \operatorname{Pr}(x)H(Y|X=x)\\
+&=H(p)\sum_{x\in \mathcal{X}} \operatorname{Pr}(x) \\
+&=H(p)
+$$
+
+So,
+
+$$
+\begin{aligned}
+I(X;Y)&=H(Y)-H(Y|X)\\
+&\leq 1-H(p)
+$$
+
+When $p=0$ the capacity is $1$.
+
+When $p=\frac{1}{2}$ the capacity is $0$. (completely noisy channel)
+
+</details>