diff --git a/content/CSE510/CSE510_L11.md b/content/CSE510/CSE510_L11.md
new file mode 100644
index 0000000..dcf08ea
--- /dev/null
+++ b/content/CSE510/CSE510_L11.md
@@ -0,0 +1,300 @@
+# CSE510 Deep Reinforcement Learning (Lecture 11)
+
+> Materials Used
+>
+> - Much of the material and slides for this lecture were taken from Chapter 13 of Barto & Sutton textbook.
+>
+> - Some slides are borrowed from Rich Sutton’s RL class and David Silver's Deep RL tutorial
+
+Problem: often the feature-based policies that work well (win games,
+maximize utilities) aren't the ones that approximate V / Q best
+
+- Q-learning's priority: get Q-values close (modeling)
+- Action selection priority: get ordering of Q-values right (prediction)
+
+Value functions can often be much more complex to represent than the
+corresponding policy
+
+- Do we really care about knowing $Q(s, \text{left}) = 0.3554, Q(s, \text{right}) = 0.533$ Or just that "right is better than left in state $s$"
+
+Motivates searching directly in a parameterized policy space
+
+- Bypass learning value function and "directly" optimize the value of a policy
+
+<details>
+<summary>Examples</summary>
+
+Rock-Paper-Scissors
+
+- Two-player game of rock-paper-scissors
+  - Scissors beats paper
+  - Rock beats scissors
+  - Paper beats rock
+- Consider policies for iterated rock-paper-scissors
+  - A deterministic policy is easily exploited
+  - A uniform random policy is optimal (i.e., Nash equilibrium)
+
+---
+
+Partial Observable GridWorld
+
+![Partial Observable Grid World](https://notenextra.trance-0.com/CSE510/Partial_Observable_GridWorld.png)
+
+The agent cannot differentiate the grey state
+
+Consider features of the following form (for all $N,E,S,W$ actions):
+
+$$
+\phi(s,a)=1(\text{wall to} N, a=\text{move} E)
+$$
+
+Compare value-based RL, suing an approximate value function
+
+$$
+Q_\theta(s,a) = f(\phi(s,a),\theta)
+$$
+
+To policy-based RL, using a parameterized policy
+
+$$
+\pi_\theta(s,a) = g(\phi(s,a),\theta)
+$$
+
+Under aliasing, an optimal deterministic policy will either
+
+- move $W$ in both grey states (shown by red arrows)
+- move $E$ in both grey states
+
+Either way, it can get stuck and _never_ reach the money
+
+- Value-based RL learns a near-deterministic policy
+  - e.g. greedy or $\epsilon$-greedy
+
+So it will traverse the corridor for a long time
+
+An optimal **stochastic** policy will randomly move $E$ or $W$ in grey cells.
+
+$$
+\pi_\theta(\text{wall to }N\text{ and }S, \text{move }E) = 0.5\\
+\pi_\theta(\text{wall to }N\text{ and }S, \text{move }W) = 0.5
+$$
+
+It will reach the goal state in a few steps with high probability
+
+Policy-based RL can learn the optimal stochastic policy
+
+</details>
+
+## RL via Policy Gradient Ascent
+
+The policy gradient approach has the following schema:
+
+1. Select a space of parameterized policies (i.e., function class)
+2. Compute the gradient of the value of current policy wrt parameters
+3. Move parameters in the direction of the gradient
+4. Repeat these steps until we reach a local maxima
+
+So we must answer the following questions:
+
+- How should we represent and evaluate parameterized policies?
+- How can we compute the gradient?
+
+### Policy learning objective
+
+Goal: given policy $\pi_\theta(s,a)$ with parameter $\theta$, find best $\theta$
+
+In episodic environments we can use the start value:
+
+$$
+J_1(\theta) = V^{\pi_\theta}(s_1)=\mathbb{E}_{\pi_\theta}[v_1]
+$$
+
+In continuing environments we can use the average value:
+
+$$
+J_{avV}(\theta) = \sum_{s\in S} d^{\pi_\theta}(s) V^{\pi_\theta}(s)
+$$
+
+Or the average reward per time-step
+
+$$
+J_{avR}(\theta) = \sum_{s\in S} d^{\pi_\theta}(s) \sum_{a\in A} \pi_\theta(s,a) \mathcal{R}(s,a)
+$$
+
+here $d^{\pi_\theta}(s)$ is the **stationary distribution** of Markov Chain for policy $\pi_\theta$.
+
+### Policy optimization
+
+Policy based reinforcement learning is an **optimization** problem
+
+Find $\theta$ that maximises $J(\theta)$
+
+Some approaches do not use gradient
+
+- Hill climbing
+- Simplex / amoeba / Nelder Mead
+- Genetic algorithms
+
+Greater efficiency often possible using gradient
+
+- Gradient descent
+- Conjugate gradient
+- Quasi-newton
+
+We focus on gradient descent, many extensions possible
+
+And on methods that exploit sequential structure
+
+### Policy gradient
+
+Let $J(\theta)$ be any policy objective function
+
+Policy gradient algorithms search for a _local_ maxima in $J(\theta)$ by ascending the gradient of the policy with respect to $\theta$
+
+$$
+\Delta \theta = \alpha \nabla_\theta J(\theta)
+$$
+
+Where $\nabla_\theta J(\theta)$ is the policy gradient.
+
+$$
+\nabla_\theta J(\theta) = \begin{pmatrix}
+\frac{\partial J(\theta)}{\partial \theta_1} \\
+\frac{\partial J(\theta)}{\partial \theta_2} \\
+\vdots \\
+\frac{\partial J(\theta)}{\partial \theta_n}
+\end{pmatrix}
+$$
+
+and $\alpha$ is the step-size parameter.
+
+### Policy gradient methods
+
+The main method we will introduce is Monte-Carlo policy gradient in Reinforcement Learning.
+
+#### Score Function
+
+Assume the policy $\pi_\theta$ is differentiable and non-zero and we know the gradient $\nabla_\theta \pi_\theta(s,a)$ for all $s\in S$ and $a\in A$.
+
+We can compute the policy gradient analytically
+
+We define the **Likelihood ratio** as:
+
+$$
+\begin{aligned}
+\nabla_\theta \pi_\theta(s,a) = \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} \\
+&= \nabla_\theta \log \pi_\theta(s,a)
+\end{aligned}
+$$
+
+The **Score Function** is:
+
+$$
+\nabla_\theta \log \pi_\theta(s,a)
+$$
+
+<details>
+<summary>Example</summary>
+
+Take the softmax policy as example:
+
+Weight actions using the linear combination of features $\phi(s,a)^T\theta$:
+
+Probability of action is proportional to the exponentiated weights:
+
+$$
+\pi_\theta(s,a) \propto \exp(\phi(s,a)^T\theta)
+$$
+
+The score function is
+
+$$
+\begin{aligned}
+\nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^T\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^T\theta))) \\
+&= \nabla_\theta\left(\phi(s,a)^T\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right) \\
+&=\phi(s,a) - \sum_{a'\in A} \prod_\theta(s,a') \phi(s,a')
+&= \phi(s,a) - \mathbb{E}_{a'\sim \pi_\theta(s,a')}[\phi(s,a')]
+\end{aligned}
+$$
+
+---
+
+In continuous action spaces, a Gaussian policy is natural
+
+Mean is a linear combination of state features $\mu(s) = \phi(s)^T\theta$
+
+Variance may be fixed $\sigma^2$, or can also parametrized
+
+Policy is Gaussian, $a \sim N (\mu(s), \sigma^2)$
+
+The score function is
+
+$$
+\nabla_\theta \log \pi_\theta(s,a) = \frac{(a - \mu(s)) \phi(s)}{\sigma^2}
+$$
+
+</details>
+
+#### Policy Gradient Theorem
+
+For any _differentiable_ policy $\pi_\theta(s,a)$, 
+
+for any of the policy objective function $J=J_1, J_{avR},$ or $\frac{1}{1-\gamma}J_{avV}$, the policy gradient is:
+
+$$
+\nabla_\theta J(\theta) = \mathbb{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]
+$$
+
+<details>
+<summary>Proof</summary>
+
+Take $\phi(s)=\sum_{a\in A} \nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a)$ to simplify the proof.
+
+$$
+\begin{aligned}
+\nabla_\theta V^{\pi}(s)&=\nabla_\theta \left(\sum_{a\in A} \pi_\theta(a|s)Q^{\pi}(s,a)\right) \\
+&=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta Q^{\pi}(s,a)\right) \text{by linear approximation}\\
+&=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta \sum_{s',r\in S\times R} P(s',r|s,a) \left(r+V^{\pi}(s')\right)\right)\text{rewrite the Q-function as sum of expected rewards from actions} \\
+&=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \sum_{s',r\in S\times R} P(s',r|s,a) \nabla_\theta \left(r+V^{\pi}(s')\right)\right) \\
+&=\phi(s)+\sum_{a\in A} \left(\pi_\theta(a|s) \sum_{s'\in S} P(s'|s,a) \nabla_\theta V^{\pi}(s')\right) \\
+&=\phi(s)+\sum_{s\in S} \sum_{a\in A} \pi_\theta(a|s) P(s'|s,a) \nabla_\theta V^{\pi}(s') \\
+&=\phi(s)+\sum_{s\in S} \rho(s\to s',1)\nabla_\theta V^{\pi}(s') \text{ notice the recurrence relation}\\
+&=\phi(s)+\sum_{s'\in S} \rho(s\to s',1)\left[\phi(s')+\sum_{s''\in S} \rho(s'\to s'',1)\nabla_\theta V^{\pi}(s'')\right] \\
+&=\phi(s)+\left[\sum_{s'\in S} \rho(s\to s',1)\phi(s')\right]+\left[\sum_{s''\in S} \rho(s\to s'',2)\nabla_\theta V^{\pi}(s'')\right] \\
+&=\cdots\\
+&=\sum_{x\in S}\sum_{k=0}^\infty \rho(s\to x,k)\phi(x)
+\end{aligned}
+$$
+
+Just to note that $\rho(s\to x,k)=\sum_{a\in A} \pi_\theta(a|s) P(x|s,a)^k$ is the probability of reaching state $x$ in $k$ steps from state $s$.
+
+Let $\eta(s)=\sum_{k=0}^\infty \rho(s_0\to s,k)$ be the expected number of steps to reach state $s$ from state $s_0$.
+
+Note that $\sum_{s\in S} \eta(s)$ is constant depends solely on the initial state $s_0$ and policy $\pi_\theta$.
+
+So $d^{\pi_\theta}(s)=\frac{\eta(s)}{\sum_{s'\in S} \eta(s')}$ is the stationary distribution of the Markov Chain for policy $\pi_\theta$.
+
+$$
+\begin{aligned}
+\nabla_\theta J(\theta)&=\nabla_\theta V^{\pi}(s_0)\\
+&=\sum_{s\in S} \sum_{k=0}^\infty \rho(s_0\to s,k)\phi(s)\\
+&=\sum_{s\in S} \eta(s)\phi(s)\\
+&=\sum_{s\in S} \eta(s)\sum_{a\in A} \frac{\eta(s)}{\sum_{s'\in S} \eta(s')}\phi(s)\\
+&\propto \sum_{s\in S} \frac{\eta(s)}{\sum_{s'\in S} \eta(s')}\phi(s)\\
+&=\sum_{s\in S} d^{\pi_\theta}(s)\sum_{a\in A} \nabla_\theta \pi_\theta(a|s)Q^{\pi_\theta}(s,a)\\
+&=\left[\sum_{s\in S} d^{\pi_\theta}(s)\sum_{a\in A} \pi_\theta(a|s)\right]\nabla_\theta Q^{\pi_\theta}(s,a)\\
+&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]
+\end{aligned}
+$$
+
+</details>
+
+#### Monte-Carlo Policy Gradient
+
+We can use the score function to compute the policy gradient.
+
+## Actor-Critic methods
+
+### Q Actor-Critic
+
+### Advantage Actor-Critic
\ No newline at end of file
diff --git a/public/CSE510/Partial_Observable_Gridworld.png b/public/CSE510/Partial_Observable_Gridworld.png
new file mode 100644
index 0000000..5c1c425
Binary files /dev/null and b/public/CSE510/Partial_Observable_Gridworld.png differ