diff --git a/content/CSE510/CSE510_L11.md b/content/CSE510/CSE510_L11.md new file mode 100644 index 0000000..dcf08ea --- /dev/null +++ b/content/CSE510/CSE510_L11.md @@ -0,0 +1,300 @@ +# CSE510 Deep Reinforcement Learning (Lecture 11) + +> Materials Used +> +> - Much of the material and slides for this lecture were taken from Chapter 13 of Barto & Sutton textbook. +> +> - Some slides are borrowed from Rich Sutton’s RL class and David Silver's Deep RL tutorial + +Problem: often the feature-based policies that work well (win games, +maximize utilities) aren't the ones that approximate V / Q best + +- Q-learning's priority: get Q-values close (modeling) +- Action selection priority: get ordering of Q-values right (prediction) + +Value functions can often be much more complex to represent than the +corresponding policy + +- Do we really care about knowing $Q(s, \text{left}) = 0.3554, Q(s, \text{right}) = 0.533$ Or just that "right is better than left in state $s$" + +Motivates searching directly in a parameterized policy space + +- Bypass learning value function and "directly" optimize the value of a policy + +
+Examples + +Rock-Paper-Scissors + +- Two-player game of rock-paper-scissors + - Scissors beats paper + - Rock beats scissors + - Paper beats rock +- Consider policies for iterated rock-paper-scissors + - A deterministic policy is easily exploited + - A uniform random policy is optimal (i.e., Nash equilibrium) + +--- + +Partial Observable GridWorld + +![Partial Observable Grid World](https://notenextra.trance-0.com/CSE510/Partial_Observable_GridWorld.png) + +The agent cannot differentiate the grey state + +Consider features of the following form (for all $N,E,S,W$ actions): + +$$ +\phi(s,a)=1(\text{wall to} N, a=\text{move} E) +$$ + +Compare value-based RL, suing an approximate value function + +$$ +Q_\theta(s,a) = f(\phi(s,a),\theta) +$$ + +To policy-based RL, using a parameterized policy + +$$ +\pi_\theta(s,a) = g(\phi(s,a),\theta) +$$ + +Under aliasing, an optimal deterministic policy will either + +- move $W$ in both grey states (shown by red arrows) +- move $E$ in both grey states + +Either way, it can get stuck and _never_ reach the money + +- Value-based RL learns a near-deterministic policy + - e.g. greedy or $\epsilon$-greedy + +So it will traverse the corridor for a long time + +An optimal **stochastic** policy will randomly move $E$ or $W$ in grey cells. + +$$ +\pi_\theta(\text{wall to }N\text{ and }S, \text{move }E) = 0.5\\ +\pi_\theta(\text{wall to }N\text{ and }S, \text{move }W) = 0.5 +$$ + +It will reach the goal state in a few steps with high probability + +Policy-based RL can learn the optimal stochastic policy + +
+ +## RL via Policy Gradient Ascent + +The policy gradient approach has the following schema: + +1. Select a space of parameterized policies (i.e., function class) +2. Compute the gradient of the value of current policy wrt parameters +3. Move parameters in the direction of the gradient +4. Repeat these steps until we reach a local maxima + +So we must answer the following questions: + +- How should we represent and evaluate parameterized policies? +- How can we compute the gradient? + +### Policy learning objective + +Goal: given policy $\pi_\theta(s,a)$ with parameter $\theta$, find best $\theta$ + +In episodic environments we can use the start value: + +$$ +J_1(\theta) = V^{\pi_\theta}(s_1)=\mathbb{E}_{\pi_\theta}[v_1] +$$ + +In continuing environments we can use the average value: + +$$ +J_{avV}(\theta) = \sum_{s\in S} d^{\pi_\theta}(s) V^{\pi_\theta}(s) +$$ + +Or the average reward per time-step + +$$ +J_{avR}(\theta) = \sum_{s\in S} d^{\pi_\theta}(s) \sum_{a\in A} \pi_\theta(s,a) \mathcal{R}(s,a) +$$ + +here $d^{\pi_\theta}(s)$ is the **stationary distribution** of Markov Chain for policy $\pi_\theta$. + +### Policy optimization + +Policy based reinforcement learning is an **optimization** problem + +Find $\theta$ that maximises $J(\theta)$ + +Some approaches do not use gradient + +- Hill climbing +- Simplex / amoeba / Nelder Mead +- Genetic algorithms + +Greater efficiency often possible using gradient + +- Gradient descent +- Conjugate gradient +- Quasi-newton + +We focus on gradient descent, many extensions possible + +And on methods that exploit sequential structure + +### Policy gradient + +Let $J(\theta)$ be any policy objective function + +Policy gradient algorithms search for a _local_ maxima in $J(\theta)$ by ascending the gradient of the policy with respect to $\theta$ + +$$ +\Delta \theta = \alpha \nabla_\theta J(\theta) +$$ + +Where $\nabla_\theta J(\theta)$ is the policy gradient. + +$$ +\nabla_\theta J(\theta) = \begin{pmatrix} +\frac{\partial J(\theta)}{\partial \theta_1} \\ +\frac{\partial J(\theta)}{\partial \theta_2} \\ +\vdots \\ +\frac{\partial J(\theta)}{\partial \theta_n} +\end{pmatrix} +$$ + +and $\alpha$ is the step-size parameter. + +### Policy gradient methods + +The main method we will introduce is Monte-Carlo policy gradient in Reinforcement Learning. + +#### Score Function + +Assume the policy $\pi_\theta$ is differentiable and non-zero and we know the gradient $\nabla_\theta \pi_\theta(s,a)$ for all $s\in S$ and $a\in A$. + +We can compute the policy gradient analytically + +We define the **Likelihood ratio** as: + +$$ +\begin{aligned} +\nabla_\theta \pi_\theta(s,a) = \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} \\ +&= \nabla_\theta \log \pi_\theta(s,a) +\end{aligned} +$$ + +The **Score Function** is: + +$$ +\nabla_\theta \log \pi_\theta(s,a) +$$ + +
+Example + +Take the softmax policy as example: + +Weight actions using the linear combination of features $\phi(s,a)^T\theta$: + +Probability of action is proportional to the exponentiated weights: + +$$ +\pi_\theta(s,a) \propto \exp(\phi(s,a)^T\theta) +$$ + +The score function is + +$$ +\begin{aligned} +\nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^T\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^T\theta))) \\ +&= \nabla_\theta\left(\phi(s,a)^T\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right) \\ +&=\phi(s,a) - \sum_{a'\in A} \prod_\theta(s,a') \phi(s,a') +&= \phi(s,a) - \mathbb{E}_{a'\sim \pi_\theta(s,a')}[\phi(s,a')] +\end{aligned} +$$ + +--- + +In continuous action spaces, a Gaussian policy is natural + +Mean is a linear combination of state features $\mu(s) = \phi(s)^T\theta$ + +Variance may be fixed $\sigma^2$, or can also parametrized + +Policy is Gaussian, $a \sim N (\mu(s), \sigma^2)$ + +The score function is + +$$ +\nabla_\theta \log \pi_\theta(s,a) = \frac{(a - \mu(s)) \phi(s)}{\sigma^2} +$$ + +
+ +#### Policy Gradient Theorem + +For any _differentiable_ policy $\pi_\theta(s,a)$, + +for any of the policy objective function $J=J_1, J_{avR},$ or $\frac{1}{1-\gamma}J_{avV}$, the policy gradient is: + +$$ +\nabla_\theta J(\theta) = \mathbb{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)] +$$ + +
+Proof + +Take $\phi(s)=\sum_{a\in A} \nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a)$ to simplify the proof. + +$$ +\begin{aligned} +\nabla_\theta V^{\pi}(s)&=\nabla_\theta \left(\sum_{a\in A} \pi_\theta(a|s)Q^{\pi}(s,a)\right) \\ +&=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta Q^{\pi}(s,a)\right) \text{by linear approximation}\\ +&=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta \sum_{s',r\in S\times R} P(s',r|s,a) \left(r+V^{\pi}(s')\right)\right)\text{rewrite the Q-function as sum of expected rewards from actions} \\ +&=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \sum_{s',r\in S\times R} P(s',r|s,a) \nabla_\theta \left(r+V^{\pi}(s')\right)\right) \\ +&=\phi(s)+\sum_{a\in A} \left(\pi_\theta(a|s) \sum_{s'\in S} P(s'|s,a) \nabla_\theta V^{\pi}(s')\right) \\ +&=\phi(s)+\sum_{s\in S} \sum_{a\in A} \pi_\theta(a|s) P(s'|s,a) \nabla_\theta V^{\pi}(s') \\ +&=\phi(s)+\sum_{s\in S} \rho(s\to s',1)\nabla_\theta V^{\pi}(s') \text{ notice the recurrence relation}\\ +&=\phi(s)+\sum_{s'\in S} \rho(s\to s',1)\left[\phi(s')+\sum_{s''\in S} \rho(s'\to s'',1)\nabla_\theta V^{\pi}(s'')\right] \\ +&=\phi(s)+\left[\sum_{s'\in S} \rho(s\to s',1)\phi(s')\right]+\left[\sum_{s''\in S} \rho(s\to s'',2)\nabla_\theta V^{\pi}(s'')\right] \\ +&=\cdots\\ +&=\sum_{x\in S}\sum_{k=0}^\infty \rho(s\to x,k)\phi(x) +\end{aligned} +$$ + +Just to note that $\rho(s\to x,k)=\sum_{a\in A} \pi_\theta(a|s) P(x|s,a)^k$ is the probability of reaching state $x$ in $k$ steps from state $s$. + +Let $\eta(s)=\sum_{k=0}^\infty \rho(s_0\to s,k)$ be the expected number of steps to reach state $s$ from state $s_0$. + +Note that $\sum_{s\in S} \eta(s)$ is constant depends solely on the initial state $s_0$ and policy $\pi_\theta$. + +So $d^{\pi_\theta}(s)=\frac{\eta(s)}{\sum_{s'\in S} \eta(s')}$ is the stationary distribution of the Markov Chain for policy $\pi_\theta$. + +$$ +\begin{aligned} +\nabla_\theta J(\theta)&=\nabla_\theta V^{\pi}(s_0)\\ +&=\sum_{s\in S} \sum_{k=0}^\infty \rho(s_0\to s,k)\phi(s)\\ +&=\sum_{s\in S} \eta(s)\phi(s)\\ +&=\sum_{s\in S} \eta(s)\sum_{a\in A} \frac{\eta(s)}{\sum_{s'\in S} \eta(s')}\phi(s)\\ +&\propto \sum_{s\in S} \frac{\eta(s)}{\sum_{s'\in S} \eta(s')}\phi(s)\\ +&=\sum_{s\in S} d^{\pi_\theta}(s)\sum_{a\in A} \nabla_\theta \pi_\theta(a|s)Q^{\pi_\theta}(s,a)\\ +&=\left[\sum_{s\in S} d^{\pi_\theta}(s)\sum_{a\in A} \pi_\theta(a|s)\right]\nabla_\theta Q^{\pi_\theta}(s,a)\\ +&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)] +\end{aligned} +$$ + +
+ +#### Monte-Carlo Policy Gradient + +We can use the score function to compute the policy gradient. + +## Actor-Critic methods + +### Q Actor-Critic + +### Advantage Actor-Critic \ No newline at end of file diff --git a/public/CSE510/Partial_Observable_Gridworld.png b/public/CSE510/Partial_Observable_Gridworld.png new file mode 100644 index 0000000..5c1c425 Binary files /dev/null and b/public/CSE510/Partial_Observable_Gridworld.png differ