updates

2025-09-30 11:24:12 -05:00
parent f37472cb1a
commit f91be9a936
2 changed files with 300 additions and 0 deletions
--- a/content/CSE510/CSE510_L11.md
+++ b/content/CSE510/CSE510_L11.md
@@ -0,0 +1,300 @@
 # CSE510 Deep Reinforcement Learning (Lecture 11)
 > Materials Used
 >
 > - Much of the material and slides for this lecture were taken from Chapter 13 of Barto & Sutton textbook.
 >
 > - Some slides are borrowed from Rich Sutton’s RL class and David Silver's Deep RL tutorial
 Problem: often the feature-based policies that work well (win games,
 maximize utilities) aren't the ones that approximate V / Q best
 - Q-learning's priority: get Q-values close (modeling)
 - Action selection priority: get ordering of Q-values right (prediction)
 Value functions can often be much more complex to represent than the
 corresponding policy
 - Do we really care about knowing $Q(s, \text{left}) = 0.3554, Q(s, \text{right}) = 0.533$ Or just that "right is better than left in state $s$"
 Motivates searching directly in a parameterized policy space
 - Bypass learning value function and "directly" optimize the value of a policy
 <details>
 <summary>Examples</summary>
 Rock-Paper-Scissors
 - Two-player game of rock-paper-scissors
  - Scissors beats paper
  - Rock beats scissors
  - Paper beats rock
 - Consider policies for iterated rock-paper-scissors
  - A deterministic policy is easily exploited
  - A uniform random policy is optimal (i.e., Nash equilibrium)
 ---
 Partial Observable GridWorld
 ![Partial Observable Grid World](https://notenextra.trance-0.com/CSE510/Partial_Observable_GridWorld.png)
 The agent cannot differentiate the grey state
 Consider features of the following form (for all $N,E,S,W$ actions):
 $$
 \phi(s,a)=1(\text{wall to} N, a=\text{move} E)
 $$
 Compare value-based RL, suing an approximate value function
 $$
 Q_\theta(s,a) = f(\phi(s,a),\theta)
 $$
 To policy-based RL, using a parameterized policy
 $$
 \pi_\theta(s,a) = g(\phi(s,a),\theta)
 $$
 Under aliasing, an optimal deterministic policy will either
 - move $W$ in both grey states (shown by red arrows)
 - move $E$ in both grey states
 Either way, it can get stuck and _never_ reach the money
 - Value-based RL learns a near-deterministic policy
  - e.g. greedy or $\epsilon$-greedy
 So it will traverse the corridor for a long time
 An optimal **stochastic** policy will randomly move $E$ or $W$ in grey cells.
 $$
 \pi_\theta(\text{wall to }N\text{ and }S, \text{move }E) = 0.5\\
 \pi_\theta(\text{wall to }N\text{ and }S, \text{move }W) = 0.5
 $$
 It will reach the goal state in a few steps with high probability
 Policy-based RL can learn the optimal stochastic policy
 </details>
 ## RL via Policy Gradient Ascent
 The policy gradient approach has the following schema:
 1. Select a space of parameterized policies (i.e., function class)
 2. Compute the gradient of the value of current policy wrt parameters
 3. Move parameters in the direction of the gradient
 4. Repeat these steps until we reach a local maxima
 So we must answer the following questions:
 - How should we represent and evaluate parameterized policies?
 - How can we compute the gradient?
 ### Policy learning objective
 Goal: given policy $\pi_\theta(s,a)$ with parameter $\theta$, find best $\theta$
 In episodic environments we can use the start value:
 $$
 J_1(\theta) = V^{\pi_\theta}(s_1)=\mathbb{E}_{\pi_\theta}[v_1]
 $$
 In continuing environments we can use the average value:
 $$
 J_{avV}(\theta) = \sum_{s\in S} d^{\pi_\theta}(s) V^{\pi_\theta}(s)
 $$
 Or the average reward per time-step
 $$
 J_{avR}(\theta) = \sum_{s\in S} d^{\pi_\theta}(s) \sum_{a\in A} \pi_\theta(s,a) \mathcal{R}(s,a)
 $$
 here $d^{\pi_\theta}(s)$ is the **stationary distribution** of Markov Chain for policy $\pi_\theta$.
 ### Policy optimization
 Policy based reinforcement learning is an **optimization** problem
 Find $\theta$ that maximises $J(\theta)$
 Some approaches do not use gradient
 - Hill climbing
 - Simplex / amoeba / Nelder Mead
 - Genetic algorithms
 Greater efficiency often possible using gradient
 - Gradient descent
 - Conjugate gradient
 - Quasi-newton
 We focus on gradient descent, many extensions possible
 And on methods that exploit sequential structure
 ### Policy gradient
 Let $J(\theta)$ be any policy objective function
 Policy gradient algorithms search for a _local_ maxima in $J(\theta)$ by ascending the gradient of the policy with respect to $\theta$
 $$
 \Delta \theta = \alpha \nabla_\theta J(\theta)
 $$
 Where $\nabla_\theta J(\theta)$ is the policy gradient.
 $$
 \nabla_\theta J(\theta) = \begin{pmatrix}
 \frac{\partial J(\theta)}{\partial \theta_1} \\
 \frac{\partial J(\theta)}{\partial \theta_2} \\
 \vdots \\
 \frac{\partial J(\theta)}{\partial \theta_n}
 \end{pmatrix}
 $$
 and $\alpha$ is the step-size parameter.
 ### Policy gradient methods
 The main method we will introduce is Monte-Carlo policy gradient in Reinforcement Learning.
 #### Score Function
 Assume the policy $\pi_\theta$ is differentiable and non-zero and we know the gradient $\nabla_\theta \pi_\theta(s,a)$ for all $s\in S$ and $a\in A$.
 We can compute the policy gradient analytically
 We define the **Likelihood ratio** as:
 $$
 \begin{aligned}
 \nabla_\theta \pi_\theta(s,a) = \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} \\
 &= \nabla_\theta \log \pi_\theta(s,a)
 \end{aligned}
 $$
 The **Score Function** is:
 $$
 \nabla_\theta \log \pi_\theta(s,a)
 $$
 <details>
 <summary>Example</summary>
 Take the softmax policy as example:
 Weight actions using the linear combination of features $\phi(s,a)^T\theta$:
 Probability of action is proportional to the exponentiated weights:
 $$
 \pi_\theta(s,a) \propto \exp(\phi(s,a)^T\theta)
 $$
 The score function is
 $$
 \begin{aligned}
 \nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^T\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^T\theta))) \\
 &= \nabla_\theta\left(\phi(s,a)^T\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right) \\
 &=\phi(s,a) - \sum_{a'\in A} \prod_\theta(s,a') \phi(s,a')
 &= \phi(s,a) - \mathbb{E}_{a'\sim \pi_\theta(s,a')}[\phi(s,a')]
 \end{aligned}
 $$
 ---
 In continuous action spaces, a Gaussian policy is natural
 Mean is a linear combination of state features $\mu(s) = \phi(s)^T\theta$
 Variance may be fixed $\sigma^2$, or can also parametrized
 Policy is Gaussian, $a \sim N (\mu(s), \sigma^2)$
 The score function is
 $$
 \nabla_\theta \log \pi_\theta(s,a) = \frac{(a - \mu(s)) \phi(s)}{\sigma^2}
 $$
 </details>
 #### Policy Gradient Theorem
 For any _differentiable_ policy $\pi_\theta(s,a)$, 
 for any of the policy objective function $J=J_1, J_{avR},$ or $\frac{1}{1-\gamma}J_{avV}$, the policy gradient is:
 $$
 \nabla_\theta J(\theta) = \mathbb{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]
 $$
 <details>
 <summary>Proof</summary>
 Take $\phi(s)=\sum_{a\in A} \nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a)$ to simplify the proof.
 $$
 \begin{aligned}
 \nabla_\theta V^{\pi}(s)&=\nabla_\theta \left(\sum_{a\in A} \pi_\theta(a|s)Q^{\pi}(s,a)\right) \\
 &=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta Q^{\pi}(s,a)\right) \text{by linear approximation}\\
 &=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta \sum_{s',r\in S\times R} P(s',r|s,a) \left(r+V^{\pi}(s')\right)\right)\text{rewrite the Q-function as sum of expected rewards from actions} \\
 &=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \sum_{s',r\in S\times R} P(s',r|s,a) \nabla_\theta \left(r+V^{\pi}(s')\right)\right) \\
 &=\phi(s)+\sum_{a\in A} \left(\pi_\theta(a|s) \sum_{s'\in S} P(s'|s,a) \nabla_\theta V^{\pi}(s')\right) \\
 &=\phi(s)+\sum_{s\in S} \sum_{a\in A} \pi_\theta(a|s) P(s'|s,a) \nabla_\theta V^{\pi}(s') \\
 &=\phi(s)+\sum_{s\in S} \rho(s\to s',1)\nabla_\theta V^{\pi}(s') \text{ notice the recurrence relation}\\
 &=\phi(s)+\sum_{s'\in S} \rho(s\to s',1)\left[\phi(s')+\sum_{s''\in S} \rho(s'\to s'',1)\nabla_\theta V^{\pi}(s'')\right] \\
 &=\phi(s)+\left[\sum_{s'\in S} \rho(s\to s',1)\phi(s')\right]+\left[\sum_{s''\in S} \rho(s\to s'',2)\nabla_\theta V^{\pi}(s'')\right] \\
 &=\cdots\\
 &=\sum_{x\in S}\sum_{k=0}^\infty \rho(s\to x,k)\phi(x)
 \end{aligned}
 $$
 Just to note that $\rho(s\to x,k)=\sum_{a\in A} \pi_\theta(a|s) P(x|s,a)^k$ is the probability of reaching state $x$ in $k$ steps from state $s$.
 Let $\eta(s)=\sum_{k=0}^\infty \rho(s_0\to s,k)$ be the expected number of steps to reach state $s$ from state $s_0$.
 Note that $\sum_{s\in S} \eta(s)$ is constant depends solely on the initial state $s_0$ and policy $\pi_\theta$.
 So $d^{\pi_\theta}(s)=\frac{\eta(s)}{\sum_{s'\in S} \eta(s')}$ is the stationary distribution of the Markov Chain for policy $\pi_\theta$.
 $$
 \begin{aligned}
 \nabla_\theta J(\theta)&=\nabla_\theta V^{\pi}(s_0)\\
 &=\sum_{s\in S} \sum_{k=0}^\infty \rho(s_0\to s,k)\phi(s)\\
 &=\sum_{s\in S} \eta(s)\phi(s)\\
 &=\sum_{s\in S} \eta(s)\sum_{a\in A} \frac{\eta(s)}{\sum_{s'\in S} \eta(s')}\phi(s)\\
 &\propto \sum_{s\in S} \frac{\eta(s)}{\sum_{s'\in S} \eta(s')}\phi(s)\\
 &=\sum_{s\in S} d^{\pi_\theta}(s)\sum_{a\in A} \nabla_\theta \pi_\theta(a|s)Q^{\pi_\theta}(s,a)\\
 &=\left[\sum_{s\in S} d^{\pi_\theta}(s)\sum_{a\in A} \pi_\theta(a|s)\right]\nabla_\theta Q^{\pi_\theta}(s,a)\\
 &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]
 \end{aligned}
 $$
 </details>
 #### Monte-Carlo Policy Gradient
 We can use the score function to compute the policy gradient.
 ## Actor-Critic methods
 ### Q Actor-Critic
 ### Advantage Actor-Critic
--- a/public/CSE510/Partial_Observable_Gridworld.png
+++ b/public/CSE510/Partial_Observable_Gridworld.png