updates
This commit is contained in:
300
content/CSE510/CSE510_L11.md
Normal file
300
content/CSE510/CSE510_L11.md
Normal file
@@ -0,0 +1,300 @@
|
|||||||
|
# CSE510 Deep Reinforcement Learning (Lecture 11)
|
||||||
|
|
||||||
|
> Materials Used
|
||||||
|
>
|
||||||
|
> - Much of the material and slides for this lecture were taken from Chapter 13 of Barto & Sutton textbook.
|
||||||
|
>
|
||||||
|
> - Some slides are borrowed from Rich Sutton’s RL class and David Silver's Deep RL tutorial
|
||||||
|
|
||||||
|
Problem: often the feature-based policies that work well (win games,
|
||||||
|
maximize utilities) aren't the ones that approximate V / Q best
|
||||||
|
|
||||||
|
- Q-learning's priority: get Q-values close (modeling)
|
||||||
|
- Action selection priority: get ordering of Q-values right (prediction)
|
||||||
|
|
||||||
|
Value functions can often be much more complex to represent than the
|
||||||
|
corresponding policy
|
||||||
|
|
||||||
|
- Do we really care about knowing $Q(s, \text{left}) = 0.3554, Q(s, \text{right}) = 0.533$ Or just that "right is better than left in state $s$"
|
||||||
|
|
||||||
|
Motivates searching directly in a parameterized policy space
|
||||||
|
|
||||||
|
- Bypass learning value function and "directly" optimize the value of a policy
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Examples</summary>
|
||||||
|
|
||||||
|
Rock-Paper-Scissors
|
||||||
|
|
||||||
|
- Two-player game of rock-paper-scissors
|
||||||
|
- Scissors beats paper
|
||||||
|
- Rock beats scissors
|
||||||
|
- Paper beats rock
|
||||||
|
- Consider policies for iterated rock-paper-scissors
|
||||||
|
- A deterministic policy is easily exploited
|
||||||
|
- A uniform random policy is optimal (i.e., Nash equilibrium)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Partial Observable GridWorld
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
The agent cannot differentiate the grey state
|
||||||
|
|
||||||
|
Consider features of the following form (for all $N,E,S,W$ actions):
|
||||||
|
|
||||||
|
$$
|
||||||
|
\phi(s,a)=1(\text{wall to} N, a=\text{move} E)
|
||||||
|
$$
|
||||||
|
|
||||||
|
Compare value-based RL, suing an approximate value function
|
||||||
|
|
||||||
|
$$
|
||||||
|
Q_\theta(s,a) = f(\phi(s,a),\theta)
|
||||||
|
$$
|
||||||
|
|
||||||
|
To policy-based RL, using a parameterized policy
|
||||||
|
|
||||||
|
$$
|
||||||
|
\pi_\theta(s,a) = g(\phi(s,a),\theta)
|
||||||
|
$$
|
||||||
|
|
||||||
|
Under aliasing, an optimal deterministic policy will either
|
||||||
|
|
||||||
|
- move $W$ in both grey states (shown by red arrows)
|
||||||
|
- move $E$ in both grey states
|
||||||
|
|
||||||
|
Either way, it can get stuck and _never_ reach the money
|
||||||
|
|
||||||
|
- Value-based RL learns a near-deterministic policy
|
||||||
|
- e.g. greedy or $\epsilon$-greedy
|
||||||
|
|
||||||
|
So it will traverse the corridor for a long time
|
||||||
|
|
||||||
|
An optimal **stochastic** policy will randomly move $E$ or $W$ in grey cells.
|
||||||
|
|
||||||
|
$$
|
||||||
|
\pi_\theta(\text{wall to }N\text{ and }S, \text{move }E) = 0.5\\
|
||||||
|
\pi_\theta(\text{wall to }N\text{ and }S, \text{move }W) = 0.5
|
||||||
|
$$
|
||||||
|
|
||||||
|
It will reach the goal state in a few steps with high probability
|
||||||
|
|
||||||
|
Policy-based RL can learn the optimal stochastic policy
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
## RL via Policy Gradient Ascent
|
||||||
|
|
||||||
|
The policy gradient approach has the following schema:
|
||||||
|
|
||||||
|
1. Select a space of parameterized policies (i.e., function class)
|
||||||
|
2. Compute the gradient of the value of current policy wrt parameters
|
||||||
|
3. Move parameters in the direction of the gradient
|
||||||
|
4. Repeat these steps until we reach a local maxima
|
||||||
|
|
||||||
|
So we must answer the following questions:
|
||||||
|
|
||||||
|
- How should we represent and evaluate parameterized policies?
|
||||||
|
- How can we compute the gradient?
|
||||||
|
|
||||||
|
### Policy learning objective
|
||||||
|
|
||||||
|
Goal: given policy $\pi_\theta(s,a)$ with parameter $\theta$, find best $\theta$
|
||||||
|
|
||||||
|
In episodic environments we can use the start value:
|
||||||
|
|
||||||
|
$$
|
||||||
|
J_1(\theta) = V^{\pi_\theta}(s_1)=\mathbb{E}_{\pi_\theta}[v_1]
|
||||||
|
$$
|
||||||
|
|
||||||
|
In continuing environments we can use the average value:
|
||||||
|
|
||||||
|
$$
|
||||||
|
J_{avV}(\theta) = \sum_{s\in S} d^{\pi_\theta}(s) V^{\pi_\theta}(s)
|
||||||
|
$$
|
||||||
|
|
||||||
|
Or the average reward per time-step
|
||||||
|
|
||||||
|
$$
|
||||||
|
J_{avR}(\theta) = \sum_{s\in S} d^{\pi_\theta}(s) \sum_{a\in A} \pi_\theta(s,a) \mathcal{R}(s,a)
|
||||||
|
$$
|
||||||
|
|
||||||
|
here $d^{\pi_\theta}(s)$ is the **stationary distribution** of Markov Chain for policy $\pi_\theta$.
|
||||||
|
|
||||||
|
### Policy optimization
|
||||||
|
|
||||||
|
Policy based reinforcement learning is an **optimization** problem
|
||||||
|
|
||||||
|
Find $\theta$ that maximises $J(\theta)$
|
||||||
|
|
||||||
|
Some approaches do not use gradient
|
||||||
|
|
||||||
|
- Hill climbing
|
||||||
|
- Simplex / amoeba / Nelder Mead
|
||||||
|
- Genetic algorithms
|
||||||
|
|
||||||
|
Greater efficiency often possible using gradient
|
||||||
|
|
||||||
|
- Gradient descent
|
||||||
|
- Conjugate gradient
|
||||||
|
- Quasi-newton
|
||||||
|
|
||||||
|
We focus on gradient descent, many extensions possible
|
||||||
|
|
||||||
|
And on methods that exploit sequential structure
|
||||||
|
|
||||||
|
### Policy gradient
|
||||||
|
|
||||||
|
Let $J(\theta)$ be any policy objective function
|
||||||
|
|
||||||
|
Policy gradient algorithms search for a _local_ maxima in $J(\theta)$ by ascending the gradient of the policy with respect to $\theta$
|
||||||
|
|
||||||
|
$$
|
||||||
|
\Delta \theta = \alpha \nabla_\theta J(\theta)
|
||||||
|
$$
|
||||||
|
|
||||||
|
Where $\nabla_\theta J(\theta)$ is the policy gradient.
|
||||||
|
|
||||||
|
$$
|
||||||
|
\nabla_\theta J(\theta) = \begin{pmatrix}
|
||||||
|
\frac{\partial J(\theta)}{\partial \theta_1} \\
|
||||||
|
\frac{\partial J(\theta)}{\partial \theta_2} \\
|
||||||
|
\vdots \\
|
||||||
|
\frac{\partial J(\theta)}{\partial \theta_n}
|
||||||
|
\end{pmatrix}
|
||||||
|
$$
|
||||||
|
|
||||||
|
and $\alpha$ is the step-size parameter.
|
||||||
|
|
||||||
|
### Policy gradient methods
|
||||||
|
|
||||||
|
The main method we will introduce is Monte-Carlo policy gradient in Reinforcement Learning.
|
||||||
|
|
||||||
|
#### Score Function
|
||||||
|
|
||||||
|
Assume the policy $\pi_\theta$ is differentiable and non-zero and we know the gradient $\nabla_\theta \pi_\theta(s,a)$ for all $s\in S$ and $a\in A$.
|
||||||
|
|
||||||
|
We can compute the policy gradient analytically
|
||||||
|
|
||||||
|
We define the **Likelihood ratio** as:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
\nabla_\theta \pi_\theta(s,a) = \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} \\
|
||||||
|
&= \nabla_\theta \log \pi_\theta(s,a)
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
The **Score Function** is:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\nabla_\theta \log \pi_\theta(s,a)
|
||||||
|
$$
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Example</summary>
|
||||||
|
|
||||||
|
Take the softmax policy as example:
|
||||||
|
|
||||||
|
Weight actions using the linear combination of features $\phi(s,a)^T\theta$:
|
||||||
|
|
||||||
|
Probability of action is proportional to the exponentiated weights:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\pi_\theta(s,a) \propto \exp(\phi(s,a)^T\theta)
|
||||||
|
$$
|
||||||
|
|
||||||
|
The score function is
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
\nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^T\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^T\theta))) \\
|
||||||
|
&= \nabla_\theta\left(\phi(s,a)^T\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right) \\
|
||||||
|
&=\phi(s,a) - \sum_{a'\in A} \prod_\theta(s,a') \phi(s,a')
|
||||||
|
&= \phi(s,a) - \mathbb{E}_{a'\sim \pi_\theta(s,a')}[\phi(s,a')]
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
In continuous action spaces, a Gaussian policy is natural
|
||||||
|
|
||||||
|
Mean is a linear combination of state features $\mu(s) = \phi(s)^T\theta$
|
||||||
|
|
||||||
|
Variance may be fixed $\sigma^2$, or can also parametrized
|
||||||
|
|
||||||
|
Policy is Gaussian, $a \sim N (\mu(s), \sigma^2)$
|
||||||
|
|
||||||
|
The score function is
|
||||||
|
|
||||||
|
$$
|
||||||
|
\nabla_\theta \log \pi_\theta(s,a) = \frac{(a - \mu(s)) \phi(s)}{\sigma^2}
|
||||||
|
$$
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
#### Policy Gradient Theorem
|
||||||
|
|
||||||
|
For any _differentiable_ policy $\pi_\theta(s,a)$,
|
||||||
|
|
||||||
|
for any of the policy objective function $J=J_1, J_{avR},$ or $\frac{1}{1-\gamma}J_{avV}$, the policy gradient is:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\nabla_\theta J(\theta) = \mathbb{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]
|
||||||
|
$$
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Proof</summary>
|
||||||
|
|
||||||
|
Take $\phi(s)=\sum_{a\in A} \nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a)$ to simplify the proof.
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
\nabla_\theta V^{\pi}(s)&=\nabla_\theta \left(\sum_{a\in A} \pi_\theta(a|s)Q^{\pi}(s,a)\right) \\
|
||||||
|
&=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta Q^{\pi}(s,a)\right) \text{by linear approximation}\\
|
||||||
|
&=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta \sum_{s',r\in S\times R} P(s',r|s,a) \left(r+V^{\pi}(s')\right)\right)\text{rewrite the Q-function as sum of expected rewards from actions} \\
|
||||||
|
&=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \sum_{s',r\in S\times R} P(s',r|s,a) \nabla_\theta \left(r+V^{\pi}(s')\right)\right) \\
|
||||||
|
&=\phi(s)+\sum_{a\in A} \left(\pi_\theta(a|s) \sum_{s'\in S} P(s'|s,a) \nabla_\theta V^{\pi}(s')\right) \\
|
||||||
|
&=\phi(s)+\sum_{s\in S} \sum_{a\in A} \pi_\theta(a|s) P(s'|s,a) \nabla_\theta V^{\pi}(s') \\
|
||||||
|
&=\phi(s)+\sum_{s\in S} \rho(s\to s',1)\nabla_\theta V^{\pi}(s') \text{ notice the recurrence relation}\\
|
||||||
|
&=\phi(s)+\sum_{s'\in S} \rho(s\to s',1)\left[\phi(s')+\sum_{s''\in S} \rho(s'\to s'',1)\nabla_\theta V^{\pi}(s'')\right] \\
|
||||||
|
&=\phi(s)+\left[\sum_{s'\in S} \rho(s\to s',1)\phi(s')\right]+\left[\sum_{s''\in S} \rho(s\to s'',2)\nabla_\theta V^{\pi}(s'')\right] \\
|
||||||
|
&=\cdots\\
|
||||||
|
&=\sum_{x\in S}\sum_{k=0}^\infty \rho(s\to x,k)\phi(x)
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Just to note that $\rho(s\to x,k)=\sum_{a\in A} \pi_\theta(a|s) P(x|s,a)^k$ is the probability of reaching state $x$ in $k$ steps from state $s$.
|
||||||
|
|
||||||
|
Let $\eta(s)=\sum_{k=0}^\infty \rho(s_0\to s,k)$ be the expected number of steps to reach state $s$ from state $s_0$.
|
||||||
|
|
||||||
|
Note that $\sum_{s\in S} \eta(s)$ is constant depends solely on the initial state $s_0$ and policy $\pi_\theta$.
|
||||||
|
|
||||||
|
So $d^{\pi_\theta}(s)=\frac{\eta(s)}{\sum_{s'\in S} \eta(s')}$ is the stationary distribution of the Markov Chain for policy $\pi_\theta$.
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
\nabla_\theta J(\theta)&=\nabla_\theta V^{\pi}(s_0)\\
|
||||||
|
&=\sum_{s\in S} \sum_{k=0}^\infty \rho(s_0\to s,k)\phi(s)\\
|
||||||
|
&=\sum_{s\in S} \eta(s)\phi(s)\\
|
||||||
|
&=\sum_{s\in S} \eta(s)\sum_{a\in A} \frac{\eta(s)}{\sum_{s'\in S} \eta(s')}\phi(s)\\
|
||||||
|
&\propto \sum_{s\in S} \frac{\eta(s)}{\sum_{s'\in S} \eta(s')}\phi(s)\\
|
||||||
|
&=\sum_{s\in S} d^{\pi_\theta}(s)\sum_{a\in A} \nabla_\theta \pi_\theta(a|s)Q^{\pi_\theta}(s,a)\\
|
||||||
|
&=\left[\sum_{s\in S} d^{\pi_\theta}(s)\sum_{a\in A} \pi_\theta(a|s)\right]\nabla_\theta Q^{\pi_\theta}(s,a)\\
|
||||||
|
&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
#### Monte-Carlo Policy Gradient
|
||||||
|
|
||||||
|
We can use the score function to compute the policy gradient.
|
||||||
|
|
||||||
|
## Actor-Critic methods
|
||||||
|
|
||||||
|
### Q Actor-Critic
|
||||||
|
|
||||||
|
### Advantage Actor-Critic
|
||||||
BIN
public/CSE510/Partial_Observable_Gridworld.png
Normal file
BIN
public/CSE510/Partial_Observable_Gridworld.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 25 KiB |
Reference in New Issue
Block a user