9.3 KiB
CSE510 Deep Reinforcement Learning (Lecture 11)
Materials Used
Much of the material and slides for this lecture were taken from Chapter 13 of Barto & Sutton textbook.
Some slides are borrowed from Rich Sutton’s RL class and David Silver's Deep RL tutorial
Problem: often the feature-based policies that work well (win games, maximize utilities) aren't the ones that approximate V / Q best
- Q-learning's priority: get Q-values close (modeling)
- Action selection priority: get ordering of Q-values right (prediction)
Value functions can often be much more complex to represent than the corresponding policy
- Do we really care about knowing
Q(s, \text{left}) = 0.3554, Q(s, \text{right}) = 0.533Or just that "right is better than left in state $s$"
Motivates searching directly in a parameterized policy space
- Bypass learning value function and "directly" optimize the value of a policy
Examples
Rock-Paper-Scissors
- Two-player game of rock-paper-scissors
- Scissors beats paper
- Rock beats scissors
- Paper beats rock
- Consider policies for iterated rock-paper-scissors
- A deterministic policy is easily exploited
- A uniform random policy is optimal (i.e., Nash equilibrium)
Partial Observable GridWorld
The agent cannot differentiate the grey state
Consider features of the following form (for all N,E,S,W actions):
\phi(s,a)=1(\text{wall to} N, a=\text{move} E)
Compare value-based RL, suing an approximate value function
Q_\theta(s,a) = f(\phi(s,a),\theta)
To policy-based RL, using a parameterized policy
\pi_\theta(s,a) = g(\phi(s,a),\theta)
Under aliasing, an optimal deterministic policy will either
- move
Win both grey states (shown by red arrows) - move
Ein both grey states
Either way, it can get stuck and never reach the money
- Value-based RL learns a near-deterministic policy
- e.g. greedy or $\epsilon$-greedy
So it will traverse the corridor for a long time
An optimal stochastic policy will randomly move E or W in grey cells.
\pi_\theta(\text{wall to }N\text{ and }S, \text{move }E) = 0.5\\
\pi_\theta(\text{wall to }N\text{ and }S, \text{move }W) = 0.5
It will reach the goal state in a few steps with high probability
Policy-based RL can learn the optimal stochastic policy
RL via Policy Gradient Ascent
The policy gradient approach has the following schema:
- Select a space of parameterized policies (i.e., function class)
- Compute the gradient of the value of current policy wrt parameters
- Move parameters in the direction of the gradient
- Repeat these steps until we reach a local maxima
So we must answer the following questions:
- How should we represent and evaluate parameterized policies?
- How can we compute the gradient?
Policy learning objective
Goal: given policy \pi_\theta(s,a) with parameter \theta, find best \theta
In episodic environments we can use the start value:
J_1(\theta) = V^{\pi_\theta}(s_1)=\mathbb{E}_{\pi_\theta}[v_1]
In continuing environments we can use the average value:
J_{avV}(\theta) = \sum_{s\in S} d^{\pi_\theta}(s) V^{\pi_\theta}(s)
Or the average reward per time-step
J_{avR}(\theta) = \sum_{s\in S} d^{\pi_\theta}(s) \sum_{a\in A} \pi_\theta(s,a) \mathcal{R}(s,a)
here d^{\pi_\theta}(s) is the stationary distribution of Markov Chain for policy \pi_\theta.
Policy optimization
Policy based reinforcement learning is an optimization problem
Find \theta that maximises J(\theta)
Some approaches do not use gradient
- Hill climbing
- Simplex / amoeba / Nelder Mead
- Genetic algorithms
Greater efficiency often possible using gradient
- Gradient descent
- Conjugate gradient
- Quasi-newton
We focus on gradient descent, many extensions possible
And on methods that exploit sequential structure
Policy gradient
Let J(\theta) be any policy objective function
Policy gradient algorithms search for a local maxima in J(\theta) by ascending the gradient of the policy with respect to \theta
\Delta \theta = \alpha \nabla_\theta J(\theta)
Where \nabla_\theta J(\theta) is the policy gradient.
\nabla_\theta J(\theta) = \begin{pmatrix}
\frac{\partial J(\theta)}{\partial \theta_1} \\
\frac{\partial J(\theta)}{\partial \theta_2} \\
\vdots \\
\frac{\partial J(\theta)}{\partial \theta_n}
\end{pmatrix}
and \alpha is the step-size parameter.
Policy gradient methods
The main method we will introduce is Monte-Carlo policy gradient in Reinforcement Learning.
Score Function
Assume the policy \pi_\theta is differentiable and non-zero and we know the gradient \nabla_\theta \pi_\theta(s,a) for all s\in S and a\in A.
We can compute the policy gradient analytically
We define the Likelihood ratio as:
\begin{aligned}
\nabla_\theta \pi_\theta(s,a) = \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} \\
&= \nabla_\theta \log \pi_\theta(s,a)
\end{aligned}
The Score Function is:
\nabla_\theta \log \pi_\theta(s,a)
Example
Take the softmax policy as example:
Weight actions using the linear combination of features \phi(s,a)^\top\theta:
Probability of action is proportional to the exponentiated weights:
\pi_\theta(s,a) \propto \exp(\phi(s,a)^\top\theta)
The score function is
\begin{aligned}
\nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^\top\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^\top\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^\top\theta))) \\
&= \nabla_\theta\left(\phi(s,a)^\top\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}\right) \\
&=\phi(s,a) - \sum_{a'\in A} \prod_\theta(s,a') \phi(s,a')
&= \phi(s,a) - \mathbb{E}_{a'\sim \pi_\theta(s,a')}[\phi(s,a')]
\end{aligned}
In continuous action spaces, a Gaussian policy is natural
Mean is a linear combination of state features \mu(s) = \phi(s)^\top\theta
Variance may be fixed \sigma^2, or can also parametrized
Policy is Gaussian, a \sim N (\mu(s), \sigma^2)
The score function is
\nabla_\theta \log \pi_\theta(s,a) = \frac{(a - \mu(s)) \phi(s)}{\sigma^2}
Policy Gradient Theorem
For any differentiable policy \pi_\theta(s,a),
for any of the policy objective function J=J_1, J_{avR}, or \frac{1}{1-\gamma}J_{avV}, the policy gradient is:
\nabla_\theta J(\theta) = \mathbb{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]
Proof
Take \phi(s)=\sum_{a\in A} \nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) to simplify the proof.
\begin{aligned}
\nabla_\theta V^{\pi}(s)&=\nabla_\theta \left(\sum_{a\in A} \pi_\theta(a|s)Q^{\pi}(s,a)\right) \\
&=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta Q^{\pi}(s,a)\right) \text{by linear approximation}\\
&=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta \sum_{s',r\in S\times R} P(s',r|s,a) \left(r+V^{\pi}(s')\right)\right)\text{rewrite the Q-function as sum of expected rewards from actions} \\
&=\sum_{a\in A} \left(\nabla_\theta \pi_\theta(a|s)Q^{\pi}(s,a) + \pi_\theta(a|s) \sum_{s',r\in S\times R} P(s',r|s,a) \nabla_\theta \left(r+V^{\pi}(s')\right)\right) \\
&=\phi(s)+\sum_{a\in A} \left(\pi_\theta(a|s) \sum_{s'\in S} P(s'|s,a) \nabla_\theta V^{\pi}(s')\right) \\
&=\phi(s)+\sum_{s\in S} \sum_{a\in A} \pi_\theta(a|s) P(s'|s,a) \nabla_\theta V^{\pi}(s') \\
&=\phi(s)+\sum_{s\in S} \rho(s\to s',1)\nabla_\theta V^{\pi}(s') \text{ notice the recurrence relation}\\
&=\phi(s)+\sum_{s'\in S} \rho(s\to s',1)\left[\phi(s')+\sum_{s''\in S} \rho(s'\to s'',1)\nabla_\theta V^{\pi}(s'')\right] \\
&=\phi(s)+\left[\sum_{s'\in S} \rho(s\to s',1)\phi(s')\right]+\left[\sum_{s''\in S} \rho(s\to s'',2)\nabla_\theta V^{\pi}(s'')\right] \\
&=\cdots\\
&=\sum_{x\in S}\sum_{k=0}^\infty \rho(s\to x,k)\phi(x)
\end{aligned}
Just to note that \rho(s\to x,k)=\sum_{a\in A} \pi_\theta(a|s) P(x|s,a)^k is the probability of reaching state x in k steps from state s.
Let \eta(s)=\sum_{k=0}^\infty \rho(s_0\to s,k) be the expected number of steps to reach state s from state s_0.
Note that \sum_{s\in S} \eta(s) is constant depends solely on the initial state s_0 and policy \pi_\theta.
So d^{\pi_\theta}(s)=\frac{\eta(s)}{\sum_{s'\in S} \eta(s')} is the stationary distribution of the Markov Chain for policy \pi_\theta.
\begin{aligned}
\nabla_\theta J(\theta)&=\nabla_\theta V^{\pi}(s_0)\\
&=\sum_{s\in S} \sum_{k=0}^\infty \rho(s_0\to s,k)\phi(s)\\
&=\sum_{s\in S} \eta(s)\phi(s)\\
&=\sum_{s\in S} \eta(s)\sum_{a\in A} \frac{\eta(s)}{\sum_{s'\in S} \eta(s')}\phi(s)\\
&\propto \sum_{s\in S} \frac{\eta(s)}{\sum_{s'\in S} \eta(s')}\phi(s)\\
&=\sum_{s\in S} d^{\pi_\theta}(s)\sum_{a\in A} \nabla_\theta \pi_\theta(a|s)Q^{\pi_\theta}(s,a)\\
&=\left[\sum_{s\in S} d^{\pi_\theta}(s)\sum_{a\in A} \pi_\theta(a|s)\right]\nabla_\theta Q^{\pi_\theta}(s,a)\\
&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]
\end{aligned}
Monte-Carlo Policy Gradient
We can use the score function to compute the policy gradient.
