205 lines
6.3 KiB
Markdown
205 lines
6.3 KiB
Markdown
# CSE510 Deep Reinforcement Learning (Lecture 12)
|
||
|
||
## Policy Gradient Theorem
|
||
|
||
For any differentiable policy $\pi_\theta(s,a)$, for any o the policy objective functions $J=J_1, J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$
|
||
|
||
The policy gradient is
|
||
|
||
$$
|
||
\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]
|
||
$$
|
||
|
||
## Policy Gradient Methods
|
||
|
||
Advantages of Policy-Based RL
|
||
|
||
Advantages:
|
||
|
||
- Better convergence properties
|
||
- Effective in high-dimensional or continuous action spaces
|
||
- Can learn stochastic policies
|
||
|
||
Disadvantages:
|
||
|
||
- Typically converge to a local rather than global optimum
|
||
- Evaluating a policy is typically inefficient and high variance
|
||
|
||
### Anchor-Critic Methods
|
||
|
||
#### Q Actor-Critic
|
||
|
||
Reducing Variance Using a Critic
|
||
|
||
Monte-Carlo Policy Gradient still has high variance.
|
||
|
||
We use a critic to estimate the action-value function $Q_w(s,a)\approx Q^{\pi_\theta}(s,a)$.
|
||
|
||
Anchor-critic algorithms maintain two sets of parameters:
|
||
|
||
Critic: updates action-value function parameters $w$
|
||
|
||
Actor: updates policy parameters $\theta$, in direction suggested by the critic.
|
||
|
||
Actor-critic algorithms follow an approximate policy gradient:
|
||
|
||
$$
|
||
\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q_w(s,a)\right]
|
||
$$
|
||
$$
|
||
\Delta \theta = \alpha \nabla_\theta \log \pi_\theta(s,a)Q_w(s,a)
|
||
$$
|
||
|
||
Action-Value Actor-Critic
|
||
|
||
- Simple actor-critic algorithm based on action-value critic
|
||
- Using linear value function approximation $Q_w(s,a)=\phi(s,a)^\top w$
|
||
|
||
Critic: updates $w$ by linear $TD(0)$
|
||
Actor: updates $\theta$ by policy gradient
|
||
|
||
```python
|
||
def Q_actor-critic(states,theta):
|
||
actions=sample_actions(a,pi_theta)
|
||
for i in range(num_steps):
|
||
reward=sample_rewards(actions,states)
|
||
transition=sample_transition(actions,states)
|
||
new_actions=sample_action(transition,theta)
|
||
delta=sample_reward+gamma*Q_w(transition, new_actions)-Q_w(states, actions)
|
||
theta=theta+alpha*nabla_theta*log(pi_theta(states, actions))*Q_w(states, actions)
|
||
w=w+beta*delta*phi(states, actions)
|
||
a=new_actions
|
||
s=transition
|
||
```
|
||
|
||
#### Advantage Actor-Critic
|
||
|
||
Reducing variance using a baseline
|
||
|
||
- We subtract a baseline function $B(s)$ form the policy gradient
|
||
- This can reduce the variance without changing expectation
|
||
|
||
$$
|
||
\begin{aligned}
|
||
\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log \pi_\theta(s,a)B(s)\right]&=\sum_{s\in S}d^{\pi_\theta}(s)\sum_{a\in A}\nabla_{\theta}\pi_\theta(s,a)B(s)\\
|
||
&=\sum_{s\in S}d^{\pi_\theta}B(s)\nabla_\theta\sum_{a\in A}\pi_\theta(s,a)\\
|
||
&=0
|
||
\end{aligned}
|
||
$$
|
||
|
||
A good baseline is the state value function $B(s)=V^{\pi_\theta}(s)$
|
||
|
||
So we can rewrite the policy gradient using the advantage function $A^{\pi_\theta}(s,a)=Q^{\pi_\theta}(s,a)-V^{\pi_theta}(s)$
|
||
|
||
$$
|
||
\nabla_\theta J(\theta)=\mathbb{E}\left[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_theta}(s,a)\right]
|
||
$$
|
||
|
||
##### Estimating the Advantage function
|
||
|
||
**Method 1:** direct estimation
|
||
|
||
> May increase the variance
|
||
|
||
The advantage function can significantly reduce variance of policy gradient
|
||
|
||
So the critic should really estimate the advantage function
|
||
|
||
For example, by estimating both $V^{\pi_theta}(s)$ and $Q^{\pi_theta}(s,a)$
|
||
|
||
Using two function approximators and two parameter vectors,
|
||
|
||
$$
|
||
V_v(s)\approx V^{\pi_\theta}(s)\\
|
||
Q_w(s,a)\approx Q^{\pi_\theta}(s,a)\\
|
||
A(s,a)=Q_w(s,a)-V_v(s)
|
||
$$
|
||
|
||
And updating both value functions by e.g. TD learning
|
||
|
||
**Method 2:** using the TD error
|
||
|
||
> We can prove that TD error is an unbiased estimation of the advantage function
|
||
|
||
For the true value function $V^{\pi_\theta}(s)$, the TD error $\delta^{\pi_\theta}$
|
||
|
||
$$
|
||
\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s) - V^{\pi_\theta}(s)
|
||
$$
|
||
|
||
is an unbiased estimate of the advantage function
|
||
|
||
$$
|
||
\begin{aligned}
|
||
\mathbb{E}_{\pi_\theta}[\delta^{\pi_\theta}| s,a]&=\mathbb{E}_{\pi_\theta}[r + \gamma V^{\pi_\theta}(s') |s,a]-V^{\pi_\theta}(s)\\
|
||
&=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)\\
|
||
&=A^{\pi_\theta}(s,a)
|
||
\end{aligned}
|
||
$$
|
||
|
||
So we can use the TD error to compute the policy gradient
|
||
|
||
$$
|
||
\Delta \theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}]
|
||
$$
|
||
|
||
In practice, we can use an approximate TD error $\delta_v=r+\gamma V_v(s')-V_v(s)$ to compute the policy gradient
|
||
|
||
### Summary of policy gradient algorithms
|
||
|
||
THe policy gradient has many equivalent forms.
|
||
|
||
$$
|
||
\begin{aligned}
|
||
\nabla_\theta J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) v_t] \text{ REINFORCE} \\
|
||
&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)] \text{ Q Actor-Critic} \\
|
||
&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_\theta}(s,a)] \text{ Advantage Actor-Critic} \\
|
||
&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}] \text{ TD Actor-Critic}
|
||
\end{aligned}
|
||
$$
|
||
|
||
Each leads s stochastic gradient ascent algorithm.
|
||
|
||
Critic use policy evaluation to estimate the $Q^\pi(s,a)$ or $A^\pi(s,a)$ or $V^\pi(s)$.
|
||
|
||
## Compatible Function Approximation
|
||
|
||
If the following two conditions are satisfied:
|
||
|
||
1. Value function approximation is a compatible with the policy
|
||
$$
|
||
\nabla_w Q_w(s,a) = \nabla_\theta \log \pi_\theta(s,a)
|
||
$$
|
||
2. Value function parameters $w$ minimize the MSE
|
||
$$
|
||
\epsilon = \mathbb{E}_{\pi_\theta}[(Q^{\pi_\theta}(s,a)-Q_w(s,a))^2]
|
||
$$
|
||
Note $\epsilon$ need not be zero, just need to be minimized.
|
||
|
||
Then the policy gradient is exact
|
||
|
||
$$
|
||
\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)]
|
||
$$
|
||
|
||
Remember:
|
||
|
||
$$
|
||
\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]
|
||
$$
|
||
|
||
### Challenges with Policy Gradient Methods
|
||
|
||
- Data Inefficiency
|
||
- On-policy method: for each new policy, we need to generate a completely new
|
||
- trajectory
|
||
- The data is thrown out after just one gradient update
|
||
- As complex neural networks need many updates, this makes the training process very slow
|
||
- Unstable update: step size is very important
|
||
- If step size is too large:
|
||
- Large step -> bad policy
|
||
- Next batch is generated from current bad policy -> collect bad samples
|
||
- Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
|
||
- If step size is too small: the learning process is slow
|
||
|