Files
NoteNextra-origin/content/CSE510/CSE510_L12.md
2025-11-04 12:43:23 -06:00

205 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CSE510 Deep Reinforcement Learning (Lecture 12)
## Policy Gradient Theorem
For any differentiable policy $\pi_\theta(s,a)$, for any o the policy objective functions $J=J_1, J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$
The policy gradient is
$$
\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]
$$
## Policy Gradient Methods
Advantages of Policy-Based RL
Advantages:
- Better convergence properties
- Effective in high-dimensional or continuous action spaces
- Can learn stochastic policies
Disadvantages:
- Typically converge to a local rather than global optimum
- Evaluating a policy is typically inefficient and high variance
### Anchor-Critic Methods
#### Q Actor-Critic
Reducing Variance Using a Critic
Monte-Carlo Policy Gradient still has high variance.
We use a critic to estimate the action-value function $Q_w(s,a)\approx Q^{\pi_\theta}(s,a)$.
Anchor-critic algorithms maintain two sets of parameters:
Critic: updates action-value function parameters $w$
Actor: updates policy parameters $\theta$, in direction suggested by the critic.
Actor-critic algorithms follow an approximate policy gradient:
$$
\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q_w(s,a)\right]
$$
$$
\Delta \theta = \alpha \nabla_\theta \log \pi_\theta(s,a)Q_w(s,a)
$$
Action-Value Actor-Critic
- Simple actor-critic algorithm based on action-value critic
- Using linear value function approximation $Q_w(s,a)=\phi(s,a)^\top w$
Critic: updates $w$ by linear $TD(0)$
Actor: updates $\theta$ by policy gradient
```python
def Q_actor-critic(states,theta):
actions=sample_actions(a,pi_theta)
for i in range(num_steps):
reward=sample_rewards(actions,states)
transition=sample_transition(actions,states)
new_actions=sample_action(transition,theta)
delta=sample_reward+gamma*Q_w(transition, new_actions)-Q_w(states, actions)
theta=theta+alpha*nabla_theta*log(pi_theta(states, actions))*Q_w(states, actions)
w=w+beta*delta*phi(states, actions)
a=new_actions
s=transition
```
#### Advantage Actor-Critic
Reducing variance using a baseline
- We subtract a baseline function $B(s)$ form the policy gradient
- This can reduce the variance without changing expectation
$$
\begin{aligned}
\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log \pi_\theta(s,a)B(s)\right]&=\sum_{s\in S}d^{\pi_\theta}(s)\sum_{a\in A}\nabla_{\theta}\pi_\theta(s,a)B(s)\\
&=\sum_{s\in S}d^{\pi_\theta}B(s)\nabla_\theta\sum_{a\in A}\pi_\theta(s,a)\\
&=0
\end{aligned}
$$
A good baseline is the state value function $B(s)=V^{\pi_\theta}(s)$
So we can rewrite the policy gradient using the advantage function $A^{\pi_\theta}(s,a)=Q^{\pi_\theta}(s,a)-V^{\pi_theta}(s)$
$$
\nabla_\theta J(\theta)=\mathbb{E}\left[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_theta}(s,a)\right]
$$
##### Estimating the Advantage function
**Method 1:** direct estimation
> May increase the variance
The advantage function can significantly reduce variance of policy gradient
So the critic should really estimate the advantage function
For example, by estimating both $V^{\pi_theta}(s)$ and $Q^{\pi_theta}(s,a)$
Using two function approximators and two parameter vectors,
$$
V_v(s)\approx V^{\pi_\theta}(s)\\
Q_w(s,a)\approx Q^{\pi_\theta}(s,a)\\
A(s,a)=Q_w(s,a)-V_v(s)
$$
And updating both value functions by e.g. TD learning
**Method 2:** using the TD error
> We can prove that TD error is an unbiased estimation of the advantage function
For the true value function $V^{\pi_\theta}(s)$, the TD error $\delta^{\pi_\theta}$
$$
\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s) - V^{\pi_\theta}(s)
$$
is an unbiased estimate of the advantage function
$$
\begin{aligned}
\mathbb{E}_{\pi_\theta}[\delta^{\pi_\theta}| s,a]&=\mathbb{E}_{\pi_\theta}[r + \gamma V^{\pi_\theta}(s') |s,a]-V^{\pi_\theta}(s)\\
&=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)\\
&=A^{\pi_\theta}(s,a)
\end{aligned}
$$
So we can use the TD error to compute the policy gradient
$$
\Delta \theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}]
$$
In practice, we can use an approximate TD error $\delta_v=r+\gamma V_v(s')-V_v(s)$ to compute the policy gradient
### Summary of policy gradient algorithms
THe policy gradient has many equivalent forms.
$$
\begin{aligned}
\nabla_\theta J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) v_t] \text{ REINFORCE} \\
&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)] \text{ Q Actor-Critic} \\
&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_\theta}(s,a)] \text{ Advantage Actor-Critic} \\
&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}] \text{ TD Actor-Critic}
\end{aligned}
$$
Each leads s stochastic gradient ascent algorithm.
Critic use policy evaluation to estimate the $Q^\pi(s,a)$ or $A^\pi(s,a)$ or $V^\pi(s)$.
## Compatible Function Approximation
If the following two conditions are satisfied:
1. Value function approximation is a compatible with the policy
$$
\nabla_w Q_w(s,a) = \nabla_\theta \log \pi_\theta(s,a)
$$
2. Value function parameters $w$ minimize the MSE
$$
\epsilon = \mathbb{E}_{\pi_\theta}[(Q^{\pi_\theta}(s,a)-Q_w(s,a))^2]
$$
Note $\epsilon$ need not be zero, just need to be minimized.
Then the policy gradient is exact
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)]
$$
Remember:
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]
$$
### Challenges with Policy Gradient Methods
- Data Inefficiency
- On-policy method: for each new policy, we need to generate a completely new
- trajectory
- The data is thrown out after just one gradient update
- As complex neural networks need many updates, this makes the training process very slow
- Unstable update step size is very important
- If step size is too large:
- Large step -> bad policy
- Next batch is generated from current bad policy -> collect bad samples
- Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
- If step size is too small: the learning process is slow