5.5 KiB
CSE510 Deep Reinforcement Learning (Lecture 14)
Advanced Policy Gradient Methods
Trust Region Policy Optimization (TRPO)
"Recall" from last lecture
\max_{\pi'} \mathbb{E}_{s\sim d^{\pi},a\sim \pi} \left[\frac{\pi'(a|s)}{\pi(a|s)}A^{\pi}(s,a)\right]
such that
\mathbb{E}_{s\sim d^{\pi}} D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta
Unconstrained penalized objective:
d^*=\arg\max_{d} J(\theta+d)-\lambda(D_{KL}\left[\pi_\theta||\pi_{\theta+d}\right]-\delta)
\theta_{new}=\theta_{old}+d
First order Taylor expansion for the loss and second order for the KL:
\approx \arg\max_{d} J(\theta_{old})+\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d-\frac{1}{2}\lambda(d^\top\nabla_\theta^2 D_{KL}\left[\pi_{\theta_{old}}||\pi_{\theta}\right]\mid_{\theta=\theta_{old}}d)+\lambda \delta
If you are really interested, try to fill the solving the KL Constrained Problem section.
Natural Gradient Descent
Setting the gradient to zero:
\begin{aligned}
0&=\frac{\partial}{\partial d}\left(-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d+\frac{1}{2}\lambda(d^\top F(\theta_{old})d\right)\\
&=-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}+\frac{1}{2}\lambda F(\theta_{old})d
\end{aligned}
d=\frac{2}{\lambda} F^{-1}(\theta_{old})\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}
The natural gradient is
\tilde{\nabla}J(\theta)=F^{-1}(\theta_{old})\nabla_\theta J(\theta)
\theta_{new}=\theta_{old}+\alpha F^{-1}(\theta_{old})\hat{g}
D_{KL}(\pi_{\theta_{old}}||\pi_{\theta})\approx \frac{1}{2}(\theta-\theta_{old})^\top F(\theta_{old})(\theta-\theta_{old})
\frac{1}{2}(\alpha g_N)^\top F(\alpha g_N)=\delta
\alpha=\sqrt{\frac{2\delta}{g_N^\top F g_N}}
However, due to the quadratic approximation, the KL constrains may be violated.
Linear search
We do Linear search for the best step size by making sure that
- Improving the objective
- Satisfying the KL constraint
TRPO = NPG + Linesearch + monotonic improvement theorem
Summary of TRPO
Pros
- Proper learning step
- Monotonic improvement guarantee
Cons
- Poor scalability
- Second-order optimization: computing Fisher Information Matrix and its inverse every time for the current policy model is expensive
- Not quite sample efficient
- Requiring a large batch of rollouts to approximate accurately
Proximal Policy Optimization (PPO)
Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. -- OpenAI
Idea:
- The constraint helps in the training process. However, maybe the constraint is not a strict constraint:
- Does it matter if we only break the constraint just a few times?
What if we treat it as a “soft” constraint? Add proximal value to the objective function?
PPO with Adaptive KL Penalty
\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]
Use adaptive \beta value.
L^{KLPEN}(\theta)=\hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]
Compute d=\hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]
- If
d<d_{target}/1.5,\beta\gets \beta/2 - If
d>d_{target}\times 1.5,\beta\gets \beta\times 2
PPO with Clipped Objective
\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]
r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}
- Here,
r_t(\theta)measures how much the new policy changes the probability of taking actiona_tin states_t:- If
r_t > 1:the action becomes more likely under the new policy. - If
r_t < 1:the action becomes less likely.
- If
- We'd like
r_tA_tto increase ifA_t > 0(good actions become more probable) and decrease ifA_t < 0. - But if
r_tchanges too much, the update becomes unstable, just like in vanilla PG.
We limit r_t(\theta) to be in a range:
L^{CLIP}(\theta)=\hat{\mathbb{E}_t}\left[\min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\right]
Trusted region Policy Optimization (TRPO): Don't move further than
\deltain KL. Proximal Policy Optimization (PPO): Don't letr_t(\theta)drift further than\epsilon.
PPO in Practice
L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}_t}\left[L_t^{CLIP}(\theta)+c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)\right]
Here L_t^{CLIP}(\theta) is the surrogate objective function.
L_t^{VF}(\theta) is a squared-error loss for "critic" (V_\theta(s_t)-V_t^{target})^2.
S[\pi_\theta](s_t) is the entropy bonus to ensure sufficient exploration. Encourage diversity of actions.
c_1 and c_2 are trade-off parameters, in paper c_1=1 and c_2=0.01.
Summary for Policy Gradient Methods
Trust region policy optimization (TRPO)
- Optimization problem formulation
- Natural gradient ascent + monotonic improvement + line search
- But require second-order optimization
Proximal policy optimization (PPO)
- Clipped objective
- Simple yet effective
Take-away:
- Proper step size is critical for policy gradient methods
- Sample efficiency can be improved by using important sampling