Trance-0/NoteNextra-origin

Fork 0

Files

Trance-0 614479e4d0 update notations

2025-11-04 12:43:23 -06:00

5.5 KiB

Raw Permalink Blame History

CSE510 Deep Reinforcement Learning (Lecture 14)

Advanced Policy Gradient Methods

Trust Region Policy Optimization (TRPO)

"Recall" from last lecture


\max_{\pi'} \mathbb{E}_{s\sim d^{\pi},a\sim \pi} \left[\frac{\pi'(a|s)}{\pi(a|s)}A^{\pi}(s,a)\right]

such that


\mathbb{E}_{s\sim d^{\pi}} D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta

Unconstrained penalized objective:


d^*=\arg\max_{d} J(\theta+d)-\lambda(D_{KL}\left[\pi_\theta||\pi_{\theta+d}\right]-\delta)

\theta_{new}=\theta_{old}+d

First order Taylor expansion for the loss and second order for the KL:


\approx \arg\max_{d} J(\theta_{old})+\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d-\frac{1}{2}\lambda(d^\top\nabla_\theta^2 D_{KL}\left[\pi_{\theta_{old}}||\pi_{\theta}\right]\mid_{\theta=\theta_{old}}d)+\lambda \delta

If you are really interested, try to fill the solving the KL Constrained Problem section.

Natural Gradient Descent

Setting the gradient to zero:


\begin{aligned}
0&=\frac{\partial}{\partial d}\left(-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d+\frac{1}{2}\lambda(d^\top F(\theta_{old})d\right)\\
&=-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}+\frac{1}{2}\lambda F(\theta_{old})d
\end{aligned}


d=\frac{2}{\lambda} F^{-1}(\theta_{old})\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}

The natural gradient is


\tilde{\nabla}J(\theta)=F^{-1}(\theta_{old})\nabla_\theta J(\theta)


\theta_{new}=\theta_{old}+\alpha F^{-1}(\theta_{old})\hat{g}


D_{KL}(\pi_{\theta_{old}}||\pi_{\theta})\approx \frac{1}{2}(\theta-\theta_{old})^\top F(\theta_{old})(\theta-\theta_{old})


\frac{1}{2}(\alpha g_N)^\top F(\alpha g_N)=\delta


\alpha=\sqrt{\frac{2\delta}{g_N^\top F g_N}}

However, due to the quadratic approximation, the KL constrains may be violated.

Linear search

We do Linear search for the best step size by making sure that

Improving the objective
Satisfying the KL constraint

TRPO = NPG + Linesearch + monotonic improvement theorem

Summary of TRPO

Pros

Proper learning step
Monotonic improvement guarantee

Cons

Poor scalability
- Second-order optimization: computing Fisher Information Matrix and its inverse every time for the current policy model is expensive
Not quite sample efficient
- Requiring a large batch of rollouts to approximate accurately

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. -- OpenAI

link to paper

Idea:

The constraint helps in the training process. However, maybe the constraint is not a strict constraint:
Does it matter if we only break the constraint just a few times?

What if we treat it as a “soft” constraint? Add proximal value to the objective function?

PPO with Adaptive KL Penalty


\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]

Use adaptive \beta value.


L^{KLPEN}(\theta)=\hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]

Compute d=\hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]

If d<d_{target}/1.5, \beta\gets \beta/2
If d>d_{target}\times 1.5, \beta\gets \beta\times 2

PPO with Clipped Objective


\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]


r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}

Here, r_t(\theta) measures how much the new policy changes the probability of taking action a_t in state s_t:
- If r_t > 1 :the action becomes more likely under the new policy.
- If r_t < 1 :the action becomes less likely.
We'd like r_tA_t to increase if A_t > 0 (good actions become more probable) and decrease if A_t < 0.
But if r_t changes too much, the update becomes unstable, just like in vanilla PG.

We limit r_t(\theta) to be in a range:


L^{CLIP}(\theta)=\hat{\mathbb{E}_t}\left[\min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\right]

Trusted region Policy Optimization (TRPO): Don't move further than \delta in KL. Proximal Policy Optimization (PPO): Don't let r_t(\theta) drift further than \epsilon.

PPO in Practice


L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}_t}\left[L_t^{CLIP}(\theta)+c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)\right]

Here L_t^{CLIP}(\theta) is the surrogate objective function.

L_t^{VF}(\theta) is a squared-error loss for "critic" (V_\theta(s_t)-V_t^{target})^2.

S[\pi_\theta](s_t) is the entropy bonus to ensure sufficient exploration. Encourage diversity of actions.

c_1 and c_2 are trade-off parameters, in paper c_1=1 and c_2=0.01.

Summary for Policy Gradient Methods

Trust region policy optimization (TRPO)

Optimization problem formulation
Natural gradient ascent + monotonic improvement + line search
But require second-order optimization

Proximal policy optimization (PPO)

Clipped objective
Simple yet effective

Take-away:

Proper step size is critical for policy gradient methods
Sample efficiency can be improved by using important sampling

5.5 KiB Raw Permalink Blame History