This commit is contained in:
Trance-0
2025-10-14 11:19:51 -05:00
parent 4e8139856e
commit e425088318
3 changed files with 198 additions and 6 deletions

View File

@@ -1,8 +1,12 @@
# CSE510 Deep Reinforcement Learning (Lecture 13)
> [!CAUTION]
> Recap from last lecture
>
> This lecture is terribly taught in 90 minutes for me, who have no idea about the optimization and the Lagrangian Duality. May include numerous mistakes and it is good to restart learning from this checkpoint.
> For any differentiable policy $\pi_\theta(s,a)$, for any o the policy objective functions $J=J_1, J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$
>
> The policy gradient is
>
> $\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]$
## Problem for policy gradient method
@@ -20,7 +24,7 @@ Unstable update step size is very important
- Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
- If step size is too small: the learning process is slow
## Deriving the optimization objection function of TRPO
## Deriving the optimization objection function of Trusted Region Policy Optimization (TRPO)
### Objective of Policy Gradient Methods
@@ -30,8 +34,7 @@ $$
J(\pi_\theta)=\mathbb{E}_{\tau\sim \pi_theta}\sum_{t=0}^{\infty} \gamma^t r^t
$$
here $\tau$ is the trajectory.
here $\tau$ is the trajectory for the policy $\pi_\theta$.
Policy objective can be written in terms of old one:
@@ -97,6 +100,12 @@ $$
### Lower bound of Optimization
> [!NOTE]
>
> (Kullback-Leibler) KL divergence is a measure of the difference between two probability distributions.
>
> $D_{KL}(\pi(\dot|s)||\pi'(\dot|s))=\int_s \pi(a|s)\log \frac{\pi(a|s)}{\pi'(a|s)}da$
$$
J(\pi')-J(\pi)\geq L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
$$

View File

@@ -0,0 +1,182 @@
# CSE510 Deep Reinforcement Learning (Lecture 14)
## Advanced Policy Gradient Methods
### Trust Region Policy Optimization (TRPO)
"Recall" from last lecture
$$
\max_{\pi'} \mathbb{E}_{s\sim d^{\pi},a\sim \pi} \left[\frac{\pi'(a|s)}{\pi(a|s)}A^{\pi}(s,a)\right]
$$
such that
$$
\mathbb{E}_{s\sim d^{\pi}} D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta
$$
Unconstrained penalized objective:
$$
d^*=\arg\max_{d} J(\theta+d)-\lambda(D_{KL}\left[\pi_\theta||\pi_{\theta+d}\right]-\delta)
$$
$\theta_{new}=\theta_{old}+d$
First order Taylor expansion for the loss and second order for the KL:
$$
\approx \arg\max_{d} J(\theta_{old})+\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d-\frac{1}{2}\lambda(d^T\nabla_\theta^2 D_{KL}\left[\pi_{\theta_{old}}||\pi_{\theta}\right]\mid_{\theta=\theta_{old}}d)+\lambda \delta
$$
If you are really interested, try to fill the solving the KL Constrained Problem section.
#### Natural Gradient Descent
Setting the gradient to zero:
$$
\begin{aligned}
0&=\frac{\partial}{\partial d}\left(-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d+\frac{1}{2}\lambda(d^T F(\theta_{old})d\right)\\
&=-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}+\frac{1}{2}\lambda F(\theta_{old})d
\end{aligned}
$$
$$
d=\frac{2}{\lambda} F^{-1}(\theta_{old})\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}
$$
The natural gradient is
$$
\tilde{\nabla}J(\theta)=F^{-1}(\theta_{old})\nabla_\theta J(\theta)
$$
$$
\theta_{new}=\theta_{old}+\alpha F^{-1}(\theta_{old})\hat{g}
$$
$$
D_{KL}(\pi_{\theta_{old}}||\pi_{\theta})\approx \frac{1}{2}(\theta-\theta_{old})^T F(\theta_{old})(\theta-\theta_{old})
$$
$$
\frac{1}{2}(\alpha g_N)^T F(\alpha g_N)=\delta
$$
$$
\alpha=\sqrt{\frac{2\delta}{g_N^T F g_N}}
$$
However, due to the quadratic approximation, the KL constrains may be violated.
#### Linear search
We do Linear search for the best step size by making sure that
- Improving the objective
- Satisfying the KL constraint
TRPO = NPG + Linesearch + monotonic improvement theorem
#### Summary of TRPO
Pros
- Proper learning step
- [Monotonic improvement guarantee](./CSE510_L13.md#Monotonic-Improvement-Theorem)
Cons
- Poor scalability
- Second-order optimization: computing Fisher Information Matrix and its inverse every time for the current policy model is expensive
- Not quite sample efficient
- Requiring a large batch of rollouts to approximate accurately
### Proximal Policy Optimization (PPO)
> Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. -- OpenAI
[link to paper](https://arxiv.org/pdf/1707.06347)
Idea:
- The constraint helps in the training process. However, maybe the constraint is not a strict constraint:
- Does it matter if we only break the constraint just a few times?
What if we treat it as a “soft” constraint? Add proximal value to the objective function?
#### PPO with Adaptive KL Penalty
$$
\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]
$$
Use adaptive $\beta$ value.
$$
L^{KLPEN}(\theta)=\hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]
$$
Compute $d=\hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]$
- If $d<d_{target}/1.5$, $\beta\gets \beta/2$
- If $d>d_{target}\times 1.5$, $\beta\gets \beta\times 2$
#### PPO with Clipped Objective
$$
\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]
$$
$$
r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}
$$
- Here, $r_t(\theta)$ measures how much the new policy changes the probability of taking action $a_t$ in state $s_t$:
- If $r_t > 1$ :the action becomes more likely under the new policy.
- If $r_t < 1$ :the action becomes less likely.
- We'd like $r_tA_t$ to increase if $A_t > 0$ (good actions become more probable) and decrease if $A_t < 0$.
- But if $r_t$ changes too much, the update becomes **unstable**, just like in vanilla PG.
We limit $r_t(\theta)$ to be in a range:
$$
L^{CLIP}(\theta)=\hat{\mathbb{E}_t}\left[\min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\right]
$$
> Trusted region Policy Optimization (TRPO): Don't move further than $\delta$ in KL.
> Proximal Policy Optimization (PPO): Don't let $r_t(\theta)$ drift further than $\epsilon$.
#### PPO in Practice
$$
L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}_t}\left[L_t^{CLIP}(\theta)+c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)\right]
$$
Here $L_t^{CLIP}(\theta)$ is the surrogate objective function.
$L_t^{VF}(\theta)$ is a squared-error loss for "critic" $(V_\theta(s_t)-V_t^{target})^2$.
$S[\pi_\theta](s_t)$ is the entropy bonus to ensure sufficient exploration. Encourage diversity of actions.
$c_1$ and $c_2$ are trade-off parameters, in paper $c_1=1$ and $c_2=0.01$.
### Summary for Policy Gradient Methods
Trust region policy optimization (TRPO)
- Optimization problem formulation
- Natural gradient ascent + monotonic improvement + line search
- But require second-order optimization
Proximal policy optimization (PPO)
- Clipped objective
- Simple yet effective
Take-away:
- Proper step size is critical for policy gradient methods
- Sample efficiency can be improved by using important sampling

View File

@@ -15,5 +15,6 @@ export default {
CSE510_L10: "CSE510 Deep Reinforcement Learning (Lecture 10)",
CSE510_L11: "CSE510 Deep Reinforcement Learning (Lecture 11)",
CSE510_L12: "CSE510 Deep Reinforcement Learning (Lecture 12)",
CSE510_L13: "CSE510 Deep Reinforcement Learning (Lecture 13)"
CSE510_L13: "CSE510 Deep Reinforcement Learning (Lecture 13)",
CSE510_L14: "CSE510 Deep Reinforcement Learning (Lecture 14)",
}