updates

2025-10-14 11:19:51 -05:00
parent 4e8139856e
commit e425088318
3 changed files with 198 additions and 6 deletions
--- a/content/CSE510/CSE510_L14.md
+++ b/content/CSE510/CSE510_L14.md
@@ -0,0 +1,182 @@
+# CSE510 Deep Reinforcement Learning (Lecture 14)
+
+## Advanced Policy Gradient Methods
+
+### Trust Region Policy Optimization (TRPO)
+
+"Recall" from last lecture
+
+$$
+\max_{\pi'} \mathbb{E}_{s\sim d^{\pi},a\sim \pi} \left[\frac{\pi'(a|s)}{\pi(a|s)}A^{\pi}(s,a)\right]
+$$
+
+such that
+
+$$
+\mathbb{E}_{s\sim d^{\pi}} D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta
+$$
+
+Unconstrained penalized objective:
+
+$$
+d^*=\arg\max_{d} J(\theta+d)-\lambda(D_{KL}\left[\pi_\theta||\pi_{\theta+d}\right]-\delta)
+$$
+
+$\theta_{new}=\theta_{old}+d$
+
+First order Taylor expansion for the loss and second order for the KL:
+
+$$
+\approx \arg\max_{d} J(\theta_{old})+\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d-\frac{1}{2}\lambda(d^T\nabla_\theta^2 D_{KL}\left[\pi_{\theta_{old}}||\pi_{\theta}\right]\mid_{\theta=\theta_{old}}d)+\lambda \delta
+$$
+
+If you are really interested, try to fill the solving the KL Constrained Problem section.
+
+#### Natural Gradient Descent
+
+Setting the gradient to zero:
+
+$$
+\begin{aligned}
+0&=\frac{\partial}{\partial d}\left(-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d+\frac{1}{2}\lambda(d^T F(\theta_{old})d\right)\\
+&=-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}+\frac{1}{2}\lambda F(\theta_{old})d
+\end{aligned}
+$$
+
+$$
+d=\frac{2}{\lambda} F^{-1}(\theta_{old})\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}
+$$
+
+The natural gradient is
+
+$$
+\tilde{\nabla}J(\theta)=F^{-1}(\theta_{old})\nabla_\theta J(\theta)
+$$
+
+$$
+\theta_{new}=\theta_{old}+\alpha F^{-1}(\theta_{old})\hat{g}
+$$
+
+$$
+D_{KL}(\pi_{\theta_{old}}||\pi_{\theta})\approx \frac{1}{2}(\theta-\theta_{old})^T F(\theta_{old})(\theta-\theta_{old})
+$$
+
+$$
+\frac{1}{2}(\alpha g_N)^T F(\alpha g_N)=\delta
+$$
+
+$$
+\alpha=\sqrt{\frac{2\delta}{g_N^T F g_N}}
+$$
+
+However, due to the quadratic approximation, the KL constrains may be violated.
+
+#### Linear search
+
+We do Linear search for the best step size by making sure that
+
+- Improving the objective
+- Satisfying the KL constraint
+
+TRPO = NPG + Linesearch + monotonic improvement theorem
+
+#### Summary of TRPO
+
+Pros
+
+- Proper learning step
+- [Monotonic improvement guarantee](./CSE510_L13.md#Monotonic-Improvement-Theorem)
+
+Cons
+
+- Poor scalability
+  - Second-order optimization: computing Fisher Information Matrix and its inverse every time for the current policy model is expensive
+- Not quite sample efficient
+  - Requiring a large batch of rollouts to approximate accurately
+
+### Proximal Policy Optimization (PPO)
+
+> Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. -- OpenAI
+
+[link to paper](https://arxiv.org/pdf/1707.06347)
+
+Idea:
+
+- The constraint helps in the training process. However, maybe the constraint is not a strict constraint:
+- Does it matter if we only break the constraint just a few times?
+
+What if we treat it as a “soft” constraint? Add proximal value to the objective function?
+
+#### PPO with Adaptive KL Penalty
+
+$$
+\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]
+$$
+
+Use adaptive $\beta$ value.
+
+$$
+L^{KLPEN}(\theta)=\hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]-\beta \hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]
+$$
+
+Compute $d=\hat{\mathbb{E}_t}\left[KL[\pi_{\theta_{old}}(\dot|s_t),\pi_{\theta}(\dot|s_t)]\right]$
+
+- If $d<d_{target}/1.5$, $\beta\gets \beta/2$
+- If $d>d_{target}\times 1.5$, $\beta\gets \beta\times 2$
+
+#### PPO with Clipped Objective
+
+$$
+\max_{\theta} \hat{\mathbb{E}_t}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t\right]
+$$
+
+$$
+r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}
+$$
+
+- Here, $r_t(\theta)$ measures how much the new policy changes the probability of taking action $a_t$ in state $s_t$:
+  - If $r_t > 1$ :the action becomes more likely under the new policy.
+  - If $r_t < 1$ :the action becomes less likely.
+- We'd like $r_tA_t$ to increase if $A_t > 0$ (good actions become more probable) and decrease if $A_t < 0$.
+- But if $r_t$ changes too much, the update becomes **unstable**, just like in vanilla PG.
+
+We limit $r_t(\theta)$ to be in a range:
+
+$$
+L^{CLIP}(\theta)=\hat{\mathbb{E}_t}\left[\min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\right]
+$$
+
+> Trusted region Policy Optimization (TRPO): Don't move further than $\delta$ in KL.
+> Proximal Policy Optimization (PPO): Don't let $r_t(\theta)$ drift further than $\epsilon$.
+
+#### PPO in Practice
+
+$$
+L_t^{CLIP+VF+S}(\theta)=\hat{\mathbb{E}_t}\left[L_t^{CLIP}(\theta)+c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)\right]
+$$
+
+Here $L_t^{CLIP}(\theta)$ is the surrogate objective function.
+
+$L_t^{VF}(\theta)$ is a squared-error loss for "critic" $(V_\theta(s_t)-V_t^{target})^2$.
+
+$S[\pi_\theta](s_t)$ is the entropy bonus to ensure sufficient exploration. Encourage diversity of actions.
+
+$c_1$ and $c_2$ are trade-off parameters, in paper $c_1=1$ and $c_2=0.01$.
+
+### Summary for Policy Gradient Methods
+
+Trust region policy optimization (TRPO)
+
+- Optimization problem formulation
+- Natural gradient ascent + monotonic improvement + line search
+- But require second-order optimization
+
+Proximal policy optimization (PPO)
+
+- Clipped objective
+- Simple yet effective
+
+Take-away:
+
+- Proper step size is critical for policy gradient methods
+- Sample efficiency can be improved by using important sampling