updates

2025-10-14 11:19:51 -05:00
parent 4e8139856e
commit e425088318
3 changed files with 198 additions and 6 deletions
--- a/content/CSE510/CSE510_L13.md
+++ b/content/CSE510/CSE510_L13.md
@@ -1,8 +1,12 @@
 # CSE510 Deep Reinforcement Learning (Lecture 13)

-> [!CAUTION]
+> Recap from last lecture
 >
-> This lecture is terribly taught in 90 minutes for me, who have no idea about the optimization and the Lagrangian Duality. May include numerous mistakes and it is good to restart learning from this checkpoint.
+> For any differentiable policy $\pi_\theta(s,a)$, for any o the policy objective functions $J=J_1, J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$
+> 
+> The policy gradient is
+>
+> $\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]$

 ## Problem for policy gradient method

@@ -20,7 +24,7 @@ Unstable update： step size is very important
  - Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
 - If step size is too small: the learning process is slow

-## Deriving the optimization objection function of TRPO
+## Deriving the optimization objection function of Trusted Region Policy Optimization (TRPO)

 ### Objective of Policy Gradient Methods

@@ -30,8 +34,7 @@ $$
 J(\pi_\theta)=\mathbb{E}_{\tau\sim \pi_theta}\sum_{t=0}^{\infty} \gamma^t r^t
 $$

-
-here $\tau$ is the trajectory.
+here $\tau$ is the trajectory for the policy $\pi_\theta$.

 Policy objective can be written in terms of old one:

@@ -97,6 +100,12 @@ $$

 ### Lower bound of Optimization

+> [!NOTE]
+>
+> (Kullback-Leibler) KL divergence is a measure of the difference between two probability distributions.
+>
+> $D_{KL}(\pi(\dot|s)||\pi'(\dot|s))=\int_s \pi(a|s)\log \frac{\pi(a|s)}{\pi'(a|s)}da$
+
 $$
 J(\pi')-J(\pi)\geq L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
 $$