updates
This commit is contained in:
@@ -1,8 +1,12 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 13)
|
||||
|
||||
> [!CAUTION]
|
||||
> Recap from last lecture
|
||||
>
|
||||
> This lecture is terribly taught in 90 minutes for me, who have no idea about the optimization and the Lagrangian Duality. May include numerous mistakes and it is good to restart learning from this checkpoint.
|
||||
> For any differentiable policy $\pi_\theta(s,a)$, for any o the policy objective functions $J=J_1, J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$
|
||||
>
|
||||
> The policy gradient is
|
||||
>
|
||||
> $\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]$
|
||||
|
||||
## Problem for policy gradient method
|
||||
|
||||
@@ -20,7 +24,7 @@ Unstable update: step size is very important
|
||||
- Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
|
||||
- If step size is too small: the learning process is slow
|
||||
|
||||
## Deriving the optimization objection function of TRPO
|
||||
## Deriving the optimization objection function of Trusted Region Policy Optimization (TRPO)
|
||||
|
||||
### Objective of Policy Gradient Methods
|
||||
|
||||
@@ -30,8 +34,7 @@ $$
|
||||
J(\pi_\theta)=\mathbb{E}_{\tau\sim \pi_theta}\sum_{t=0}^{\infty} \gamma^t r^t
|
||||
$$
|
||||
|
||||
|
||||
here $\tau$ is the trajectory.
|
||||
here $\tau$ is the trajectory for the policy $\pi_\theta$.
|
||||
|
||||
Policy objective can be written in terms of old one:
|
||||
|
||||
@@ -97,6 +100,12 @@ $$
|
||||
|
||||
### Lower bound of Optimization
|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> (Kullback-Leibler) KL divergence is a measure of the difference between two probability distributions.
|
||||
>
|
||||
> $D_{KL}(\pi(\dot|s)||\pi'(\dot|s))=\int_s \pi(a|s)\log \frac{\pi(a|s)}{\pi'(a|s)}da$
|
||||
|
||||
$$
|
||||
J(\pi')-J(\pi)\geq L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
|
||||
$$
|
||||
|
||||
Reference in New Issue
Block a user