230 lines
7.4 KiB
Markdown
230 lines
7.4 KiB
Markdown
# CSE510 Deep Reinforcement Learning (Lecture 13)
|
||
|
||
> Recap from last lecture
|
||
>
|
||
> For any differentiable policy $\pi_\theta(s,a)$, for any o the policy objective functions $J=J_1, J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$
|
||
>
|
||
> The policy gradient is
|
||
>
|
||
> $\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]$
|
||
|
||
## Problem for policy gradient method
|
||
|
||
Data Inefficiency
|
||
|
||
- On-policy method: for each new policy, we need to generate a completely new trajectory
|
||
- The data is thrown out after just one gradient update
|
||
- As complex neural networks need many updates, this makes the training process very slow
|
||
|
||
Unstable update: step size is very important
|
||
|
||
- If step size is too large:
|
||
- Large step -> bad policy
|
||
- Next batch is generated from current bad policy → collect bad samples
|
||
- Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
|
||
- If step size is too small: the learning process is slow
|
||
|
||
## Deriving the optimization objection function of Trusted Region Policy Optimization (TRPO)
|
||
|
||
### Objective of Policy Gradient Methods
|
||
|
||
Policy Objective
|
||
|
||
$$
|
||
J(\pi_\theta)=\mathbb{E}_{\tau\sim \pi_theta}\sum_{t=0}^{\infty} \gamma^t r^t
|
||
$$
|
||
|
||
here $\tau$ is the trajectory for the policy $\pi_\theta$.
|
||
|
||
Policy objective can be written in terms of old one:
|
||
|
||
$$
|
||
J(\pi_{\theta'})-J(\pi_{\theta})=\mathbb{E}_{\tau \sim \pi_{\theta'}}\sum_{t=0}^{\infty}\gamma^t A^{\pi_\theta}(s_t,a_t)
|
||
$$
|
||
|
||
Equivalently for succinctness:
|
||
|
||
$$
|
||
J(\pi')-J(\pi)=\mathbb{E}_{\tau\in \pi'}\sum_{t=0}^{\infty} A^{\pi}(s_t,a_t)
|
||
$$
|
||
|
||
<details>
|
||
<summary> Proof</summary>
|
||
|
||
$$
|
||
\begin{aligned}
|
||
&\mathbb{E}_{\tau\sim \pi'}\left[\gamma^t A^{\pi_\theta}(s_t,a_t)\right]\\
|
||
&=E_{\tau\sim \pi'}\left[\sum_{t=0}^{\infty}\gamma^t R(s_0)+\sum_{t=0}^{\infty} \gamma^{t+1}V^{\pi}(s_{t+1})-\sum_{t=0}^{\infty} \gamma^{t}V^\pi(s_t)\right]\\
|
||
&=J(\pi')+\sum_{t=1}^{\infty} \gamma^{t}V^{\pi}(s_t)-\sum_{t=0}^{\infty} \gamma^{t}V^\pi(s_t)\\
|
||
&=J(\pi')-\mathbb{E}_{\tau\sim\pi'}V^{\pi}(s_0)\\
|
||
&=J(\pi')-J(\pi)
|
||
\end{aligned}
|
||
$$
|
||
|
||
</details>
|
||
|
||
### Importance Sampling
|
||
|
||
Estimate one distribution by sampling form anther distribution
|
||
|
||
$$
|
||
\begin{aligned}
|
||
\mathbb{E}_{x\sim p}[f(x)]&=\int f(x)p(x)dx
|
||
&=\int f(x)\frac{p(x)}{q(x)}q(x)dx\\
|
||
&=\mathbb{E}_{x\sim q}\left[f(x)\frac{p(x)}{q(x)}\right]\\
|
||
&\approx \frac{1}{N}\sum_{i=1,x^i\in q}^N f(x^i)\frac{p(x^i)}{q(x^i)}
|
||
\end{aligned}
|
||
$$
|
||
|
||
### Estimating objective with importance sampling
|
||
|
||
Discounted state visit distribution:
|
||
|
||
$$
|
||
d^\pi(s)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^t P(s_t=s|\pi)
|
||
$$
|
||
|
||
$$
|
||
\begin{aligned}
|
||
J(\pi')-J(\pi)&=\mathbb{E}_{\tau\sim\pi'}\sum_{t=0}^{\infty} \gamma^t A^{\pi}(s_t,a_t)\\
|
||
&=\mathbb{E}_{s\sim d^{\pi'}, a\sim \pi'} A^{\pi}(s,a)\\
|
||
&=\mathbb{E}_{s\sim d^{\pi'}, a\sim \pi}\left[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}\right]\\
|
||
\end{aligned}
|
||
$$
|
||
|
||
Using the old policy to sample states form a policy that we are trying to optimize.
|
||
|
||
$$
|
||
L_\pi(\pi')=\mathbb{E}_{s\sim d^{\pi'}, a\sim \pi}\left[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}\right]
|
||
$$
|
||
|
||
### Lower bound of Optimization
|
||
|
||
> [!NOTE]
|
||
>
|
||
> (Kullback-Leibler) KL divergence is a measure of the difference between two probability distributions.
|
||
>
|
||
> $D_{KL}(\pi(\dot|s)||\pi'(\dot|s))=\int_s \pi(a|s)\log \frac{\pi(a|s)}{\pi'(a|s)}da$
|
||
|
||
$$
|
||
J(\pi')-J(\pi)\geq L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
|
||
$$
|
||
|
||
where $C$ is a constant.
|
||
|
||
Optimizing the objective function:
|
||
|
||
$$
|
||
\max_{\pi'} J(\pi')-J(\pi)
|
||
$$
|
||
|
||
By maximizing the lower bound
|
||
|
||
$$
|
||
\max_{\pi'} L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
|
||
$$
|
||
|
||
### Monotonic Improvement Theorem
|
||
|
||
Proof of improvement guarantee: Suppose $\pi_{k+1}$ and $\pi_k$ are related by
|
||
|
||
$$
|
||
\pi_{k+1}=\max_{\pi'} L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
|
||
$$
|
||
|
||
$\pi_{k}$ is a feasible point, and the objective at $\pi_k$ is equal to 0.
|
||
|
||
$$
|
||
L_{\pi_k}(\pi_{k})\propto \mathbb{E}_{s,a\sim d^{\pi_k}}[A^{\pi_k}(s,a)]=0
|
||
$$
|
||
$$
|
||
D_{KL}(\pi_k||\pi_k)[s]=0
|
||
$$
|
||
|
||
Optimal value $\geq 0$.
|
||
|
||
By the performance bound, $J_{pi_{k+1}}-J_{pi_k}\geq 0$.
|
||
|
||
### Final objective function
|
||
|
||
$$
|
||
\max_{\pi'}\mathbb{E}_{s\sim d^{\pi}, a\sim \pi}[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}]-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
|
||
$$
|
||
|
||
by approximation
|
||
|
||
$$
|
||
\max_{\pi'}\mathbb{E}_{s\sim d^{\pi}, a\sim \pi}[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}]-C\mathbb{E}_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
|
||
$$
|
||
|
||
With the Lagrangian Duality, the objective is mathematically the same as following using a trust region constraint:
|
||
|
||
$$
|
||
\max_{\pi'} L_\pi(\pi')
|
||
$$
|
||
|
||
such that
|
||
|
||
$$
|
||
\mathbb{E}_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta
|
||
$$
|
||
|
||
$C$ gets very high when $\gamma$ is close to one and the corresponding gradient step size becomes too small.
|
||
|
||
$$
|
||
C\propto \frac{\epsilon \gamma}{(1-\gamma)^2}
|
||
$$
|
||
|
||
- Empirical results show that it needs to more adaptive
|
||
- But Tuning $C$ is hard (need some trick just like PPO)
|
||
- TRPO uses trust region constraint and make $\delta$ a tunable hyperparameter.
|
||
|
||
## Trust Region Policy Optimization (TRPO)
|
||
|
||
$$
|
||
\max_{\pi'} L_\pi(\pi')
|
||
$$
|
||
|
||
such that
|
||
|
||
$$
|
||
\mathbb{E}_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta
|
||
$$
|
||
|
||
Make linear approximation to $L_{\pi_{\theta_{old}}}$ and quadratic approximation to KL term.
|
||
|
||
Maximize $g\cdot(\theta-\theta_{old})-\frac{\beta}{2}(\theta-\theta_{old})^\top F(\theta-\theta_{old})$
|
||
|
||
where $g=\frac{\partial}{\partial \theta}L_{\pi_{\theta_{old}}}(\pi_{\theta})\vert_{\theta=\theta_{old}}$ and $F=\frac{\partial^2}{\partial \theta^2}\overline{KL}_{\pi_{\theta_{old}}}(\pi_{\theta})\vert_{\theta=\theta_{old}}$
|
||
|
||
<details>
|
||
<summary>Taylor Expansion of KL Term</summary>
|
||
|
||
$$
|
||
D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\approx D_{KL}(\pi_{\theta_{old}}|\pi_{\theta_{old}})+d^\top \nabla_\theta D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}+\frac{1}{2}d^\top \nabla_\theta^2 D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}d
|
||
$$
|
||
|
||
$$
|
||
\begin{aligned}
|
||
\nabla_\theta D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})&=-\nabla_\theta \mathbb{E}_{x\sim \pi_{\theta_{old}}}\log P_\theta(x)\vert_{\theta=\theta_{old}}\\
|
||
&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta \log P_\theta(x)\vert_{\theta=\theta_{old}}\\
|
||
&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\frac{1}{\pi_{\theta_{old}}(x)}\nabla_\theta P_\theta(x)\vert_{\theta=\theta_{old}}\\
|
||
&=\int_x P_{\theta_{old}}(x)\frac{1}{P_{\theta_{old}}(x)}\nabla_\theta P_\theta(x)\vert_{\theta=\theta_{old}} dx\\
|
||
&=\int_x P_{\theta_{old}}(x)\nabla_\theta P_\theta(x)\vert_{\theta=\theta_{old}} dx\\
|
||
&=\nabla_\theta \int_x P_{\theta_{old}}(x) \vert_{\theta=\theta_{old}} dx\\
|
||
&=0
|
||
\end{aligned}
|
||
$$
|
||
|
||
$$
|
||
\begin{aligned}
|
||
\nabla_\theta^2 D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta^2 \log P_\theta(x)\vert_{\theta=\theta_{old}}\\
|
||
&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta \left(\frac{\nabla_\theta P_\theta(x)}{P_\theta(x)}\right)\vert_{\theta=\theta_{old}}\\
|
||
&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\frac{\nabla_\theta^2 P_\theta(x)-\nabla_\theta P_\theta(x)\nabla_\theta P_\theta(x)^\top}{P_\theta(x)^2}\right)\vert_{\theta=\theta_{old}}\\
|
||
&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\frac{\nabla_\theta^2 P_\theta(x)\vert_{\theta=\theta_{old}}}P_{\theta_{old}}(x)\right)+\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\nabla_\theta \log P_\theta(x)\nabla_\theta \log P_\theta(x)^\top\right)\vert_{\theta=\theta_{old}}\\
|
||
&=\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta\log P_\theta(x)\nabla_\theta\log P_\theta(x)^\top\vert_{\theta=\theta_{old}}\\
|
||
\end{aligned}
|
||
$$
|
||
|
||
</details>
|