updates
This commit is contained in:
220
content/CSE510/CSE510_L13.md
Normal file
220
content/CSE510/CSE510_L13.md
Normal file
@@ -0,0 +1,220 @@
|
|||||||
|
# CSE510 Deep Reinforcement Learning (Lecture 13)
|
||||||
|
|
||||||
|
> [!CAUTION]
|
||||||
|
>
|
||||||
|
> This lecture is terribly taught in 90 minutes for me, who have no idea about the optimization and the Lagrangian Duality. May include numerous mistakes and it is good to restart learning from this checkpoint.
|
||||||
|
|
||||||
|
## Problem for policy gradient method
|
||||||
|
|
||||||
|
Data Inefficiency
|
||||||
|
|
||||||
|
- On-policy method: for each new policy, we need to generate a completely new trajectory
|
||||||
|
- The data is thrown out after just one gradient update
|
||||||
|
- As complex neural networks need many updates, this makes the training process very slow
|
||||||
|
|
||||||
|
Unstable update: step size is very important
|
||||||
|
|
||||||
|
- If step size is too large:
|
||||||
|
- Large step -> bad policy
|
||||||
|
- Next batch is generated from current bad policy → collect bad samples
|
||||||
|
- Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
|
||||||
|
- If step size is too small: the learning process is slow
|
||||||
|
|
||||||
|
## Deriving the optimization objection function of TRPO
|
||||||
|
|
||||||
|
### Objective of Policy Gradient Methods
|
||||||
|
|
||||||
|
Policy Objective
|
||||||
|
|
||||||
|
$$
|
||||||
|
J(\pi_\theta)=\mathbb{E}_{\tau\sim \pi_theta}\sum_{t=0}^{\infty} \gamma^t r^t
|
||||||
|
$$
|
||||||
|
|
||||||
|
|
||||||
|
here $\tau$ is the trajectory.
|
||||||
|
|
||||||
|
Policy objective can be written in terms of old one:
|
||||||
|
|
||||||
|
$$
|
||||||
|
J(\pi_{\theta'})-J(\pi_{\theta})=\mathbb{E}_{\tau \sim \pi_{\theta'}}\sum_{t=0}^{\infty}\gamma^t A^{\pi_\theta}(s_t,a_t)
|
||||||
|
$$
|
||||||
|
|
||||||
|
Equivalently for succinctness:
|
||||||
|
|
||||||
|
$$
|
||||||
|
J(\pi')-J(\pi)=\mathbb{E}_{\tau\in \pi'}\sum_{t=0}^{\infty} A^{\pi}(s_t,a_t)
|
||||||
|
$$
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary> Proof</summary>
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
&\mathbb{E}_{\tau\sim \pi'}\left[\gamma^t A^{\pi_\theta}(s_t,a_t)\right]\\
|
||||||
|
&=E_{\tau\sim \pi'}\left[\sum_{t=0}^{\infty}\gamma^t R(s_0)+\sum_{t=0}^{\infty} \gamma^{t+1}V^{\pi}(s_{t+1})-\sum_{t=0}^{\infty} \gamma^{t}V^\pi(s_t)\right]\\
|
||||||
|
&=J(\pi')+\sum_{t=1}^{\infty} \gamma^{t}V^{\pi}(s_t)-\sum_{t=0}^{\infty} \gamma^{t}V^\pi(s_t)\\
|
||||||
|
&=J(\pi')-\mathbb{E}_{\tau\sim\pi'}V^{\pi}(s_0)\\
|
||||||
|
&=J(\pi')-J(\pi)
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
### Importance Sampling
|
||||||
|
|
||||||
|
Estimate one distribution by sampling form anther distribution
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
\mathbb{E}_{x\sim p}[f(x)]&=\int f(x)p(x)dx
|
||||||
|
&=\int f(x)\frac{p(x)}{q(x)}q(x)dx\\
|
||||||
|
&=\mathbb{E}_{x\sim q}\left[f(x)\frac{p(x)}{q(x)}\right]\\
|
||||||
|
&\approx \frac{1}{N}\sum_{i=1,x^i\in q}^N f(x^i)\frac{p(x^i)}{q(x^i)}
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
### Estimating objective with importance sampling
|
||||||
|
|
||||||
|
Discounted state visit distribution:
|
||||||
|
|
||||||
|
$$
|
||||||
|
d^\pi(s)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^t P(s_t=s|\pi)
|
||||||
|
$$
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
J(\pi')-J(\pi)&=\mathbb{E}_{\tau\sim\pi'}\sum_{t=0}^{\infty} \gamma^t A^{\pi}(s_t,a_t)\\
|
||||||
|
&=\mathbb{E}_{s\sim d^{\pi'}, a\sim \pi'} A^{\pi}(s,a)\\
|
||||||
|
&=\mathbb{E}_{s\sim d^{\pi'}, a\sim \pi}\left[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}\right]\\
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Using the old policy to sample states form a policy that we are trying to optimize.
|
||||||
|
|
||||||
|
$$
|
||||||
|
L_\pi(\pi')=\mathbb{E}_{s\sim d^{\pi'}, a\sim \pi}\left[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}\right]
|
||||||
|
$$
|
||||||
|
|
||||||
|
### Lower bound of Optimization
|
||||||
|
|
||||||
|
$$
|
||||||
|
J(\pi')-J(\pi)\geq L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
|
||||||
|
$$
|
||||||
|
|
||||||
|
where $C$ is a constant.
|
||||||
|
|
||||||
|
Optimizing the objective function:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\max_{\pi'} J(\pi')-J(\pi)
|
||||||
|
$$
|
||||||
|
|
||||||
|
By maximizing the lower bound
|
||||||
|
|
||||||
|
$$
|
||||||
|
\max_{\pi'} L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
|
||||||
|
$$
|
||||||
|
|
||||||
|
### Monotonic Improvement Theorem
|
||||||
|
|
||||||
|
Proof of improvement guarantee: Suppose $\pi_{k+1}$ and $\pi_k$ are related by
|
||||||
|
|
||||||
|
$$
|
||||||
|
\pi_{k+1}=\max_{\pi'} L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
|
||||||
|
$$
|
||||||
|
|
||||||
|
$\pi_{k}$ is a feasible point, and the objective at $\pi_k$ is equal to 0.
|
||||||
|
|
||||||
|
$$
|
||||||
|
L_{\pi_k}(\pi_{k})\propto \mathbb{E}_{s,a\sim d^{\pi_k}}[A^{\pi_k}(s,a)]=0
|
||||||
|
$$
|
||||||
|
$$
|
||||||
|
D_{KL}(\pi_k||\pi_k)[s]=0
|
||||||
|
$$
|
||||||
|
|
||||||
|
Optimal value $\geq 0$.
|
||||||
|
|
||||||
|
By the performance bound, $J_{pi_{k+1}}-J_{pi_k}\geq 0$.
|
||||||
|
|
||||||
|
### Final objective function
|
||||||
|
|
||||||
|
$$
|
||||||
|
\max_{\pi'}\mathbb{E}_{s\sim d^{\pi}, a\sim \pi}[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}]-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
|
||||||
|
$$
|
||||||
|
|
||||||
|
by approximation
|
||||||
|
|
||||||
|
$$
|
||||||
|
\max_{\pi'}\mathbb{E}_{s\sim d^{\pi}, a\sim \pi}[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}]-C\mathbb{E}_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))
|
||||||
|
$$
|
||||||
|
|
||||||
|
With the Lagrangian Duality, the objective is mathematically the same as following using a trust region constraint:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\max_{\pi'} L_\pi(\pi')
|
||||||
|
$$
|
||||||
|
|
||||||
|
such that
|
||||||
|
|
||||||
|
$$
|
||||||
|
\mathbb{E}_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta
|
||||||
|
$$
|
||||||
|
|
||||||
|
$C$ gets very high when $\gamma$ is close to one and the corresponding gradient step size becomes too small.
|
||||||
|
|
||||||
|
$$
|
||||||
|
C\propto \frac{\epsilon \gamma}{(1-\gamma)^2}
|
||||||
|
$$
|
||||||
|
|
||||||
|
- Empirical results show that it needs to more adaptive
|
||||||
|
- But Tuning $C$ is hard (need some trick just like PPO)
|
||||||
|
- TRPO uses trust region constraint and make $\delta$ a tunable hyperparameter.
|
||||||
|
|
||||||
|
## Trust Region Policy Optimization (TRPO)
|
||||||
|
|
||||||
|
$$
|
||||||
|
\max_{\pi'} L_\pi(\pi')
|
||||||
|
$$
|
||||||
|
|
||||||
|
such that
|
||||||
|
|
||||||
|
$$
|
||||||
|
\mathbb{E}_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta
|
||||||
|
$$
|
||||||
|
|
||||||
|
Make linear approximation to $L_{\pi_{\theta_{old}}}$ and quadratic approximation to KL term.
|
||||||
|
|
||||||
|
Maximize $g\cdot(\theta-\theta_{old})-\frac{\beta}{2}(\theta-\theta_{old})^T F(\theta-\theta_{old})$
|
||||||
|
|
||||||
|
where $g=\frac{\partial}{\partial \theta}L_{\pi_{\theta_{old}}}(\pi_{\theta})\vert_{\theta=\theta_{old}}$ and $F=\frac{\partial^2}{\partial \theta^2}\overline{KL}_{\pi_{\theta_{old}}}(\pi_{\theta})\vert_{\theta=\theta_{old}}$
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Taylor Expansion of KL Term</summary>
|
||||||
|
|
||||||
|
$$
|
||||||
|
D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\approx D_{KL}(\pi_{\theta_{old}}|\pi_{\theta_{old}})+d^T \nabla_\theta D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}+\frac{1}{2}d^T \nabla_\theta^2 D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}d
|
||||||
|
$$
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
\nabla_\theta D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})&=-\nabla_\theta \mathbb{E}_{x\sim \pi_{\theta_{old}}}\log P_\theta(x)\vert_{\theta=\theta_{old}}\\
|
||||||
|
&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta \log P_\theta(x)\vert_{\theta=\theta_{old}}\\
|
||||||
|
&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\frac{1}{\pi_{\theta_{old}}(x)}\nabla_\theta P_\theta(x)\vert_{\theta=\theta_{old}}\\
|
||||||
|
&=\int_x P_{\theta_{old}}(x)\frac{1}{P_{\theta_{old}}(x)}\nabla_\theta P_\theta(x)\vert_{\theta=\theta_{old}} dx\\
|
||||||
|
&=\int_x P_{\theta_{old}}(x)\nabla_\theta P_\theta(x)\vert_{\theta=\theta_{old}} dx\\
|
||||||
|
&=\nabla_\theta \int_x P_{\theta_{old}}(x) \vert_{\theta=\theta_{old}} dx\\
|
||||||
|
&=0
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
\nabla_\theta^2 D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta^2 \log P_\theta(x)\vert_{\theta=\theta_{old}}\\
|
||||||
|
&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta \left(\frac{\nabla_\theta P_\theta(x)}{P_\theta(x)}\right)\vert_{\theta=\theta_{old}}\\
|
||||||
|
&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\frac{\nabla_\theta^2 P_\theta(x)-\nabla_\theta P_\theta(x)\nabla_\theta P_\theta(x)^T}{P_\theta(x)^2}\right)\vert_{\theta=\theta_{old}}\\
|
||||||
|
&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\frac{\nabla_\theta^2 P_\theta(x)\vert_{\theta=\theta_{old}}}P_{\theta_{old}}(x)\right)+\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\nabla_\theta \log P_\theta(x)\nabla_\theta \log P_\theta(x)^T\right)\vert_{\theta=\theta_{old}}\\
|
||||||
|
&=\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta\log P_\theta(x)\nabla_\theta\log P_\theta(x)^T\vert_{\theta=\theta_{old}}\\
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
</details>
|
||||||
@@ -14,5 +14,6 @@ export default {
|
|||||||
CSE510_L9: "CSE510 Deep Reinforcement Learning (Lecture 9)",
|
CSE510_L9: "CSE510 Deep Reinforcement Learning (Lecture 9)",
|
||||||
CSE510_L10: "CSE510 Deep Reinforcement Learning (Lecture 10)",
|
CSE510_L10: "CSE510 Deep Reinforcement Learning (Lecture 10)",
|
||||||
CSE510_L11: "CSE510 Deep Reinforcement Learning (Lecture 11)",
|
CSE510_L11: "CSE510 Deep Reinforcement Learning (Lecture 11)",
|
||||||
CSE510_L12: "CSE510 Deep Reinforcement Learning (Lecture 12)"
|
CSE510_L12: "CSE510 Deep Reinforcement Learning (Lecture 12)",
|
||||||
|
CSE510_L13: "CSE510 Deep Reinforcement Learning (Lecture 13)"
|
||||||
}
|
}
|
||||||
1
content/Math4201/Exam_reviews/Math4201_E1.md
Normal file
1
content/Math4201/Exam_reviews/Math4201_E1.md
Normal file
@@ -0,0 +1 @@
|
|||||||
|
# Math 4201 Exam 1 review
|
||||||
@@ -3,6 +3,7 @@ export default {
|
|||||||
"---":{
|
"---":{
|
||||||
type: 'separator'
|
type: 'separator'
|
||||||
},
|
},
|
||||||
|
Exam_reviews: "Exam reviews",
|
||||||
Math4201_L1: "Topology I (Lecture 1)",
|
Math4201_L1: "Topology I (Lecture 1)",
|
||||||
Math4201_L2: "Topology I (Lecture 2)",
|
Math4201_L2: "Topology I (Lecture 2)",
|
||||||
Math4201_L3: "Topology I (Lecture 3)",
|
Math4201_L3: "Topology I (Lecture 3)",
|
||||||
@@ -20,4 +21,5 @@ export default {
|
|||||||
Math4201_L15: "Topology I (Lecture 15)",
|
Math4201_L15: "Topology I (Lecture 15)",
|
||||||
Math4201_L16: "Topology I (Lecture 16)",
|
Math4201_L16: "Topology I (Lecture 16)",
|
||||||
Math4201_L17: "Topology I (Lecture 17)",
|
Math4201_L17: "Topology I (Lecture 17)",
|
||||||
|
Math4201_L18: "Topology I (Lecture 18)"
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user