updates
This commit is contained in:
@@ -1,8 +1,148 @@
|
|||||||
# CSE510 Deep Reinforcement Learning (Lecture 15)
|
# CSE510 Deep Reinforcement Learning (Lecture 15)
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
For policy gradient methods over stochastic policies
|
||||||
|
|
||||||
|
$$
|
||||||
|
\pi_\theta(a|s) = P[a|s,\theta]
|
||||||
|
$$
|
||||||
|
|
||||||
|
Advantages
|
||||||
|
|
||||||
|
- Potentially learning optimal solutions for multi-agent settings
|
||||||
|
- Dealing with partial observable settings
|
||||||
|
- Sufficient exploration
|
||||||
|
|
||||||
|
Disadvantages
|
||||||
|
|
||||||
|
- Can not learning a deterministic policy
|
||||||
|
- Extension to continuous action space is not straightforward
|
||||||
|
|
||||||
|
### On-Policy vs. Off-Policy Policy Gradients
|
||||||
|
|
||||||
|
On-Policy Policy Gradients:
|
||||||
|
|
||||||
|
- Training samples are collected according to the current policy.
|
||||||
|
|
||||||
|
Off-Policy Algorithms:
|
||||||
|
|
||||||
|
- Enable the reuse of past experience.
|
||||||
|
- Samples can be collected by an exploratory behavior policy.
|
||||||
|
|
||||||
|
How to design off-policy policy gradient?
|
||||||
|
|
||||||
|
- Using importance sampling
|
||||||
|
|
||||||
## Off-Policy Actor-Critic (OffPAC)
|
## Off-Policy Actor-Critic (OffPAC)
|
||||||
|
|
||||||
|
Stochastic Behavior Policy for exploration.
|
||||||
|
|
||||||
|
- For collecting data. Labelled as $\beta(a|s)$
|
||||||
|
|
||||||
|
The objective function is:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
J(\theta)=\mathbb{E}_{s\sim d^\beta}[V^{\pi}(s)]
|
||||||
|
&= \sum_{s\in S} d^\beta(s) \sum_{a\in A} \pi_\theta(a|s) Q^{\pi}(s,a)\\
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
$d^\beta(s)$ is the stationary distribution under the behavior policy $\beta(a|s)$.
|
||||||
|
|
||||||
|
### Solving the Off-Policy Policy Gradient
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
\nabla_\theta J(\theta) &= \nabla_\theta \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \pi_\theta(a|s) Q^{\pi}(s,a)\right]\\
|
||||||
|
&= \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \nabla_\theta \pi_\theta(a|s) Q^{\pi}(s,a)+\pi_\theta(a|s) \nabla_\theta Q^{\pi}(s,a)\right]\\
|
||||||
|
&= \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \nabla_\theta \pi_\theta(a|s) Q^{\pi}(s,a)\right]\\
|
||||||
|
&= \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \beta(a|s) \frac{1}{\beta(a|s)} \nabla_\theta \pi_\theta(a|s) Q^{\pi}(s,a)\right]\\
|
||||||
|
&= \mathbb{E}_{s\sim d^\beta}\left[\sum_{a\in A} \beta(a|s) \nabla_\theta \log \beta(a|s) Q^{\pi}(s,a)\right]\\
|
||||||
|
&= \mathbb{E}_{\beta}\left[\frac{1}{\beta(a|s)} \nabla_\theta \pi_\theta(a|s) Q^{\pi}(s,a)\right]\\
|
||||||
|
&= \mathbb{E}_{\beta}\left[\frac{\pi_\theta(a|s)}{\beta(a|s)} Q^{\pi}(s,a)\nabla_\theta \log \pi_\theta(a|s)\right]\\
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
To compute the off-policy policy gradient, $Q^{\pi}(s,a)$ is estimated given data collected by $\beta$.
|
||||||
|
|
||||||
|
Common solution:
|
||||||
|
|
||||||
|
- Importance sampling
|
||||||
|
- Tree backup
|
||||||
|
- Gradient temporal-difference learning
|
||||||
|
- Retrace [Munos et al., 2016] [IMPALA](https://arxiv.org/abs/1802.01561)
|
||||||
|
|
||||||
|
### Importance Sampling
|
||||||
|
|
||||||
|
Assume that samples come in the form of episodes.
|
||||||
|
|
||||||
|
$M$ is the number of episodes containing $(s,a), t_m$ be the first time when $(s,a)$ appears in episode $m$.
|
||||||
|
|
||||||
|
The first-visit importance sampling estimator of $Q^{\pi}(s,a)$ is:
|
||||||
|
|
||||||
|
$$
|
||||||
|
Q^{IS}(s,a)\coloneqq \frac{1}{M}\sum_{m=1}^M R_m w_m
|
||||||
|
$$
|
||||||
|
|
||||||
|
$R_m$ is the return following $(s,a)$ in episode $m$.
|
||||||
|
|
||||||
|
$$
|
||||||
|
R_m\coloneqq r_{t_m +1}+\gamma r_{t_m +2}+\cdots+\gamma^{T_m-t_m -1} r_{T_m}
|
||||||
|
$$
|
||||||
|
|
||||||
|
$w_m$ is the importance sampling weight:
|
||||||
|
|
||||||
|
$$
|
||||||
|
w_m\coloneqq \frac{\pi(a_{t_m}|s_{t_m})}{\beta(a_{t_m}|s_{t_m})}\frac{\pi(a_{t_m+1}|s_{t_m+1})}{\beta(a_{t_m+1}|s_{t_m+1})}\cdots\frac{\pi(a_{T_m}|s_{T_m})}{\beta(a_{T_m}|s_{T_m})}
|
||||||
|
$$
|
||||||
|
|
||||||
|
### Per-decision algorithm
|
||||||
|
|
||||||
|
Consider the parts we used in importance sampling:
|
||||||
|
|
||||||
|
$$
|
||||||
|
R_m w_m=\sum_{i=t_m+1}^{T_m}\gamma^{i-t_m-1} r_i \frac{\pi(a_{t_m}|s_{t_m})}{\beta(a_{t_m}|s_{t_m})}\cdots \frac{\pi(a_{t_{i-1}}|s_{t_{i-1}})}{\beta(a_{t_{i-1}}|s_{t_{i-1}})}\frac{\pi(a_{t_i}|s_{t_i})}{\beta(a_{t_i}|s_{t_i})}\cdots \frac{\pi(a_{T_m-1}|s_{T_m-1})}{\beta(a_{T_m-1}|s_{T_m-1})}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Intuitively, $r_i$ should not depend on the actions taken after $t_i$.
|
||||||
|
|
||||||
|
This gives the per-decision importance sampling estimator:
|
||||||
|
|
||||||
|
$$
|
||||||
|
Q^{PD}(s,a)\coloneqq \frac{1}{M}\sum_{m=1}^M \sum_{k=1}^{T_m-t_m} \gamma^{k-1} r_{t_m+k}\prod_{i=t_m}^{t_m+k-1} \frac{\pi(a_{t_i}|s_{t_i})}{\beta(a_{t_i}|s_{t_i})}
|
||||||
|
$$
|
||||||
|
|
||||||
|
The per-decision importance sampling estimator is consistence and unbiased estimator of $Q^{\pi}(s,a)$.
|
||||||
|
|
||||||
|
Proof as exercise.
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Hints</summary>
|
||||||
|
- Show the expectation of $Q^{PD}(s,a)$ is the same as $Q^{IS}(s,a)$.
|
||||||
|
- $Q^{IS}(s,a)$ is a consistence and unbiased estimator of $Q^{\pi}(s,a)$.
|
||||||
|
</details>
|
||||||
|
|
||||||
## Deterministic Policy Gradient (DPG)
|
## Deterministic Policy Gradient (DPG)
|
||||||
|
|
||||||
## Deep Deterministic Policy Gradient (DDPG)
|
The objective function is:
|
||||||
§ Extensions of DDPG
|
|
||||||
|
$$
|
||||||
|
J(\theta)=\int_{s\in S} \pho^{\mu}(s) r(s,\mu_\theta(s)) ds
|
||||||
|
$$
|
||||||
|
|
||||||
|
where $\pho^{\mu}(s)$ is the stationary distribution under the behavior policy $\mu_\theta(s)$.
|
||||||
|
|
||||||
|
Proof along the same lines of the standard policy gradient theorem.
|
||||||
|
|
||||||
|
$$
|
||||||
|
\nabla_\theta J(\theta) = \mathbb{E}_{\mu_\theta}[\nabla_\theta Q^{\mu_\theta}(s,a)]=\mathbb{E}_{s\sim \pho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
|
||||||
|
$$
|
||||||
|
|
||||||
|
### Issues for DPG
|
||||||
|
|
||||||
|
The formulations up to now can only use on-policy data.
|
||||||
|
|
||||||
|
|
||||||
|
## Deep Deterministic Policy Gradient (DDPG)
|
||||||
@@ -47,12 +47,6 @@ export default {
|
|||||||
timestamp: true,
|
timestamp: true,
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
Math4501: {
|
|
||||||
type: 'page',
|
|
||||||
theme:{
|
|
||||||
timestamp: true,
|
|
||||||
}
|
|
||||||
},
|
|
||||||
Math416: {
|
Math416: {
|
||||||
type: 'page',
|
type: 'page',
|
||||||
theme:{
|
theme:{
|
||||||
|
|||||||
Reference in New Issue
Block a user