updates
This commit is contained in:
150
content/CSE510/CSE510_L16.md
Normal file
150
content/CSE510/CSE510_L16.md
Normal file
@@ -0,0 +1,150 @@
|
||||
# CSE510 Deep Reinforcement Learning (Lecture 16)
|
||||
|
||||
## Deterministic Policy Gradient (DPG)
|
||||
|
||||
### Learning Deterministic Policies
|
||||
|
||||
- Deterministic policy gradients [Silver et al., ICML 2014]
|
||||
- Explicitly learn a deterministic policy.
|
||||
- $a = \mu_\theta(s)$
|
||||
- Advantages
|
||||
- Existing optimal deterministic policy for MDPs
|
||||
- Naturally dealing with a continuous action space
|
||||
- Expected to be more efficient than learning stochastic policies
|
||||
- Computing stochastic gradient requires more samples, as it integrates over both state and action space.
|
||||
- Deterministic gradient is preferable as it integrates over state space only.
|
||||
|
||||
### Deterministic Policy Gradient
|
||||
|
||||
The objective function is:
|
||||
|
||||
$$
|
||||
J(\theta)=\int_{s\in S} \rho^{\mu}(s) r(s,\mu_\theta(s)) ds
|
||||
$$
|
||||
|
||||
where $\rho^{\mu}(s)$ is the stationary distribution under the behavior policy $\mu_\theta(s)$.
|
||||
|
||||
The policy gradient from the standard policy gradient theorem is:
|
||||
|
||||
$$
|
||||
\nabla_\theta J(\theta) = \mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta Q^{\mu_\theta}(s,a)]=\mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
|
||||
$$
|
||||
|
||||
#### Issues for DPG
|
||||
|
||||
The formulations up to now can only use on-policy data.
|
||||
|
||||
Deterministic policy can hardly guarantee sufficient
|
||||
exploration.
|
||||
|
||||
- Solution: Off-policy training using a stochastic behavior policy.
|
||||
|
||||
#### Off-Policy Deterministic Policy Gradient (Off-DPG)
|
||||
|
||||
Use a stochastic behavior policy $\beta(a|s)$. The modified objective function is:
|
||||
|
||||
$$
|
||||
J(\mu_\theta)=\int_{s\in S} \rho^{\beta}(s) Q^{\mu_\theta}(s,\beta(s)) ds
|
||||
$$
|
||||
|
||||
The gradients are:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\nabla_\theta J(\mu_\theta) &\approx \int_{s\in S} \rho^{\beta}(s) \nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)} ds\\
|
||||
&= \mathbb{E}_{s\sim \rho^{\beta}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
|
||||
\end{aligned}
|
||||
$$
|
||||
|
||||
Importance sampling is avoided in the actor due to the absence of integral over actions.
|
||||
|
||||
#### Policy Evaluation in DPG
|
||||
|
||||
Importance sampling can also be avoided in the critic.
|
||||
|
||||
Gradient TD-like algorithm can be directly applied to the critic.
|
||||
|
||||
$$
|
||||
\mathcal{L}_{critic}(w) = \mathbb{E}[r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)]^2
|
||||
$$
|
||||
|
||||
#### Off-Policy Deterministic Actor-Critic
|
||||
|
||||
$$
|
||||
\delta_t=r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)
|
||||
$$
|
||||
|
||||
$$
|
||||
w_{t+1} = w_t + \alpha_w \delta_t \nabla_w Q^w(s_t,a_t)
|
||||
$$
|
||||
|
||||
$$
|
||||
\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu_\theta(s_t) \nabla_a Q^{\mu_\theta}(s_t,a_t)\vert_{a=\mu_\theta(s_t)}
|
||||
$$
|
||||
|
||||
### Deep Deterministic Policy Gradient (DDPG)
|
||||
|
||||
Insights from DQN + Deterministic Policy Gradients
|
||||
|
||||
- Use a replay buffer
|
||||
- Critic is updated every timestep (Sample from buffer, minibatch):
|
||||
|
||||
$$
|
||||
\mathcal{L}_{critic}(w) = \mathbb{E}[r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)]^2
|
||||
$$
|
||||
|
||||
Actor is updated every timestep:
|
||||
|
||||
$$
|
||||
\nabla_a Q(s_t,a;w)|_{a=\mu_\theta(s_t)} \nabla_\theta \mu_\theta(s_t)
|
||||
$$
|
||||
|
||||
Smoothing target updated at every timestep:
|
||||
|
||||
$$
|
||||
w_{t+1} = \tau w_t + (1-\tau) w_{t+1}
|
||||
$$
|
||||
|
||||
$$
|
||||
\theta_{t+1} = \tau \theta_t + (1-\tau) \theta_{t+1}
|
||||
$$
|
||||
|
||||
Exploration: add noise to the action selection: $a_t = \mu_\theta(s_t) + \mathcal{N}_t$
|
||||
|
||||
Batch normalization used for training networks
|
||||
|
||||
### Extension of DDPG
|
||||
|
||||
Overestimation bias is an issue of Q-learning in which the maximization of a noisy value estimate
|
||||
|
||||
$$
|
||||
DDPG:\nabla_\theta J(\theta) = \mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
|
||||
$$
|
||||
|
||||
#### Double DQN is not enough
|
||||
|
||||
Because the slow-changing policy in an actor-critic setting
|
||||
|
||||
- the current and target value estimates remain too similar to avoid maximization bias.
|
||||
- Target value of Double DQN: $r_t + \gamma Q^w'(s_{t+1},\mu_\theta(s_{t+1}))$
|
||||
|
||||
#### TD3: Twin Delayed Deep Deterministic policy gradient
|
||||
|
||||
Address overestimation bias:
|
||||
|
||||
- Double Q-learning is unbiased in tabular settings, but still slight overestimation with function approximation.
|
||||
|
||||
$$
|
||||
y_1 = r + \gamma Q^{\theta_2'}(s', \pi_{\phi_1}(s'))
|
||||
$$
|
||||
$$
|
||||
y_2 = r + \gamma Q^{\theta_1'}(s', \pi_{\phi_2}(s'))
|
||||
$$
|
||||
It is possible that $Q^{\theta_2}(s, \pi_{\phi_1}(s)) > Q^{\theta_1}(s, \pi_{\phi_1}(s))$
|
||||
|
||||
Clipped double Q-learning:
|
||||
|
||||
$$
|
||||
y_1 = r + \gamma \min_{i=1,2} Q^{\theta_i'}(s', \pi_{\phi_i}(s'))
|
||||
$$
|
||||
|
||||
@@ -18,4 +18,5 @@ export default {
|
||||
CSE510_L13: "CSE510 Deep Reinforcement Learning (Lecture 13)",
|
||||
CSE510_L14: "CSE510 Deep Reinforcement Learning (Lecture 14)",
|
||||
CSE510_L15: "CSE510 Deep Reinforcement Learning (Lecture 15)",
|
||||
CSE510_L16: "CSE510 Deep Reinforcement Learning (Lecture 16)",
|
||||
}
|
||||
Reference in New Issue
Block a user