updates
This commit is contained in:
@@ -148,3 +148,30 @@ $$
|
||||
y_1 = r + \gamma \min_{i=1,2} Q^{\theta_i'}(s', \pi_{\phi_i}(s'))
|
||||
$$
|
||||
|
||||
High-variance estimates provide a noisy gradient.
|
||||
|
||||
Techniques in TD3 to reduce the variance:
|
||||
|
||||
- Update the policy at a lower frequency than the value network.
|
||||
- Smoothing the value estimate:
|
||||
$$
|
||||
y=r+\gamma \mathbb{E}_{\epsilon}[Q^{\theta'}(s', \pi_{\phi'}(s')+\epsilon)]
|
||||
$$
|
||||
|
||||
Update target:
|
||||
|
||||
$$
|
||||
y=r+\gamma \mathbb{E}_{\epsilon}[Q^{\theta'}(s', \pi_{\phi'}(s')+\epsilon)]
|
||||
$$
|
||||
|
||||
where $\epsilon\sim clip(\mathcal{N}(0, \sigma), -c, c)$
|
||||
|
||||
#### Other methods
|
||||
|
||||
- Generalizable Episode Memory for Deep Reinforcement Learning
|
||||
- Distributed Distributional Deep Deterministic Policy Gradient
|
||||
- Distributional critic
|
||||
- N-step returns are used to update the critic
|
||||
- Multiple distributed parallel actors
|
||||
- Prioritized experience replay
|
||||
-
|
||||
Reference in New Issue
Block a user