updates

2025-10-21 12:50:11 -05:00
parent 9f7d99b745
commit b845aca63c
2 changed files with 354 additions and 1 deletions
--- a/content/CSE510/CSE510_L16.md
+++ b/content/CSE510/CSE510_L16.md
@@ -148,3 +148,30 @@ $$
 y_1 = r + \gamma \min_{i=1,2} Q^{\theta_i'}(s', \pi_{\phi_i}(s'))
 $$

+High-variance estimates provide a noisy gradient.
+
+Techniques in TD3 to reduce the variance:
+
+- Update the policy at a lower frequency than the value network.
+- Smoothing the value estimate:
+  $$
+  y=r+\gamma \mathbb{E}_{\epsilon}[Q^{\theta'}(s', \pi_{\phi'}(s')+\epsilon)]
+  $$
+
+Update target:
+
+$$
+y=r+\gamma \mathbb{E}_{\epsilon}[Q^{\theta'}(s', \pi_{\phi'}(s')+\epsilon)]
+$$
+
+where $\epsilon\sim clip(\mathcal{N}(0, \sigma), -c, c)$
+
+#### Other methods
+
+- Generalizable Episode Memory for Deep Reinforcement Learning
+- Distributed Distributional Deep Deterministic Policy Gradient
+  - Distributional critic
+  - N-step returns are used to update the critic
+  - Multiple distributed parallel actors
+  - Prioritized experience replay
+-