updates

2025-10-21 11:10:14 -05:00
parent 2d5527303d
commit 9f7d99b745
6 changed files with 183 additions and 0 deletions
--- a/content/CSE510/CSE510_L16.md
+++ b/content/CSE510/CSE510_L16.md
@@ -0,0 +1,150 @@
+# CSE510 Deep Reinforcement Learning (Lecture 16)
+
+## Deterministic Policy Gradient (DPG)
+
+### Learning Deterministic Policies
+
+- Deterministic policy gradients [Silver et al., ICML 2014]
+  - Explicitly learn a deterministic policy.
+  - $a = \mu_\theta(s)$
+- Advantages
+  - Existing optimal deterministic policy for MDPs
+  - Naturally dealing with a continuous action space
+  - Expected to be more efficient than learning stochastic policies
+    - Computing stochastic gradient requires more samples, as it integrates over both state and action space.
+    - Deterministic gradient is preferable as it integrates over state space only.
+
+### Deterministic Policy Gradient
+
+The objective function is:
+
+$$
+J(\theta)=\int_{s\in S} \rho^{\mu}(s) r(s,\mu_\theta(s)) ds
+$$
+
+where $\rho^{\mu}(s)$ is the stationary distribution under the behavior policy $\mu_\theta(s)$.
+
+The policy gradient from the standard policy gradient theorem is:
+
+$$
+\nabla_\theta J(\theta) = \mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta Q^{\mu_\theta}(s,a)]=\mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
+$$
+
+#### Issues for DPG
+
+The formulations up to now can only use on-policy data.
+
+Deterministic policy can hardly guarantee sufficient
+exploration.
+
+- Solution: Off-policy training using a stochastic behavior policy.
+
+#### Off-Policy Deterministic Policy Gradient (Off-DPG)
+
+Use a stochastic behavior policy $\beta(a|s)$. The modified objective function is:
+
+$$
+J(\mu_\theta)=\int_{s\in S} \rho^{\beta}(s) Q^{\mu_\theta}(s,\beta(s)) ds
+$$
+
+The gradients are:
+
+$$
+\begin{aligned}
+\nabla_\theta J(\mu_\theta) &\approx \int_{s\in S} \rho^{\beta}(s) \nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)} ds\\
+&= \mathbb{E}_{s\sim \rho^{\beta}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
+\end{aligned}
+$$
+
+Importance sampling is avoided in the actor due to the absence of integral over actions.
+
+#### Policy Evaluation in DPG
+
+Importance sampling can also be avoided in the critic.
+
+Gradient TD-like algorithm can be directly applied to the critic.
+
+$$
+\mathcal{L}_{critic}(w) = \mathbb{E}[r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)]^2
+$$
+
+#### Off-Policy Deterministic Actor-Critic
+
+$$
+\delta_t=r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)
+$$
+
+$$
+w_{t+1} = w_t + \alpha_w \delta_t \nabla_w Q^w(s_t,a_t)
+$$
+
+$$
+\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu_\theta(s_t) \nabla_a Q^{\mu_\theta}(s_t,a_t)\vert_{a=\mu_\theta(s_t)}
+$$
+
+### Deep Deterministic Policy Gradient (DDPG)
+
+Insights from DQN + Deterministic Policy Gradients
+
+    - Use a replay buffer
+- Critic is updated every timestep (Sample from buffer, minibatch):
+
+$$
+\mathcal{L}_{critic}(w) = \mathbb{E}[r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)]^2
+$$
+
+Actor is updated every timestep:
+
+$$
+\nabla_a Q(s_t,a;w)|_{a=\mu_\theta(s_t)} \nabla_\theta \mu_\theta(s_t)
+$$
+
+Smoothing target updated at every timestep:
+
+$$
+w_{t+1} = \tau w_t + (1-\tau) w_{t+1}
+$$
+
+$$
+\theta_{t+1} = \tau \theta_t + (1-\tau) \theta_{t+1}
+$$
+
+Exploration: add noise to the action selection: $a_t = \mu_\theta(s_t) + \mathcal{N}_t$
+
+Batch normalization used for training networks
+
+### Extension of DDPG
+
+Overestimation bias is an issue of Q-learning in which the maximization of a noisy value estimate
+
+$$
+DDPG:\nabla_\theta J(\theta) = \mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
+$$
+
+#### Double DQN is not enough
+
+Because the slow-changing policy in an actor-critic setting
+
+- the current and target value estimates remain too similar to avoid maximization bias.
+- Target value of Double DQN: $r_t + \gamma Q^w'(s_{t+1},\mu_\theta(s_{t+1}))$
+
+#### TD3: Twin Delayed Deep Deterministic policy gradient
+
+Address overestimation bias:
+
+- Double Q-learning is unbiased in tabular settings, but still slight overestimation with function approximation.
+
+$$
+y_1 = r + \gamma Q^{\theta_2'}(s', \pi_{\phi_1}(s'))
+$$
+$$
+y_2 = r + \gamma Q^{\theta_1'}(s', \pi_{\phi_2}(s'))
+$$
+ It is possible that $Q^{\theta_2}(s, \pi_{\phi_1}(s)) > Q^{\theta_1}(s, \pi_{\phi_1}(s))$
+
+Clipped double Q-learning:
+
+$$
+y_1 = r + \gamma \min_{i=1,2} Q^{\theta_i'}(s', \pi_{\phi_i}(s'))
+$$
+
--- a/content/CSE510/_meta.js
+++ b/content/CSE510/_meta.js
@@ -18,4 +18,5 @@ export default {
    CSE510_L13: "CSE510 Deep Reinforcement Learning (Lecture 13)",
    CSE510_L14: "CSE510 Deep Reinforcement Learning (Lecture 14)",
    CSE510_L15: "CSE510 Deep Reinforcement Learning (Lecture 15)",
+    CSE510_L16: "CSE510 Deep Reinforcement Learning (Lecture 16)",
 }