# CSE510 Deep Reinforcement Learning (Lecture 16) ## Deterministic Policy Gradient (DPG) ### Learning Deterministic Policies - Deterministic policy gradients [Silver et al., ICML 2014] - Explicitly learn a deterministic policy. - $a = \mu_\theta(s)$ - Advantages - Existing optimal deterministic policy for MDPs - Naturally dealing with a continuous action space - Expected to be more efficient than learning stochastic policies - Computing stochastic gradient requires more samples, as it integrates over both state and action space. - Deterministic gradient is preferable as it integrates over state space only. ### Deterministic Policy Gradient The objective function is: $$ J(\theta)=\int_{s\in S} \rho^{\mu}(s) r(s,\mu_\theta(s)) ds $$ where $\rho^{\mu}(s)$ is the stationary distribution under the behavior policy $\mu_\theta(s)$. The policy gradient from the standard policy gradient theorem is: $$ \nabla_\theta J(\theta) = \mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta Q^{\mu_\theta}(s,a)]=\mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}] $$ #### Issues for DPG The formulations up to now can only use on-policy data. Deterministic policy can hardly guarantee sufficient exploration. - Solution: Off-policy training using a stochastic behavior policy. #### Off-Policy Deterministic Policy Gradient (Off-DPG) Use a stochastic behavior policy $\beta(a|s)$. The modified objective function is: $$ J(\mu_\theta)=\int_{s\in S} \rho^{\beta}(s) Q^{\mu_\theta}(s,\beta(s)) ds $$ The gradients are: $$ \begin{aligned} \nabla_\theta J(\mu_\theta) &\approx \int_{s\in S} \rho^{\beta}(s) \nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)} ds\\ &= \mathbb{E}_{s\sim \rho^{\beta}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}] \end{aligned} $$ Importance sampling is avoided in the actor due to the absence of integral over actions. #### Policy Evaluation in DPG Importance sampling can also be avoided in the critic. Gradient TD-like algorithm can be directly applied to the critic. $$ \mathcal{L}_{critic}(w) = \mathbb{E}[r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)]^2 $$ #### Off-Policy Deterministic Actor-Critic $$ \delta_t=r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t) $$ $$ w_{t+1} = w_t + \alpha_w \delta_t \nabla_w Q^w(s_t,a_t) $$ $$ \theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu_\theta(s_t) \nabla_a Q^{\mu_\theta}(s_t,a_t)\vert_{a=\mu_\theta(s_t)} $$ ### Deep Deterministic Policy Gradient (DDPG) Insights from DQN + Deterministic Policy Gradients - Use a replay buffer - Critic is updated every timestep (Sample from buffer, minibatch): $$ \mathcal{L}_{critic}(w) = \mathbb{E}[r_t+\gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)]^2 $$ Actor is updated every timestep: $$ \nabla_a Q(s_t,a;w)|_{a=\mu_\theta(s_t)} \nabla_\theta \mu_\theta(s_t) $$ Smoothing target updated at every timestep: $$ w_{t+1} = \tau w_t + (1-\tau) w_{t+1} $$ $$ \theta_{t+1} = \tau \theta_t + (1-\tau) \theta_{t+1} $$ Exploration: add noise to the action selection: $a_t = \mu_\theta(s_t) + \mathcal{N}_t$ Batch normalization used for training networks ### Extension of DDPG Overestimation bias is an issue of Q-learning in which the maximization of a noisy value estimate $$ DDPG:\nabla_\theta J(\theta) = \mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}] $$ #### Double DQN is not enough Because the slow-changing policy in an actor-critic setting - the current and target value estimates remain too similar to avoid maximization bias. - Target value of Double DQN: $r_t + \gamma Q^w'(s_{t+1},\mu_\theta(s_{t+1}))$ #### TD3: Twin Delayed Deep Deterministic policy gradient Address overestimation bias: - Double Q-learning is unbiased in tabular settings, but still slight overestimation with function approximation. $$ y_1 = r + \gamma Q^{\theta_2'}(s', \pi_{\phi_1}(s')) $$ $$ y_2 = r + \gamma Q^{\theta_1'}(s', \pi_{\phi_2}(s')) $$ It is possible that $Q^{\theta_2}(s, \pi_{\phi_1}(s)) > Q^{\theta_1}(s, \pi_{\phi_1}(s))$ Clipped double Q-learning: $$ y_1 = r + \gamma \min_{i=1,2} Q^{\theta_i'}(s', \pi_{\phi_i}(s')) $$ High-variance estimates provide a noisy gradient. Techniques in TD3 to reduce the variance: - Update the policy at a lower frequency than the value network. - Smoothing the value estimate: $$ y=r+\gamma \mathbb{E}_{\epsilon}[Q^{\theta'}(s', \pi_{\phi'}(s')+\epsilon)] $$ Update target: $$ y=r+\gamma \mathbb{E}_{\epsilon}[Q^{\theta'}(s', \pi_{\phi'}(s')+\epsilon)] $$ where $\epsilon\sim clip(\mathcal{N}(0, \sigma), -c, c)$ #### Other methods - Generalizable Episode Memory for Deep Reinforcement Learning - Distributed Distributional Deep Deterministic Policy Gradient - Distributional critic - N-step returns are used to update the critic - Multiple distributed parallel actors - Prioritized experience replay -