# CSE510 Deep Reinforcement Learning (Lecture 12) ## Policy Gradient Theorem For any differentiable policy $\pi_\theta(s,a)$, for any o the policy objective functions $J=J_1, J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$ The policy gradient is $$ \nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right] $$ ## Policy Gradient Methods Advantages of Policy-Based RL Advantages: - Better convergence properties - Effective in high-dimensional or continuous action spaces - Can learn stochastic policies Disadvantages: - Typically converge to a local rather than global optimum - Evaluating a policy is typically inefficient and high variance ### Anchor-Critic Methods #### Q Actor-Critic Reducing Variance Using a Critic Monte-Carlo Policy Gradient still has high variance. We use a critic to estimate the action-value function $Q_w(s,a)\approx Q^{\pi_\theta}(s,a)$. Anchor-critic algorithms maintain two sets of parameters: Critic: updates action-value function parameters $w$ Actor: updates policy parameters $\theta$, in direction suggested by the critic. Actor-critic algorithms follow an approximate policy gradient: $$ \nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q_w(s,a)\right] $$ $$ \Delta \theta = \alpha \nabla_\theta \log \pi_\theta(s,a)Q_w(s,a) $$ Action-Value Actor-Critic - Simple actor-critic algorithm based on action-value critic - Using linear value function approximation $Q_w(s,a)=\phi(s,a)^\top w$ Critic: updates $w$ by linear $TD(0)$ Actor: updates $\theta$ by policy gradient ```python def Q_actor-critic(states,theta): actions=sample_actions(a,pi_theta) for i in range(num_steps): reward=sample_rewards(actions,states) transition=sample_transition(actions,states) new_actions=sample_action(transition,theta) delta=sample_reward+gamma*Q_w(transition, new_actions)-Q_w(states, actions) theta=theta+alpha*nabla_theta*log(pi_theta(states, actions))*Q_w(states, actions) w=w+beta*delta*phi(states, actions) a=new_actions s=transition ``` #### Advantage Actor-Critic Reducing variance using a baseline - We subtract a baseline function $B(s)$ form the policy gradient - This can reduce the variance without changing expectation $$ \begin{aligned} \mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log \pi_\theta(s,a)B(s)\right]&=\sum_{s\in S}d^{\pi_\theta}(s)\sum_{a\in A}\nabla_{\theta}\pi_\theta(s,a)B(s)\\ &=\sum_{s\in S}d^{\pi_\theta}B(s)\nabla_\theta\sum_{a\in A}\pi_\theta(s,a)\\ &=0 \end{aligned} $$ A good baseline is the state value function $B(s)=V^{\pi_\theta}(s)$ So we can rewrite the policy gradient using the advantage function $A^{\pi_\theta}(s,a)=Q^{\pi_\theta}(s,a)-V^{\pi_theta}(s)$ $$ \nabla_\theta J(\theta)=\mathbb{E}\left[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_theta}(s,a)\right] $$ ##### Estimating the Advantage function **Method 1:** direct estimation > May increase the variance The advantage function can significantly reduce variance of policy gradient So the critic should really estimate the advantage function For example, by estimating both $V^{\pi_theta}(s)$ and $Q^{\pi_theta}(s,a)$ Using two function approximators and two parameter vectors, $$ V_v(s)\approx V^{\pi_\theta}(s)\\ Q_w(s,a)\approx Q^{\pi_\theta}(s,a)\\ A(s,a)=Q_w(s,a)-V_v(s) $$ And updating both value functions by e.g. TD learning **Method 2:** using the TD error > We can prove that TD error is an unbiased estimation of the advantage function For the true value function $V^{\pi_\theta}(s)$, the TD error $\delta^{\pi_\theta}$ $$ \delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s) - V^{\pi_\theta}(s) $$ is an unbiased estimate of the advantage function $$ \begin{aligned} \mathbb{E}_{\pi_\theta}[\delta^{\pi_\theta}| s,a]&=\mathbb{E}_{\pi_\theta}[r + \gamma V^{\pi_\theta}(s') |s,a]-V^{\pi_\theta}(s)\\ &=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)\\ &=A^{\pi_\theta}(s,a) \end{aligned} $$ So we can use the TD error to compute the policy gradient $$ \Delta \theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}] $$ In practice, we can use an approximate TD error $\delta_v=r+\gamma V_v(s')-V_v(s)$ to compute the policy gradient ### Summary of policy gradient algorithms THe policy gradient has many equivalent forms. $$ \begin{aligned} \nabla_\theta J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) v_t] \text{ REINFORCE} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)] \text{ Q Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_\theta}(s,a)] \text{ Advantage Actor-Critic} \\ &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}] \text{ TD Actor-Critic} \end{aligned} $$ Each leads s stochastic gradient ascent algorithm. Critic use policy evaluation to estimate the $Q^\pi(s,a)$ or $A^\pi(s,a)$ or $V^\pi(s)$. ## Compatible Function Approximation If the following two conditions are satisfied: 1. Value function approximation is a compatible with the policy $$ \nabla_w Q_w(s,a) = \nabla_\theta \log \pi_\theta(s,a) $$ 2. Value function parameters $w$ minimize the MSE $$ \epsilon = \mathbb{E}_{\pi_\theta}[(Q^{\pi_\theta}(s,a)-Q_w(s,a))^2] $$ Note $\epsilon$ need not be zero, just need to be minimized. Then the policy gradient is exact $$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)] $$ Remember: $$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)] $$ ### Challenges with Policy Gradient Methods - Data Inefficiency - On-policy method: for each new policy, we need to generate a completely new - trajectory - The data is thrown out after just one gradient update - As complex neural networks need many updates, this makes the training process very slow - Unstable update: step size is very important - If step size is too large: - Large step -> bad policy - Next batch is generated from current bad policy -> collect bad samples - Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it) - If step size is too small: the learning process is slow