Files
NoteNextra-origin/content/CSE510/CSE510_L12.md
Zheyuan Wu 29a5945f05 fix typos
2025-10-11 12:25:24 -05:00

6.3 KiB
Raw Blame History

CSE510 Deep Reinforcement Learning (Lecture 12)

Policy Gradient Theorem

For any differentiable policy \pi_\theta(s,a), for any o the policy objective functions J=J_1, J_{avR} or \frac{1}{1-\gamma} J_{avV}

The policy gradient is


\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]

Policy Gradient Methods

Advantages of Policy-Based RL

Advantages:

  • Better convergence properties
  • Effective in high-dimensional or continuous action spaces
  • Can learn stochastic policies

Disadvantages:

  • Typically converge to a local rather than global optimum
  • Evaluating a policy is typically inefficient and high variance

Anchor-Critic Methods

Q Actor-Critic

Reducing Variance Using a Critic

Monte-Carlo Policy Gradient still has high variance.

We use a critic to estimate the action-value function Q_w(s,a)\approx Q^{\pi_\theta}(s,a).

Anchor-critic algorithms maintain two sets of parameters:

Critic: updates action-value function parameters w

Actor: updates policy parameters \theta, in direction suggested by the critic.

Actor-critic algorithms follow an approximate policy gradient:


\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q_w(s,a)\right]

\Delta \theta = \alpha \nabla_\theta \log \pi_\theta(s,a)Q_w(s,a)

Action-Value Actor-Critic

  • Simple actor-critic algorithm based on action-value critic
  • Using linear value function approximation Q_w(s,a)=\phi(s,a)^T w

Critic: updates w by linear TD(0) Actor: updates \theta by policy gradient

def Q_actor-critic(states,theta):
    actions=sample_actions(a,pi_theta)
    for i in range(num_steps):
        reward=sample_rewards(actions,states)
        transition=sample_transition(actions,states)
        new_actions=sample_action(transition,theta)
        delta=sample_reward+gamma*Q_w(transition, new_actions)-Q_w(states, actions)
        theta=theta+alpha*nabla_theta*log(pi_theta(states, actions))*Q_w(states, actions)
        w=w+beta*delta*phi(states, actions)
        a=new_actions
        s=transition

Advantage Actor-Critic

Reducing variance using a baseline

  • We subtract a baseline function B(s) form the policy gradient
  • This can reduce the variance without changing expectation

\begin{aligned}
\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log \pi_\theta(s,a)B(s)\right]&=\sum_{s\in S}d^{\pi_\theta}(s)\sum_{a\in A}\nabla_{\theta}\pi_\theta(s,a)B(s)\\
&=\sum_{s\in S}d^{\pi_\theta}B(s)\nabla_\theta\sum_{a\in A}\pi_\theta(s,a)\\
&=0
\end{aligned}

A good baseline is the state value function B(s)=V^{\pi_\theta}(s)

So we can rewrite the policy gradient using the advantage function A^{\pi_\theta}(s,a)=Q^{\pi_\theta}(s,a)-V^{\pi_theta}(s)


\nabla_\theta J(\theta)=\mathbb{E}\left[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_theta}(s,a)\right]
Estimating the Advantage function

Method 1: direct estimation

May increase the variance

The advantage function can significantly reduce variance of policy gradient

So the critic should really estimate the advantage function

For example, by estimating both V^{\pi_theta}(s) and Q^{\pi_theta}(s,a)

Using two function approximators and two parameter vectors,


V_v(s)\approx V^{\pi_\theta}(s)\\
Q_w(s,a)\approx Q^{\pi_\theta}(s,a)\\
A(s,a)=Q_w(s,a)-V_v(s)

And updating both value functions by e.g. TD learning

Method 2: using the TD error

We can prove that TD error is an unbiased estimation of the advantage function

For the true value function V^{\pi_\theta}(s), the TD error \delta^{\pi_\theta}


\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s) - V^{\pi_\theta}(s)

is an unbiased estimate of the advantage function


\begin{aligned}
\mathbb{E}_{\pi_\theta}[\delta^{\pi_\theta}| s,a]&=\mathbb{E}_{\pi_\theta}[r + \gamma V^{\pi_\theta}(s') |s,a]-V^{\pi_\theta}(s)\\
&=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)\\
&=A^{\pi_\theta}(s,a)
\end{aligned}

So we can use the TD error to compute the policy gradient


\Delta \theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}]

In practice, we can use an approximate TD error \delta_v=r+\gamma V_v(s')-V_v(s) to compute the policy gradient

Summary of policy gradient algorithms

THe policy gradient has many equivalent forms.


\begin{aligned}
\nabla_\theta J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) v_t] \text{  REINFORCE} \\
&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)] \text{  Q Actor-Critic} \\
&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_\theta}(s,a)] \text{  Advantage Actor-Critic} \\
&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}] \text{  TD Actor-Critic}
\end{aligned}

Each leads s stochastic gradient ascent algorithm.

Critic use policy evaluation to estimate the Q^\pi(s,a) or A^\pi(s,a) or V^\pi(s).

Compatible Function Approximation

If the following two conditions are satisfied:

  1. Value function approximation is a compatible with the policy
    
    \nabla_w Q_w(s,a) = \nabla_\theta \log \pi_\theta(s,a)
    
  2. Value function parameters w minimize the MSE
    
    \epsilon = \mathbb{E}_{\pi_\theta}[(Q^{\pi_\theta}(s,a)-Q_w(s,a))^2]
    
    Note \epsilon need not be zero, just need to be minimized.

Then the policy gradient is exact


\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)]

Remember:


\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]

Challenges with Policy Gradient Methods

  • Data Inefficiency
    • On-policy method: for each new policy, we need to generate a completely new
    • trajectory
    • The data is thrown out after just one gradient update
    • As complex neural networks need many updates, this makes the training process very slow
  • Unstable update step size is very important
    • If step size is too large:
      • Large step -> bad policy
      • Next batch is generated from current bad policy -> collect bad samples
      • Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
    • If step size is too small: the learning process is slow