updates

2025-10-03 10:54:02 -05:00
parent 42c09d7103
commit 535b9329f0
6 changed files with 375 additions and 0 deletions
--- a/content/CSE510/CSE510_L12.md
+++ b/content/CSE510/CSE510_L12.md
@@ -0,0 +1,204 @@
+# CSE510 Deep Reinforcement Learning (Lecture 12)
+
+## Policy Gradient Theorem
+
+For any differentiable policy $\pi_\theta(s,a)$, for any o the policy objective functions $J=J_1, J_{avR}$ or $\frac{1}{1-\gamma} J_{avV}$
+
+The policy gradient is
+
+$$
+\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]
+$$
+
+## Policy Gradient Methods
+
+Advantages of Policy-Based RL
+
+Advantages:
+
+- Better convergence properties
+- Effective in high-dimensional or continuous action spaces
+- Can learn stochastic policies
+
+Disadvantages:
+
+- Typically converge to a local rather than global optimum
+- Evaluating a policy is typically inefficient and high variance
+
+### Anchor-Critic Methods
+
+#### Q Actor-Critic
+
+Reducing Variance Using a Critic
+
+Monte-Carlo Policy Gradient still has high variance.
+
+We use a critic to estimate the action-value function $Q_w(s,a)\approx Q^{\pi_\theta}(s,a)$.
+
+Anchor-critic algorithms maintain two sets of parameters:
+
+Critic: updates action-value function parameters $w$
+
+Actor: updates policy parameters $\theta$, in direction suggested by the critic.
+
+Actor-critic algorithms follow an approximate policy gradient:
+
+$$
+\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q_w(s,a)\right]
+$$
+$$
+\Delta \theta = \alpha \nabla_\theta \log \pi_\theta(s,a)Q_w(s,a)
+$$
+
+Action-Value Actor-Critic
+
+- Simple actor-critic algorithm based on action-value critic
+- Using linear value function approximation $Q_w(s,a)=\phi(s,a)^T w$
+
+Critic: updates $w$ by linear $TD(0)$
+Actor: updates $\theta$ by policy gradient
+
+```python
+def Q_actor-critic(states,theta):
+    actions=sample_actions(a,pi_theta)
+    for i in range(num_steps):
+        reward=sample_rewards(actions,states)
+        transition=sample_transition(actions,states)
+        new_actions=sample_action(transition,theta)
+        delta=sample_reward+gamma*Q_w(transition, new_actions)-Q_w(states, actions)
+        theta=theta+alpha*nabla_theta*log(pi_theta(states, actions))*Q_w(states, actions)
+        w=w+beta*delta*phi(states, actions)
+        a=new_actions
+        s=transition
+```
+
+#### Advantage Actor-Critic
+
+Reducing variance using a baseline
+
+- We subtract a baseline function $B(s)$ form the policy gradient
+- This can reduce the variance without changing expectation
+
+$$
+\begin{aligned}
+\mathbb{E}_{\pi_\theta}\left[\nabla_\theta\log \pi_\theta(s,a)B(s)]&=\sum_{s\in S}d^{\pi_\theta}(s)\sum_{a\in A}\nabla_{\theta}\pi_\theta(s,a)B(s)\\
+&=\sum_{s\in S}d^{\pi_\theta}B(s)\nabla_\theta\sum_{a\in A}\pi_\theta(s,a)\\
+&=0
+\end{aligned}
+$$
+
+A good baseline is the state value function $B(s)=V^{\pi_\theta}(s)$
+
+So we can rewrite the policy gradient using the advantage function $A^{\pi_\theta}(s,a)=Q^{\pi_\theta}(s,a)-V^{\pi_theta}(s)$
+
+$$
+\nabla_\theta J(\theta)=\mathbb{E}\left[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_theta}(s,a)\right]
+$$
+
+##### Estimating the Advantage function
+
+**Method 1:** direct estimation
+
+> May increase the variance
+
+The advantage function can significantly reduce variance of policy gradient
+
+So the critic should really estimate the advantage function
+
+For example, by estimating both $V^{\pi_theta}(s)$ and $Q^{\pi_theta}(s,a)$
+
+Using two function approximators and two parameter vectors,
+
+$$
+V_v(s)\approx V^{\pi_\theta}(s)\\
+Q_w(s,a)\approx Q^{\pi_\theta}(s,a)\\
+A(s,a)=Q_w(s,a)-V_v(s)
+$$
+
+And updating both value functions by e.g. TD learning
+
+**Method 2:** using the TD error
+
+> We can prove that TD error is an unbiased estimation of the advantage function
+
+For the true value function $V^{\pi_\theta}(s)$, the TD error $\delta^{\pi_\theta}$
+
+$$
+\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s) - V^{\pi_\theta}(s)
+$$
+
+is an unbiased estimate of the advantage function
+
+$$
+\begin{aligned}
+\mathbb{E}_{\pi_\theta}[\delta^{\pi_\theta}| s,a]&=\mathbb{E}_{\pi_\theta}[r + \gamma V^{\pi_\theta}(s') |s,a]-V^{\pi_\theta}(s)\\
+&=Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s)\\
+&=A^{\pi_\theta}(s,a)
+\end{aligned}
+$$
+
+So we can use the TD error to compute the policy gradient
+
+$$
+\Delta \theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}]
+$$
+
+In practice, we can use an approximate TD error $\delta_v=r+\gamma V_v(s')-V_v(s)$ to compute the policy gradient
+
+### Summary of policy gradient algorithms
+
+THe policy gradient has many equivalent forms.
+
+$$
+\begin{aligned}
+\nabla_\theta J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) v_t] \text{  REINFORCE} \\
+&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)] \text{  Q Actor-Critic} \\
+&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) A^{\pi_\theta}(s,a)] \text{  Advantage Actor-Critic} \\
+&= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}] \text{  TD Actor-Critic}
+\end{aligned}
+$$
+
+Each leads s stochastic gradient ascent algorithm.
+
+Critic use policy evaluation to estimate the $Q^\pi(s,a)$ or $A^\pi(s,a)$ or $V^\pi(s)$.
+
+## Compatible Function Approximation
+
+If the following two conditions are satisfied:
+
+1. Value function approximation is a compatible with the policy
+    $$
+    \nabla_w Q_w(s,a) = \nabla_\theta \log \pi_\theta(s,a)
+    $$
+2. Value function parameters $w$ minimize the MSE
+    $$
+    \epsilon = \mathbb{E}_{\pi_\theta}[(Q^{\pi_\theta}(s,a)-Q_w(s,a))^2]
+    $$
+    Note $\epsilon$ need not be zero, just need to be minimized.
+
+Then the policy gradient is exact
+
+$$
+\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q_w(s,a)]
+$$
+
+Remember:
+
+$$
+\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(s,a) Q^{\pi_\theta}(s,a)]
+$$
+
+### Challenges with Policy Gradient Methods
+
+- Data Inefficiency
+  - On-policy method: for each new policy, we need to generate a completely new
+  - trajectory
+  - The data is thrown out after just one gradient update
+  - As complex neural networks need many updates, this makes the training process very slow
+- Unstable update： step size is very important
+  - If step size is too large:
+    - Large step -> bad policy
+    - Next batch is generated from current bad policy -> collect bad samples
+    - Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
+  - If step size is too small: the learning process is slow
+