Files

Trance-0 e425088318 updates

2025-10-14 11:19:51 -05:00

7.4 KiB

Raw Blame History

CSE510 Deep Reinforcement Learning (Lecture 13)

Recap from last lecture

For any differentiable policy \pi_\theta(s,a), for any o the policy objective functions J=J_1, J_{avR} or \frac{1}{1-\gamma} J_{avV}

The policy gradient is

\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_\theta \log \pi_\theta(s,a)Q^{\pi_\theta}(s,a)\right]

Problem for policy gradient method

Data Inefficiency

On-policy method: for each new policy, we need to generate a completely new trajectory
The data is thrown out after just one gradient update
As complex neural networks need many updates, this makes the training process very slow

Unstable update： step size is very important

If step size is too large:
- Large step -> bad policy
- Next batch is generated from current bad policy → collect bad samples
- Bad samples -> worse policy (compare to supervised learning: the correct label and data in the following batches may correct it)
If step size is too small: the learning process is slow

Deriving the optimization objection function of Trusted Region Policy Optimization (TRPO)

Objective of Policy Gradient Methods

Policy Objective


J(\pi_\theta)=\mathbb{E}_{\tau\sim \pi_theta}\sum_{t=0}^{\infty} \gamma^t r^t

here \tau is the trajectory for the policy \pi_\theta.

Policy objective can be written in terms of old one:


J(\pi_{\theta'})-J(\pi_{\theta})=\mathbb{E}_{\tau \sim \pi_{\theta'}}\sum_{t=0}^{\infty}\gamma^t A^{\pi_\theta}(s_t,a_t)

Equivalently for succinctness:


J(\pi')-J(\pi)=\mathbb{E}_{\tau\in \pi'}\sum_{t=0}^{\infty} A^{\pi}(s_t,a_t)

Proof


\begin{aligned}
&\mathbb{E}_{\tau\sim \pi'}\left[\gamma^t  A^{\pi_\theta}(s_t,a_t)\right]\\
&=E_{\tau\sim \pi'}\left[\sum_{t=0}^{\infty}\gamma^t R(s_0)+\sum_{t=0}^{\infty} \gamma^{t+1}V^{\pi}(s_{t+1})-\sum_{t=0}^{\infty} \gamma^{t}V^\pi(s_t)\right]\\
&=J(\pi')+\sum_{t=1}^{\infty} \gamma^{t}V^{\pi}(s_t)-\sum_{t=0}^{\infty} \gamma^{t}V^\pi(s_t)\\
&=J(\pi')-\mathbb{E}_{\tau\sim\pi'}V^{\pi}(s_0)\\
&=J(\pi')-J(\pi)
\end{aligned}

Importance Sampling

Estimate one distribution by sampling form anther distribution


\begin{aligned}
\mathbb{E}_{x\sim p}[f(x)]&=\int f(x)p(x)dx
&=\int f(x)\frac{p(x)}{q(x)}q(x)dx\\
&=\mathbb{E}_{x\sim q}\left[f(x)\frac{p(x)}{q(x)}\right]\\
&\approx \frac{1}{N}\sum_{i=1,x^i\in q}^N f(x^i)\frac{p(x^i)}{q(x^i)}
\end{aligned}

Estimating objective with importance sampling

Discounted state visit distribution:


d^\pi(s)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^t P(s_t=s|\pi)


\begin{aligned}
J(\pi')-J(\pi)&=\mathbb{E}_{\tau\sim\pi'}\sum_{t=0}^{\infty} \gamma^t A^{\pi}(s_t,a_t)\\
&=\mathbb{E}_{s\sim d^{\pi'}, a\sim \pi'} A^{\pi}(s,a)\\
&=\mathbb{E}_{s\sim d^{\pi'}, a\sim \pi}\left[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}\right]\\
\end{aligned}

Using the old policy to sample states form a policy that we are trying to optimize.


L_\pi(\pi')=\mathbb{E}_{s\sim d^{\pi'}, a\sim \pi}\left[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}\right]

Lower bound of Optimization

Note

(Kullback-Leibler) KL divergence is a measure of the difference between two probability distributions.

D_{KL}(\pi(\dot|s)||\pi'(\dot|s))=\int_s \pi(a|s)\log \frac{\pi(a|s)}{\pi'(a|s)}da


J(\pi')-J(\pi)\geq L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))

where C is a constant.

Optimizing the objective function:


\max_{\pi'} J(\pi')-J(\pi)

By maximizing the lower bound


\max_{\pi'} L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))

Monotonic Improvement Theorem

Proof of improvement guarantee: Suppose \pi_{k+1} and \pi_k are related by


\pi_{k+1}=\max_{\pi'} L_\pi(\pi')-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))

\pi_{k} is a feasible point, and the objective at \pi_k is equal to 0.


L_{\pi_k}(\pi_{k})\propto \mathbb{E}_{s,a\sim d^{\pi_k}}[A^{\pi_k}(s,a)]=0


D_{KL}(\pi_k||\pi_k)[s]=0

Optimal value \geq 0.

By the performance bound, J_{pi_{k+1}}-J_{pi_k}\geq 0.

Final objective function


\max_{\pi'}\mathbb{E}_{s\sim d^{\pi}, a\sim \pi}[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}]-C\max_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))

by approximation


\max_{\pi'}\mathbb{E}_{s\sim d^{\pi}, a\sim \pi}[A^{\pi'}(s,a)\frac{\pi'(a|s)}{\pi(a|s)}]-C\mathbb{E}_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))

With the Lagrangian Duality, the objective is mathematically the same as following using a trust region constraint:


\max_{\pi'} L_\pi(\pi')

such that


\mathbb{E}_{s\in S}D_{KL}(\pi(\dot|s)||\pi'(\dot|s))\leq \delta

C gets very high when \gamma is close to one and the corresponding gradient step size becomes too small.


C\propto \frac{\epsilon \gamma}{(1-\gamma)^2}

Empirical results show that it needs to more adaptive
But Tuning C is hard (need some trick just like PPO)
TRPO uses trust region constraint and make \delta a tunable hyperparameter.

Trust Region Policy Optimization (TRPO)


\max_{\pi'} L_\pi(\pi')