This commit is contained in:
Zheyuan Wu
2025-10-17 11:52:31 -05:00
parent bf344359a1
commit c41f7204a8
5 changed files with 107 additions and 5 deletions

View File

@@ -129,15 +129,15 @@ Proof as exercise.
The objective function is:
$$
J(\theta)=\int_{s\in S} \pho^{\mu}(s) r(s,\mu_\theta(s)) ds
J(\theta)=\int_{s\in S} \rho^{\mu}(s) r(s,\mu_\theta(s)) ds
$$
where $\pho^{\mu}(s)$ is the stationary distribution under the behavior policy $\mu_\theta(s)$.
where $\rho^{\mu}(s)$ is the stationary distribution under the behavior policy $\mu_\theta(s)$.
Proof along the same lines of the standard policy gradient theorem.
$$
\nabla_\theta J(\theta) = \mathbb{E}_{\mu_\theta}[\nabla_\theta Q^{\mu_\theta}(s,a)]=\mathbb{E}_{s\sim \pho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
\nabla_\theta J(\theta) = \mathbb{E}_{\mu_\theta}[\nabla_\theta Q^{\mu_\theta}(s,a)]=\mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
$$
### Issues for DPG