updates
This commit is contained in:
@@ -129,15 +129,15 @@ Proof as exercise.
|
||||
The objective function is:
|
||||
|
||||
$$
|
||||
J(\theta)=\int_{s\in S} \pho^{\mu}(s) r(s,\mu_\theta(s)) ds
|
||||
J(\theta)=\int_{s\in S} \rho^{\mu}(s) r(s,\mu_\theta(s)) ds
|
||||
$$
|
||||
|
||||
where $\pho^{\mu}(s)$ is the stationary distribution under the behavior policy $\mu_\theta(s)$.
|
||||
where $\rho^{\mu}(s)$ is the stationary distribution under the behavior policy $\mu_\theta(s)$.
|
||||
|
||||
Proof along the same lines of the standard policy gradient theorem.
|
||||
|
||||
$$
|
||||
\nabla_\theta J(\theta) = \mathbb{E}_{\mu_\theta}[\nabla_\theta Q^{\mu_\theta}(s,a)]=\mathbb{E}_{s\sim \pho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
|
||||
\nabla_\theta J(\theta) = \mathbb{E}_{\mu_\theta}[\nabla_\theta Q^{\mu_\theta}(s,a)]=\mathbb{E}_{s\sim \rho^{\mu}}[\nabla_\theta \mu_\theta(s) \nabla_a Q^{\mu_\theta}(s,a)\vert_{a=\mu_\theta(s)}]
|
||||
$$
|
||||
|
||||
### Issues for DPG
|
||||
|
||||
Reference in New Issue
Block a user