update notations
This commit is contained in:
@@ -198,20 +198,20 @@ $$
|
||||
|
||||
Take the softmax policy as example:
|
||||
|
||||
Weight actions using the linear combination of features $\phi(s,a)^T\theta$:
|
||||
Weight actions using the linear combination of features $\phi(s,a)^\top\theta$:
|
||||
|
||||
Probability of action is proportional to the exponentiated weights:
|
||||
|
||||
$$
|
||||
\pi_\theta(s,a) \propto \exp(\phi(s,a)^T\theta)
|
||||
\pi_\theta(s,a) \propto \exp(\phi(s,a)^\top\theta)
|
||||
$$
|
||||
|
||||
The score function is
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^T\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^T\theta))) \\
|
||||
&= \nabla_\theta\left(\phi(s,a)^T\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right) \\
|
||||
\nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^\top\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^\top\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^\top\theta))) \\
|
||||
&= \nabla_\theta\left(\phi(s,a)^\top\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}\right) \\
|
||||
&=\phi(s,a) - \sum_{a'\in A} \prod_\theta(s,a') \phi(s,a')
|
||||
&= \phi(s,a) - \mathbb{E}_{a'\sim \pi_\theta(s,a')}[\phi(s,a')]
|
||||
\end{aligned}
|
||||
@@ -221,7 +221,7 @@ $$
|
||||
|
||||
In continuous action spaces, a Gaussian policy is natural
|
||||
|
||||
Mean is a linear combination of state features $\mu(s) = \phi(s)^T\theta$
|
||||
Mean is a linear combination of state features $\mu(s) = \phi(s)^\top\theta$
|
||||
|
||||
Variance may be fixed $\sigma^2$, or can also parametrized
|
||||
|
||||
|
||||
Reference in New Issue
Block a user