update notations

This commit is contained in:
Trance-0
2025-11-04 12:43:23 -06:00
parent d24c0bdd9e
commit 614479e4d0
27 changed files with 333 additions and 100 deletions

View File

@@ -198,20 +198,20 @@ $$
Take the softmax policy as example:
Weight actions using the linear combination of features $\phi(s,a)^T\theta$:
Weight actions using the linear combination of features $\phi(s,a)^\top\theta$:
Probability of action is proportional to the exponentiated weights:
$$
\pi_\theta(s,a) \propto \exp(\phi(s,a)^T\theta)
\pi_\theta(s,a) \propto \exp(\phi(s,a)^\top\theta)
$$
The score function is
$$
\begin{aligned}
\nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^T\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^T\theta))) \\
&= \nabla_\theta\left(\phi(s,a)^T\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right) \\
\nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^\top\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^\top\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^\top\theta))) \\
&= \nabla_\theta\left(\phi(s,a)^\top\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}\right) \\
&=\phi(s,a) - \sum_{a'\in A} \prod_\theta(s,a') \phi(s,a')
&= \phi(s,a) - \mathbb{E}_{a'\sim \pi_\theta(s,a')}[\phi(s,a')]
\end{aligned}
@@ -221,7 +221,7 @@ $$
In continuous action spaces, a Gaussian policy is natural
Mean is a linear combination of state features $\mu(s) = \phi(s)^T\theta$
Mean is a linear combination of state features $\mu(s) = \phi(s)^\top\theta$
Variance may be fixed $\sigma^2$, or can also parametrized