update notations

2025-11-04 12:43:23 -06:00
parent d24c0bdd9e
commit 614479e4d0
27 changed files with 333 additions and 100 deletions
--- a/content/CSE510/CSE510_L11.md
+++ b/content/CSE510/CSE510_L11.md
@@ -198,20 +198,20 @@ $$

 Take the softmax policy as example:

-Weight actions using the linear combination of features $\phi(s,a)^T\theta$:
+Weight actions using the linear combination of features $\phi(s,a)^\top\theta$:

 Probability of action is proportional to the exponentiated weights:

 $$
-\pi_\theta(s,a) \propto \exp(\phi(s,a)^T\theta)
+\pi_\theta(s,a) \propto \exp(\phi(s,a)^\top\theta)
 $$

 The score function is

 $$
 \begin{aligned}
-\nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^T\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^T\theta))) \\
-&= \nabla_\theta\left(\phi(s,a)^T\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right) \\
+\nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^\top\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^\top\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^\top\theta))) \\
+&= \nabla_\theta\left(\phi(s,a)^\top\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}\right) \\
 &=\phi(s,a) - \sum_{a'\in A} \prod_\theta(s,a') \phi(s,a')
 &= \phi(s,a) - \mathbb{E}_{a'\sim \pi_\theta(s,a')}[\phi(s,a')]
 \end{aligned}
@@ -221,7 +221,7 @@ $$

 In continuous action spaces, a Gaussian policy is natural

-Mean is a linear combination of state features $\mu(s) = \phi(s)^T\theta$
+Mean is a linear combination of state features $\mu(s) = \phi(s)^\top\theta$

 Variance may be fixed $\sigma^2$, or can also parametrized