update notations

2025-11-04 12:43:23 -06:00
parent d24c0bdd9e
commit 614479e4d0
27 changed files with 333 additions and 100 deletions
--- a/content/CSE510/CSE510_L11.md
+++ b/content/CSE510/CSE510_L11.md
@@ -198,20 +198,20 @@ $$

 Take the softmax policy as example:

-Weight actions using the linear combination of features $\phi(s,a)^T\theta$:
+Weight actions using the linear combination of features $\phi(s,a)^\top\theta$:

 Probability of action is proportional to the exponentiated weights:

 $$
-\pi_\theta(s,a) \propto \exp(\phi(s,a)^T\theta)
+\pi_\theta(s,a) \propto \exp(\phi(s,a)^\top\theta)
 $$

 The score function is

 $$
 \begin{aligned}
-\nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^T\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^T\theta))) \\
-&= \nabla_\theta\left(\phi(s,a)^T\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^T\theta)}\right) \\
+\nabla_\theta \ln\left[\frac{\exp(\phi(s,a)^\top\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}\right] &= \nabla_\theta(\ln \exp(\phi(s,a)^\top\theta) - (\ln \sum_{a'\in A}\exp(\phi(s,a')^\top\theta))) \\
+&= \nabla_\theta\left(\phi(s,a)^\top\theta -\frac{\phi(s,a)\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}{\sum_{a'\in A}\exp(\phi(s,a')^\top\theta)}\right) \\
 &=\phi(s,a) - \sum_{a'\in A} \prod_\theta(s,a') \phi(s,a')
 &= \phi(s,a) - \mathbb{E}_{a'\sim \pi_\theta(s,a')}[\phi(s,a')]
 \end{aligned}
@@ -221,7 +221,7 @@ $$

 In continuous action spaces, a Gaussian policy is natural

-Mean is a linear combination of state features $\mu(s) = \phi(s)^T\theta$
+Mean is a linear combination of state features $\mu(s) = \phi(s)^\top\theta$

 Variance may be fixed $\sigma^2$, or can also parametrized

--- a/content/CSE510/CSE510_L12.md
+++ b/content/CSE510/CSE510_L12.md
@@ -53,7 +53,7 @@ $$
 Action-Value Actor-Critic

 - Simple actor-critic algorithm based on action-value critic
- Using linear value function approximation $Q_w(s,a)=\phi(s,a)^T w$
+- Using linear value function approximation $Q_w(s,a)=\phi(s,a)^\top w$

 Critic: updates $w$ by linear $TD(0)$
 Actor: updates $\theta$ by policy gradient
--- a/content/CSE510/CSE510_L13.md
+++ b/content/CSE510/CSE510_L13.md
@@ -193,7 +193,7 @@ $$

 Make linear approximation to $L_{\pi_{\theta_{old}}}$ and quadratic approximation to KL term.

-Maximize $g\cdot(\theta-\theta_{old})-\frac{\beta}{2}(\theta-\theta_{old})^T F(\theta-\theta_{old})$
+Maximize $g\cdot(\theta-\theta_{old})-\frac{\beta}{2}(\theta-\theta_{old})^\top F(\theta-\theta_{old})$

 where $g=\frac{\partial}{\partial \theta}L_{\pi_{\theta_{old}}}(\pi_{\theta})\vert_{\theta=\theta_{old}}$ and $F=\frac{\partial^2}{\partial \theta^2}\overline{KL}_{\pi_{\theta_{old}}}(\pi_{\theta})\vert_{\theta=\theta_{old}}$

@@ -201,7 +201,7 @@ where $g=\frac{\partial}{\partial \theta}L_{\pi_{\theta_{old}}}(\pi_{\theta})\ve
 <summary>Taylor Expansion of KL Term</summary>

 $$
-D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\approx D_{KL}(\pi_{\theta_{old}}|\pi_{\theta_{old}})+d^T \nabla_\theta D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}+\frac{1}{2}d^T \nabla_\theta^2 D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}d
+D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\approx D_{KL}(\pi_{\theta_{old}}|\pi_{\theta_{old}})+d^\top \nabla_\theta D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}+\frac{1}{2}d^\top \nabla_\theta^2 D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}d
 $$

 $$
@@ -220,9 +220,9 @@ $$
 \begin{aligned}
 \nabla_\theta^2 D_{KL}(\pi_{\theta_{old}}|\pi_{\theta})\vert_{\theta=\theta_{old}}&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta^2 \log P_\theta(x)\vert_{\theta=\theta_{old}}\\
 &=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta \left(\frac{\nabla_\theta P_\theta(x)}{P_\theta(x)}\right)\vert_{\theta=\theta_{old}}\\
-&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\frac{\nabla_\theta^2 P_\theta(x)-\nabla_\theta P_\theta(x)\nabla_\theta P_\theta(x)^T}{P_\theta(x)^2}\right)\vert_{\theta=\theta_{old}}\\
-&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\frac{\nabla_\theta^2 P_\theta(x)\vert_{\theta=\theta_{old}}}P_{\theta_{old}}(x)\right)+\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\nabla_\theta \log P_\theta(x)\nabla_\theta \log P_\theta(x)^T\right)\vert_{\theta=\theta_{old}}\\
-&=\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta\log P_\theta(x)\nabla_\theta\log P_\theta(x)^T\vert_{\theta=\theta_{old}}\\
+&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\frac{\nabla_\theta^2 P_\theta(x)-\nabla_\theta P_\theta(x)\nabla_\theta P_\theta(x)^\top}{P_\theta(x)^2}\right)\vert_{\theta=\theta_{old}}\\
+&=-\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\frac{\nabla_\theta^2 P_\theta(x)\vert_{\theta=\theta_{old}}}P_{\theta_{old}}(x)\right)+\mathbb{E}_{x\sim \pi_{\theta_{old}}}\left(\nabla_\theta \log P_\theta(x)\nabla_\theta \log P_\theta(x)^\top\right)\vert_{\theta=\theta_{old}}\\
+&=\mathbb{E}_{x\sim \pi_{\theta_{old}}}\nabla_\theta\log P_\theta(x)\nabla_\theta\log P_\theta(x)^\top\vert_{\theta=\theta_{old}}\\
 \end{aligned}
 $$

--- a/content/CSE510/CSE510_L14.md
+++ b/content/CSE510/CSE510_L14.md
@@ -27,7 +27,7 @@ $\theta_{new}=\theta_{old}+d$
 First order Taylor expansion for the loss and second order for the KL:

 $$
-\approx \arg\max_{d} J(\theta_{old})+\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d-\frac{1}{2}\lambda(d^T\nabla_\theta^2 D_{KL}\left[\pi_{\theta_{old}}||\pi_{\theta}\right]\mid_{\theta=\theta_{old}}d)+\lambda \delta
+\approx \arg\max_{d} J(\theta_{old})+\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d-\frac{1}{2}\lambda(d^\top\nabla_\theta^2 D_{KL}\left[\pi_{\theta_{old}}||\pi_{\theta}\right]\mid_{\theta=\theta_{old}}d)+\lambda \delta
 $$

 If you are really interested, try to fill the solving the KL Constrained Problem section.
@@ -38,7 +38,7 @@ Setting the gradient to zero:

 $$
 \begin{aligned}
-0&=\frac{\partial}{\partial d}\left(-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d+\frac{1}{2}\lambda(d^T F(\theta_{old})d\right)\\
+0&=\frac{\partial}{\partial d}\left(-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}d+\frac{1}{2}\lambda(d^\top F(\theta_{old})d\right)\\
 &=-\nabla_\theta J(\theta)\mid_{\theta=\theta_{old}}+\frac{1}{2}\lambda F(\theta_{old})d
 \end{aligned}
 $$
@@ -58,15 +58,15 @@ $$
 $$

 $$
-D_{KL}(\pi_{\theta_{old}}||\pi_{\theta})\approx \frac{1}{2}(\theta-\theta_{old})^T F(\theta_{old})(\theta-\theta_{old})
+D_{KL}(\pi_{\theta_{old}}||\pi_{\theta})\approx \frac{1}{2}(\theta-\theta_{old})^\top F(\theta_{old})(\theta-\theta_{old})
 $$

 $$
-\frac{1}{2}(\alpha g_N)^T F(\alpha g_N)=\delta
+\frac{1}{2}(\alpha g_N)^\top F(\alpha g_N)=\delta
 $$

 $$
-\alpha=\sqrt{\frac{2\delta}{g_N^T F g_N}}
+\alpha=\sqrt{\frac{2\delta}{g_N^\top F g_N}}
 $$

 However, due to the quadratic approximation, the KL constrains may be violated.
--- a/content/CSE510/CSE510_L18.md
+++ b/content/CSE510/CSE510_L18.md
@@ -16,7 +16,7 @@ So we can learn $f(s_t,a_t)$ from data, and _then_ plan through it.

 Model-based reinforcement learning version **0.5**:

-1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
+1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top$
 2. Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$
 3. Plan through $f(s_t,a_t)$ to choose action $a_t$

@@ -52,10 +52,10 @@ Version 2.0: backpropagate directly into policy

 Final version:

-1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
+1. Run base polity $\pi_0$ (e.g. random policy) to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top$
 2. Learn dynamics model $f(s_t,a_t)$ to minimize $\sum_{i}\|f(s_i,a_i)-s_{i+1}\|^2$
 3. Backpropagate through $f(s_t,a_t)$ into the policy to optimized $\pi_\theta(s_t,a_t)$
-4. Run the policy $\pi_\theta(s_t,a_t)$ to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^T$
+4. Run the policy $\pi_\theta(s_t,a_t)$ to collect $\mathcal{D} = \{(s_t, a_t, s_{t+1})\}_{t=0}^\top$
 5. Goto 2

 ## Model Learning with High-Dimensional Observations