From 898671c2dd58744d6743848ab5d3141cd4f0b290 Mon Sep 17 00:00:00 2001
From: Trance-0 <60459821+Trance-0@users.noreply.github.com>
Date: Tue, 23 Sep 2025 11:22:34 -0500
Subject: [PATCH] updates

---
 content/CSE510/CSE510_L9.md | 271 ++++++++++++++++++++++++++++++++++++
 content/CSE510/_meta.js     |   1 +
 2 files changed, 272 insertions(+)
 create mode 100644 content/CSE510/CSE510_L9.md
diff --git a/content/CSE510/CSE510_L9.md b/content/CSE510/CSE510_L9.md
new file mode 100644
index 0000000..4e53253
--- /dev/null
+++ b/content/CSE510/CSE510_L9.md
@@ -0,0 +1,271 @@
+# CSE510 Deep Reinforcement Learning (Lecture 9)
+
+## Large state spaces
+
+RL algorithms presented so far have little chance to solve real-world problems when the state (or action) space is large.
+
+- not longer represent the $V$ or $Q$ function as explicit tables
+
+Even if we had enough memory
+
+- Never enough training data
+- Learning takes too long
+
+What about large state spaces? 
+
+We will now study three other approaches 
+
+- Value function approximation 
+- Policy gradient methods 
+- Actor-critic methods
+
+## RL with Function Approximation
+
+Solution for large MDPs:
+
+- Estimate value function using a function approximator
+
+**Value function approximation (VFA)** replaces the table with general parameterize form:
+
+$$
+\hat{V}(s, \theta) \approx V_\pi(s)
+$$
+
+or
+
+$$
+\hat{Q}(s, a, \theta) \approx Q_\pi(s, a)
+$$
+
+Benefit:
+
+- Generalization: those functions can be trained to map similar states to similar values
+  - Reduce memory usage
+  - Reduce computation time
+  - Reduce experience needed to learn the V/Q
+
+## Linear Function Approximation
+
+Defined a set of state features $f_1(s),\ldots,f_n(s)$
+
+- The features are used as our representation of the state
+- States with similar features values will be considered similar
+
+A common approximation is to represent $V(s)$ as a linear combination of the features:
+
+$$
+\hat{V}(s, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s)
+$$
+
+The approximation accuracy is fundamentally limited by the information provided by the features
+
+Can we always defined features that allow for a perfect linear approximation?
+
+- Yes. Assign each state an indicator feature. ($i$th feature is $1$ if and only if the $i$th state is present and $\theta_i$ represents value of $i$th state)
+- However, this requires a feature for each state, which is impractical for large state spaces. (no generalization)
+
+<details>
+<summary>Example</summary>
+
+Grid with no obstacles, deterministic actions U/D/L/R, no discounting, -1 reward everywhere except +10 at goal.
+
+The grid is:
+
+|4|5|6|7|8|9|10|
+|---|---|---|---|---|---|---|
+|3|4|5|6|7|8|9|
+|2|3|4|5|6|7|8|
+|1|2|3|4|5|6|7|
+|0|1|2|3|4|5|6|
+|0|0|1|2|3|4|5|
+|0|0|0|1|2|3|4|
+
+Features for state $s=(x, y): f_1(s)=x, f_2(s)=y$ (just 2 features)
+
+$$
+V(s) = \theta_0 + \theta_1 x + \theta_2 y
+$$
+
+Is there a good linear approximation?
+
+- Yes. 
+- $\theta_0 =10, \theta_1 = -1, \theta_2 = -1$ 
+- (note upper right is origin)
+
+$$
+V(s) = 10 - x - y
+$$
+
+subtracts Manhattan dist from goal reward
+
+---
+
+However, for different grid, $V(s)=\theta_0 + \theta_1 x + \theta_2 y$ is not a good approximation.
+
+|4|5|6|7|6|5|4|
+|---|---|---|---|---|---|---|
+|5|6|7|8|7|6|5|
+|6|7|8|9|8|7|6|
+|7|8|9|10|9|8|7|
+|6|7|8|9|8|7|6|
+|5|6|7|8|7|6|5|
+|4|5|6|7|6|5|4|
+
+But we can include a new feature $z=|3-x|+|3-y|$ to get a good approximation.
+
+$V(s) = \theta_0 + \theta_1 x + \theta_2 y + \theta_3 z$
+
+> Usually, we need to define different approximation for different problems.
+
+</details>
+
+### Learning with Linear Function Approximation
+
+Define a set of features $f_1(s),\ldots,f_n(s)$
+
+- The features are used as our representation of the state
+- States with similar features values will be treated similarly
+- More complex functions require more features
+
+$$
+\hat{V}(s, \theta) =\theta_0 + \sum_{i=1}^n \theta_i f_i(s)
+$$
+
+Our goal is to learn good parameter values that approximate the value function well
+
+- How can we do this?
+- Use TD-based RL and somehow update parameters based on each experience
+
+#### TD-based learning with function approximators
+
+1. Start with initial parameter values
+2. Take action according to an exploration/exploitation policy
+3. Update estimated model
+4. Perform TD update for each parameter
+   $$
+   \theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j)
+   $$
+5. Goto 2
+
+**The TD update for each parameter is**:
+
+$$
+\theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j)
+$$
+
+<details>
+<summary>Proof from Gradient Descent</summary>
+
+Our goal is to minimize the squared errors between our estimated value function at each target value:
+
+$$
+E_j(\theta) = \frac{1}{2} \sum_{i=1}^n \left(\hat{V}_\theta(s_j)-v(s_j)\right)^2
+$$
+
+Here $E_j(\theta)$ is the squared error of example $j$.
+
+$\hat{V}_\theta(s_j)$ is our estimated value function at state $s_j$.
+
+$v(s_j)$ is the true target value at state $s_j$.
+
+After seeing $j$'th state, the **gradient descent rule** tells us that we can decrease error with respect to $E_j(\theta)$ by
+
+$$
+\theta_i \gets \theta_i - \alpha \frac{\partial E_j(\theta)}{\partial \theta_i}
+$$
+
+here $\alpha$ is the learning rate.
+
+By the chain rule, we have:
+
+$$
+\begin{aligned}
+\theta_i &\gets \theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} \\
+\theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} &=\theta_i - \alpha \frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}\\
+&= \theta_i - \alpha \left(\hat{V}_\theta(s_j)-v(s_j)\right)f_i(s_j)
+\end{aligned}
+$$
+
+Note that $\frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}=\hat{V}_\theta(s_j)-v(s_j)$
+
+and $\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)$
+
+For the linear approximation function
+
+$$
+\hat{V}_\theta(s_j) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s_j)
+$$
+
+we have $\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)$
+
+Thus the TD update for each parameter is:
+
+$$
+\theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j)
+$$
+
+For linear functions, this update is guaranteed to converge to the best approximation for a suitable learning rate.
+
+</details>
+
+What we use for **target value** $v(s_j)$?
+
+Use the TD prediction based on the next state $s_{j+1}$: (bootstrap learning)
+
+$v(s)=R(s)+\gamma \hat{V}_\theta(s')$
+
+So the TD update for each parameter is:
+
+$$
+\theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j)
+$$
+
+> [!NOTE]
+>
+> Initially, the value function may be full of zeros. It's better to use other dense reward to initialize the value function.
+
+#### Q-function approximation
+
+Instead of $f(s)$, we use $f(s,a)$ to approximate $Q(s,a)$:
+
+State-action paris with similar feature values will be treated similarly.
+
+More complex functions require more complex features.
+
+$$
+\hat{Q}(s,a, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s,a)
+$$
+
+_Features are a function of state and action._
+
+Just as fore TD, we can generalize Q-learning to updated parameters of the Q-function approximation
+
+Q-learning with Linear Approximators:
+
+1. Start with initial parameter values
+2. Take action according to an exploration/exploitation policy transition from $s$ to $s'$
+3. Perform TD update for each parameter
+   $$
+   \theta_i \gets \theta_i + \alpha \left(R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a')- \hat{Q}_\theta(s,a)\right)f_i(s,a)
+   $$
+4. Goto 2
+
+> [!WARNING]
+>
+> Typically the space has many local minima and we no longer guarantee convergence.
+> However, it often works in practice.
+
+Here $R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a')$ is the estimate of $Q(s,a)$ based on an observed transition.
+
+Note here $f_i(s,a)=\frac{\partial \hat{Q}_\theta(s,a)}{\partial \theta_i}$ This need to be computed in closed form.
+
+## Deep Q-network (DQN)
+
+This is a non-linear function approximator. That use deep neural networks to approximate the value function.
+
+Goal is to seeking a single agent which can solve any human-level control problem.
+
+- RL defined the objective (Q-value function)
+- DL learns the hierarchical feature representation
+
+Use deep network to represent the value function:
diff --git a/content/CSE510/_meta.js b/content/CSE510/_meta.js
index 67feade..2eee658 100644
--- a/content/CSE510/_meta.js
+++ b/content/CSE510/_meta.js
@@ -11,4 +11,5 @@ export default {
     CSE510_L6: "CSE510 Deep Reinforcement Learning (Lecture 6)",
     CSE510_L7: "CSE510 Deep Reinforcement Learning (Lecture 7)",
     CSE510_L8: "CSE510 Deep Reinforcement Learning (Lecture 8)",
+    CSE510_L9: "CSE510 Deep Reinforcement Learning (Lecture 9)",
 }
\ No newline at end of file