From 898671c2dd58744d6743848ab5d3141cd4f0b290 Mon Sep 17 00:00:00 2001 From: Trance-0 <60459821+Trance-0@users.noreply.github.com> Date: Tue, 23 Sep 2025 11:22:34 -0500 Subject: [PATCH] updates --- content/CSE510/CSE510_L9.md | 271 ++++++++++++++++++++++++++++++++++++ content/CSE510/_meta.js | 1 + 2 files changed, 272 insertions(+) create mode 100644 content/CSE510/CSE510_L9.md diff --git a/content/CSE510/CSE510_L9.md b/content/CSE510/CSE510_L9.md new file mode 100644 index 0000000..4e53253 --- /dev/null +++ b/content/CSE510/CSE510_L9.md @@ -0,0 +1,271 @@ +# CSE510 Deep Reinforcement Learning (Lecture 9) + +## Large state spaces + +RL algorithms presented so far have little chance to solve real-world problems when the state (or action) space is large. + +- not longer represent the $V$ or $Q$ function as explicit tables + +Even if we had enough memory + +- Never enough training data +- Learning takes too long + +What about large state spaces? + +We will now study three other approaches + +- Value function approximation +- Policy gradient methods +- Actor-critic methods + +## RL with Function Approximation + +Solution for large MDPs: + +- Estimate value function using a function approximator + +**Value function approximation (VFA)** replaces the table with general parameterize form: + +$$ +\hat{V}(s, \theta) \approx V_\pi(s) +$$ + +or + +$$ +\hat{Q}(s, a, \theta) \approx Q_\pi(s, a) +$$ + +Benefit: + +- Generalization: those functions can be trained to map similar states to similar values + - Reduce memory usage + - Reduce computation time + - Reduce experience needed to learn the V/Q + +## Linear Function Approximation + +Defined a set of state features $f_1(s),\ldots,f_n(s)$ + +- The features are used as our representation of the state +- States with similar features values will be considered similar + +A common approximation is to represent $V(s)$ as a linear combination of the features: + +$$ +\hat{V}(s, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s) +$$ + +The approximation accuracy is fundamentally limited by the information provided by the features + +Can we always defined features that allow for a perfect linear approximation? + +- Yes. Assign each state an indicator feature. ($i$th feature is $1$ if and only if the $i$th state is present and $\theta_i$ represents value of $i$th state) +- However, this requires a feature for each state, which is impractical for large state spaces. (no generalization) + +
+Example + +Grid with no obstacles, deterministic actions U/D/L/R, no discounting, -1 reward everywhere except +10 at goal. + +The grid is: + +|4|5|6|7|8|9|10| +|---|---|---|---|---|---|---| +|3|4|5|6|7|8|9| +|2|3|4|5|6|7|8| +|1|2|3|4|5|6|7| +|0|1|2|3|4|5|6| +|0|0|1|2|3|4|5| +|0|0|0|1|2|3|4| + +Features for state $s=(x, y): f_1(s)=x, f_2(s)=y$ (just 2 features) + +$$ +V(s) = \theta_0 + \theta_1 x + \theta_2 y +$$ + +Is there a good linear approximation? + +- Yes. +- $\theta_0 =10, \theta_1 = -1, \theta_2 = -1$ +- (note upper right is origin) + +$$ +V(s) = 10 - x - y +$$ + +subtracts Manhattan dist from goal reward + +--- + +However, for different grid, $V(s)=\theta_0 + \theta_1 x + \theta_2 y$ is not a good approximation. + +|4|5|6|7|6|5|4| +|---|---|---|---|---|---|---| +|5|6|7|8|7|6|5| +|6|7|8|9|8|7|6| +|7|8|9|10|9|8|7| +|6|7|8|9|8|7|6| +|5|6|7|8|7|6|5| +|4|5|6|7|6|5|4| + +But we can include a new feature $z=|3-x|+|3-y|$ to get a good approximation. + +$V(s) = \theta_0 + \theta_1 x + \theta_2 y + \theta_3 z$ + +> Usually, we need to define different approximation for different problems. + +
+ +### Learning with Linear Function Approximation + +Define a set of features $f_1(s),\ldots,f_n(s)$ + +- The features are used as our representation of the state +- States with similar features values will be treated similarly +- More complex functions require more features + +$$ +\hat{V}(s, \theta) =\theta_0 + \sum_{i=1}^n \theta_i f_i(s) +$$ + +Our goal is to learn good parameter values that approximate the value function well + +- How can we do this? +- Use TD-based RL and somehow update parameters based on each experience + +#### TD-based learning with function approximators + +1. Start with initial parameter values +2. Take action according to an exploration/exploitation policy +3. Update estimated model +4. Perform TD update for each parameter + $$ + \theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j) + $$ +5. Goto 2 + +**The TD update for each parameter is**: + +$$ +\theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j) +$$ + +
+Proof from Gradient Descent + +Our goal is to minimize the squared errors between our estimated value function at each target value: + +$$ +E_j(\theta) = \frac{1}{2} \sum_{i=1}^n \left(\hat{V}_\theta(s_j)-v(s_j)\right)^2 +$$ + +Here $E_j(\theta)$ is the squared error of example $j$. + +$\hat{V}_\theta(s_j)$ is our estimated value function at state $s_j$. + +$v(s_j)$ is the true target value at state $s_j$. + +After seeing $j$'th state, the **gradient descent rule** tells us that we can decrease error with respect to $E_j(\theta)$ by + +$$ +\theta_i \gets \theta_i - \alpha \frac{\partial E_j(\theta)}{\partial \theta_i} +$$ + +here $\alpha$ is the learning rate. + +By the chain rule, we have: + +$$ +\begin{aligned} +\theta_i &\gets \theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} \\ +\theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} &=\theta_i - \alpha \frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}\\ +&= \theta_i - \alpha \left(\hat{V}_\theta(s_j)-v(s_j)\right)f_i(s_j) +\end{aligned} +$$ + +Note that $\frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}=\hat{V}_\theta(s_j)-v(s_j)$ + +and $\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)$ + +For the linear approximation function + +$$ +\hat{V}_\theta(s_j) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s_j) +$$ + +we have $\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)$ + +Thus the TD update for each parameter is: + +$$ +\theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j) +$$ + +For linear functions, this update is guaranteed to converge to the best approximation for a suitable learning rate. + +
+ +What we use for **target value** $v(s_j)$? + +Use the TD prediction based on the next state $s_{j+1}$: (bootstrap learning) + +$v(s)=R(s)+\gamma \hat{V}_\theta(s')$ + +So the TD update for each parameter is: + +$$ +\theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j) +$$ + +> [!NOTE] +> +> Initially, the value function may be full of zeros. It's better to use other dense reward to initialize the value function. + +#### Q-function approximation + +Instead of $f(s)$, we use $f(s,a)$ to approximate $Q(s,a)$: + +State-action paris with similar feature values will be treated similarly. + +More complex functions require more complex features. + +$$ +\hat{Q}(s,a, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s,a) +$$ + +_Features are a function of state and action._ + +Just as fore TD, we can generalize Q-learning to updated parameters of the Q-function approximation + +Q-learning with Linear Approximators: + +1. Start with initial parameter values +2. Take action according to an exploration/exploitation policy transition from $s$ to $s'$ +3. Perform TD update for each parameter + $$ + \theta_i \gets \theta_i + \alpha \left(R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a')- \hat{Q}_\theta(s,a)\right)f_i(s,a) + $$ +4. Goto 2 + +> [!WARNING] +> +> Typically the space has many local minima and we no longer guarantee convergence. +> However, it often works in practice. + +Here $R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a')$ is the estimate of $Q(s,a)$ based on an observed transition. + +Note here $f_i(s,a)=\frac{\partial \hat{Q}_\theta(s,a)}{\partial \theta_i}$ This need to be computed in closed form. + +## Deep Q-network (DQN) + +This is a non-linear function approximator. That use deep neural networks to approximate the value function. + +Goal is to seeking a single agent which can solve any human-level control problem. + +- RL defined the objective (Q-value function) +- DL learns the hierarchical feature representation + +Use deep network to represent the value function: diff --git a/content/CSE510/_meta.js b/content/CSE510/_meta.js index 67feade..2eee658 100644 --- a/content/CSE510/_meta.js +++ b/content/CSE510/_meta.js @@ -11,4 +11,5 @@ export default { CSE510_L6: "CSE510 Deep Reinforcement Learning (Lecture 6)", CSE510_L7: "CSE510 Deep Reinforcement Learning (Lecture 7)", CSE510_L8: "CSE510 Deep Reinforcement Learning (Lecture 8)", + CSE510_L9: "CSE510 Deep Reinforcement Learning (Lecture 9)", } \ No newline at end of file