# CSE510 Deep Reinforcement Learning (Lecture 9) ## Large state spaces RL algorithms presented so far have little chance to solve real-world problems when the state (or action) space is large. - not longer represent the $V$ or $Q$ function as explicit tables Even if we had enough memory - Never enough training data - Learning takes too long What about large state spaces? We will now study three other approaches - Value function approximation - Policy gradient methods - Actor-critic methods ## RL with Function Approximation Solution for large MDPs: - Estimate value function using a function approximator **Value function approximation (VFA)** replaces the table with general parameterize form: $$ \hat{V}(s, \theta) \approx V_\pi(s) $$ or $$ \hat{Q}(s, a, \theta) \approx Q_\pi(s, a) $$ Benefit: - Generalization: those functions can be trained to map similar states to similar values - Reduce memory usage - Reduce computation time - Reduce experience needed to learn the V/Q ## Linear Function Approximation Defined a set of state features $f_1(s),\ldots,f_n(s)$ - The features are used as our representation of the state - States with similar features values will be considered similar A common approximation is to represent $V(s)$ as a linear combination of the features: $$ \hat{V}(s, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s) $$ The approximation accuracy is fundamentally limited by the information provided by the features Can we always defined features that allow for a perfect linear approximation? - Yes. Assign each state an indicator feature. ($i$th feature is $1$ if and only if the $i$th state is present and $\theta_i$ represents value of $i$th state) - However, this requires a feature for each state, which is impractical for large state spaces. (no generalization)
Example Grid with no obstacles, deterministic actions U/D/L/R, no discounting, -1 reward everywhere except +10 at goal. The grid is: |4|5|6|7|8|9|10| |---|---|---|---|---|---|---| |3|4|5|6|7|8|9| |2|3|4|5|6|7|8| |1|2|3|4|5|6|7| |0|1|2|3|4|5|6| |0|0|1|2|3|4|5| |0|0|0|1|2|3|4| Features for state $s=(x, y): f_1(s)=x, f_2(s)=y$ (just 2 features) $$ V(s) = \theta_0 + \theta_1 x + \theta_2 y $$ Is there a good linear approximation? - Yes. - $\theta_0 =10, \theta_1 = -1, \theta_2 = -1$ - (note upper right is origin) $$ V(s) = 10 - x - y $$ subtracts Manhattan dist from goal reward --- However, for different grid, $V(s)=\theta_0 + \theta_1 x + \theta_2 y$ is not a good approximation. |4|5|6|7|6|5|4| |---|---|---|---|---|---|---| |5|6|7|8|7|6|5| |6|7|8|9|8|7|6| |7|8|9|10|9|8|7| |6|7|8|9|8|7|6| |5|6|7|8|7|6|5| |4|5|6|7|6|5|4| But we can include a new feature $z=|3-x|+|3-y|$ to get a good approximation. $V(s) = \theta_0 + \theta_1 x + \theta_2 y + \theta_3 z$ > Usually, we need to define different approximation for different problems.
### Learning with Linear Function Approximation Define a set of features $f_1(s),\ldots,f_n(s)$ - The features are used as our representation of the state - States with similar features values will be treated similarly - More complex functions require more features $$ \hat{V}(s, \theta) =\theta_0 + \sum_{i=1}^n \theta_i f_i(s) $$ Our goal is to learn good parameter values that approximate the value function well - How can we do this? - Use TD-based RL and somehow update parameters based on each experience #### TD-based learning with function approximators 1. Start with initial parameter values 2. Take action according to an exploration/exploitation policy 3. Update estimated model 4. Perform TD update for each parameter $$ \theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j) $$ 5. Goto 2 **The TD update for each parameter is**: $$ \theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j) $$
Proof from Gradient Descent Our goal is to minimize the squared errors between our estimated value function at each target value: $$ E_j(\theta) = \frac{1}{2} \sum_{i=1}^n \left(\hat{V}_\theta(s_j)-v(s_j)\right)^2 $$ Here $E_j(\theta)$ is the squared error of example $j$. $\hat{V}_\theta(s_j)$ is our estimated value function at state $s_j$. $v(s_j)$ is the true target value at state $s_j$. After seeing $j$'th state, the **gradient descent rule** tells us that we can decrease error with respect to $E_j(\theta)$ by $$ \theta_i \gets \theta_i - \alpha \frac{\partial E_j(\theta)}{\partial \theta_i} $$ here $\alpha$ is the learning rate. By the chain rule, we have: $$ \begin{aligned} \theta_i &\gets \theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} \\ \theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} &=\theta_i - \alpha \frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}\\ &= \theta_i - \alpha \left(\hat{V}_\theta(s_j)-v(s_j)\right)f_i(s_j) \end{aligned} $$ Note that $\frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}=\hat{V}_\theta(s_j)-v(s_j)$ and $\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)$ For the linear approximation function $$ \hat{V}_\theta(s_j) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s_j) $$ we have $\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)$ Thus the TD update for each parameter is: $$ \theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j) $$ For linear functions, this update is guaranteed to converge to the best approximation for a suitable learning rate.
What we use for **target value** $v(s_j)$? Use the TD prediction based on the next state $s_{j+1}$: (bootstrap learning) $v(s)=R(s)+\gamma \hat{V}_\theta(s')$ So the TD update for each parameter is: $$ \theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j) $$ > [!NOTE] > > Initially, the value function may be full of zeros. It's better to use other dense reward to initialize the value function. #### Q-function approximation Instead of $f(s)$, we use $f(s,a)$ to approximate $Q(s,a)$: State-action paris with similar feature values will be treated similarly. More complex functions require more complex features. $$ \hat{Q}(s,a, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s,a) $$ _Features are a function of state and action._ Just as fore TD, we can generalize Q-learning to updated parameters of the Q-function approximation Q-learning with Linear Approximators: 1. Start with initial parameter values 2. Take action according to an exploration/exploitation policy transition from $s$ to $s'$ 3. Perform TD update for each parameter $$ \theta_i \gets \theta_i + \alpha \left(R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a')- \hat{Q}_\theta(s,a)\right)f_i(s,a) $$ 4. Goto 2 > [!WARNING] > > Typically the space has many local minima and we no longer guarantee convergence. > However, it often works in practice. Here $R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a')$ is the estimate of $Q(s,a)$ based on an observed transition. Note here $f_i(s,a)=\frac{\partial \hat{Q}_\theta(s,a)}{\partial \theta_i}$ This need to be computed in closed form. ## Deep Q-network (DQN) This is a non-linear function approximator. That use deep neural networks to approximate the value function. Goal is to seeking a single agent which can solve any human-level control problem. - RL defined the objective (Q-value function) - DL learns the hierarchical feature representation Use deep network to represent the value function: