Merge branch 'main' of https://github.com/Trance-0/NoteNextra
This commit is contained in:
271
content/CSE510/CSE510_L9.md
Normal file
271
content/CSE510/CSE510_L9.md
Normal file
@@ -0,0 +1,271 @@
|
|||||||
|
# CSE510 Deep Reinforcement Learning (Lecture 9)
|
||||||
|
|
||||||
|
## Large state spaces
|
||||||
|
|
||||||
|
RL algorithms presented so far have little chance to solve real-world problems when the state (or action) space is large.
|
||||||
|
|
||||||
|
- not longer represent the $V$ or $Q$ function as explicit tables
|
||||||
|
|
||||||
|
Even if we had enough memory
|
||||||
|
|
||||||
|
- Never enough training data
|
||||||
|
- Learning takes too long
|
||||||
|
|
||||||
|
What about large state spaces?
|
||||||
|
|
||||||
|
We will now study three other approaches
|
||||||
|
|
||||||
|
- Value function approximation
|
||||||
|
- Policy gradient methods
|
||||||
|
- Actor-critic methods
|
||||||
|
|
||||||
|
## RL with Function Approximation
|
||||||
|
|
||||||
|
Solution for large MDPs:
|
||||||
|
|
||||||
|
- Estimate value function using a function approximator
|
||||||
|
|
||||||
|
**Value function approximation (VFA)** replaces the table with general parameterize form:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\hat{V}(s, \theta) \approx V_\pi(s)
|
||||||
|
$$
|
||||||
|
|
||||||
|
or
|
||||||
|
|
||||||
|
$$
|
||||||
|
\hat{Q}(s, a, \theta) \approx Q_\pi(s, a)
|
||||||
|
$$
|
||||||
|
|
||||||
|
Benefit:
|
||||||
|
|
||||||
|
- Generalization: those functions can be trained to map similar states to similar values
|
||||||
|
- Reduce memory usage
|
||||||
|
- Reduce computation time
|
||||||
|
- Reduce experience needed to learn the V/Q
|
||||||
|
|
||||||
|
## Linear Function Approximation
|
||||||
|
|
||||||
|
Defined a set of state features $f_1(s),\ldots,f_n(s)$
|
||||||
|
|
||||||
|
- The features are used as our representation of the state
|
||||||
|
- States with similar features values will be considered similar
|
||||||
|
|
||||||
|
A common approximation is to represent $V(s)$ as a linear combination of the features:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\hat{V}(s, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s)
|
||||||
|
$$
|
||||||
|
|
||||||
|
The approximation accuracy is fundamentally limited by the information provided by the features
|
||||||
|
|
||||||
|
Can we always defined features that allow for a perfect linear approximation?
|
||||||
|
|
||||||
|
- Yes. Assign each state an indicator feature. ($i$th feature is $1$ if and only if the $i$th state is present and $\theta_i$ represents value of $i$th state)
|
||||||
|
- However, this requires a feature for each state, which is impractical for large state spaces. (no generalization)
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Example</summary>
|
||||||
|
|
||||||
|
Grid with no obstacles, deterministic actions U/D/L/R, no discounting, -1 reward everywhere except +10 at goal.
|
||||||
|
|
||||||
|
The grid is:
|
||||||
|
|
||||||
|
|4|5|6|7|8|9|10|
|
||||||
|
|---|---|---|---|---|---|---|
|
||||||
|
|3|4|5|6|7|8|9|
|
||||||
|
|2|3|4|5|6|7|8|
|
||||||
|
|1|2|3|4|5|6|7|
|
||||||
|
|0|1|2|3|4|5|6|
|
||||||
|
|0|0|1|2|3|4|5|
|
||||||
|
|0|0|0|1|2|3|4|
|
||||||
|
|
||||||
|
Features for state $s=(x, y): f_1(s)=x, f_2(s)=y$ (just 2 features)
|
||||||
|
|
||||||
|
$$
|
||||||
|
V(s) = \theta_0 + \theta_1 x + \theta_2 y
|
||||||
|
$$
|
||||||
|
|
||||||
|
Is there a good linear approximation?
|
||||||
|
|
||||||
|
- Yes.
|
||||||
|
- $\theta_0 =10, \theta_1 = -1, \theta_2 = -1$
|
||||||
|
- (note upper right is origin)
|
||||||
|
|
||||||
|
$$
|
||||||
|
V(s) = 10 - x - y
|
||||||
|
$$
|
||||||
|
|
||||||
|
subtracts Manhattan dist from goal reward
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
However, for different grid, $V(s)=\theta_0 + \theta_1 x + \theta_2 y$ is not a good approximation.
|
||||||
|
|
||||||
|
|4|5|6|7|6|5|4|
|
||||||
|
|---|---|---|---|---|---|---|
|
||||||
|
|5|6|7|8|7|6|5|
|
||||||
|
|6|7|8|9|8|7|6|
|
||||||
|
|7|8|9|10|9|8|7|
|
||||||
|
|6|7|8|9|8|7|6|
|
||||||
|
|5|6|7|8|7|6|5|
|
||||||
|
|4|5|6|7|6|5|4|
|
||||||
|
|
||||||
|
But we can include a new feature $z=|3-x|+|3-y|$ to get a good approximation.
|
||||||
|
|
||||||
|
$V(s) = \theta_0 + \theta_1 x + \theta_2 y + \theta_3 z$
|
||||||
|
|
||||||
|
> Usually, we need to define different approximation for different problems.
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
### Learning with Linear Function Approximation
|
||||||
|
|
||||||
|
Define a set of features $f_1(s),\ldots,f_n(s)$
|
||||||
|
|
||||||
|
- The features are used as our representation of the state
|
||||||
|
- States with similar features values will be treated similarly
|
||||||
|
- More complex functions require more features
|
||||||
|
|
||||||
|
$$
|
||||||
|
\hat{V}(s, \theta) =\theta_0 + \sum_{i=1}^n \theta_i f_i(s)
|
||||||
|
$$
|
||||||
|
|
||||||
|
Our goal is to learn good parameter values that approximate the value function well
|
||||||
|
|
||||||
|
- How can we do this?
|
||||||
|
- Use TD-based RL and somehow update parameters based on each experience
|
||||||
|
|
||||||
|
#### TD-based learning with function approximators
|
||||||
|
|
||||||
|
1. Start with initial parameter values
|
||||||
|
2. Take action according to an exploration/exploitation policy
|
||||||
|
3. Update estimated model
|
||||||
|
4. Perform TD update for each parameter
|
||||||
|
$$
|
||||||
|
\theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j)
|
||||||
|
$$
|
||||||
|
5. Goto 2
|
||||||
|
|
||||||
|
**The TD update for each parameter is**:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j)
|
||||||
|
$$
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Proof from Gradient Descent</summary>
|
||||||
|
|
||||||
|
Our goal is to minimize the squared errors between our estimated value function at each target value:
|
||||||
|
|
||||||
|
$$
|
||||||
|
E_j(\theta) = \frac{1}{2} \sum_{i=1}^n \left(\hat{V}_\theta(s_j)-v(s_j)\right)^2
|
||||||
|
$$
|
||||||
|
|
||||||
|
Here $E_j(\theta)$ is the squared error of example $j$.
|
||||||
|
|
||||||
|
$\hat{V}_\theta(s_j)$ is our estimated value function at state $s_j$.
|
||||||
|
|
||||||
|
$v(s_j)$ is the true target value at state $s_j$.
|
||||||
|
|
||||||
|
After seeing $j$'th state, the **gradient descent rule** tells us that we can decrease error with respect to $E_j(\theta)$ by
|
||||||
|
|
||||||
|
$$
|
||||||
|
\theta_i \gets \theta_i - \alpha \frac{\partial E_j(\theta)}{\partial \theta_i}
|
||||||
|
$$
|
||||||
|
|
||||||
|
here $\alpha$ is the learning rate.
|
||||||
|
|
||||||
|
By the chain rule, we have:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\begin{aligned}
|
||||||
|
\theta_i &\gets \theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} \\
|
||||||
|
\theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} &=\theta_i - \alpha \frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}\\
|
||||||
|
&= \theta_i - \alpha \left(\hat{V}_\theta(s_j)-v(s_j)\right)f_i(s_j)
|
||||||
|
\end{aligned}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Note that $\frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}=\hat{V}_\theta(s_j)-v(s_j)$
|
||||||
|
|
||||||
|
and $\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)$
|
||||||
|
|
||||||
|
For the linear approximation function
|
||||||
|
|
||||||
|
$$
|
||||||
|
\hat{V}_\theta(s_j) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s_j)
|
||||||
|
$$
|
||||||
|
|
||||||
|
we have $\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)$
|
||||||
|
|
||||||
|
Thus the TD update for each parameter is:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j)
|
||||||
|
$$
|
||||||
|
|
||||||
|
For linear functions, this update is guaranteed to converge to the best approximation for a suitable learning rate.
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
What we use for **target value** $v(s_j)$?
|
||||||
|
|
||||||
|
Use the TD prediction based on the next state $s_{j+1}$: (bootstrap learning)
|
||||||
|
|
||||||
|
$v(s)=R(s)+\gamma \hat{V}_\theta(s')$
|
||||||
|
|
||||||
|
So the TD update for each parameter is:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j)
|
||||||
|
$$
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
>
|
||||||
|
> Initially, the value function may be full of zeros. It's better to use other dense reward to initialize the value function.
|
||||||
|
|
||||||
|
#### Q-function approximation
|
||||||
|
|
||||||
|
Instead of $f(s)$, we use $f(s,a)$ to approximate $Q(s,a)$:
|
||||||
|
|
||||||
|
State-action paris with similar feature values will be treated similarly.
|
||||||
|
|
||||||
|
More complex functions require more complex features.
|
||||||
|
|
||||||
|
$$
|
||||||
|
\hat{Q}(s,a, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s,a)
|
||||||
|
$$
|
||||||
|
|
||||||
|
_Features are a function of state and action._
|
||||||
|
|
||||||
|
Just as fore TD, we can generalize Q-learning to updated parameters of the Q-function approximation
|
||||||
|
|
||||||
|
Q-learning with Linear Approximators:
|
||||||
|
|
||||||
|
1. Start with initial parameter values
|
||||||
|
2. Take action according to an exploration/exploitation policy transition from $s$ to $s'$
|
||||||
|
3. Perform TD update for each parameter
|
||||||
|
$$
|
||||||
|
\theta_i \gets \theta_i + \alpha \left(R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a')- \hat{Q}_\theta(s,a)\right)f_i(s,a)
|
||||||
|
$$
|
||||||
|
4. Goto 2
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
>
|
||||||
|
> Typically the space has many local minima and we no longer guarantee convergence.
|
||||||
|
> However, it often works in practice.
|
||||||
|
|
||||||
|
Here $R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a')$ is the estimate of $Q(s,a)$ based on an observed transition.
|
||||||
|
|
||||||
|
Note here $f_i(s,a)=\frac{\partial \hat{Q}_\theta(s,a)}{\partial \theta_i}$ This need to be computed in closed form.
|
||||||
|
|
||||||
|
## Deep Q-network (DQN)
|
||||||
|
|
||||||
|
This is a non-linear function approximator. That use deep neural networks to approximate the value function.
|
||||||
|
|
||||||
|
Goal is to seeking a single agent which can solve any human-level control problem.
|
||||||
|
|
||||||
|
- RL defined the objective (Q-value function)
|
||||||
|
- DL learns the hierarchical feature representation
|
||||||
|
|
||||||
|
Use deep network to represent the value function:
|
||||||
@@ -11,4 +11,5 @@ export default {
|
|||||||
CSE510_L6: "CSE510 Deep Reinforcement Learning (Lecture 6)",
|
CSE510_L6: "CSE510 Deep Reinforcement Learning (Lecture 6)",
|
||||||
CSE510_L7: "CSE510 Deep Reinforcement Learning (Lecture 7)",
|
CSE510_L7: "CSE510 Deep Reinforcement Learning (Lecture 7)",
|
||||||
CSE510_L8: "CSE510 Deep Reinforcement Learning (Lecture 8)",
|
CSE510_L8: "CSE510 Deep Reinforcement Learning (Lecture 8)",
|
||||||
|
CSE510_L9: "CSE510 Deep Reinforcement Learning (Lecture 9)",
|
||||||
}
|
}
|
||||||
Reference in New Issue
Block a user