7.5 KiB
CSE510 Deep Reinforcement Learning (Lecture 9)
Large state spaces
RL algorithms presented so far have little chance to solve real-world problems when the state (or action) space is large.
- not longer represent the
VorQfunction as explicit tables
Even if we had enough memory
- Never enough training data
- Learning takes too long
What about large state spaces?
We will now study three other approaches
- Value function approximation
- Policy gradient methods
- Actor-critic methods
RL with Function Approximation
Solution for large MDPs:
- Estimate value function using a function approximator
Value function approximation (VFA) replaces the table with general parameterize form:
\hat{V}(s, \theta) \approx V_\pi(s)
or
\hat{Q}(s, a, \theta) \approx Q_\pi(s, a)
Benefit:
- Generalization: those functions can be trained to map similar states to similar values
- Reduce memory usage
- Reduce computation time
- Reduce experience needed to learn the V/Q
Linear Function Approximation
Defined a set of state features f_1(s),\ldots,f_n(s)
- The features are used as our representation of the state
- States with similar features values will be considered similar
A common approximation is to represent V(s) as a linear combination of the features:
\hat{V}(s, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s)
The approximation accuracy is fundamentally limited by the information provided by the features
Can we always defined features that allow for a perfect linear approximation?
- Yes. Assign each state an indicator feature. ($i$th feature is
1if and only if the $i$th state is present and\theta_irepresents value of $i$th state) - However, this requires a feature for each state, which is impractical for large state spaces. (no generalization)
Example
Grid with no obstacles, deterministic actions U/D/L/R, no discounting, -1 reward everywhere except +10 at goal.
The grid is:
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|
| 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| 0 | 0 | 1 | 2 | 3 | 4 | 5 |
| 0 | 0 | 0 | 1 | 2 | 3 | 4 |
Features for state s=(x, y): f_1(s)=x, f_2(s)=y (just 2 features)
V(s) = \theta_0 + \theta_1 x + \theta_2 y
Is there a good linear approximation?
- Yes.
\theta_0 =10, \theta_1 = -1, \theta_2 = -1- (note upper right is origin)
V(s) = 10 - x - y
subtracts Manhattan dist from goal reward
However, for different grid, V(s)=\theta_0 + \theta_1 x + \theta_2 y is not a good approximation.
| 4 | 5 | 6 | 7 | 6 | 5 | 4 |
|---|---|---|---|---|---|---|
| 5 | 6 | 7 | 8 | 7 | 6 | 5 |
| 6 | 7 | 8 | 9 | 8 | 7 | 6 |
| 7 | 8 | 9 | 10 | 9 | 8 | 7 |
| 6 | 7 | 8 | 9 | 8 | 7 | 6 |
| 5 | 6 | 7 | 8 | 7 | 6 | 5 |
| 4 | 5 | 6 | 7 | 6 | 5 | 4 |
But we can include a new feature z=|3-x|+|3-y| to get a good approximation.
V(s) = \theta_0 + \theta_1 x + \theta_2 y + \theta_3 z
Usually, we need to define different approximation for different problems.
Learning with Linear Function Approximation
Define a set of features f_1(s),\ldots,f_n(s)
- The features are used as our representation of the state
- States with similar features values will be treated similarly
- More complex functions require more features
\hat{V}(s, \theta) =\theta_0 + \sum_{i=1}^n \theta_i f_i(s)
Our goal is to learn good parameter values that approximate the value function well
- How can we do this?
- Use TD-based RL and somehow update parameters based on each experience
TD-based learning with function approximators
- Start with initial parameter values
- Take action according to an exploration/exploitation policy
- Update estimated model
- Perform TD update for each parameter
\theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j) - Goto 2
The TD update for each parameter is:
\theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j)
Proof from Gradient Descent
Our goal is to minimize the squared errors between our estimated value function at each target value:
E_j(\theta) = \frac{1}{2} \sum_{i=1}^n \left(\hat{V}_\theta(s_j)-v(s_j)\right)^2
Here E_j(\theta) is the squared error of example j.
\hat{V}_\theta(s_j) is our estimated value function at state s_j.
v(s_j) is the true target value at state s_j.
After seeing $j$'th state, the gradient descent rule tells us that we can decrease error with respect to E_j(\theta) by
\theta_i \gets \theta_i - \alpha \frac{\partial E_j(\theta)}{\partial \theta_i}
here \alpha is the learning rate.
By the chain rule, we have:
\begin{aligned}
\theta_i &\gets \theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} \\
\theta_i -\alpha \frac{\partial E_j(\theta)}{\partial \theta_i} &=\theta_i - \alpha \frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}\frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}\\
&= \theta_i - \alpha \left(\hat{V}_\theta(s_j)-v(s_j)\right)f_i(s_j)
\end{aligned}
Note that \frac{\partial E_j}{\partial \hat{V}_\theta(s_j)}=\hat{V}_\theta(s_j)-v(s_j)
and \frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)
For the linear approximation function
\hat{V}_\theta(s_j) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s_j)
we have \frac{\partial \hat{V}_\theta(s_j)}{\partial \theta_i}=f_i(s_j)
Thus the TD update for each parameter is:
\theta_i \gets \theta_i + \alpha \left(v(s_j)-\hat{V}_\theta(s_j)\right)f_i(s_j)
For linear functions, this update is guaranteed to converge to the best approximation for a suitable learning rate.
What we use for target value v(s_j)?
Use the TD prediction based on the next state s_{j+1}: (bootstrap learning)
v(s)=R(s)+\gamma \hat{V}_\theta(s')
So the TD update for each parameter is:
\theta_i \gets \theta_i + \alpha \left(R(s_j)+\gamma \hat{V}_\theta(s_{j+1})- \hat{V}_\theta(s_j)\right)f_i(s_j)
Note
Initially, the value function may be full of zeros. It's better to use other dense reward to initialize the value function.
Q-function approximation
Instead of f(s), we use f(s,a) to approximate Q(s,a):
State-action paris with similar feature values will be treated similarly.
More complex functions require more complex features.
\hat{Q}(s,a, \theta) = \theta_0 + \sum_{i=1}^n \theta_i f_i(s,a)
Features are a function of state and action.
Just as fore TD, we can generalize Q-learning to updated parameters of the Q-function approximation
Q-learning with Linear Approximators:
- Start with initial parameter values
- Take action according to an exploration/exploitation policy transition from
stos' - Perform TD update for each parameter
\theta_i \gets \theta_i + \alpha \left(R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a')- \hat{Q}_\theta(s,a)\right)f_i(s,a) - Goto 2
Warning
Typically the space has many local minima and we no longer guarantee convergence. However, it often works in practice.
Here R(s)+\gamma \max_{a'\in A} \hat{Q}_\theta(s',a') is the estimate of Q(s,a) based on an observed transition.
Note here f_i(s,a)=\frac{\partial \hat{Q}_\theta(s,a)}{\partial \theta_i} This need to be computed in closed form.
Deep Q-network (DQN)
This is a non-linear function approximator. That use deep neural networks to approximate the value function.
Goal is to seeking a single agent which can solve any human-level control problem.
- RL defined the objective (Q-value function)
- DL learns the hierarchical feature representation
Use deep network to represent the value function: