191 lines
4.5 KiB
Markdown
191 lines
4.5 KiB
Markdown
# CSE510 Deep Reinforcement Learning (Lecture 7)
|
|
|
|
## Large Scale RL
|
|
|
|
So far we have represented value functions by a lookup table
|
|
|
|
- Every state s has an entry V(s), or
|
|
- Every state-action pair (s, a) has an entry Q(s, a)
|
|
|
|
Reinforcement learning should be used to solve large problems, e.g.
|
|
|
|
- Backgammon: 10^20 states
|
|
- Computer Go: 10^170 states
|
|
- Helicopter, robot, ...: enormous continuous state space
|
|
|
|
Tabular methods clearly cannot handle this.. why?
|
|
|
|
- There are too many states and/or actions to store in memory
|
|
- It is too slow to learn the value of each state individually
|
|
- You cannot generalize across states!
|
|
|
|
### Value Function Approximation (VFA)
|
|
|
|
Solution for large MDPs:
|
|
|
|
- Estimate the value function using a function approximator
|
|
|
|
**Value function approximation (VFA)** replaces the table with general parameterize form:
|
|
|
|
$$
|
|
\hat{V}(s, \theta) \approx V_\pi(s)
|
|
$$
|
|
|
|
or
|
|
|
|
$$
|
|
\hat{Q}(s, a, \theta) \approx Q_\pi(s, a)
|
|
$$
|
|
|
|
Benefit:
|
|
|
|
- Can generalize across states
|
|
- Save memory (only need to store the function approximator parameters)
|
|
|
|
### End-to-End RL
|
|
|
|
End-to-end RL methods replace the hand-designed state representation with raw observations.
|
|
|
|
- Good: We get rid of manual design of state representations
|
|
- Bad: we need tons of data to train the network since O_t usually WAY more high dimensional than hand-designed S_t
|
|
|
|
## Function Approximation
|
|
|
|
- Linear function approximation
|
|
- Neural network function approximation
|
|
- Decision tree function approximation
|
|
- Nearest neighbor
|
|
- ...
|
|
|
|
In this course, we will focus on **Linear combination of features** and **Neural networks**.
|
|
|
|
Today we will do Deep neural networks (fully connected and convolutional).
|
|
|
|
### Artificial Neural Networks
|
|
|
|
#### Neuron
|
|
|
|
$f(x) = \mathbb{R}^k\to \mathbb{R}$
|
|
|
|
$z=a_1w_1+a_2w_2+\cdots+a_kw_k+b$
|
|
|
|
$a_1,a_2,\cdots,a_k$ are the inputs, $w_1,w_2,\cdots,w_k$ are the weights, $b$ is the bias.
|
|
|
|
Then we have activation function $\sigma(z)$ (usually non-linear)
|
|
|
|
##### Activation functions
|
|
|
|
ReLU (rectified linear unit):
|
|
|
|
$$
|
|
\text{ReLU}(x) = \max(0, x)
|
|
$$
|
|
|
|
- Bounded below by 0.
|
|
- Non-vanishing gradient.
|
|
- No upper bound.
|
|
|
|
Sigmoid:
|
|
|
|
$$
|
|
\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
|
|
$$
|
|
|
|
- Always positive.
|
|
- Bounded between 0 and 1.
|
|
- Strictly increasing.
|
|
|
|
> [!TIP]
|
|
>
|
|
> Use relu for previous layers, use sigmoid for output layer.
|
|
>
|
|
> For fully connected shallow networks, you may use more sigmoid layers.
|
|
|
|
We can use parallel computing techniques to speed up the computation.
|
|
|
|
#### Universal Approximation Theorem
|
|
|
|
Any continuous function can be approximated by a neural network with a single hidden layer.
|
|
|
|
(flat layer)
|
|
|
|
#### Why use deep neural networks?
|
|
|
|
Motivation from Biology
|
|
|
|
- Visual Cortex
|
|
|
|
Motivation from circuit theory
|
|
|
|
- Compact representation
|
|
|
|
Modularity
|
|
|
|
- More efficiently using data
|
|
|
|
In Practice: works better for many domains
|
|
|
|
- Hard to argue with results.
|
|
|
|
### Training Neural Networks
|
|
|
|
- Loss function
|
|
- Model
|
|
- Optimization
|
|
|
|
Empirical loss minimization framework:
|
|
|
|
$$
|
|
\argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; \theta), y_i)+\lambda \Omega(\theta)
|
|
$$
|
|
|
|
$\ell$ is the loss function, $f$ is the model, $\theta$ is the parameters, $\Omega$ is the regularization term, $\lambda$ is the regularization parameter.
|
|
|
|
Learning is cast as optimization.
|
|
|
|
- For classification problems, we would like to minimize classification error, e.g., logistic or cross entropy loss.
|
|
- For regression problems, we would like to minimize regression error, e.g. L1 or L2 distance from groundtruth.
|
|
|
|
#### Stochastic Gradient Descent
|
|
|
|
Perform updates after seeing each example:
|
|
|
|
- Initialize: $\theta\equiv\{W^{(1)},b^{(1)},\cdots,W^{(L)},b^{(L)}\}$
|
|
- For $t=1,2,\cdots,T$:
|
|
- For each training example $(x^{(t)},y^{(t)})$:
|
|
- Compute gradient: $\Delta = -\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})-\lambda\nabla_\theta \Omega(\theta)$
|
|
- $\theta \gets \theta + \alpha \Delta$
|
|
|
|
Training a neural network, we need:
|
|
|
|
- Loss function
|
|
- Procedure to compute the gradient
|
|
- Regularization term
|
|
|
|
#### Mini-batch and Momentum
|
|
|
|
Make updates based on a mini-batch of examples (instead of a single example)
|
|
|
|
- the gradient is computed on the average regularized loss for that mini-batch
|
|
- can give a more accurate estimate of the gradient
|
|
|
|
Momentum can use an exponential average of previous gradients.
|
|
|
|
$$
|
|
\overline{\nabla}_\theta^{(t)}=\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})+\beta\overline{\nabla}_\theta^{(t-1)}
|
|
$$
|
|
|
|
can get pass plateaus more quickly, by "gaining momentum".
|
|
|
|
### Convolutional Neural Networks
|
|
|
|
Overview of history:
|
|
|
|
- CNN
|
|
- MLP
|
|
- RNN/LSTM/GRU(Gated Recurrent Unit)
|
|
- Transformer
|
|
|
|
|
|
|