Files
NoteNextra-origin/content/CSE510/CSE510_L7.md
Trance-0 3fbbb89f5e updates
2025-09-16 12:48:24 -05:00

191 lines
4.5 KiB
Markdown

# CSE510 Deep Reinforcement Learning (Lecture 7)
## Large Scale RL
So far we have represented value functions by a lookup table
- Every state s has an entry V(s), or
- Every state-action pair (s, a) has an entry Q(s, a)
Reinforcement learning should be used to solve large problems, e.g.
- Backgammon: 10^20 states
- Computer Go: 10^170 states
- Helicopter, robot, ...: enormous continuous state space
Tabular methods clearly cannot handle this.. why?
- There are too many states and/or actions to store in memory
- It is too slow to learn the value of each state individually
- You cannot generalize across states!
### Value Function Approximation (VFA)
Solution for large MDPs:
- Estimate the value function using a function approximator
**Value function approximation (VFA)** replaces the table with general parameterize form:
$$
\hat{V}(s, \theta) \approx V_\pi(s)
$$
or
$$
\hat{Q}(s, a, \theta) \approx Q_\pi(s, a)
$$
Benefit:
- Can generalize across states
- Save memory (only need to store the function approximator parameters)
### End-to-End RL
End-to-end RL methods replace the hand-designed state representation with raw observations.
- Good: We get rid of manual design of state representations
- Bad: we need tons of data to train the network since O_t usually WAY more high dimensional than hand-designed S_t
## Function Approximation
- Linear function approximation
- Neural network function approximation
- Decision tree function approximation
- Nearest neighbor
- ...
In this course, we will focus on **Linear combination of features** and **Neural networks**.
Today we will do Deep neural networks (fully connected and convolutional).
### Artificial Neural Networks
#### Neuron
$f(x) = \mathbb{R}^k\to \mathbb{R}$
$z=a_1w_1+a_2w_2+\cdots+a_kw_k+b$
$a_1,a_2,\cdots,a_k$ are the inputs, $w_1,w_2,\cdots,w_k$ are the weights, $b$ is the bias.
Then we have activation function $\sigma(z)$ (usually non-linear)
##### Activation functions
ReLU (rectified linear unit):
$$
\text{ReLU}(x) = \max(0, x)
$$
- Bounded below by 0.
- Non-vanishing gradient.
- No upper bound.
Sigmoid:
$$
\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
$$
- Always positive.
- Bounded between 0 and 1.
- Strictly increasing.
> [!TIP]
>
> Use relu for previous layers, use sigmoid for output layer.
>
> For fully connected shallow networks, you may use more sigmoid layers.
We can use parallel computing techniques to speed up the computation.
#### Universal Approximation Theorem
Any continuous function can be approximated by a neural network with a single hidden layer.
(flat layer)
#### Why use deep neural networks?
Motivation from Biology
- Visual Cortex
Motivation from circuit theory
- Compact representation
Modularity
- More efficiently using data
In Practice: works better for many domains
- Hard to argue with results.
### Training Neural Networks
- Loss function
- Model
- Optimization
Empirical loss minimization framework:
$$
\argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; \theta), y_i)+\lambda \Omega(\theta)
$$
$\ell$ is the loss function, $f$ is the model, $\theta$ is the parameters, $\Omega$ is the regularization term, $\lambda$ is the regularization parameter.
Learning is cast as optimization.
- For classification problems, we would like to minimize classification error, e.g., logistic or cross entropy loss.
- For regression problems, we would like to minimize regression error, e.g. L1 or L2 distance from groundtruth.
#### Stochastic Gradient Descent
Perform updates after seeing each example:
- Initialize: $\theta\equiv\{W^{(1)},b^{(1)},\cdots,W^{(L)},b^{(L)}\}$
- For $t=1,2,\cdots,T$:
- For each training example $(x^{(t)},y^{(t)})$:
- Compute gradient: $\Delta = -\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})-\lambda\nabla_\theta \Omega(\theta)$
- $\theta \gets \theta + \alpha \Delta$
Training a neural network, we need:
- Loss function
- Procedure to compute the gradient
- Regularization term
#### Mini-batch and Momentum
Make updates based on a mini-batch of examples (instead of a single example)
- the gradient is computed on the average regularized loss for that mini-batch
- can give a more accurate estimate of the gradient
Momentum can use an exponential average of previous gradients.
$$
\overline{\nabla}_\theta^{(t)}=\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})+\beta\overline{\nabla}_\theta^{(t-1)}
$$
can get pass plateaus more quickly, by "gaining momentum".
### Convolutional Neural Networks
Overview of history:
- CNN
- MLP
- RNN/LSTM/GRU(Gated Recurrent Unit)
- Transformer