# CSE510 Deep Reinforcement Learning (Lecture 7)

## Large Scale RL

So far we have represented value functions by a lookup table

- Every state s has an entry V(s), or
- Every state-action pair (s, a) has an entry Q(s, a)

Reinforcement learning should be used to solve large problems, e.g.

- Backgammon: 10^20 states
- Computer Go: 10^170 states
- Helicopter, robot, ...: enormous continuous state space

Tabular methods clearly cannot handle this.. why?

- There are too many states and/or actions to store in memory
- It is too slow to learn the value of each state individually
- You cannot generalize across states!

### Value Function Approximation (VFA)

Solution for large MDPs:

- Estimate the value function using a function approximator

**Value function approximation (VFA)** replaces the table with general parameterize form:

$$
\hat{V}(s, \theta) \approx V_\pi(s)
$$

or

$$
\hat{Q}(s, a, \theta) \approx Q_\pi(s, a)
$$

Benefit:

- Can generalize across states
- Save memory (only need to store the function approximator parameters)

### End-to-End RL

End-to-end RL methods replace the hand-designed state representation with raw observations.

- Good: We get rid of manual design of state representations
- Bad: we need tons of data to train the network since O_t usually WAY more high dimensional than hand-designed S_t

## Function Approximation

- Linear function approximation
- Neural network function approximation
- Decision tree function approximation
- Nearest neighbor 
- ...

In this course, we will focus on **Linear combination of features** and **Neural networks**.

Today we will do Deep neural networks (fully connected and convolutional).

### Artificial Neural Networks

#### Neuron

$f(x) = \mathbb{R}^k\to \mathbb{R}$

$z=a_1w_1+a_2w_2+\cdots+a_kw_k+b$

$a_1,a_2,\cdots,a_k$ are the inputs, $w_1,w_2,\cdots,w_k$ are the weights, $b$ is the bias.

Then we have activation function $\sigma(z)$ (usually non-linear)

##### Activation functions

ReLU (rectified linear unit):

$$
\text{ReLU}(x) = \max(0, x)
$$

- Bounded below by 0.
- Non-vanishing gradient.
- No upper bound.

Sigmoid:

$$
\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
$$

- Always positive.
- Bounded between 0 and 1.
- Strictly increasing.

> [!TIP]
>
> Use relu for previous layers, use sigmoid for output layer.
>
> For fully connected shallow networks, you may use more sigmoid layers.

We can use parallel computing techniques to speed up the computation.

#### Universal Approximation Theorem

Any continuous function can be approximated by a neural network with a single hidden layer.

(flat layer)

#### Why use deep neural networks?

Motivation from Biology

- Visual Cortex

Motivation from circuit theory

- Compact representation

Modularity

- More efficiently using data

In Practice: works better for many domains

- Hard to argue with results.

### Training Neural Networks

- Loss function
- Model
- Optimization

Empirical loss minimization framework:

$$
\argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; \theta), y_i)+\lambda \Omega(\theta)
$$

$\ell$ is the loss function, $f$ is the model, $\theta$ is the parameters, $\Omega$ is the regularization term, $\lambda$ is the regularization parameter.

Learning is cast as optimization.

- For classification problems, we would like to minimize classification error, e.g., logistic or cross entropy loss.
- For regression problems, we would like to minimize regression error, e.g. L1 or L2 distance from groundtruth.

#### Stochastic Gradient Descent

Perform updates after seeing each example:

- Initialize: $\theta\equiv\{W^{(1)},b^{(1)},\cdots,W^{(L)},b^{(L)}\}$
- For $t=1,2,\cdots,T$:
  - For each training example $(x^{(t)},y^{(t)})$:
  - Compute gradient: $\Delta = -\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})-\lambda\nabla_\theta \Omega(\theta)$
  - $\theta \gets \theta + \alpha \Delta$

Training a neural network, we need:

- Loss function
- Procedure to compute the gradient
- Regularization term

#### Mini-batch and Momentum

Make updates based on a mini-batch of examples (instead of a single example)

- the gradient is computed on the average regularized loss for that mini-batch
- can give a more accurate estimate of the gradient

Momentum can use an exponential average of previous gradients.

$$
\overline{\nabla}_\theta^{(t)}=\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})+\beta\overline{\nabla}_\theta^{(t-1)}
$$

can get pass plateaus more quickly, by "gaining momentum".

### Convolutional Neural Networks

Overview of history:

- CNN
- MLP
- RNN/LSTM/GRU(Gated Recurrent Unit)
- Transformer