# CSE510 Deep Reinforcement Learning (Lecture 7) ## Large Scale RL So far we have represented value functions by a lookup table - Every state s has an entry V(s), or - Every state-action pair (s, a) has an entry Q(s, a) Reinforcement learning should be used to solve large problems, e.g. - Backgammon: 10^20 states - Computer Go: 10^170 states - Helicopter, robot, ...: enormous continuous state space Tabular methods clearly cannot handle this.. why? - There are too many states and/or actions to store in memory - It is too slow to learn the value of each state individually - You cannot generalize across states! ### Value Function Approximation (VFA) Solution for large MDPs: - Estimate the value function using a function approximator **Value function approximation (VFA)** replaces the table with general parameterize form: $$ \hat{V}(s, \theta) \approx V_\pi(s) $$ or $$ \hat{Q}(s, a, \theta) \approx Q_\pi(s, a) $$ Benefit: - Can generalize across states - Save memory (only need to store the function approximator parameters) ### End-to-End RL End-to-end RL methods replace the hand-designed state representation with raw observations. - Good: We get rid of manual design of state representations - Bad: we need tons of data to train the network since O_t usually WAY more high dimensional than hand-designed S_t ## Function Approximation - Linear function approximation - Neural network function approximation - Decision tree function approximation - Nearest neighbor - ... In this course, we will focus on **Linear combination of features** and **Neural networks**. Today we will do Deep neural networks (fully connected and convolutional). ### Artificial Neural Networks #### Neuron $f(x) = \mathbb{R}^k\to \mathbb{R}$ $z=a_1w_1+a_2w_2+\cdots+a_kw_k+b$ $a_1,a_2,\cdots,a_k$ are the inputs, $w_1,w_2,\cdots,w_k$ are the weights, $b$ is the bias. Then we have activation function $\sigma(z)$ (usually non-linear) ##### Activation functions ReLU (rectified linear unit): $$ \text{ReLU}(x) = \max(0, x) $$ - Bounded below by 0. - Non-vanishing gradient. - No upper bound. Sigmoid: $$ \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}} $$ - Always positive. - Bounded between 0 and 1. - Strictly increasing. > [!TIP] > > Use relu for previous layers, use sigmoid for output layer. > > For fully connected shallow networks, you may use more sigmoid layers. We can use parallel computing techniques to speed up the computation. #### Universal Approximation Theorem Any continuous function can be approximated by a neural network with a single hidden layer. (flat layer) #### Why use deep neural networks? Motivation from Biology - Visual Cortex Motivation from circuit theory - Compact representation Modularity - More efficiently using data In Practice: works better for many domains - Hard to argue with results. ### Training Neural Networks - Loss function - Model - Optimization Empirical loss minimization framework: $$ \argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; \theta), y_i)+\lambda \Omega(\theta) $$ $\ell$ is the loss function, $f$ is the model, $\theta$ is the parameters, $\Omega$ is the regularization term, $\lambda$ is the regularization parameter. Learning is cast as optimization. - For classification problems, we would like to minimize classification error, e.g., logistic or cross entropy loss. - For regression problems, we would like to minimize regression error, e.g. L1 or L2 distance from groundtruth. #### Stochastic Gradient Descent Perform updates after seeing each example: - Initialize: $\theta\equiv\{W^{(1)},b^{(1)},\cdots,W^{(L)},b^{(L)}\}$ - For $t=1,2,\cdots,T$: - For each training example $(x^{(t)},y^{(t)})$: - Compute gradient: $\Delta = -\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})-\lambda\nabla_\theta \Omega(\theta)$ - $\theta \gets \theta + \alpha \Delta$ Training a neural network, we need: - Loss function - Procedure to compute the gradient - Regularization term #### Mini-batch and Momentum Make updates based on a mini-batch of examples (instead of a single example) - the gradient is computed on the average regularized loss for that mini-batch - can give a more accurate estimate of the gradient Momentum can use an exponential average of previous gradients. $$ \overline{\nabla}_\theta^{(t)}=\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})+\beta\overline{\nabla}_\theta^{(t-1)} $$ can get pass plateaus more quickly, by "gaining momentum". ### Convolutional Neural Networks Overview of history: - CNN - MLP - RNN/LSTM/GRU(Gated Recurrent Unit) - Transformer