updates
This commit is contained in:
@@ -75,15 +75,116 @@ Then we have activation function $\sigma(z)$ (usually non-linear)
|
||||
|
||||
##### Activation functions
|
||||
|
||||
Always positive.
|
||||
ReLU (rectified linear unit):
|
||||
|
||||
$$
|
||||
\text{ReLU}(x) = \max(0, x)
|
||||
$$
|
||||
|
||||
- Bounded below by 0.
|
||||
- Non-vanishing gradient.
|
||||
- No upper bound.
|
||||
|
||||
Sigmoid:
|
||||
|
||||
$$
|
||||
\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
|
||||
$$
|
||||
|
||||
- Always positive.
|
||||
- Bounded between 0 and 1.
|
||||
- Strictly increasing.
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> Use relu for previous layers, use sigmoid for output layer.
|
||||
>
|
||||
> For fully connected shallow networks, you may use more sigmoid layers.
|
||||
|
||||
We can use parallel computing techniques to speed up the computation.
|
||||
|
||||
#### Universal Approximation Theorem
|
||||
|
||||
Any continuous function can be approximated by a neural network with a single hidden layer.
|
||||
|
||||
(flat layer)
|
||||
|
||||
#### Why use deep neural networks?
|
||||
|
||||
Motivation from Biology
|
||||
|
||||
- Visual Cortex
|
||||
|
||||
Motivation from circuit theory
|
||||
|
||||
- Compact representation
|
||||
|
||||
Modularity
|
||||
|
||||
- More efficiently using data
|
||||
|
||||
In Practice: works better for many domains
|
||||
|
||||
- Hard to argue with results.
|
||||
|
||||
### Training Neural Networks
|
||||
|
||||
- Loss function
|
||||
- Model
|
||||
- Optimization
|
||||
|
||||
Empirical loss minimization framework:
|
||||
|
||||
$$
|
||||
\argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; \theta), y_i)+\lambda \Omega(\theta)
|
||||
$$
|
||||
|
||||
$\ell$ is the loss function, $f$ is the model, $\theta$ is the parameters, $\Omega$ is the regularization term, $\lambda$ is the regularization parameter.
|
||||
|
||||
Learning is cast as optimization.
|
||||
|
||||
- For classification problems, we would like to minimize classification error, e.g., logistic or cross entropy loss.
|
||||
- For regression problems, we would like to minimize regression error, e.g. L1 or L2 distance from groundtruth.
|
||||
|
||||
#### Stochastic Gradient Descent
|
||||
|
||||
Perform updates after seeing each example:
|
||||
|
||||
- Initialize: $\theta\equiv\{W^{(1)},b^{(1)},\cdots,W^{(L)},b^{(L)}\}$
|
||||
- For $t=1,2,\cdots,T$:
|
||||
- For each training example $(x^{(t)},y^{(t)})$:
|
||||
- Compute gradient: $\Delta = -\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})-\lambda\nabla_\theta \Omega(\theta)$
|
||||
- $\theta \gets \theta + \alpha \Delta$
|
||||
|
||||
Training a neural network, we need:
|
||||
|
||||
- Loss function
|
||||
- Procedure to compute the gradient
|
||||
- Regularization term
|
||||
|
||||
#### Mini-batch and Momentum
|
||||
|
||||
Make updates based on a mini-batch of examples (instead of a single example)
|
||||
|
||||
- the gradient is computed on the average regularized loss for that mini-batch
|
||||
- can give a more accurate estimate of the gradient
|
||||
|
||||
Momentum can use an exponential average of previous gradients.
|
||||
|
||||
$$
|
||||
\overline{\nabla}_\theta^{(t)}=\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})+\beta\overline{\nabla}_\theta^{(t-1)}
|
||||
$$
|
||||
|
||||
can get pass plateaus more quickly, by "gaining momentum".
|
||||
|
||||
### Convolutional Neural Networks
|
||||
|
||||
Overview of history:
|
||||
|
||||
- CNN
|
||||
- MLP
|
||||
- RNN/LSTM/GRU(Gated Recurrent Unit)
|
||||
- Transformer
|
||||
|
||||
- ReLU (rectified linear unit):
|
||||
- $$
|
||||
\text{ReLU}(x) = \max(0, x)
|
||||
$$
|
||||
- Sigmoid:
|
||||
- $$
|
||||
\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
|
||||
$$
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user