This commit is contained in:
Trance-0
2025-09-16 12:48:24 -05:00
parent 5e7b8a141d
commit 3fbbb89f5e
3 changed files with 276 additions and 9 deletions

View File

@@ -75,15 +75,116 @@ Then we have activation function $\sigma(z)$ (usually non-linear)
##### Activation functions
Always positive.
ReLU (rectified linear unit):
$$
\text{ReLU}(x) = \max(0, x)
$$
- Bounded below by 0.
- Non-vanishing gradient.
- No upper bound.
Sigmoid:
$$
\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
$$
- Always positive.
- Bounded between 0 and 1.
- Strictly increasing.
> [!TIP]
>
> Use relu for previous layers, use sigmoid for output layer.
>
> For fully connected shallow networks, you may use more sigmoid layers.
We can use parallel computing techniques to speed up the computation.
#### Universal Approximation Theorem
Any continuous function can be approximated by a neural network with a single hidden layer.
(flat layer)
#### Why use deep neural networks?
Motivation from Biology
- Visual Cortex
Motivation from circuit theory
- Compact representation
Modularity
- More efficiently using data
In Practice: works better for many domains
- Hard to argue with results.
### Training Neural Networks
- Loss function
- Model
- Optimization
Empirical loss minimization framework:
$$
\argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; \theta), y_i)+\lambda \Omega(\theta)
$$
$\ell$ is the loss function, $f$ is the model, $\theta$ is the parameters, $\Omega$ is the regularization term, $\lambda$ is the regularization parameter.
Learning is cast as optimization.
- For classification problems, we would like to minimize classification error, e.g., logistic or cross entropy loss.
- For regression problems, we would like to minimize regression error, e.g. L1 or L2 distance from groundtruth.
#### Stochastic Gradient Descent
Perform updates after seeing each example:
- Initialize: $\theta\equiv\{W^{(1)},b^{(1)},\cdots,W^{(L)},b^{(L)}\}$
- For $t=1,2,\cdots,T$:
- For each training example $(x^{(t)},y^{(t)})$:
- Compute gradient: $\Delta = -\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})-\lambda\nabla_\theta \Omega(\theta)$
- $\theta \gets \theta + \alpha \Delta$
Training a neural network, we need:
- Loss function
- Procedure to compute the gradient
- Regularization term
#### Mini-batch and Momentum
Make updates based on a mini-batch of examples (instead of a single example)
- the gradient is computed on the average regularized loss for that mini-batch
- can give a more accurate estimate of the gradient
Momentum can use an exponential average of previous gradients.
$$
\overline{\nabla}_\theta^{(t)}=\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})+\beta\overline{\nabla}_\theta^{(t-1)}
$$
can get pass plateaus more quickly, by "gaining momentum".
### Convolutional Neural Networks
Overview of history:
- CNN
- MLP
- RNN/LSTM/GRU(Gated Recurrent Unit)
- Transformer
- ReLU (rectified linear unit):
- $$
\text{ReLU}(x) = \max(0, x)
$$
- Sigmoid:
- $$
\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
$$