updates

2025-09-16 12:48:24 -05:00
parent 5e7b8a141d
commit 3fbbb89f5e
3 changed files with 276 additions and 9 deletions
--- a/content/CSE510/CSE510_L7.md
+++ b/content/CSE510/CSE510_L7.md
@@ -75,15 +75,116 @@ Then we have activation function $\sigma(z)$ (usually non-linear)

 ##### Activation functions

-Always positive.
+ReLU (rectified linear unit):
+
+$$
+\text{ReLU}(x) = \max(0, x)
+$$
+
+- Bounded below by 0.
+- Non-vanishing gradient.
+- No upper bound.
+
+Sigmoid:
+
+$$
+\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
+$$
+
+- Always positive.
+- Bounded between 0 and 1.
+- Strictly increasing.
+
+> [!TIP]
+>
+> Use relu for previous layers, use sigmoid for output layer.
+>
+> For fully connected shallow networks, you may use more sigmoid layers.
+
+We can use parallel computing techniques to speed up the computation.
+
+#### Universal Approximation Theorem
+
+Any continuous function can be approximated by a neural network with a single hidden layer.
+
+(flat layer)
+
+#### Why use deep neural networks?
+
+Motivation from Biology
+
+- Visual Cortex
+
+Motivation from circuit theory
+
+- Compact representation
+
+Modularity
+
+- More efficiently using data
+
+In Practice: works better for many domains
+
+- Hard to argue with results.
+
+### Training Neural Networks
+
+- Loss function
+- Model
+- Optimization
+
+Empirical loss minimization framework:
+
+$$
+\argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; \theta), y_i)+\lambda \Omega(\theta)
+$$
+
+$\ell$ is the loss function, $f$ is the model, $\theta$ is the parameters, $\Omega$ is the regularization term, $\lambda$ is the regularization parameter.
+
+Learning is cast as optimization.
+
+- For classification problems, we would like to minimize classification error, e.g., logistic or cross entropy loss.
+- For regression problems, we would like to minimize regression error, e.g. L1 or L2 distance from groundtruth.
+
+#### Stochastic Gradient Descent
+
+Perform updates after seeing each example:
+
+- Initialize: $\theta\equiv\{W^{(1)},b^{(1)},\cdots,W^{(L)},b^{(L)}\}$
+- For $t=1,2,\cdots,T$:
+  - For each training example $(x^{(t)},y^{(t)})$:
+  - Compute gradient: $\Delta = -\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})-\lambda\nabla_\theta \Omega(\theta)$
+  - $\theta \gets \theta + \alpha \Delta$
+
+Training a neural network, we need:
+
+- Loss function
+- Procedure to compute the gradient
+- Regularization term
+
+#### Mini-batch and Momentum
+
+Make updates based on a mini-batch of examples (instead of a single example)
+
+- the gradient is computed on the average regularized loss for that mini-batch
+- can give a more accurate estimate of the gradient
+
+Momentum can use an exponential average of previous gradients.
+
+$$
+\overline{\nabla}_\theta^{(t)}=\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})+\beta\overline{\nabla}_\theta^{(t-1)}
+$$
+
+can get pass plateaus more quickly, by "gaining momentum".
+
+### Convolutional Neural Networks
+
+Overview of history:
+
+- CNN
+- MLP
+- RNN/LSTM/GRU(Gated Recurrent Unit)
+- Transformer

- ReLU (rectified linear unit):
-  - $$
-    \text{ReLU}(x) = \max(0, x)
-    $$
- Sigmoid:
-  - $$
-    \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
-    $$