updates

2025-09-16 12:48:24 -05:00
parent 5e7b8a141d
commit 3fbbb89f5e
3 changed files with 276 additions and 9 deletions
--- a/content/CSE510/CSE510_L7.md
+++ b/content/CSE510/CSE510_L7.md
@@ -75,15 +75,116 @@ Then we have activation function $\sigma(z)$ (usually non-linear)
 ##### Activation functions
-Always positive.
+ReLU (rectified linear unit):
 $$
 \text{ReLU}(x) = \max(0, x)
 $$
 - Bounded below by 0.
 - Non-vanishing gradient.
 - No upper bound.
 Sigmoid:
 $$
 \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
 $$
 - Always positive.
 - Bounded between 0 and 1.
 - Strictly increasing.
 > [!TIP]
 >
 > Use relu for previous layers, use sigmoid for output layer.
 >
 > For fully connected shallow networks, you may use more sigmoid layers.
 We can use parallel computing techniques to speed up the computation.
 #### Universal Approximation Theorem
 Any continuous function can be approximated by a neural network with a single hidden layer.
 (flat layer)
 #### Why use deep neural networks?
 Motivation from Biology
 - Visual Cortex
 Motivation from circuit theory
 - Compact representation
 Modularity
 - More efficiently using data
 In Practice: works better for many domains
 - Hard to argue with results.
 ### Training Neural Networks
 - Loss function
 - Model
 - Optimization
 Empirical loss minimization framework:
 $$
 \argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; \theta), y_i)+\lambda \Omega(\theta)
 $$
 $\ell$ is the loss function, $f$ is the model, $\theta$ is the parameters, $\Omega$ is the regularization term, $\lambda$ is the regularization parameter.
 Learning is cast as optimization.
 - For classification problems, we would like to minimize classification error, e.g., logistic or cross entropy loss.
 - For regression problems, we would like to minimize regression error, e.g. L1 or L2 distance from groundtruth.
 #### Stochastic Gradient Descent
 Perform updates after seeing each example:
 - Initialize: $\theta\equiv\{W^{(1)},b^{(1)},\cdots,W^{(L)},b^{(L)}\}$
 - For $t=1,2,\cdots,T$:
  - For each training example $(x^{(t)},y^{(t)})$:
  - Compute gradient: $\Delta = -\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})-\lambda\nabla_\theta \Omega(\theta)$
  - $\theta \gets \theta + \alpha \Delta$
 Training a neural network, we need:
 - Loss function
 - Procedure to compute the gradient
 - Regularization term
 #### Mini-batch and Momentum
 Make updates based on a mini-batch of examples (instead of a single example)
 - the gradient is computed on the average regularized loss for that mini-batch
 - can give a more accurate estimate of the gradient
 Momentum can use an exponential average of previous gradients.
 $$
 \overline{\nabla}_\theta^{(t)}=\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})+\beta\overline{\nabla}_\theta^{(t-1)}
 $$
 can get pass plateaus more quickly, by "gaining momentum".
 ### Convolutional Neural Networks
 Overview of history:
 - CNN
 - MLP
 - RNN/LSTM/GRU(Gated Recurrent Unit)
 - Transformer
 - ReLU (rectified linear unit):
  - $$
    \text{ReLU}(x) = \max(0, x)
    $$
 - Sigmoid:
  - $$
    \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
    $$
--- a/content/CSE5313/CSE5313_L7.md
+++ b/content/CSE5313/CSE5313_L7.md
@@ -0,0 +1,165 @@
 # CSE5313 Coding and information theory for data science (Lecture 7)
 ## Continue on Linear codes
 Let $\mathcal{C}= [n,k,d]_{\mathbb{F}}$ be a linear code.
 There are two equivalent ways to describe a linear code:
 1. A generator matrix $G\in \mathbb{F}^{k\times n}_q$ with $k$ rows and $n$ columns, entry taken from $\mathbb{F}_q$. $\mathcal{C}=\{xG|x\in \mathbb{F}^k\}$
 2. A parity check matrix $H\in \mathbb{F}^{(n-k)\times n}_q$ with $(n-k)$ rows and $n$ columns, entry taken from $\mathbb{F}_q$. $\mathcal{C}=\{c\in \mathbb{F}^n:Hc^T=0\}$
 ### Dual code
 #### Definition of dual code
 $C^{\perp}$ is the set of all vectors in $\mathbb{F}^n$ that are orthogonal to every vector in $C$.
 $$
 C^{\perp}=\{x\in \mathbb{F}^n:x\cdot c=0\text{ for all }c\in C\}
 $$
 Also, the alternative definition is:
 1. $C^{\perp}=\{x\in \mathbb{F}^n:Gx^T=0\}$ (only need to check basis of $C$)
 2. $C^{\perp}=\{xH|x\in \mathbb{F}^{n-k}\}$
 By rank-nullity theorem, $dim(C^{\perp})=n-dim(C)=n-k$.
 > [!WARNING]
 >
 > $C^{\perp}\cap C=\{0\}$ is not always true.
 >
 > Let $C=\{(0,0),(1,1)\}\subseteq \mathbb{F}_2^2$. Then $C^{\perp}=\{(0,0),(1,1)\}\subseteq \mathbb{F}_2^2$ since $(1,1)\begin{pmatrix} 1\\ 1\end{pmatrix}=0$.
 ### Example of binary codes
 #### Trivial code
 Let $\mathbb{F}=\mathbb{F}_2$.
 Let $C=\mathbb{F}^n$.
 Generator matrix is identity matrix.
 Parity check matrix is zero matrix.
 Minimum distance is 1.
 #### Parity code
 Let $\mathbb{F}=\mathbb{F}_2$.
 Let $C=\{(c_1,c_2,\cdots,c_{k},\sum_{i=1}^k c_i)\}$.
 The generator matrix is:
 $$
 G=\begin{pmatrix}
 1 & 0 & 0 & \cdots & 0 & 1\\
 0 & 1 & 0 & \cdots & 0 & 1\\
 0 & 0 & 1 & \cdots & 0 & 1\\
 \vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\
 0 & 0 & 0 & \cdots & 1 & 1
 \end{pmatrix}
 $$
 The parity check matrix is:
 $$
 H=\begin{pmatrix}
 1 & 1 & 1 & \cdots & 1 & 1
 \end{pmatrix}
 $$
 Minimum distance is 2.
 $C^{\perp}$ is the repetition code.
 #### Lemma for minimum distance
 The minimum distance of $\mathcal{C}$ is the maximum integer $d$ such that every $d-1$ columns of $H$ are linearly independent.
 <details>
 <summary>Proof</summary>
 Assume minimum distance is $d$. Show that every $d-1$ columns of $H$ are independent.
 - Fact: In linear codes minimum distance is the minimum weight ($d_H(x,y)=w_H(x-y)$).
 Indeed, if there exists a $d-1$ columns of $H$ that are linearly dependent, then we have $Hc^T=0$ for some $c\in \mathcal{C}$ with $w_H(c)<d$.
 Reverse are similar.
 </details>
 #### The Hamming code
 Let $m\in \mathbb{N}$.
 Take all $2^m-1$ non-zero vectors in $\mathbb{F}_2^m$.
 Put them as columns of a matrix $H$.
 Example: for $m=3$,
 $$
 H=\begin{pmatrix}
 1 & 0 & 0 & 1 & 1 & 0 & 1\\
 0 & 1 & 0 & 1 & 0 & 1 & 1\\
 0 & 0 & 1 & 0 & 1 & 1 & 1\\
 \end{pmatrix}
 $$
 Minimum distance is 3.
 <details>
 <summary>Proof for minimum distance</summary>
 Using the lemma for minimum distance. Since $H$ are linearly independent, the minimum distance is 3.
 </details>
 So the maximum number of error correction is 1.
 The length of code is $2^m-1$.
 $k=2^m-m-1$.
 #### Hadamard code
 Define the code by encoding function:
 $E(x): \mathbb{F}_2^m\to \mathbb{F}_2^{2^m}=(xy_1^T,\cdots,xy_{2^m}^T)$ ($y\in \mathbb{F}_2^m$)
 Space of codewords is image of $E$.
 This is a linear code since each term is a linear combination of $x$ and $y$.
 If $x_1,x_2,\ldots,x_m$ is a basis of $\mathbb{F}_2^m$, then $E(x_1),E(x_2),\ldots,E(x_m)$ is a basis of $\mathcal{C}$.
 So the dimension of $\mathcal{C}$ is $m$.
 Minimum distance is: $2^{m-1}$.
 <details>
 <summary>Proof for minimum distance</summary>
 Since the code is linear, then the minimum distance is the minimum weight of the codewords.
 For each $x\in \mathbb{F}_2^m$, there exists $2^{m-1}$ such that $E(x)=1$.
 For all non-zero $x$ we have $d(E(x),E(0))=2^{m-1}$.
 </details>
 There exists a redundant $y_i=0$, we remove it from $E(x)$ to get a puntured Hadamard codeword.
 The length of the code is $2^{m-1}$.
 SO $E: \mathbb{F}_2^m\to \mathbb{F}_2^{2^{m-1}}$ is a linear code.
 The generator matrix is the parity check matrix of Hamming code.
 The dual of Hamming code is the (puntured) Hadamard code.
--- a/content/CSE5313/_meta.js
+++ b/content/CSE5313/_meta.js
@@ -9,4 +9,5 @@ export default {
    CSE5313_L4: "CSE5313 Coding and information theory for data science (Lecture 4)",
    CSE5313_L5: "CSE5313 Coding and information theory for data science (Lecture 5)",
    CSE5313_L6: "CSE5313 Coding and information theory for data science (Lecture 6)",
    CSE5313_L7: "CSE5313 Coding and information theory for data science (Lecture 7)",
 }