From 3fbbb89f5e617fffb25630d76407228e3bc58275 Mon Sep 17 00:00:00 2001 From: Trance-0 <60459821+Trance-0@users.noreply.github.com> Date: Tue, 16 Sep 2025 12:48:24 -0500 Subject: [PATCH] updates --- content/CSE510/CSE510_L7.md | 119 ++++++++++++++++++++++-- content/CSE5313/CSE5313_L7.md | 165 ++++++++++++++++++++++++++++++++++ content/CSE5313/_meta.js | 1 + 3 files changed, 276 insertions(+), 9 deletions(-) create mode 100644 content/CSE5313/CSE5313_L7.md diff --git a/content/CSE510/CSE510_L7.md b/content/CSE510/CSE510_L7.md index 5a1cde7..6d678e4 100644 --- a/content/CSE510/CSE510_L7.md +++ b/content/CSE510/CSE510_L7.md @@ -75,15 +75,116 @@ Then we have activation function $\sigma(z)$ (usually non-linear) ##### Activation functions -Always positive. +ReLU (rectified linear unit): + +$$ +\text{ReLU}(x) = \max(0, x) +$$ + +- Bounded below by 0. +- Non-vanishing gradient. +- No upper bound. + +Sigmoid: + +$$ +\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}} +$$ + +- Always positive. +- Bounded between 0 and 1. +- Strictly increasing. + +> [!TIP] +> +> Use relu for previous layers, use sigmoid for output layer. +> +> For fully connected shallow networks, you may use more sigmoid layers. + +We can use parallel computing techniques to speed up the computation. + +#### Universal Approximation Theorem + +Any continuous function can be approximated by a neural network with a single hidden layer. + +(flat layer) + +#### Why use deep neural networks? + +Motivation from Biology + +- Visual Cortex + +Motivation from circuit theory + +- Compact representation + +Modularity + +- More efficiently using data + +In Practice: works better for many domains + +- Hard to argue with results. + +### Training Neural Networks + +- Loss function +- Model +- Optimization + +Empirical loss minimization framework: + +$$ +\argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; \theta), y_i)+\lambda \Omega(\theta) +$$ + +$\ell$ is the loss function, $f$ is the model, $\theta$ is the parameters, $\Omega$ is the regularization term, $\lambda$ is the regularization parameter. + +Learning is cast as optimization. + +- For classification problems, we would like to minimize classification error, e.g., logistic or cross entropy loss. +- For regression problems, we would like to minimize regression error, e.g. L1 or L2 distance from groundtruth. + +#### Stochastic Gradient Descent + +Perform updates after seeing each example: + +- Initialize: $\theta\equiv\{W^{(1)},b^{(1)},\cdots,W^{(L)},b^{(L)}\}$ +- For $t=1,2,\cdots,T$: + - For each training example $(x^{(t)},y^{(t)})$: + - Compute gradient: $\Delta = -\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})-\lambda\nabla_\theta \Omega(\theta)$ + - $\theta \gets \theta + \alpha \Delta$ + +Training a neural network, we need: + +- Loss function +- Procedure to compute the gradient +- Regularization term + +#### Mini-batch and Momentum + +Make updates based on a mini-batch of examples (instead of a single example) + +- the gradient is computed on the average regularized loss for that mini-batch +- can give a more accurate estimate of the gradient + +Momentum can use an exponential average of previous gradients. + +$$ +\overline{\nabla}_\theta^{(t)}=\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})+\beta\overline{\nabla}_\theta^{(t-1)} +$$ + +can get pass plateaus more quickly, by "gaining momentum". + +### Convolutional Neural Networks + +Overview of history: + +- CNN +- MLP +- RNN/LSTM/GRU(Gated Recurrent Unit) +- Transformer -- ReLU (rectified linear unit): - - $$ - \text{ReLU}(x) = \max(0, x) - $$ -- Sigmoid: - - $$ - \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}} - $$ diff --git a/content/CSE5313/CSE5313_L7.md b/content/CSE5313/CSE5313_L7.md new file mode 100644 index 0000000..37b54e8 --- /dev/null +++ b/content/CSE5313/CSE5313_L7.md @@ -0,0 +1,165 @@ +# CSE5313 Coding and information theory for data science (Lecture 7) + +## Continue on Linear codes + +Let $\mathcal{C}= [n,k,d]_{\mathbb{F}}$ be a linear code. + +There are two equivalent ways to describe a linear code: + +1. A generator matrix $G\in \mathbb{F}^{k\times n}_q$ with $k$ rows and $n$ columns, entry taken from $\mathbb{F}_q$. $\mathcal{C}=\{xG|x\in \mathbb{F}^k\}$ +2. A parity check matrix $H\in \mathbb{F}^{(n-k)\times n}_q$ with $(n-k)$ rows and $n$ columns, entry taken from $\mathbb{F}_q$. $\mathcal{C}=\{c\in \mathbb{F}^n:Hc^T=0\}$ + +### Dual code + +#### Definition of dual code + +$C^{\perp}$ is the set of all vectors in $\mathbb{F}^n$ that are orthogonal to every vector in $C$. + +$$ +C^{\perp}=\{x\in \mathbb{F}^n:x\cdot c=0\text{ for all }c\in C\} +$$ + +Also, the alternative definition is: + +1. $C^{\perp}=\{x\in \mathbb{F}^n:Gx^T=0\}$ (only need to check basis of $C$) +2. $C^{\perp}=\{xH|x\in \mathbb{F}^{n-k}\}$ + +By rank-nullity theorem, $dim(C^{\perp})=n-dim(C)=n-k$. + +> [!WARNING] +> +> $C^{\perp}\cap C=\{0\}$ is not always true. +> +> Let $C=\{(0,0),(1,1)\}\subseteq \mathbb{F}_2^2$. Then $C^{\perp}=\{(0,0),(1,1)\}\subseteq \mathbb{F}_2^2$ since $(1,1)\begin{pmatrix} 1\\ 1\end{pmatrix}=0$. + +### Example of binary codes + +#### Trivial code + +Let $\mathbb{F}=\mathbb{F}_2$. + +Let $C=\mathbb{F}^n$. + +Generator matrix is identity matrix. + +Parity check matrix is zero matrix. + +Minimum distance is 1. + +#### Parity code + +Let $\mathbb{F}=\mathbb{F}_2$. + +Let $C=\{(c_1,c_2,\cdots,c_{k},\sum_{i=1}^k c_i)\}$. + +The generator matrix is: + +$$ +G=\begin{pmatrix} +1 & 0 & 0 & \cdots & 0 & 1\\ +0 & 1 & 0 & \cdots & 0 & 1\\ +0 & 0 & 1 & \cdots & 0 & 1\\ +\vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\ +0 & 0 & 0 & \cdots & 1 & 1 +\end{pmatrix} +$$ + +The parity check matrix is: + +$$ +H=\begin{pmatrix} +1 & 1 & 1 & \cdots & 1 & 1 +\end{pmatrix} +$$ + +Minimum distance is 2. + +$C^{\perp}$ is the repetition code. + +#### Lemma for minimum distance + +The minimum distance of $\mathcal{C}$ is the maximum integer $d$ such that every $d-1$ columns of $H$ are linearly independent. + +
+Proof + +Assume minimum distance is $d$. Show that every $d-1$ columns of $H$ are independent. + +- Fact: In linear codes minimum distance is the minimum weight ($d_H(x,y)=w_H(x-y)$). + +Indeed, if there exists a $d-1$ columns of $H$ that are linearly dependent, then we have $Hc^T=0$ for some $c\in \mathcal{C}$ with $w_H(c) + +#### The Hamming code + +Let $m\in \mathbb{N}$. + +Take all $2^m-1$ non-zero vectors in $\mathbb{F}_2^m$. + +Put them as columns of a matrix $H$. + +Example: for $m=3$, + +$$ +H=\begin{pmatrix} +1 & 0 & 0 & 1 & 1 & 0 & 1\\ +0 & 1 & 0 & 1 & 0 & 1 & 1\\ +0 & 0 & 1 & 0 & 1 & 1 & 1\\ +\end{pmatrix} +$$ + +Minimum distance is 3. + +
+Proof for minimum distance + +Using the lemma for minimum distance. Since $H$ are linearly independent, the minimum distance is 3. + +
+ +So the maximum number of error correction is 1. + +The length of code is $2^m-1$. + +$k=2^m-m-1$. + +#### Hadamard code + +Define the code by encoding function: + +$E(x): \mathbb{F}_2^m\to \mathbb{F}_2^{2^m}=(xy_1^T,\cdots,xy_{2^m}^T)$ ($y\in \mathbb{F}_2^m$) + +Space of codewords is image of $E$. + +This is a linear code since each term is a linear combination of $x$ and $y$. + +If $x_1,x_2,\ldots,x_m$ is a basis of $\mathbb{F}_2^m$, then $E(x_1),E(x_2),\ldots,E(x_m)$ is a basis of $\mathcal{C}$. + +So the dimension of $\mathcal{C}$ is $m$. + +Minimum distance is: $2^{m-1}$. + +
+Proof for minimum distance + +Since the code is linear, then the minimum distance is the minimum weight of the codewords. + +For each $x\in \mathbb{F}_2^m$, there exists $2^{m-1}$ such that $E(x)=1$. + +For all non-zero $x$ we have $d(E(x),E(0))=2^{m-1}$. + +
+ +There exists a redundant $y_i=0$, we remove it from $E(x)$ to get a puntured Hadamard codeword. + +The length of the code is $2^{m-1}$. + +SO $E: \mathbb{F}_2^m\to \mathbb{F}_2^{2^{m-1}}$ is a linear code. + +The generator matrix is the parity check matrix of Hamming code. + +The dual of Hamming code is the (puntured) Hadamard code. + diff --git a/content/CSE5313/_meta.js b/content/CSE5313/_meta.js index 132f5b1..2068cc7 100644 --- a/content/CSE5313/_meta.js +++ b/content/CSE5313/_meta.js @@ -9,4 +9,5 @@ export default { CSE5313_L4: "CSE5313 Coding and information theory for data science (Lecture 4)", CSE5313_L5: "CSE5313 Coding and information theory for data science (Lecture 5)", CSE5313_L6: "CSE5313 Coding and information theory for data science (Lecture 6)", + CSE5313_L7: "CSE5313 Coding and information theory for data science (Lecture 7)", } \ No newline at end of file