updates
This commit is contained in:
@@ -75,15 +75,116 @@ Then we have activation function $\sigma(z)$ (usually non-linear)
|
|||||||
|
|
||||||
##### Activation functions
|
##### Activation functions
|
||||||
|
|
||||||
Always positive.
|
ReLU (rectified linear unit):
|
||||||
|
|
||||||
|
$$
|
||||||
|
\text{ReLU}(x) = \max(0, x)
|
||||||
|
$$
|
||||||
|
|
||||||
|
- Bounded below by 0.
|
||||||
|
- Non-vanishing gradient.
|
||||||
|
- No upper bound.
|
||||||
|
|
||||||
|
Sigmoid:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
|
||||||
|
$$
|
||||||
|
|
||||||
|
- Always positive.
|
||||||
|
- Bounded between 0 and 1.
|
||||||
|
- Strictly increasing.
|
||||||
|
|
||||||
|
> [!TIP]
|
||||||
|
>
|
||||||
|
> Use relu for previous layers, use sigmoid for output layer.
|
||||||
|
>
|
||||||
|
> For fully connected shallow networks, you may use more sigmoid layers.
|
||||||
|
|
||||||
|
We can use parallel computing techniques to speed up the computation.
|
||||||
|
|
||||||
|
#### Universal Approximation Theorem
|
||||||
|
|
||||||
|
Any continuous function can be approximated by a neural network with a single hidden layer.
|
||||||
|
|
||||||
|
(flat layer)
|
||||||
|
|
||||||
|
#### Why use deep neural networks?
|
||||||
|
|
||||||
|
Motivation from Biology
|
||||||
|
|
||||||
|
- Visual Cortex
|
||||||
|
|
||||||
|
Motivation from circuit theory
|
||||||
|
|
||||||
|
- Compact representation
|
||||||
|
|
||||||
|
Modularity
|
||||||
|
|
||||||
|
- More efficiently using data
|
||||||
|
|
||||||
|
In Practice: works better for many domains
|
||||||
|
|
||||||
|
- Hard to argue with results.
|
||||||
|
|
||||||
|
### Training Neural Networks
|
||||||
|
|
||||||
|
- Loss function
|
||||||
|
- Model
|
||||||
|
- Optimization
|
||||||
|
|
||||||
|
Empirical loss minimization framework:
|
||||||
|
|
||||||
|
$$
|
||||||
|
\argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; \theta), y_i)+\lambda \Omega(\theta)
|
||||||
|
$$
|
||||||
|
|
||||||
|
$\ell$ is the loss function, $f$ is the model, $\theta$ is the parameters, $\Omega$ is the regularization term, $\lambda$ is the regularization parameter.
|
||||||
|
|
||||||
|
Learning is cast as optimization.
|
||||||
|
|
||||||
|
- For classification problems, we would like to minimize classification error, e.g., logistic or cross entropy loss.
|
||||||
|
- For regression problems, we would like to minimize regression error, e.g. L1 or L2 distance from groundtruth.
|
||||||
|
|
||||||
|
#### Stochastic Gradient Descent
|
||||||
|
|
||||||
|
Perform updates after seeing each example:
|
||||||
|
|
||||||
|
- Initialize: $\theta\equiv\{W^{(1)},b^{(1)},\cdots,W^{(L)},b^{(L)}\}$
|
||||||
|
- For $t=1,2,\cdots,T$:
|
||||||
|
- For each training example $(x^{(t)},y^{(t)})$:
|
||||||
|
- Compute gradient: $\Delta = -\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})-\lambda\nabla_\theta \Omega(\theta)$
|
||||||
|
- $\theta \gets \theta + \alpha \Delta$
|
||||||
|
|
||||||
|
Training a neural network, we need:
|
||||||
|
|
||||||
|
- Loss function
|
||||||
|
- Procedure to compute the gradient
|
||||||
|
- Regularization term
|
||||||
|
|
||||||
|
#### Mini-batch and Momentum
|
||||||
|
|
||||||
|
Make updates based on a mini-batch of examples (instead of a single example)
|
||||||
|
|
||||||
|
- the gradient is computed on the average regularized loss for that mini-batch
|
||||||
|
- can give a more accurate estimate of the gradient
|
||||||
|
|
||||||
|
Momentum can use an exponential average of previous gradients.
|
||||||
|
|
||||||
|
$$
|
||||||
|
\overline{\nabla}_\theta^{(t)}=\nabla_\theta \ell(f(x^{(t)}; \theta), y^{(t)})+\beta\overline{\nabla}_\theta^{(t-1)}
|
||||||
|
$$
|
||||||
|
|
||||||
|
can get pass plateaus more quickly, by "gaining momentum".
|
||||||
|
|
||||||
|
### Convolutional Neural Networks
|
||||||
|
|
||||||
|
Overview of history:
|
||||||
|
|
||||||
|
- CNN
|
||||||
|
- MLP
|
||||||
|
- RNN/LSTM/GRU(Gated Recurrent Unit)
|
||||||
|
- Transformer
|
||||||
|
|
||||||
- ReLU (rectified linear unit):
|
|
||||||
- $$
|
|
||||||
\text{ReLU}(x) = \max(0, x)
|
|
||||||
$$
|
|
||||||
- Sigmoid:
|
|
||||||
- $$
|
|
||||||
\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
|
|
||||||
$$
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
165
content/CSE5313/CSE5313_L7.md
Normal file
165
content/CSE5313/CSE5313_L7.md
Normal file
@@ -0,0 +1,165 @@
|
|||||||
|
# CSE5313 Coding and information theory for data science (Lecture 7)
|
||||||
|
|
||||||
|
## Continue on Linear codes
|
||||||
|
|
||||||
|
Let $\mathcal{C}= [n,k,d]_{\mathbb{F}}$ be a linear code.
|
||||||
|
|
||||||
|
There are two equivalent ways to describe a linear code:
|
||||||
|
|
||||||
|
1. A generator matrix $G\in \mathbb{F}^{k\times n}_q$ with $k$ rows and $n$ columns, entry taken from $\mathbb{F}_q$. $\mathcal{C}=\{xG|x\in \mathbb{F}^k\}$
|
||||||
|
2. A parity check matrix $H\in \mathbb{F}^{(n-k)\times n}_q$ with $(n-k)$ rows and $n$ columns, entry taken from $\mathbb{F}_q$. $\mathcal{C}=\{c\in \mathbb{F}^n:Hc^T=0\}$
|
||||||
|
|
||||||
|
### Dual code
|
||||||
|
|
||||||
|
#### Definition of dual code
|
||||||
|
|
||||||
|
$C^{\perp}$ is the set of all vectors in $\mathbb{F}^n$ that are orthogonal to every vector in $C$.
|
||||||
|
|
||||||
|
$$
|
||||||
|
C^{\perp}=\{x\in \mathbb{F}^n:x\cdot c=0\text{ for all }c\in C\}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Also, the alternative definition is:
|
||||||
|
|
||||||
|
1. $C^{\perp}=\{x\in \mathbb{F}^n:Gx^T=0\}$ (only need to check basis of $C$)
|
||||||
|
2. $C^{\perp}=\{xH|x\in \mathbb{F}^{n-k}\}$
|
||||||
|
|
||||||
|
By rank-nullity theorem, $dim(C^{\perp})=n-dim(C)=n-k$.
|
||||||
|
|
||||||
|
> [!WARNING]
|
||||||
|
>
|
||||||
|
> $C^{\perp}\cap C=\{0\}$ is not always true.
|
||||||
|
>
|
||||||
|
> Let $C=\{(0,0),(1,1)\}\subseteq \mathbb{F}_2^2$. Then $C^{\perp}=\{(0,0),(1,1)\}\subseteq \mathbb{F}_2^2$ since $(1,1)\begin{pmatrix} 1\\ 1\end{pmatrix}=0$.
|
||||||
|
|
||||||
|
### Example of binary codes
|
||||||
|
|
||||||
|
#### Trivial code
|
||||||
|
|
||||||
|
Let $\mathbb{F}=\mathbb{F}_2$.
|
||||||
|
|
||||||
|
Let $C=\mathbb{F}^n$.
|
||||||
|
|
||||||
|
Generator matrix is identity matrix.
|
||||||
|
|
||||||
|
Parity check matrix is zero matrix.
|
||||||
|
|
||||||
|
Minimum distance is 1.
|
||||||
|
|
||||||
|
#### Parity code
|
||||||
|
|
||||||
|
Let $\mathbb{F}=\mathbb{F}_2$.
|
||||||
|
|
||||||
|
Let $C=\{(c_1,c_2,\cdots,c_{k},\sum_{i=1}^k c_i)\}$.
|
||||||
|
|
||||||
|
The generator matrix is:
|
||||||
|
|
||||||
|
$$
|
||||||
|
G=\begin{pmatrix}
|
||||||
|
1 & 0 & 0 & \cdots & 0 & 1\\
|
||||||
|
0 & 1 & 0 & \cdots & 0 & 1\\
|
||||||
|
0 & 0 & 1 & \cdots & 0 & 1\\
|
||||||
|
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\
|
||||||
|
0 & 0 & 0 & \cdots & 1 & 1
|
||||||
|
\end{pmatrix}
|
||||||
|
$$
|
||||||
|
|
||||||
|
The parity check matrix is:
|
||||||
|
|
||||||
|
$$
|
||||||
|
H=\begin{pmatrix}
|
||||||
|
1 & 1 & 1 & \cdots & 1 & 1
|
||||||
|
\end{pmatrix}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Minimum distance is 2.
|
||||||
|
|
||||||
|
$C^{\perp}$ is the repetition code.
|
||||||
|
|
||||||
|
#### Lemma for minimum distance
|
||||||
|
|
||||||
|
The minimum distance of $\mathcal{C}$ is the maximum integer $d$ such that every $d-1$ columns of $H$ are linearly independent.
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Proof</summary>
|
||||||
|
|
||||||
|
Assume minimum distance is $d$. Show that every $d-1$ columns of $H$ are independent.
|
||||||
|
|
||||||
|
- Fact: In linear codes minimum distance is the minimum weight ($d_H(x,y)=w_H(x-y)$).
|
||||||
|
|
||||||
|
Indeed, if there exists a $d-1$ columns of $H$ that are linearly dependent, then we have $Hc^T=0$ for some $c\in \mathcal{C}$ with $w_H(c)<d$.
|
||||||
|
|
||||||
|
Reverse are similar.
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
#### The Hamming code
|
||||||
|
|
||||||
|
Let $m\in \mathbb{N}$.
|
||||||
|
|
||||||
|
Take all $2^m-1$ non-zero vectors in $\mathbb{F}_2^m$.
|
||||||
|
|
||||||
|
Put them as columns of a matrix $H$.
|
||||||
|
|
||||||
|
Example: for $m=3$,
|
||||||
|
|
||||||
|
$$
|
||||||
|
H=\begin{pmatrix}
|
||||||
|
1 & 0 & 0 & 1 & 1 & 0 & 1\\
|
||||||
|
0 & 1 & 0 & 1 & 0 & 1 & 1\\
|
||||||
|
0 & 0 & 1 & 0 & 1 & 1 & 1\\
|
||||||
|
\end{pmatrix}
|
||||||
|
$$
|
||||||
|
|
||||||
|
Minimum distance is 3.
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Proof for minimum distance</summary>
|
||||||
|
|
||||||
|
Using the lemma for minimum distance. Since $H$ are linearly independent, the minimum distance is 3.
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
So the maximum number of error correction is 1.
|
||||||
|
|
||||||
|
The length of code is $2^m-1$.
|
||||||
|
|
||||||
|
$k=2^m-m-1$.
|
||||||
|
|
||||||
|
#### Hadamard code
|
||||||
|
|
||||||
|
Define the code by encoding function:
|
||||||
|
|
||||||
|
$E(x): \mathbb{F}_2^m\to \mathbb{F}_2^{2^m}=(xy_1^T,\cdots,xy_{2^m}^T)$ ($y\in \mathbb{F}_2^m$)
|
||||||
|
|
||||||
|
Space of codewords is image of $E$.
|
||||||
|
|
||||||
|
This is a linear code since each term is a linear combination of $x$ and $y$.
|
||||||
|
|
||||||
|
If $x_1,x_2,\ldots,x_m$ is a basis of $\mathbb{F}_2^m$, then $E(x_1),E(x_2),\ldots,E(x_m)$ is a basis of $\mathcal{C}$.
|
||||||
|
|
||||||
|
So the dimension of $\mathcal{C}$ is $m$.
|
||||||
|
|
||||||
|
Minimum distance is: $2^{m-1}$.
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Proof for minimum distance</summary>
|
||||||
|
|
||||||
|
Since the code is linear, then the minimum distance is the minimum weight of the codewords.
|
||||||
|
|
||||||
|
For each $x\in \mathbb{F}_2^m$, there exists $2^{m-1}$ such that $E(x)=1$.
|
||||||
|
|
||||||
|
For all non-zero $x$ we have $d(E(x),E(0))=2^{m-1}$.
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
There exists a redundant $y_i=0$, we remove it from $E(x)$ to get a puntured Hadamard codeword.
|
||||||
|
|
||||||
|
The length of the code is $2^{m-1}$.
|
||||||
|
|
||||||
|
SO $E: \mathbb{F}_2^m\to \mathbb{F}_2^{2^{m-1}}$ is a linear code.
|
||||||
|
|
||||||
|
The generator matrix is the parity check matrix of Hamming code.
|
||||||
|
|
||||||
|
The dual of Hamming code is the (puntured) Hadamard code.
|
||||||
|
|
||||||
@@ -9,4 +9,5 @@ export default {
|
|||||||
CSE5313_L4: "CSE5313 Coding and information theory for data science (Lecture 4)",
|
CSE5313_L4: "CSE5313 Coding and information theory for data science (Lecture 4)",
|
||||||
CSE5313_L5: "CSE5313 Coding and information theory for data science (Lecture 5)",
|
CSE5313_L5: "CSE5313 Coding and information theory for data science (Lecture 5)",
|
||||||
CSE5313_L6: "CSE5313 Coding and information theory for data science (Lecture 6)",
|
CSE5313_L6: "CSE5313 Coding and information theory for data science (Lecture 6)",
|
||||||
|
CSE5313_L7: "CSE5313 Coding and information theory for data science (Lecture 7)",
|
||||||
}
|
}
|
||||||
Reference in New Issue
Block a user