1.7 KiB
1.7 KiB
CSE559A Lecture 8
Paper review sharing.
Recap: Three ways to think about linear classifiers
Geometric view: Hyperplanes in the feature space
Algebraic view: Linear functions of the features
Visual view: One template per class
Continue on linear classification models
Two layer networks as combination of templates.
Interpretability is lost during the depth increase.
A two layer network is a universal approximator (we can approximate any continuous function to arbitrary accuracy). But the hidden layer may need to be huge.
Supervised learning outline
- Collect training data
- Specify model (select hyper-parameters)
- Train model
Hyper-parameters selection
- Number of layers, number of units per layer, learning rate, etc.
- Type of non-linearity, regularization, etc.
- Type of loss function, etc.
- SGD settings: batch size, number of epochs, etc.
Hyper-parameter searching
Use validation set to evaluate the performance of the model.
Never peek the test set.
Use the training set to do K-fold cross validation.
Backpropagation
Computation graphs
SGD update for each parameter
w_k\gets w_k-\eta\frac{\partial e}{\partial w_k}
e is the error function.
Using the chain rule
Suppose k=1, e=l(f_1(x,w_1),y)
Example: e=(f_1(x,w_1)-y)^2
So h_1=f_1(x,w_1)=w^\top_1x, e=l(h_1,y)=(y-h_1)^2
\frac{\partial e}{\partial w_1}=\frac{\partial e}{\partial h_1}\frac{\partial h_1}{\partial w_1}
\frac{\partial e}{\partial h_1}=2(h_1-y)
\frac{\partial h_1}{\partial w_1}=x
\frac{\partial e}{\partial w_1}=2(h_1-y)x