NoteNextra-origin/content/CSE559A/CSE559A_L8.md

# CSE559A Lecture 8

Paper review sharing.

## Recap: Three ways to think about linear classifiers

Geometric view: Hyperplanes in the feature space

Algebraic view: Linear functions of the features

Visual view: One template per class

## Continue on linear classification models

Two layer networks as combination of templates.

Interpretability is lost during the depth increase.

A two layer network is a **universal approximator** (we can approximate any continuous function to arbitrary accuracy). But the hidden layer may need to be huge.

[Multi-layer networks demo](https://playground.tensorflow.org)

### Supervised learning outline

1. Collect training data
2. Specify model (select hyper-parameters)
3. Train model

#### Hyper-parameters selection

- Number of layers, number of units per layer, learning rate, etc.
- Type of non-linearity, regularization, etc.
- Type of loss function, etc.
- SGD settings: batch size, number of epochs, etc.

#### Hyper-parameter searching

Use validation set to evaluate the performance of the model.

Never peek the test set.

Use the training set to do K-fold cross validation.

### Backpropagation

#### Computation graphs

SGD update for each parameter

$$
w_k\gets w_k-\eta\frac{\partial e}{\partial w_k}
$$

$e$ is the error function.

#### Using the chain rule

Suppose $k=1$, $e=l(f_1(x,w_1),y)$

Example: $e=(f_1(x,w_1)-y)^2$

So $h_1=f_1(x,w_1)=w^\top_1x$, $e=l(h_1,y)=(y-h_1)^2$

$$
\frac{\partial e}{\partial w_1}=\frac{\partial e}{\partial h_1}\frac{\partial h_1}{\partial w_1}
$$

$$
\frac{\partial e}{\partial h_1}=2(h_1-y)
$$

$$
\frac{\partial h_1}{\partial w_1}=x
$$

$$
\frac{\partial e}{\partial w_1}=2(h_1-y)x
$$

#### General backpropagation algorithm