# CSE559A Lecture 7 ## Computer Vision (In Artificial Neural Networks for Image Understanding) Early example of image understanding using Neural Networks: [Back propagation for zip code recognition] Central idea; representation change, on each layer of feature. Plan for next few weeks: 1. How do we train such models? 2. What are those building blocks 3. How should we combine those building blocks? ## How do we train such models? CV is finally useful... 1. Image classification 2. Image segmentation 3. Object detection ImageNet Large Scale Visual Recognition Challenge (ILSVRC) - 1000 classes - 1.2 million images - 10000 test images ### Deep Learning (Just neural networks) Bigger datasets, larger models, faster computers, lots of incremental improvements. ```python import torch import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 6, 5) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2)) x = F.max_pool2d(F.relu(self.conv2(x)), 2) x = x.view(-1, self.num_flat_features(x)) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x def num_flat_features(self, x): size = x.size()[1:] num_features = 1 for s in size: num_features *= s return num_features # create pytorch dataset and dataloader dataset = torch.utils.data.TensorDataset(torch.randn(1000, 1, 28, 28), torch.randint(10, (1000,))) dataloader = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True, num_workers=2) # training process net = Net() optimizer = optim.Adam(net.parameters(), lr=0.001) criterion = nn.CrossEntropyLoss() # loop over the dataset multiple times for epoch in range(2): for i, data in enumerate(dataloader, 0): inputs, labels = data optimizer.zero_grad() outputs = net(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() print(f"Finished Training") ``` Some generated code above. ### Supervised Learning Training: given a dataset, learn a mapping from input to output. Testing: given a new input, predict the output. Example: Linear classification models Find a linear function that separates the data. $$ f(x) = w^T x + b $$ [Linear classification models](http://cs231n.github.io/linear-classify/) Simple representation of a linear classifier. ### Empirical loss minimization framework Given a training set, find a model that minimizes the loss function. Assume iid samples. Example of loss function: l1 loss: $$ \ell(f(x; w), y) = |f(x; w) - y| $$ l2 loss: $$ \ell(f(x; w), y) = (f(x; w) - y)^2 $$ ### Linear classification models $$ \hat{L}(w) = \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; w), y_i) $$ hard to find the global minimum. #### Linear regression However, if we use l2 loss, we can find the global minimum. $$ \hat{L}(w) = \frac{1}{n} \sum_{i=1}^n (f(x_i; w) - y_i)^2 $$ This is a convex function, so we can find the global minimum. The gradient is: $$ \nabla_w||Xw-Y||^2 = 2X^T(Xw-Y) $$ Set the gradient to 0, we get: $$ w = (X^T X)^{-1} X^T Y $$ From the maximum likelihood perspective, we can also derive the same result. #### Logistic regression Sigmoid function: $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$ The loss of logistic regression is not convex, so we cannot find the global minimum using normal equations. #### Gradient Descent Full batch gradient descent: $$ w \leftarrow w - \eta \nabla_w \hat{L}(w) $$ Stochastic gradient descent: $$ w \leftarrow w - \eta \nabla_w \hat{L}(w; x_i, y_i) $$ Mini-batch gradient descent: $$ w \leftarrow w - \eta \nabla_w \hat{L}(w; x_i, y_i) $$ Mini-batch Gradient Descent: $$ w \leftarrow w - \eta \nabla_w \hat{L}(w; x_i, y_i) $$ at each step, we update the weights using the average gradient of the mini-batch. the mini-batch is selected randomly from the training set. #### Multi-class classification Use softmax function to convert the output to a probability distribution. ## Neural Networks From linear to non-linear. - Shadow approach: - Use feature transformation to make the data linearly separable. - Deep approach: - Stack multiple layers of linear models. Common non-linear functions: - ReLU: - $$ \text{ReLU}(x) = \max(0, x) $$ - Sigmoid: - $$ \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}} $$ - Tanh: - $$ \text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$ ### Backpropagation