# CSE5519 Advances in Computer Vision (Lecture 3)

## Reminders

First Example notebook due Sep 18

Project proposal due Sep 23

## Continued: A brief history (time) of computer vision

### Theme changes

#### 1980

- “Definitive” detectors
  - Edges: Canny (1986); corners: Harris & Stephens (1988)
- Multiscale image representations
  - Witkin (1983), Burt & Adelson (1984), Koenderink (1984, 1987), etc.
  - Markov Random Field models: Geman & Geman (1984)
- Segmentation by energy minimization
  - Kass, Witkin & Terzopoulos (1987), Mumford & Shah (1989)

#### Conferences, journals, books

- Conferences: ICPR (1973), CVPR (1983), ICCV (1987), ECCV (1990)
- Journals: TPAMI (1979), IJCV (1987)
- Books: Duda & Hart (1972), Marr (1982), Ballard & Brown (1982), Horn (1986)

#### 1980s: The dead ends

- Alignment-based recognition
  - Faugeras & Hebert (1983), Grimson & Lozano-Perez (1984), Lowe (1985), Huttenlocher & Ullman (1987), etc.
- Aspect graphs
  - Koenderink & Van Doorn (1979), Plantinga & Dyer (1986), Hebert & Kanade (1985), Ikeuchi & Kanade (1988), Gigus & Malik (1990)
- Invariants: Mundy & Zisserman (1992)

#### 1980s: Meanwhile...

- Neocognitron: Fukushima (1980)
- Back-propagation: Rumelhart, Hinton & Williams (1986)
  - Origins in control theory and optimization: Kelley (1960), Dreyfus (1962), Bryson & Ho (1969), Linnainmaa (1970)
  - Application to neural networks: Werbos (1974)
  - Interesting blog post: Backpropagating through time Or, How come BP hasn’t been invented earlier?
- Parallel Distributed Processing: Rumelhart et al. (1987)
- Neural networks for digit recognition: LeCun et al. (1989)

#### 1990s

Multi-view geometry, statistical and appearance-based models for recognition, first approaches for (class-specific) object detection

Geometry (mostly) solved

- Fundamental matrix: Faugeras (1992)
- Normalized 8-point algorithm: Hartley (1997)
- RANSAC for robust fundamental matrix estimation: Torr & Murray (1997)
- Bundle adjustment: Triggs et al. (1999)
- Hartley & Zisserman book (2000)
- Projective structure from motion: Faugeras and Luong (2001)

Data enters the scene

- Appearance-based models: Turk & Pentland (1991), Murase & Nayar (1995)

PCA for face recognition: Turk & Pentland (1991)
Image manifolds

Keypoint-based image indexing

- Schmid & Mohr (1996), Lowe (1999)

Constellation models for object categories

- Burl, Weber & Perona (1998), Weber, Welling & Perona (2000)

First sustained use of classifiers and negative data

- Face detectors: Rowley, Baluja & Kanade (1996), Osuna, Freund & Girosi (1997), Schneiderman & Kanade (1998), Viola & Jones (2001)
- Convolutional nets: LeCun et al. (1998)

Graph cut image inference

- Boykov, Veksler & Zabih (1998)

Segmentation

- Normalized cuts: Shi & Malik (2000)
- Berkeley segmentation dataset: Martin et al. (2001)

Video processing

- Layered motion models: Adelson & Wang (1993)
- Robust optical flow: Black & Anandan (1993)
- Probabilistic curve tracking: Isard & Blake (1998)

#### 2000s: Keypoints and reconstruction

Keypoints craze

- Kadir & Brady (2001), Mikolajczyk & Schmid (2002), Matas et al. (2004), Lowe (2004), Bay et al. (2006), etc.

3D reconstruction "in the wild"

- SFM in the wild
- Multi-view stereo, stereo on GPU's

Generic object recognition

- Constellation models
- Bags of features
- Datasets: Caltech-101 -> ImageNet

Generic object detection

- PASCAL dataset
- HOG, Deformable part models

Action and activity recognition:

"misc. early efforts"

#### 1990s-2000s: Dead ends (?)

Probabilistic graphical models

Perceptual organization

#### 2010s: Deep learning, big data

They can be more accurate (often much more accurate).

They are faster (often much faster).

They are adaptable to new problems.

Deep Convolutional Neural Networks

- Many layers, some of which are convolutional (usually near the input)
- Early layers "extract features"
- Trained using stochastic gradient descent on very large datasets
- Many possible loss functions (depending on task)

Additional benefits:

- High-quality software frameworks
- "New" network layers
  - Dropout (enables simultaneously training many models)
  - ReLU activation (enables faster training because gradients don’t become zero)
- Bigger datasets
  - reduces overfitting
  - improves robustness
  - enable larger, deeper networks
- Deeper networks eliminate the need for hand-engineered features

### Where did we go wrong?

In retrospect, computer vision has had several periods of "spinning its wheels"

- We've always **prioritized methods that could already do interesting things** over potentially more promising methods that could not yet deliver
- We've undervalued simple methods, data, and learning
- When nothing worked, we **distracted ourselves with fancy math**
- On a few occasions, we unaccountably **ignored methods that later proved to be "game changers"** (RANSAC, SIFT)
- We've had some problems with **bandwagon jumping and intellectual snobbery**

But it's not clear whether any of it mattered in the end.