Files
NoteNextra-origin/content/CSE5519/CSE5519_L3.md
2025-09-02 22:11:18 -05:00

165 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CSE5519 Advances in Computer Vision (Lecture 3)
## Reminders
First Example notebook due Sep 18
Project proposal due Sep 23
## Continued: A brief history (time) of computer vision
### Theme changes
#### 1980
- “Definitive” detectors
- Edges: Canny (1986); corners: Harris & Stephens (1988)
- Multiscale image representations
- Witkin (1983), Burt & Adelson (1984), Koenderink (1984, 1987), etc.
- Markov Random Field models: Geman & Geman (1984)
- Segmentation by energy minimization
- Kass, Witkin & Terzopoulos (1987), Mumford & Shah (1989)
#### Conferences, journals, books
- Conferences: ICPR (1973), CVPR (1983), ICCV (1987), ECCV (1990)
- Journals: TPAMI (1979), IJCV (1987)
- Books: Duda & Hart (1972), Marr (1982), Ballard & Brown (1982), Horn (1986)
#### 1980s: The dead ends
- Alignment-based recognition
- Faugeras & Hebert (1983), Grimson & Lozano-Perez (1984), Lowe (1985), Huttenlocher & Ullman (1987), etc.
- Aspect graphs
- Koenderink & Van Doorn (1979), Plantinga & Dyer (1986), Hebert & Kanade (1985), Ikeuchi & Kanade (1988), Gigus & Malik (1990)
- Invariants: Mundy & Zisserman (1992)
#### 1980s: Meanwhile...
- Neocognitron: Fukushima (1980)
- Back-propagation: Rumelhart, Hinton & Williams (1986)
- Origins in control theory and optimization: Kelley (1960), Dreyfus (1962), Bryson & Ho (1969), Linnainmaa (1970)
- Application to neural networks: Werbos (1974)
- Interesting blog post: Backpropagating through time Or, How come BP hasnt been invented earlier?
- Parallel Distributed Processing: Rumelhart et al. (1987)
- Neural networks for digit recognition: LeCun et al. (1989)
#### 1990s
Multi-view geometry, statistical and appearance-based models for recognition, first approaches for (class-specific) object detection
Geometry (mostly) solved
- Fundamental matrix: Faugeras (1992)
- Normalized 8-point algorithm: Hartley (1997)
- RANSAC for robust fundamental matrix estimation: Torr & Murray (1997)
- Bundle adjustment: Triggs et al. (1999)
- Hartley & Zisserman book (2000)
- Projective structure from motion: Faugeras and Luong (2001)
Data enters the scene
- Appearance-based models: Turk & Pentland (1991), Murase & Nayar (1995)
PCA for face recognition: Turk & Pentland (1991)
Image manifolds
Keypoint-based image indexing
- Schmid & Mohr (1996), Lowe (1999)
Constellation models for object categories
- Burl, Weber & Perona (1998), Weber, Welling & Perona (2000)
First sustained use of classifiers and negative data
- Face detectors: Rowley, Baluja & Kanade (1996), Osuna, Freund & Girosi (1997), Schneiderman & Kanade (1998), Viola & Jones (2001)
- Convolutional nets: LeCun et al. (1998)
Graph cut image inference
- Boykov, Veksler & Zabih (1998)
Segmentation
- Normalized cuts: Shi & Malik (2000)
- Berkeley segmentation dataset: Martin et al. (2001)
Video processing
- Layered motion models: Adelson & Wang (1993)
- Robust optical flow: Black & Anandan (1993)
- Probabilistic curve tracking: Isard & Blake (1998)
#### 2000s: Keypoints and reconstruction
Keypoints craze
- Kadir & Brady (2001), Mikolajczyk & Schmid (2002), Matas et al. (2004), Lowe (2004), Bay et al. (2006), etc.
3D reconstruction "in the wild"
- SFM in the wild
- Multi-view stereo, stereo on GPU's
Generic object recognition
- Constellation models
- Bags of features
- Datasets: Caltech-101 -> ImageNet
Generic object detection
- PASCAL dataset
- HOG, Deformable part models
Action and activity recognition:
"misc. early efforts"
#### 1990s-2000s: Dead ends (?)
Probabilistic graphical models
Perceptual organization
#### 2010s: Deep learning, big data
They can be more accurate (often much more accurate).
They are faster (often much faster).
They are adaptable to new problems.
Deep Convolutional Neural Networks
- Many layers, some of which are convolutional (usually near the input)
- Early layers "extract features"
- Trained using stochastic gradient descent on very large datasets
- Many possible loss functions (depending on task)
Additional benefits:
- High-quality software frameworks
- "New" network layers
- Dropout (enables simultaneously training many models)
- ReLU activation (enables faster training because gradients dont become zero)
- Bigger datasets
- reduces overfitting
- improves robustness
- enable larger, deeper networks
- Deeper networks eliminate the need for hand-engineered features
### Where did we go wrong?
In retrospect, computer vision has had several periods of "spinning its wheels"
- We've always **prioritized methods that could already do interesting things** over potentially more promising methods that could not yet deliver
- We've undervalued simple methods, data, and learning
- When nothing worked, we **distracted ourselves with fancy math**
- On a few occasions, we unaccountably **ignored methods that later proved to be "game changers"** (RANSAC, SIFT)
- We've had some problems with **bandwagon jumping and intellectual snobbery**
But it's not clear whether any of it mattered in the end.