# CSE5519 Advances in Computer Vision (Lecture 3) ## Reminders First Example notebook due Sep 18 Project proposal due Sep 23 ## Continued: A brief history (time) of computer vision ### Theme changes #### 1980 - “Definitive” detectors - Edges: Canny (1986); corners: Harris & Stephens (1988) - Multiscale image representations - Witkin (1983), Burt & Adelson (1984), Koenderink (1984, 1987), etc. - Markov Random Field models: Geman & Geman (1984) - Segmentation by energy minimization - Kass, Witkin & Terzopoulos (1987), Mumford & Shah (1989) #### Conferences, journals, books - Conferences: ICPR (1973), CVPR (1983), ICCV (1987), ECCV (1990) - Journals: TPAMI (1979), IJCV (1987) - Books: Duda & Hart (1972), Marr (1982), Ballard & Brown (1982), Horn (1986) #### 1980s: The dead ends - Alignment-based recognition - Faugeras & Hebert (1983), Grimson & Lozano-Perez (1984), Lowe (1985), Huttenlocher & Ullman (1987), etc. - Aspect graphs - Koenderink & Van Doorn (1979), Plantinga & Dyer (1986), Hebert & Kanade (1985), Ikeuchi & Kanade (1988), Gigus & Malik (1990) - Invariants: Mundy & Zisserman (1992) #### 1980s: Meanwhile... - Neocognitron: Fukushima (1980) - Back-propagation: Rumelhart, Hinton & Williams (1986) - Origins in control theory and optimization: Kelley (1960), Dreyfus (1962), Bryson & Ho (1969), Linnainmaa (1970) - Application to neural networks: Werbos (1974) - Interesting blog post: Backpropagating through time Or, How come BP hasn’t been invented earlier? - Parallel Distributed Processing: Rumelhart et al. (1987) - Neural networks for digit recognition: LeCun et al. (1989) #### 1990s Multi-view geometry, statistical and appearance-based models for recognition, first approaches for (class-specific) object detection Geometry (mostly) solved - Fundamental matrix: Faugeras (1992) - Normalized 8-point algorithm: Hartley (1997) - RANSAC for robust fundamental matrix estimation: Torr & Murray (1997) - Bundle adjustment: Triggs et al. (1999) - Hartley & Zisserman book (2000) - Projective structure from motion: Faugeras and Luong (2001) Data enters the scene - Appearance-based models: Turk & Pentland (1991), Murase & Nayar (1995) PCA for face recognition: Turk & Pentland (1991) Image manifolds Keypoint-based image indexing - Schmid & Mohr (1996), Lowe (1999) Constellation models for object categories - Burl, Weber & Perona (1998), Weber, Welling & Perona (2000) First sustained use of classifiers and negative data - Face detectors: Rowley, Baluja & Kanade (1996), Osuna, Freund & Girosi (1997), Schneiderman & Kanade (1998), Viola & Jones (2001) - Convolutional nets: LeCun et al. (1998) Graph cut image inference - Boykov, Veksler & Zabih (1998) Segmentation - Normalized cuts: Shi & Malik (2000) - Berkeley segmentation dataset: Martin et al. (2001) Video processing - Layered motion models: Adelson & Wang (1993) - Robust optical flow: Black & Anandan (1993) - Probabilistic curve tracking: Isard & Blake (1998) #### 2000s: Keypoints and reconstruction Keypoints craze - Kadir & Brady (2001), Mikolajczyk & Schmid (2002), Matas et al. (2004), Lowe (2004), Bay et al. (2006), etc. 3D reconstruction "in the wild" - SFM in the wild - Multi-view stereo, stereo on GPU's Generic object recognition - Constellation models - Bags of features - Datasets: Caltech-101 -> ImageNet Generic object detection - PASCAL dataset - HOG, Deformable part models Action and activity recognition: "misc. early efforts" #### 1990s-2000s: Dead ends (?) Probabilistic graphical models Perceptual organization #### 2010s: Deep learning, big data They can be more accurate (often much more accurate). They are faster (often much faster). They are adaptable to new problems. Deep Convolutional Neural Networks - Many layers, some of which are convolutional (usually near the input) - Early layers "extract features" - Trained using stochastic gradient descent on very large datasets - Many possible loss functions (depending on task) Additional benefits: - High-quality software frameworks - "New" network layers - Dropout (enables simultaneously training many models) - ReLU activation (enables faster training because gradients don’t become zero) - Bigger datasets - reduces overfitting - improves robustness - enable larger, deeper networks - Deeper networks eliminate the need for hand-engineered features ### Where did we go wrong? In retrospect, computer vision has had several periods of "spinning its wheels" - We've always **prioritized methods that could already do interesting things** over potentially more promising methods that could not yet deliver - We've undervalued simple methods, data, and learning - When nothing worked, we **distracted ourselves with fancy math** - On a few occasions, we unaccountably **ignored methods that later proved to be "game changers"** (RANSAC, SIFT) - We've had some problems with **bandwagon jumping and intellectual snobbery** But it's not clear whether any of it mattered in the end.