# CSE559A Lecture 15

## Continue on object detection

### Two strategies for object detection

#### R-CNN: Region proposals + CNN features

![R-CNN](https://notenextra.trance-0.com/CSE559A/R-CNN.png)

#### Fast R-CNN: CNN features + RoI pooling

![Fast R-CNN](https://notenextra.trance-0.com/CSE559A/Fast-R-CNN.png)

Use bilinear interpolation to get the features of the proposal.

#### Region of interest pooling

![RoI pooling](https://notenextra.trance-0.com/CSE559A/RoI-pooling.png)

Use backpropagation to get the gradient of the proposal.

### New materials

#### Faster R-CNN

Use one CNN to generate region proposals. And use another CNN to classify the proposals.

##### Region proposal network

Idea: put an "anchor box" of fixed size over each position in the feature map and try to predict whether this box is likely to contain an object.

Introduce anchor boxes at multiple scales and aspect ratios to handle a wider range of object sizes and shapes.

![Anchor boxes](https://notenextra.trance-0.com/CSE559A/Anchor-boxes.png)

### Single-stage and multi-resolution detection

#### YOLO

You only look once (YOLO) is a state-of-the-art, real-time object detection system.

1. Take conv feature maps at 7x7 resolution
2. Add two FC layers to predict, at each location, a score for each class and 2 bboxes with confidences

For PASCAL, output is 7×7×30 (30=20 + 2∗(4+1))

![YOLO](https://notenextra.trance-0.com/CSE559A/YOLO.png)

##### YOLO Network Head

```python
model.add(Conv2D(1024, (3, 3), activation='lrelu', kernel_regularizer=l2(0.0005)))
model.add(Conv2D(1024, (3, 3), activation='lrelu', kernel_regularizer=l2(0.0005)))
# use flatten layer for global reasoning
model.add(Flatten())
model.add(Dense(512))
model.add(Dense(1024))
model.add(Dropout(0.5))
model.add(Dense(7 * 7 * 30, activation='sigmoid'))
model.add(YOLO_Reshape(target_shape=(7, 7, 30)))
model.summary()
```

#### YOLO results

1. Each grid cell predicts only two boxes and can only have one class – this limits the number of nearby objects that can be predicted
2. Localization accuracy suffers compared to Fast(er) R-CNN due to coarser features, errors on small boxes
3. 7x speedup over Faster R-CNN (45-155 FPS vs. 7-18 FPS)

#### YOLOv2

1. Remove FC layer, do convolutional prediction with anchor boxes instead
2. Increase resolution of input images and conv feature maps
3. Improve accuracy using batch normalization and other tricks

#### SSD

SSD is a multi-resolution object detection

![SSD](https://notenextra.trance-0.com/CSE559A/SSD.png)

1. Predict boxes of different size from different conv maps
2. Each level of resolution has its own predictor

##### Feature Pyramid Network

- Improve predictive power of lower-level feature maps by adding contextual information from higher-level feature maps
- Predict different sizes of bounding boxes from different levels of the pyramid (but share parameters of predictors)

#### RetinaNet

RetinaNet combine feature pyramid network with focal loss to reduce the standard cross-entropy loss for well-classified examples.

![RetinaNet](https://notenextra.trance-0.com/CSE559A/RetinaNet.png)

> Cross-entropy loss:
> $$CE(p_t) = - \log(p_t)$$

The focal loss is defined as:

$$
FL(p_t) = - (1 - p_t)^{\gamma} \log(p_t)
$$

We can increase $\gamma$ to reduce the loss for well-classified examples.

#### YOLOv3

Minor refinements

### Alternative approaches

#### CornerNet

Use a pair of corners to represent the bounding box.

Use hourglass network to accumulate the information of the corners.

#### CenterNet

Use a center point to represent the bounding box.

#### Detection Transformer

Use transformer architecture to detect the object.

![DETR](https://notenextra.trance-0.com/CSE559A/DETR.png)

DETR uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a "no object" class.