From 5438f5344972f09e4dfbc6dae145022f16d33306 Mon Sep 17 00:00:00 2001
From: Zheyuan Wu <60459821+Trance-0@users.noreply.github.com>
Date: Tue, 18 Mar 2025 14:20:14 -0500
Subject: [PATCH] Update CSE559A_L15.md

---
 pages/CSE559A/CSE559A_L15.md | 98 ++++++++++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)

diff --git a/pages/CSE559A/CSE559A_L15.md b/pages/CSE559A/CSE559A_L15.md
index 8599d5f..13ab90c 100644
--- a/pages/CSE559A/CSE559A_L15.md
+++ b/pages/CSE559A/CSE559A_L15.md
@@ -26,6 +26,104 @@ Use backpropagation to get the gradient of the proposal.
 
 Use one CNN to generate region proposals. And use another CNN to classify the proposals.
 
+##### Region proposal network
 
+Idea: put an "anchor box" of fixed size over each position in the feature map and try to predict whether this box is likely to contain an object.
 
+Introduce anchor boxes at multiple scales and aspect ratios to handle a wider range of object sizes and shapes.
+
+![Anchor boxes](https://notenextra.trance-0.com/CSE559A/Anchor-boxes.png)
+
+### Single-stage and multi-resolution detection
+
+#### YOLO
+
+You only look once (YOLO) is a state-of-the-art, real-time object detection system.
+
+1. Take conv feature maps at 7x7 resolution
+2. Add two FC layers to predict, at each location, a score for each class and 2 bboxes with confidences
+
+For PASCAL, output is 7×7×30 (30=20 + 2∗(4+1))
+
+![YOLO](https://notenextra.trance-0.com/CSE559A/YOLO.png)
+
+##### YOLO Network Head
+
+```python
+model.add(Conv2D(1024, (3, 3), activation='lrelu', kernel_regularizer=l2(0.0005)))
+model.add(Conv2D(1024, (3, 3), activation='lrelu', kernel_regularizer=l2(0.0005)))
+# use flatten layer for global reasoning
+model.add(Flatten())
+model.add(Dense(512))
+model.add(Dense(1024))
+model.add(Dropout(0.5))
+model.add(Dense(7 * 7 * 30, activation='sigmoid'))
+model.add(YOLO_Reshape(target_shape=(7, 7, 30)))
+model.summary()
+```
+
+#### YOLO results
+
+1. Each grid cell predicts only two boxes and can only have one class – this limits the number of nearby objects that can be predicted
+2. Localization accuracy suffers compared to Fast(er) R-CNN due to coarser features, errors on small boxes
+3. 7x speedup over Faster R-CNN (45-155 FPS vs. 7-18 FPS)
+
+#### YOLOv2
+
+1. Remove FC layer, do convolutional prediction with anchor boxes instead
+2. Increase resolution of input images and conv feature maps
+3. Improve accuracy using batch normalization and other tricks
+
+#### SSD
+
+SSD is a multi-resolution object detection
+
+![SSD](https://notenextra.trance-0.com/CSE559A/SSD.png)
+
+1. Predict boxes of different size from different conv maps
+2. Each level of resolution has its own predictor
+
+##### Feature Pyramid Network
+
+- Improve predictive power of lower-level feature maps by adding contextual information from higher-level feature maps
+- Predict different sizes of bounding boxes from different levels of the pyramid (but share parameters of predictors)
+
+#### RetinaNet
+
+RetinaNet combine feature pyramid network with focal loss to reduce the standard cross-entropy loss for well-classified examples.
+
+> Cross-entropy loss:
+> $$CE(p_t) = - \log(p_t)$$
+
+The focal loss is defined as:
+
+$$
+FL(p_t) = - (1 - p_t)^{\gamma} \log(p_t)
+$$
+
+We can increase $\gamma$ to reduce the loss for well-classified examples.
+
+#### YOLOv3
+
+Minor refinements
+
+### Alternative approaches
+
+#### CornerNet
+
+Use a pair of corners to represent the bounding box.
+
+Use hourglass network to accumulate the information of the corners.
+
+#### CenterNet
+
+Use a center point to represent the bounding box.
+
+#### Detection Transformer
+
+Use transformer architecture to detect the object.
+
+![DETR](https://notenextra.trance-0.com/CSE559A/DETR.png)
+
+DETR uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a "no object" class.