Update CSE5519_J3.md

2025-10-23 14:32:35 -05:00
parent dbb201ef37
commit 5d70688a7d
1 changed files with 20 additions and 0 deletions
--- a/content/CSE5519/CSE5519_J3.md
+++ b/content/CSE5519/CSE5519_J3.md
@@ -1,2 +1,22 @@
 # CSE5519 Advances in Computer Vision (Topic J: 2023 - 2024: Open-Vocabulary Object Detection)
 ## Grounding DINO
 [link to the paper](https://arxiv.org/pdf/2303.05499)
 ### Novelty in Grounding DINO
 - Use CLIP to enhance the feature with DETER
 1. Contrastive loss for text-region alignment
 2. Localization loss-box regression (DINO style)
 3. Auxiliary loss across decoder layers
 Top 900 bounding boxes for inference.
 > [!TIP]
 >
 > This paper shows a novel approach to open-vocabulary object detection by marrying DINO with CLIP. The authors use a DINO model to get the query features and then use a grounding head to get the bounding box and class label.
 >
 > I'm really interested in the number of bounding boxes for inference. I wonder how fine-grained the bounding boxes are? Does it serve a good reference for counting problems and doing logical reasoning for example the hand with 6 fingers?