From 5d70688a7dbcf434781d74b15a00d98d0a4aa8cc Mon Sep 17 00:00:00 2001 From: Zheyuan Wu <60459821+Trance-0@users.noreply.github.com> Date: Thu, 23 Oct 2025 14:32:35 -0500 Subject: [PATCH] Update CSE5519_J3.md --- content/CSE5519/CSE5519_J3.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/content/CSE5519/CSE5519_J3.md b/content/CSE5519/CSE5519_J3.md index 264d8a6..bb4484e 100644 --- a/content/CSE5519/CSE5519_J3.md +++ b/content/CSE5519/CSE5519_J3.md @@ -1,2 +1,22 @@ # CSE5519 Advances in Computer Vision (Topic J: 2023 - 2024: Open-Vocabulary Object Detection) +## Grounding DINO + +[link to the paper](https://arxiv.org/pdf/2303.05499) + +### Novelty in Grounding DINO + +- Use CLIP to enhance the feature with DETER + +1. Contrastive loss for text-region alignment +2. Localization loss-box regression (DINO style) +3. Auxiliary loss across decoder layers + +Top 900 bounding boxes for inference. + +> [!TIP] +> +> This paper shows a novel approach to open-vocabulary object detection by marrying DINO with CLIP. The authors use a DINO model to get the query features and then use a grounding head to get the bounding box and class label. +> +> I'm really interested in the number of bounding boxes for inference. I wonder how fine-grained the bounding boxes are? Does it serve a good reference for counting problems and doing logical reasoning for example the hand with 6 fingers? +