Update CSE5519_J3.md
This commit is contained in:
@@ -1,2 +1,22 @@
|
|||||||
# CSE5519 Advances in Computer Vision (Topic J: 2023 - 2024: Open-Vocabulary Object Detection)
|
# CSE5519 Advances in Computer Vision (Topic J: 2023 - 2024: Open-Vocabulary Object Detection)
|
||||||
|
|
||||||
|
## Grounding DINO
|
||||||
|
|
||||||
|
[link to the paper](https://arxiv.org/pdf/2303.05499)
|
||||||
|
|
||||||
|
### Novelty in Grounding DINO
|
||||||
|
|
||||||
|
- Use CLIP to enhance the feature with DETER
|
||||||
|
|
||||||
|
1. Contrastive loss for text-region alignment
|
||||||
|
2. Localization loss-box regression (DINO style)
|
||||||
|
3. Auxiliary loss across decoder layers
|
||||||
|
|
||||||
|
Top 900 bounding boxes for inference.
|
||||||
|
|
||||||
|
> [!TIP]
|
||||||
|
>
|
||||||
|
> This paper shows a novel approach to open-vocabulary object detection by marrying DINO with CLIP. The authors use a DINO model to get the query features and then use a grounding head to get the bounding box and class label.
|
||||||
|
>
|
||||||
|
> I'm really interested in the number of bounding boxes for inference. I wonder how fine-grained the bounding boxes are? Does it serve a good reference for counting problems and doing logical reasoning for example the hand with 6 fingers?
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user