# CSE5519 Advances in Computer Vision (Topic J: 2023 - 2024: Open-Vocabulary Object Detection) ## Grounding DINO [link to the paper](https://arxiv.org/pdf/2303.05499) ### Novelty in Grounding DINO - Use CLIP to enhance the feature with DETER 1. Contrastive loss for text-region alignment 2. Localization loss-box regression (DINO style) 3. Auxiliary loss across decoder layers Top 900 bounding boxes for inference. > [!TIP] > > This paper shows a novel approach to open-vocabulary object detection by marrying DINO with CLIP. The authors use a DINO model to get the query features and then use a grounding head to get the bounding box and class label. > > I'm really interested in the number of bounding boxes for inference. I wonder how fine-grained the bounding boxes are? Does it serve a good reference for counting problems and doing logical reasoning for example the hand with 6 fingers?