# CSE5519 Advances in Computer Vision (Topic J: 2025: Open-Vocabulary Object Detection) ## DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding Input: - Text prompt encoder - Visual prompt encoder - Customized prompt encoder Output: - Box (object selection) - Mask (pixel embedding map) - Keypoint (object pose, joints estimation) - Language (semantic understanding) > [!TIP] > > This model provides the latest solution for the open-vocabulary object detection task and using Grounding-100M to break the benchmark. > > In some figures they displayed in the paper. I found some interesting differences between human recognition and CV. The fruits with smiling faces. In the object detection task, it seems that the DINO-X is not focusing on the smiling face but the fruit itself. I wonder if they can capture the abstract meaning of this representation and how it is different from human recognition, and why?