Files
NoteNextra-origin/content/CSE5519/CSE5519_J5.md
Trance-0 d90faea29b
Some checks failed
Sync from Gitea (main→main, keep workflow) / mirror (push) Has been cancelled
updates
2025-11-20 10:15:25 -06:00

943 B

CSE5519 Advances in Computer Vision (Topic J: 2025: Open-Vocabulary Object Detection)

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Input:

  • Text prompt encoder
  • Visual prompt encoder
  • Customized prompt encoder

Output:

  • Box (object selection)
  • Mask (pixel embedding map)
  • Keypoint (object pose, joints estimation)
  • Language (semantic understanding)

Tip

This model provides the latest solution for the open-vocabulary object detection task and using Grounding-100M to break the benchmark.

In some figures they displayed in the paper. I found some interesting differences between human recognition and CV. The fruits with smiling faces. In the object detection task, it seems that the DINO-X is not focusing on the smiling face but the fruit itself. I wonder if they can capture the abstract meaning of this representation and how it is different from human recognition, and why?