updates

2025-11-20 10:15:25 -06:00
parent 1fac4c46fa
commit d90faea29b
3 changed files with 40 additions and 1 deletions
--- a/content/CSE5519/CSE5519_J5.md
+++ b/content/CSE5519/CSE5519_J5.md
@@ -1,2 +1,22 @@
 # CSE5519 Advances in Computer Vision (Topic J: 2025: Open-Vocabulary Object Detection)

+## DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
+
+Input:
+
+- Text prompt encoder
+- Visual prompt encoder
+- Customized prompt encoder
+
+Output:
+
+- Box (object selection)
+- Mask (pixel embedding map)
+- Keypoint (object pose, joints estimation)
+- Language (semantic understanding)
+
+> [!TIP]
+>
+> This model provides the latest solution for the open-vocabulary object detection task and using Grounding-100M to break the benchmark.
+>
+> In some figures they displayed in the paper. I found some interesting differences between human recognition and CV. The fruits with smiling faces. In the object detection task, it seems that the DINO-X is not focusing on the smiling face but the fruit itself. I wonder if they can capture the abstract meaning of this representation and how it is different from human recognition, and why?