updates
Some checks failed
Sync from Gitea (main→main, keep workflow) / mirror (push) Has been cancelled
Some checks failed
Sync from Gitea (main→main, keep workflow) / mirror (push) Has been cancelled
This commit is contained in:
@@ -1,2 +1,22 @@
|
||||
# CSE5519 Advances in Computer Vision (Topic J: 2025: Open-Vocabulary Object Detection)
|
||||
|
||||
## DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
|
||||
|
||||
Input:
|
||||
|
||||
- Text prompt encoder
|
||||
- Visual prompt encoder
|
||||
- Customized prompt encoder
|
||||
|
||||
Output:
|
||||
|
||||
- Box (object selection)
|
||||
- Mask (pixel embedding map)
|
||||
- Keypoint (object pose, joints estimation)
|
||||
- Language (semantic understanding)
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> This model provides the latest solution for the open-vocabulary object detection task and using Grounding-100M to break the benchmark.
|
||||
>
|
||||
> In some figures they displayed in the paper. I found some interesting differences between human recognition and CV. The fruits with smiling faces. In the object detection task, it seems that the DINO-X is not focusing on the smiling face but the fruit itself. I wonder if they can capture the abstract meaning of this representation and how it is different from human recognition, and why?
|
||||
|
||||
Reference in New Issue
Block a user