updates

2025-11-20 10:15:25 -06:00
parent 1fac4c46fa
commit d90faea29b
3 changed files with 40 additions and 1 deletions
--- a/content/CSE5519/CSE5519_H5.md
+++ b/content/CSE5519/CSE5519_H5.md
@@ -1,2 +1,21 @@
 # CSE5519 Advances in Computer Vision (Topic H: 2025: Safety, Robustness, and Evaluation of CV Models)

+## Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographic Robustness in Object Recognition
+
+Does adding geographical context to CLIP prompts improve recognition across geographies?
+
+Yes, about 1%
+
+Can an LLM provide useful geographic descriptive knowledge to improve recognition?
+
+Yes
+
+How can we optimize soft prompts for CLIP using an accessible data source with consideration of target geographies not represented in the training set?
+
+Where can soft prompts enhanced with geographical knowledge provide the most benefits?
+
+> [!TIP]
+>
+> This model proposed an effective way to improve the model performance by self-querying geographical data.
+>
+> I wonder what might be the ultimate boundary of the LLM-generated context and performance improvement. Theoretically,  it seems that we can use LLM to generate the majority of possible contexts before making predictions and use the context to improve the performance. However, introducing additional (might be irrelevant) information may generate hallucinations. I wonder if we can find a general approach to let LLM generate a decent context for the task and use the context to improve the performance.
--- a/content/CSE5519/CSE5519_J5.md
+++ b/content/CSE5519/CSE5519_J5.md
@@ -1,2 +1,22 @@
 # CSE5519 Advances in Computer Vision (Topic J: 2025: Open-Vocabulary Object Detection)

+## DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
+
+Input:
+
+- Text prompt encoder
+- Visual prompt encoder
+- Customized prompt encoder
+
+Output:
+
+- Box (object selection)
+- Mask (pixel embedding map)
+- Keypoint (object pose, joints estimation)
+- Language (semantic understanding)
+
+> [!TIP]
+>
+> This model provides the latest solution for the open-vocabulary object detection task and using Grounding-100M to break the benchmark.
+>
+> In some figures they displayed in the paper. I found some interesting differences between human recognition and CV. The fruits with smiling faces. In the object detection task, it seems that the DINO-X is not focusing on the smiling face but the fruit itself. I wonder if they can capture the abstract meaning of this representation and how it is different from human recognition, and why?