This commit is contained in:
Zheyuan Wu
2025-09-09 14:23:28 -05:00
parent 341af61ea9
commit 384e538bc9
2 changed files with 50 additions and 0 deletions

View File

@@ -38,3 +38,12 @@ Zero-shot CLIP still generalizes poorly to data that is truly out-of-distributio
> In defining the general task that CLIP can solve, and experimental results from Zero-Shot CLIP vs. Linear Probe on ResNet50. I can see that the performance of Zero-Shot CLIP is better than Linear Probe on ResNet50 on the tasks that are somehow "frequently labeled" by humans. For example, the car brand, the location of the image, etc. And perform badly when humans don't label the image or the idea is more abstract. For example, the distance of the camera, the numbers in the image, a satellite image of terrain, etc.
>
> Is the CLIP model really learning enough knowledge from the general natural language description of the image? If the description is more comprehensive, will CLIP outperform the Linear Probe on ResNet50?
### On lecture new takes
Flicker image label selection.
Visual Commonsense Reasoning dataset. (Visual Bert)
Loss based on cosine similarity between caption pairs.