updates
This commit is contained in:
@@ -38,3 +38,12 @@ Zero-shot CLIP still generalizes poorly to data that is truly out-of-distributio
|
||||
> In defining the general task that CLIP can solve, and experimental results from Zero-Shot CLIP vs. Linear Probe on ResNet50. I can see that the performance of Zero-Shot CLIP is better than Linear Probe on ResNet50 on the tasks that are somehow "frequently labeled" by humans. For example, the car brand, the location of the image, etc. And perform badly when humans don't label the image or the idea is more abstract. For example, the distance of the camera, the numbers in the image, a satellite image of terrain, etc.
|
||||
>
|
||||
> Is the CLIP model really learning enough knowledge from the general natural language description of the image? If the description is more comprehensive, will CLIP outperform the Linear Probe on ResNet50?
|
||||
|
||||
### On lecture new takes
|
||||
|
||||
Flicker image label selection.
|
||||
|
||||
Visual Commonsense Reasoning dataset. (Visual Bert)
|
||||
|
||||
Loss based on cosine similarity between caption pairs.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user