updates
This commit is contained in:
@@ -1,2 +1,21 @@
|
||||
# CSE5519 Advances in Computer Vision (Topic B: 2025: Vision-Language Models)
|
||||
|
||||
## Molmo and PixMo:
|
||||
|
||||
[link to paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Deitke_Molmo_and_PixMo_Open_Weights_and_Open_Data_for_State-of-the-Art_CVPR_2025_paper.pdf)
|
||||
|
||||
## Novelty in Molmo and PixMo
|
||||
|
||||
PixMo dataset (712k images with long 200+ words description)
|
||||
|
||||
- Simplified two-stage training pipline
|
||||
- Standard ViT architecture with tokenizer and image encoder (CLIP) and pooling the embeddings to the decoder only LLM.
|
||||
- overlapping multi-crop policy
|
||||
- Add overlapping region and image cropping to truncate the large image.
|
||||
- training over multiple annotations
|
||||
- Text-only residual dropout
|
||||
- optimizer setups
|
||||
|
||||
> [!TIP]
|
||||
>
|
||||
> This paper provides an interesting dataset and a refined training pipeline that is comparable to current closed-source SOTA performance. What is the contribution of the paper from the algorithm perspective? It seems that it is just a test for a new dataset with a slightly altered training pipeline.
|
||||
|
||||
Reference in New Issue
Block a user