updates

2025-11-11 12:45:58 -06:00
parent 51c9f091d6
commit 34afc00a7f
5 changed files with 387 additions and 0 deletions
--- a/content/CSE5519/CSE5519_B5.md
+++ b/content/CSE5519/CSE5519_B5.md
@@ -1,2 +1,21 @@
 # CSE5519 Advances in Computer Vision (Topic B: 2025: Vision-Language Models)

+## Molmo and PixMo:
+
+[link to paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Deitke_Molmo_and_PixMo_Open_Weights_and_Open_Data_for_State-of-the-Art_CVPR_2025_paper.pdf)
+
+## Novelty in Molmo and PixMo
+
+PixMo dataset (712k images with long 200+ words description)
+
+- Simplified two-stage training pipline
+  - Standard ViT architecture with tokenizer and image encoder (CLIP) and pooling the embeddings to the decoder only LLM.
+- overlapping multi-crop policy
+  - Add overlapping region and image cropping to truncate the large image.
+- training over multiple annotations
+  - Text-only residual dropout
+- optimizer setups
+
+> [!TIP]
+>
+> This paper provides an interesting dataset and a refined training pipeline that is comparable to current closed-source SOTA performance. What is the contribution of the paper from the algorithm perspective? It seems that it is just a test for a new dataset with a slightly altered training pipeline.