Trance-0/NoteNextra-origin

Files

Trance-0 71ada8d498 sad

2025-11-03 23:56:22 -06:00

1.1 KiB

Raw Blame History

CSE5519 Advances in Computer Vision (Topic B: 2024: Vision-Language Models)

Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)

link to the paper

This paper shows that the visual instruction tuning can improve the performance of the vision-language model.

Novelty in LLaVA-1.5

Scaling to high resolution images by dividing images into grids and maintaining the data efficiency.
Compositional ability, (use long-form language reasoning together with shorter visual reasoning can improve the model's writing ability)
Random downsampling will not degrade the performance.

Tip

This paper shows that LLaVA-1.5 obeys the scaling law and splitting the high resolution images into grids to maintain the data efficiency. I wonder why this method is not applicable to multi-image understanding tasks? Why we cannot assign index embeddings to each image and push the image sets to the model for better understanding?