This commit is contained in:
Trance-0
2025-11-04 11:30:12 -06:00
parent 71ada8d498
commit d24c0bdd9e
3 changed files with 145 additions and 1 deletions

View File

@@ -14,4 +14,4 @@ This paper shows that the visual instruction tuning can improve the performance
>[!TIP]
>
> This paper shows that LLaVA-1.5 obeys the scaling law and splitting the high resolution images into grids to maintain the data efficiency. I wonder why this method is not applicable to multi-image understanding tasks? Why we cannot assign index embeddings to each image and push the image sets to the model for better understanding?
> This paper shows that LLaVA-1.5 obeys the scaling law and splitting the high resolution images into grids to maintain the data efficiency. I wonder why this method is not applicable to multi-image understanding tasks? Why we cannot assign index embeddings to each image and push the image sets to the model for better understanding? What are the technical challenges to implement this idea?