updates
This commit is contained in:
@@ -14,4 +14,4 @@ This paper shows that the visual instruction tuning can improve the performance
|
||||
|
||||
>[!TIP]
|
||||
>
|
||||
> This paper shows that LLaVA-1.5 obeys the scaling law and splitting the high resolution images into grids to maintain the data efficiency. I wonder why this method is not applicable to multi-image understanding tasks? Why we cannot assign index embeddings to each image and push the image sets to the model for better understanding?
|
||||
> This paper shows that LLaVA-1.5 obeys the scaling law and splitting the high resolution images into grids to maintain the data efficiency. I wonder why this method is not applicable to multi-image understanding tasks? Why we cannot assign index embeddings to each image and push the image sets to the model for better understanding? What are the technical challenges to implement this idea?
|
||||
Reference in New Issue
Block a user