updates

2025-10-09 01:35:09 -05:00
parent 09702a9418
commit c9b46d320a
2 changed files with 22 additions and 0 deletions
--- a/content/CSE5519/CSE5519_B3.md
+++ b/content/CSE5519/CSE5519_B3.md
@@ -1,2 +1,13 @@
 # CSE5519 Advances in Computer Vision (Topic B: 2023: Vision-Language Models)

+## InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
+
+[link to paper](https://arxiv.org/pdf/2305.06500)
+
+> [!TIP]
+>
+> This paper introduces InstructBLIP, a framework for a vision-language model that aligns with text instructions.
+>
+> It consists of three submodules: the BLIP-2 model with an image decoder, an LLM, and a query Transformer (Q-former) to bridge the two.
+>
+> From qualitative results, we can see some hints that the model is following the text instructions, but I wonder if this framework could also bring to the image editing and generation tasks? What might be the difficulties in migrating this framework to context-awarded image generation?