updates

2025-10-01 23:24:12 -05:00
parent dfc89afc54
commit 5ce0c8773b
2 changed files with 52 additions and 0 deletions
--- a/content/CSE5519/CSE5519_I2.md
+++ b/content/CSE5519/CSE5519_I2.md
@@ -1,2 +1,29 @@
 # CSE5519 Advances in Computer Vision (Topic I: 2022: Embodied Computer Vision and Robotics)

+## DayDreamer: World Models for Physical Robot Learning
+
+[link to paper](https://arxiv.org/pdf/2206.14176)
+
+This is a real world learning framework for robotics.
+
+### Novelty in the integration of world‑model learning with reinforcement learning
+
+Leverage the dreamer algorithm for fast robot learning in real world.
+
+Two neural network components drawn from the replay buffer.
+
+#### Encoder
+
+Fuses all sensory modalities into discrete codes. The decoder reconstruction the inputs from the codes, providing a rich learning signal and enabling human inspection of model predictions.
+
+A **recurrent state-space model** is trained to predict the future code given actions, without observing the intermediate inputs.
+
+#### World model learning
+
+The world model enables massively parallel policy optimization from imagined rollouts in the compact latent space using a large batch size, without having to reconstruct sensory inputs. Dreamer trains a _policy network_ and _value network_ from the imagined rollouts and learned.
+
+> [!TIP]
+>
+> This paper uses online reinforcement learning to reach unsupervised training in a real environment with replay buffers.
+> 
+> The key limitation in the process is that it requires long real training time, as the simulator can concurrently generate large batches of data for training. Is it more efficient to use the simulator to train some parts of the model first and use the real-world data to fine-tune the model, or would it be more efficient in terms of training time and repair costs? In the paper, there are a few comparisons of results between simulator training and the real-world model. I wonder what the story is on the other side? How does the pure simulator-based model training go?
--- a/content/CSE5519/CSE5519_J2.md
+++ b/content/CSE5519/CSE5519_J2.md
@@ -1,2 +1,27 @@
 # CSE5519 Advances in Computer Vision (Topic J: 2022: Open-Vocabulary Object Detection)

+## Class-agnostic Object Detection with Multi-modal Transformer 
+
+[link to paper](https://arxiv.org/pdf/2111.11430)
+
+MViT is a proposed Multimodal Vision Transformer.
+
+It achieves state‑of‑the‑art performance on various downstream tasks like class-agonistic object detection.
+
+### Novelty in MViT
+
+1. GPV: A unified architecture for multi-task learning. Trained on data from five different vision-language tasks.
+2. MDETER: modulated transformer trained to detect objects in an image conditioned on a text query.
+3. MAVL: Multi-scale attention Vision Transformer, using multi-scale spacial context to achieve efficient training.
+   - MSDA: Multi-scale Deformable attention. Sample a small set of keys around a reference query image location.
+   - Late Multi-modal fusion: Use the spacial structure of an image to sparsely sample keys for each query point.
+
+The model has strong generalization ability, that is able to detect object that only occurs few times in training datasets (lynx, humidifier, and armadillo). But they cannot generalized to medical imaging.
+
+The model has enhances interactability, that is it is able to comprehend "all objects" or "long objects" in the query and able to select them out.
+
+> [!TIP]
+>
+> This is an interesting paper that provides a comprehensive framework for multimodal vision language models. It uses different components that specialize in each task to achieve SOTA performance and proposes multi-scale deformable attention to speed up the training process. The final model has strong generalization ability and impressive interactability in understanding abstract concepts like "all", "tall".
+> 
+> I wonder what is the source of understanding abstract concepts in natural language emergence in MViT. Does the model learn the correlation between the words, like "tall" with man, "all" with a large bounding box, or logical reasoning? If we give the model the same-sized plushes of monkey and whale, preferably the monkey is slightly larger. If we ask the model for "large objects", will it select the monkey plush or the whale plush?