This commit is contained in:
Trance-0
2025-09-15 23:32:06 -05:00
parent 6f615b78f5
commit b51bdc5e17
4 changed files with 83 additions and 0 deletions

View File

@@ -1,2 +1,33 @@
# CSE5519 Advances in Computer Vision (Topic I: 2021 and before: Embodied Computer Vision and Robotics)
## ViNG: Learning Open-World Navigation with Visual Goals.
[link to the paper](https://arxiv.org/pdf/2012.09812)
We consider the problem of goal-directed visual navigation: a robot is tasked with navigating to a goal location G given an image observation $o_G$ taken at $G$. In addition to navigating to the goal, the robot also needs to recognize when it has reached the goal, signaling that the task has been completed. The robot does not have a spatial map of the environment, but we assume that it has access to a small number of trajectories that it has collected previously. This data will be used to construct a graph over the environment using a learned distance and reachability function. We make no assumptions on the nature of the trajectories: they may be obtained by human teleoperation, self-exploration, or a result of a random walk. Each trajectory is a dense sequence of observations $o_1, o_2, . . . , o_n$ recorded by its on-board camera. Since the robot only observes the world from a single on-board camera and does not run any state estimation, our system operates in a partially observed setting. Our system commands continuous linear and angular velocities.
NVIDIA Jetson TX2 computer.
Our method solely operates using images taken from the onboard camera.
our experiments demonstrate that two key technical insights contribute to significantly improved performance in the real-world setting: graph pruning (Sec. IV-B2) and negative mining (Sec. IV-A1). Our comparisons to prior methods in Section V and ablation studies in Section V-D demonstrate these novel improvements enable ViNG to learn goal-conditioned policies entirely from offline data, avoiding the need for simulators and online sampling, while prior methods struggle to attain good performance, particularly for long-horizon goals.
### Novelty in ViNG
#### Learning dynamical distance
More precisely, we will learn to predict the estimated number of time steps required by a controller to navigate from one observation to another. This function must encapsulate knowledge of physics beyond just geometry.
1. Negative Mining: Use augmented data with view from different trajectories to train the model.
2. Graph Pruning: As the robot gathers more experience, maintaining a dense graph of traversability across all observation nodes becomes redundant and infeasible, as the graph size grows quadratically. For our experiments, we sparsify trajectories by thresholding the edges that get added to the graph: edges that are easily traversable $(T (o_i , o_j ) < \delta_{sparsify})$ are not added to the graph, since the controller can traverse those edges with high probability.
3. Weighted Distance: We use a weighted distance (traversability) to compute the distance between two nodes with weighted-Dijkstra algorithm.
> [!TIP]
>
> This is a really interesting paper that use learned policy and positional graph to navigate in real-world.
>
> I don't have time to check the implementation, but I may assume it does not use any deep neural network for recognizing the pattern for the goal. And as the authors mentioned, the performance is sensitive to season change and illumination changes. I wonder if we can use some pattern recognition model help the model easily recognize the goal. Is there any way to do this?
>
> How the model generalized the knowledge about the topology of the environment and know it's on the correct path if the robot is interrupted by other objects, for example, crossing bicycles or dropping leaves on the ground?

View File

@@ -1,2 +1,29 @@
# CSE5519 Advances in Computer Vision (Topic J: 2021 and before: Open-Vocabulary Object Detection)
## MDETR - Modulated Detection for End-to-End Multi-Modal Understanding
[link to the paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Kamath_MDETR_-_Modulated_Detection_for_End-to-End_Multi-Modal_Understanding_ICCV_2021_paper.pdf)
MDETR uses a convolutional backbone to extract visual features, and a language model such as RoBERTa to extract text features. The features of both modalities are projected to a shared embedding space, concatenated and fed to a transformer encoder-decoder that predicts the bounding boxes of the objects and their grounding in text.
DETR Our approach to modulated detection builds on the DETR system
### Novelty in MDETR
We present the two additional loss functions used by MDETR, which encourage alignment between the image and the text. Both of these use the same source of annotations: free form text with aligned bounding boxes.
The first loss function that we term as the soft token prediction loss is a non parametric alignment loss.
The second, termed as the text-query contrastive alignment is a parametric loss function enforcing similarity between aligned object queries and tokens.
While the soft token prediction uses positional information to align the objects to text, the contrastive alignment loss enforces alignment between the embedded representations of the object at the output of the decoder, and the text representation at the output of the cross encoder.
During MDETR pre-training, the model is trained to detect all objects mentioned in the question. To extend it for question answering, we provide QA specific queries in addition to the object queries as input to the transformer decoder. We use specialized heads for different question types.
> [!TIP]
>
> This paper really shows the power of transformer architecture in object detection.
>
> I was shocked by the in-context learning ability of the model when it knows the referred object "what" in the text query.
>
> I wonder if it is possible to use this model in reverse to generate a concise and comprehensive description of the image? Maybe we can combine it with some other image generation model as the generator and the discriminator to capture the essence of the topology of the image? Would that train a better image generation model and transformer model for object description?

View File

@@ -1,3 +1,4 @@
export default {
index: "Math 401, Fall 2025: Overview of thesis",
Math401_S1: "Math 401, Fall 2025: Thesis notes, Section 1",
}

View File

@@ -0,0 +1,24 @@
import pyperclip
def clean_clipboard_content():
# Get the current content of the clipboard
clipboard_content = pyperclip.paste()
# Remove line breaks and extra spaces
cleaned_content = ' '.join(clipboard_content.split())
# Convert to UTF-8
utf8_content = cleaned_content.encode('utf-8').decode('utf-8')
# Replace the clipboard content with the cleaned and formatted text
pyperclip.copy(utf8_content)
import time
previous_content = ""
while True:
current_content = pyperclip.paste()
if current_content != previous_content:
clean_clipboard_content()
previous_content = current_content
time.sleep(0.1) # Check for new content every second