From de936c4f836df8a6996dca753a1134442588ac96 Mon Sep 17 00:00:00 2001 From: Zheyuan Wu <60459821+Trance-0@users.noreply.github.com> Date: Tue, 23 Sep 2025 21:14:58 -0500 Subject: [PATCH] updates --- content/CSE5519/CSE5519_A2.md | 28 ++++++++++++++++++++++++++++ content/CSE5519/CSE5519_E2.md | 20 ++++++++++++++++++-- 2 files changed, 46 insertions(+), 2 deletions(-) diff --git a/content/CSE5519/CSE5519_A2.md b/content/CSE5519/CSE5519_A2.md index 0ecb884..0326cd5 100644 --- a/content/CSE5519/CSE5519_A2.md +++ b/content/CSE5519/CSE5519_A2.md @@ -4,4 +4,32 @@ [link to the paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Cheng_Masked-Attention_Mask_Transformer_for_Universal_Image_Segmentation_CVPR_2022_paper.pdf) +### Definitions in Semantic Segmentation + +- Semantic Segmentation: Classify each pixel into a semantic class. (return a class label for each pixel.) +- Panoptic Segmentation: Classify each pixel into a semantic class and instance class. +- Instance Segmentation: Classify each pixel into an instance class. (return a mask with a single class label.) + ### Novelty in Masked-Attention Mask Transformer + +The authors propose a new universal architecture for panoptic segmentation. + +#### Masked-Attention in the Transformer + +Masked-attention is a variant of cross-attention that only attends within the foreground masked region. This accelerates the convergence by the assumption that the local feature around the region is sufficient to update query features and context information. + +#### Multi-scale high-resolution feature + +Use feature pyramid produced by the pixel decoder with various resolutions of the original image.with positional embedding and scale-level embedding (learnable) to + +#### Improvements on the training and inference + +Drop out is not necessary and usually decreases the performance. + +Use importance sampling to reduce memory usage. + +> [!TIP] +> +> Compared with previous works, this paper shows the potential of a universal architecture for panoptic segmentation by utilizing the masked-attention to replace the cross-attention in the transformer. +> +> However, compared with other works with transformer, this paper does not show the generalization ability of the model across different datasets. Additional training is required to adapt the model to different datasets. Is this due to the lack of generalizable dataset? If we increase the variance of the dataset, will the model have better generalization ability? Will the performance degrade on specialized datasets compared with the single dataset trained model? \ No newline at end of file diff --git a/content/CSE5519/CSE5519_E2.md b/content/CSE5519/CSE5519_E2.md index b6c97c7..3d1edf9 100644 --- a/content/CSE5519/CSE5519_E2.md +++ b/content/CSE5519/CSE5519_E2.md @@ -1,5 +1,21 @@ # CSE5519 Advances in Computer Vision (Topic E: 2022: Deep Learning for Geometric Computer Vision) -## MeshLoc: Mesh-Based Visual Localization +## Map-free Visual Relocalization: Metric Pose Relative to a Single Image -[link to the paper](https://arxiv.org/pdf/2210.05494) \ No newline at end of file +[link to the paper](https://arxiv.org/pdf/2210.05494) + +This paper proposes a map-free visual relocalization method that can estimate the metric pose relative to a single image. Only use 2 image in total. + +### Novelty in Map-free Visual Relocalization + +Use single image as reference view to estimate the metric pose for a given image up to a certain scale factor. + +Use Relative pose regression to estimate the pose relative to the reference view. + +> [!TIP] +> +> This paper reminds me of the Disparity net, which makes the estimation of disparity from the left image and the right image. They also feed the data across several ResNet layers. +> +> After reading this paper, I'm impressed by the customized data set of over 600+ places of interest around the world. The dataset considered the difference (lighting conditions, seasonal changes, etc.) between the frame of the traditional dataset to be too low compared with the customized dataset (photo taken by different people with different equipment). It raises a good metric to evaluate the performance of the model. +> +> However, I didn't see the performance of the model on the traditional dataset. How does the model adapt to the traditional dataset? Can we use the map-free visual relocalization to improve the performance of the model on the traditional dataset or tasks like structure from motion or pose estimation? I wonder how long it takes to train the model and how efficient it is in terms of memory usage or inference time compared with the traditional model?