This commit is contained in:
Zheyuan Wu
2025-09-23 21:14:58 -05:00
parent 620b8c99f4
commit de936c4f83
2 changed files with 46 additions and 2 deletions

View File

@@ -4,4 +4,32 @@
[link to the paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Cheng_Masked-Attention_Mask_Transformer_for_Universal_Image_Segmentation_CVPR_2022_paper.pdf)
### Definitions in Semantic Segmentation
- Semantic Segmentation: Classify each pixel into a semantic class. (return a class label for each pixel.)
- Panoptic Segmentation: Classify each pixel into a semantic class and instance class.
- Instance Segmentation: Classify each pixel into an instance class. (return a mask with a single class label.)
### Novelty in Masked-Attention Mask Transformer
The authors propose a new universal architecture for panoptic segmentation.
#### Masked-Attention in the Transformer
Masked-attention is a variant of cross-attention that only attends within the foreground masked region. This accelerates the convergence by the assumption that the local feature around the region is sufficient to update query features and context information.
#### Multi-scale high-resolution feature
Use feature pyramid produced by the pixel decoder with various resolutions of the original image.with positional embedding and scale-level embedding (learnable) to
#### Improvements on the training and inference
Drop out is not necessary and usually decreases the performance.
Use importance sampling to reduce memory usage.
> [!TIP]
>
> Compared with previous works, this paper shows the potential of a universal architecture for panoptic segmentation by utilizing the masked-attention to replace the cross-attention in the transformer.
>
> However, compared with other works with transformer, this paper does not show the generalization ability of the model across different datasets. Additional training is required to adapt the model to different datasets. Is this due to the lack of generalizable dataset? If we increase the variance of the dataset, will the model have better generalization ability? Will the performance degrade on specialized datasets compared with the single dataset trained model?

View File

@@ -1,5 +1,21 @@
# CSE5519 Advances in Computer Vision (Topic E: 2022: Deep Learning for Geometric Computer Vision)
## MeshLoc: Mesh-Based Visual Localization
## Map-free Visual Relocalization: Metric Pose Relative to a Single Image
[link to the paper](https://arxiv.org/pdf/2210.05494)
This paper proposes a map-free visual relocalization method that can estimate the metric pose relative to a single image. Only use 2 image in total.
### Novelty in Map-free Visual Relocalization
Use single image as reference view to estimate the metric pose for a given image up to a certain scale factor.
Use Relative pose regression to estimate the pose relative to the reference view.
> [!TIP]
>
> This paper reminds me of the Disparity net, which makes the estimation of disparity from the left image and the right image. They also feed the data across several ResNet layers.
>
> After reading this paper, I'm impressed by the customized data set of over 600+ places of interest around the world. The dataset considered the difference (lighting conditions, seasonal changes, etc.) between the frame of the traditional dataset to be too low compared with the customized dataset (photo taken by different people with different equipment). It raises a good metric to evaluate the performance of the model.
>
> However, I didn't see the performance of the model on the traditional dataset. How does the model adapt to the traditional dataset? Can we use the map-free visual relocalization to improve the performance of the model on the traditional dataset or tasks like structure from motion or pose estimation? I wonder how long it takes to train the model and how efficient it is in terms of memory usage or inference time compared with the traditional model?