update preview questions

This commit is contained in:
Trance-0
2025-09-03 23:17:01 -05:00
parent ea138d56b8
commit 95bb726462
2 changed files with 89 additions and 0 deletions

View File

@@ -1,2 +1,60 @@
# CSE5519 Advances in Computer Vision (Topic C: 2021 and before: Neural Rendering)
## NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[link to the paper](https://arxiv.org/pdf/2003.08934)
We represent a static scene as a continuous 5D function:
$$
F: (\mathbf{x}, \boldsymbol{\theta}) = (x, y, z, \theta, \phi) \mapsto (\sigma, \mathbf{c})
$$
where $(x, y, z)$ denotes a 3D position in space, $(\theta, \phi)$ specifies a viewing direction, $\sigma$ is the volume density at point $(x, y, z)$ (which acts as a differential opacity controlling how much radiance is accumulated along a ray), and $\mathbf{c}$ is the emitted RGB radiance in direction $(\theta, \phi)$ at that point.
Our method learns this function $F$ by optimizing a deep, fully-connected neural network (a multilayer perceptron, or MLP) that maps each 5D input coordinate $(x, y, z, \theta, \phi)$ to a corresponding volume density $\sigma$ and view-dependent color $\mathbf{c}$.
The expected camera ray color $r(t)=o+td$ where $o$ is the camera position and $d$ is the camera direction is:
$$
C(r)=\int_{t_n}^{t_f} T(t) \sigma(r(t)) c(r(t), d) d t
$$
Where $T(t)$ is the transmittance along the ray:
$$
T(t)=exp\left(-\int_{t_n}^{t} \sigma(r(s)) d s\right)
$$
### Novelty in NeRF
#### Positional encoding
deep networks are biased towards learning lower frequency functions.
They additionally show that mapping the inputs to a higher
dimensional space using high frequency functions before passing them to the network enables better fitting of data that contains high frequency variation.
Let $\gamma(p)$ be the positional encoding of $p$ that maps $\mathbb{R}$ to $\mathbb{R}^{2L}$ where $L$ is the number of frequencies.
$$
\gamma(p)=\left[\sin\left(2^0\pi p\right), \cos\left(2^0\pi p\right), \ldots, \sin\left(2^{L-1}\pi p\right), \cos\left(2^{L-1}\pi p\right)\right]
$$
#### Hierarchical volume sampling
Optimize coarse and find network simultaneously.
Let $\hat{C}_c(r)$ be the coarse prediction of the camera ray color.
$$
\hat{C}_c(r)=\sum_{i=1}^{N_c} w_i c_i,\quad w_i=T_i(1-\exp(-\sigma_i \delta_i))
$$
We sample a second set of $N_f$ locations from this distribution
using inverse transform sampling, evaluate our "fine" network at the union of the first and second set of samples, and compute the final rendered color of the ray $\hat{C}_f(r)$ but with all $N_c+N_f$ samples.
> [!TIP]
>
> 1. This paper reminds me of Gaussian Splatting. In this paper setting, we can treat the scene as a function of 5D coordinates. (all the cameras are focusing on the world origin) However, in general settings, we have 6D coordinates (3D position and 3D direction). Is there any way to use Gaussian Splatting to reconstruct the scene?
> 2. In the positional encoding, the function $\gamma(p)$ reminds me of the Fourier transform. Is there any connection between the two?

View File

@@ -1,2 +1,33 @@
# CSE5519 Advances in Computer Vision (Topic F: 2021 and before: Representation Learning)
## A Simple Framework for Contrastive Learning of Visual Representations
[link to the paper](https://arxiv.org/pdf/2002.05709)
~~Laughing my ass off when I see 75% accuracy on ImageNet. Can't believe what the authors think after few years, when Deep Learning is becoming the dominant paradigm in Computer Vision.~~
In this work, we introduce a simple framework for contrastive learning of visual representations, which we call SimCLR.
Wait, that IS a NEURAL NETWORK?
## General Framework
A stochastic data augmentation module
A neural network base encoder $f(\cdot)$
A small neural network projection head $g(\cdot)$
A contrastive loss function
## Novelty in SimCLR
Semi-supervised learning with data augmentation.
> [!TIP]
>
> In the section "Training with Large Batch Size", the authors mentioned that:
>
> To keep it simple, we do not train the model with a memory bank (Wu et al., 2018; He et al., 2019). Instead, we vary the training batch size N from 256 to 8192. A batch size of 8192 gives us 16382 negative examples per positive pair from both augmentation views. They use LARS optimizer for stabilizing the training.
>
> What does memory bank means here? And what is LARS optimizer, and how does it benefit the training?