upgrade structures and migrate to nextra v4

2025-07-06 12:40:25 -05:00
parent 76e50de44d
commit 717520624d
317 changed files with 18143 additions and 22777 deletions
--- a/content/CSE559A/CSE559A_L1.md
+++ b/content/CSE559A/CSE559A_L1.md
@@ -0,0 +1,59 @@
+# CSE559A Lecture 1
+
+## Introducing the syllabus
+
+See the syllabus on Canvas.
+
+## Motivational introduction for computer vision
+
+Computer vision is the study of manipulating images.
+
+Automatic understanding of images and videos
+
+1. vision for measurement (measurement, segmentation)
+2. vision for perception, interpretation (labeling)
+3. search and organization (retrieval, image or video archives)
+
+### What is image
+
+A 2d array of numbers.
+
+### Vision is hard
+
+connection to graphics.
+
+computer vision need to generate the model from the image.
+
+#### Are A and B the same color?
+
+It depends on the context what you mean by "the same".
+
+todo
+
+#### Chair detector example.
+
+double for loops.
+
+#### Our visual system is not perfect.
+
+Some optical illusion images.
+
+todo, embed images here.
+
+### Ridiculously brief history of computer vision
+
+1960s: interpretation of synthetic worlds
+1970s: some progress on interpreting selected images
+1980s: ANNs come and go; shift toward geometry and increased mathematical rigor
+1990s: face recognition; statistical analysis in vogue
+2000s: becoming useful; significant use of machine learning; large annotated datasets available; video processing starts.
+2010s: Deep learning with ConvNets
+2020s: String synthesis; continued improvement across tasks, vision-language models.
+
+## How computer vision is used now
+
+### OCR, Optical Character Recognition
+
+Technology to convert scanned docs to text.
+
+
--- a/content/CSE559A/CSE559A_L10.md
+++ b/content/CSE559A/CSE559A_L10.md
@@ -0,0 +1,148 @@
+# CSE559A Lecture 10
+
+## Convolutional Neural Networks
+
+### Convolutional Layer
+
+Output feature map resolution depends on padding and stride
+
+Padding: add zeros around the input image
+
+Stride: the step of the convolution
+
+Example:
+
+1. Convolutional layer for 5x5 image with 3x3 kernel, padding 1, stride 1 (no skipping pixels)
+   - Input: 5x5 image
+   - Output: 3x3 feature map, (5-3+2*1)/1+1=5
+2. Convolutional layer for 5x5 image with 3x3 kernel, padding 1, stride 2 (skipping pixels)
+   - Input: 5x5 image
+   - Output: 2x2 feature map, (5-3+2*1)/2+1=2
+
+_Learned weights can be thought of as local templates_
+
+```python
+import torch
+import torch.nn as nn
+
+# suppose input image is HxWx3 (assume RGB image)
+
+conv_layer = nn.Conv2d(in_channels=3, # input channel, input is HxWx3
+                        out_channels=64, # output channel (number of filters), output is HxWx64
+                        kernel_size=3, # kernel size
+                        padding=1, # padding, this ensures that the output feature map has the same resolution as the input image, H_out=H_in, W_out=W_in
+                        stride=1) # stride
+```
+
+Usually followed by a ReLU activation function
+
+```python
+conv_layer = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1, stride=1)
+relu = nn.ReLU()
+```
+
+Suppose input image is $H\times W\times K$, the output feature map is $H\times W\times L$ with kernel size $F\times F$, this takes $F^2\times K\times L\times H\times W$ parameters
+
+Each operation $D\times (K^2C)$ matrix with $(K^2C)\times N$ matrix, assume $D$ filters and $C$ output channels.
+
+### Variants 1x1 convolutions, depthwise convolutions
+
+#### 1x1 convolutions
+
+![1x1 convolution](https://notenextra.trance-0.com/CSE559A/1x1_layer.png)
+
+1x1 convolution: $F=1$, this layer do convolution in the pixel level, it is **pixel-wise** convolution for the feature.
+
+Used to save computation, reduce the number of parameters.
+
+Example: 3x3 conv layer with 256 channels at input and output.
+
+Option 1: naive way:
+
+```python
+conv_layer = nn.Conv2d(in_channels=256, out_channels=256, kernel_size=3, padding=1, stride=1)
+```
+
+This takes $256\times 3 \times 3\times 256=524,288$ parameters.
+
+Option 2: 1x1 convolution:
+
+```python
+conv_layer = nn.Conv2d(in_channels=256, out_channels=64, kernel_size=1, padding=0, stride=1)
+conv_layer = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, padding=1, stride=1)
+conv_layer = nn.Conv2d(in_channels=64, out_channels=256, kernel_size=1, padding=0, stride=1)
+```
+
+This takes $256\times 1\times 1\times 64 + 64\times 3\times 3\times 64 + 64\times 1\times 1\times 256 = 16,384 + 36,864 + 16,384 = 69,632$ parameters.
+
+This lose some information, but save a lot of parameters.
+
+#### Depthwise convolutions
+
+Depthwise convolution: $K\to K$ feature map, save computation, reduce the number of parameters.
+
+![Depthwise convolution](https://notenextra.trance-0.com/CSE559A/Depthwise_layer.png)
+
+#### Grouped convolutions
+
+Self defined convolution on the feature map following the similar manner.
+
+### Backward pass
+
+Vector-matrix form:
+
+$$
+\frac{\partial e}{\partial x}=\frac{\partial e}{\partial z}\frac{\partial z}{\partial x}
+$$
+
+Suppose the kernel is 3x3, the feature map is $\ldots, x_{i-1}, x_i, x_{i+1}, \ldots$, and $\ldots, z_{i-1}, z_i, z_{i+1}, \ldots$ is the output feature map, then:
+
+The convolution operation can be written as:
+
+$$
+z_i = w_1x_{i-1} + w_2x_i + w_3x_{i+1}
+$$
+
+The gradient of the kernel is:
+
+$$
+\frac{\partial e}{\partial x_i} = \sum_{j=-1}^{1}\frac{\partial e}{\partial z_i}\frac{\partial z_i}{\partial x_i} = \sum_{j=-1}^{1}\frac{\partial e}{\partial z_i}w_j
+$$
+
+### Max-pooling
+
+Get max value in the local region.
+
+#### Receptive field
+
+The receptive field of a unit is the region of the input feature map whose values contribute to the response of that unit (either in the previous layer or in the initial image) 
+
+## Architecture of CNNs
+
+### AlexNet (2012-2013)
+
+Successor of LeNet-5, but with a few significant changes
+
+- Max pooling, ReLU nonlinearity
+- Dropout regularization
+- More data and bigger model (7 hidden layers, 650K units, 60M params)
+- GPU implementation (50x speedup over CPU)
+  - Trained on two GPUs for a week
+
+#### Key points
+
+Most floating point operations occur in the convolutional layers.
+
+Most of the memory usage is in the early convolutional layers.
+
+Nearly all parameters are in the fully-connected layers.
+
+### VGGNet (2014)
+
+### GoogLeNet (2014)
+
+### ResNet (2015)
+
+### Beyond ResNet (2016 and onward): Wide ResNet, ResNeXT, DenseNet
+
+
--- a/content/CSE559A/CSE559A_L11.md
+++ b/content/CSE559A/CSE559A_L11.md
@@ -0,0 +1,141 @@
+# CSE559A Lecture 11
+
+## Continue on Architecture of CNNs
+
+### AlexNet (2012-2013)
+
+Successor of LeNet-5, but with a few significant changes
+
+- Max pooling, ReLU nonlinearity
+- Dropout regularization
+- More data and bigger model (7 hidden layers, 650K units, 60M params)
+- GPU implementation (50x speedup over CPU)
+  - Trained on two GPUs for a week
+
+#### Architecture for AlexNet
+
+- Input: 224x224x3
+- 11x11 conv, stride 4, 96 filters
+- 3x3 max pooling, stride 2
+- 5x5 conv, 256 filters, padding 2
+- 3x3 max pooling, stride 2
+- 3x3 conv, 384 filters, padding 1
+- 3x3 conv, 384 filters, padding 1
+- 3x3 conv, 256 filters, padding 1
+- 3x3 max pooling, stride 2
+- 4096-unit FC, ReLU
+- 4096-unit FC, ReLU
+- 1000-unit FC, softmax
+
+#### Key points for AlexNet
+
+Most floating point operations occur in the convolutional layers.
+
+Most of the memory usage is in the early convolutional layers.
+
+Nearly all parameters are in the fully-connected layers.
+
+#### Further refinement (ZFNet, 2013)
+
+Best paper award at ILSVRC 2013.
+
+Nicely visualizes the feature maps.
+
+### VGGNet (2014)
+
+All the cov layers are 3x3 filters with stride 1 and padding 1. Take advantage of pooling to reduce the spatial dimensionality.
+
+#### Architecture for VGGNet
+
+- Input: 224x224x3
+- 3x3 conv, 64 filters, padding 1
+- 3x3 conv, 64 filters, padding 1
+- 2x2 max pooling, stride 2
+- 3x3 conv, 128 filters, padding 1
+- 3x3 conv, 128 filters, padding 1
+- 2x2 max pooling, stride 2
+- 3x3 conv, 256 filters, padding 1
+- 3x3 conv, 256 filters, padding 1
+- 2x2 max pooling, stride 2
+- 3x3 conv, 512 filters, padding 1
+- 3x3 conv, 512 filters, padding 1
+- 3x3 conv, 512 filters, padding 1
+- 2x2 max pooling, stride 2
+- 3x3 conv, 512 filters, padding 1
+- 3x3 conv, 512 filters, padding 1
+- 3x3 conv, 512 filters, padding 1
+- 2x2 max pooling, stride 2
+- 4096-unit FC, ReLU
+- 4096-unit FC, ReLU
+- 1000-unit FC, softmax
+
+#### Key points for VGGNet
+
+- Sequence of deeper networks trained progressively
+- Large receptive fields replaced by successive layer of 3x3 convs with relu in between
+  - 7x7 takes $49K^2$ parameters, 3x3 takes $27K^2$ parameters
+
+#### Pretrained models
+
+- Use pretrained-network as feature extractor (removing the last layer and training a new linear layer) (transfer learning)
+  - Add RNN layers to generate captions
+- Fine-tune the model for the new task (finetuning)
+  - Keep the earlier layers fixed and only train the new prediction layer
+
+### GoogLeNet (2014)
+
+Stem network at the start aggressively downsamples input.
+
+#### Key points for GoogLeNet
+
+- Parallel paths with different receptive field size and operations are means to capture space patterns of correlations in the stack of feature maps
+- Use 1x1 convs to reduce dimensionality
+- Use Global Average Pooling (GAP) to replace the fully connected layer
+- Auxiliary classifiers to improve training
+  - Training using loss at the end of the network didn't work well: network is too deep, gradient don't provide useful model updates
+  - As a hack, attach "auxiliary classifiers" at several intermediate points in the network that also try to classify the image and receive loss
+  - _GooLeNet was before batch normalization, with batch normalization, the auxiliary classifiers were removed._
+
+### ResNet (2015)
+
+152 layers
+
+[ResNet paper](https://arxiv.org/abs/1512.03385)
+
+#### Key points for ResNet
+
+- The residual module
+  - Introduce `skip` or `shortcut` connections to avoid the degradation problem
+  - Make it easy for network layers to represent the identity mapping
+- Directly performing 3×3 convolutions with 256 feature maps at input and output:   
+  - $256 \times 256 \times 3 \times 3 \approx 600K$ operations
+  - Using 1×1 convolutions to reduce 256 to 64 feature maps, followed by 3×3 convolutions, followed by 1×1 convolutions to expand back to 256 maps:
+    - $256 \times 64 \times 1 \times 1 \approx 16K$
+    - $64 \times 64 \times 3 \times 3 \approx 36K$
+    - $64 \times 256 \times 1 \times 1 \approx 16K$
+    - Total $\approx 70K$
+
+_Possibly the first model with top-5 error rate better than human performance._
+
+### Beyond ResNet (2016 and onward): Wide ResNet, ResNeXT, DenseNet
+
+#### Wide ResNet
+
+Reduce number of residual blocks, but increase number of feature maps in each block
+
+- More parallelizable, better feature reuse
+- 16-layer WRN outperforms 1000-layer ResNets, though with much larger # of parameters
+
+#### ResNeXt
+
+- Propose “cardinality” as a new factor in network design, apart from depth and width
+- Claim that increasing cardinality is a better way to increase capacity than increasing depth or width
+
+#### DenseNet
+
+- Use Dense block between conv layers
+- Less parameters than ResNet
+
+Next class:
+
+Transformer architectures
--- a/content/CSE559A/CSE559A_L12.md
+++ b/content/CSE559A/CSE559A_L12.md
@@ -0,0 +1,159 @@
+# CSE559A Lecture 12
+
+## Transformer Architecture
+
+### Outline
+
+**Self-Attention Layers**: An important network module, which often has a global receptive field
+
+**Sequential Input Tokens**: Breaking the restriction to 2d input arrays
+
+**Positional Encodings**: Representing the metadata of each input token
+
+**Exemplar Architecture**: The Vision Transformer (ViT)
+
+**Moving Forward**: What does this new module enable? Who wins in the battle between transformers and CNNs?
+
+### The big picture
+
+CNNs
+
+- Local receptive fields
+- Struggles with global content
+- Shape of intermediate layers is sometimes a pain
+
+Things we might want:
+
+- Use information from across the image
+- More flexible shape handling
+- Multiple modalities
+
+Our Hero: MultiheadAttention
+
+Use positional encodings to represent the metadata of each input token
+
+## Self-Attention layers
+
+### Comparing with ways to handling sequential data
+
+#### RNN
+
+![Image of RNN](https://notenextra.trance-0.com/CSE559A/RNN.png)
+
+Works on **Ordered Sequences**
+
+- Good at long sequences: After one RNN layer $h_r$ sees the whole sequence
+- Bad at parallelization: need to compute hidden states sequentially
+
+#### 1D conv
+
+![Image of 1D conv](https://notenextra.trance-0.com/CSE559A/1D_Conv.png)
+
+Works on **Multidimensional Grids**
+
+- Bad at long sequences: Need to stack may conv layers or outputs to see the whole sequence
+- Good at parallelization: Each output can be computed in parallel
+
+#### Self-Attention
+
+![Image of self-attention](https://notenextra.trance-0.com/CSE559A/Self_Attention.png)
+
+Works on **Set of Vectors**
+
+- Good at Long sequences: Each output can attend to all inputs
+- Good at parallelization: Each output can be computed in parallel
+- Bad at saving memory: Need to store all inputs in memory
+
+### Encoder-Decoder Architecture
+
+The encoder is constructed by stacking multiple self-attention layers and feed-forward networks.
+
+#### Word Embeddings
+
+Translate tokens to vector space
+
+```python
+class Embedder(nn.Module):
+    def __init__(self, vocab_size, d_model):
+        super().__init__()
+        self.embed=nn.Embedding(vocab_size, d_model)
+
+    def forward(self, x):
+        return self.embed(x)
+```
+
+#### Positional Embeddings
+
+The positional encodings are a way to represent the position of each token in the sequence.
+
+Combined with the word embeddings, we get the input to the self-attention layer with information about the position of each token in the sequence.
+
+> The reason why we just add the positional encodings to the word embeddings is _perhaps_ that we want the model to self-assign weights to the word-token and positional-token.
+
+#### Query, Key, Value
+
+The query, key, and value are the three components of the self-attention layer.
+
+They are used to compute the attention weights.
+
+```python
+class SelfAttention(nn.Module):
+    def __init__(self, d_model, num_heads):
+        super().__init__()
+        self.d_model = d_model
+        self.d_k = d_k
+        self.q_linear = nn.Linear(d_model, d_k)
+        self.k_linear = nn.Linear(d_model, d_k)
+        self.v_linear = nn.Linear(d_model, d_k)
+        self.dropout = nn.Dropout(dropout)
+        self.out = nn.Linear(d_k, d_k)
+
+    def forward(self, q, k, v, mask=None):
+
+        bs = q.size(0)
+
+        k = self.k_linear(k)
+        q = self.q_linear(q)
+        v = self.v_linear(v)
+
+        # calculate attention weights
+        outputs = attention(q, k, v, self.d_k, mask, self.dropout)
+
+        # apply output linear transformation
+        outputs = self.out(outputs)
+
+        return outputs
+```
+
+#### Attention
+
+```python
+def attention(q, k, v, d_k, mask=None, dropout=None):
+    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
+
+    if mask is not None:
+        mask = mask.unsqueeze(1)
+        scores = scores.masked_fill(mask == 0, -1e9)
+
+    scores = F.softmax(scores, dim=-1)
+
+    if dropout is not None:
+        scores = dropout(scores)
+
+    outputs = torch.matmul(scores, v)
+
+    return outputs
+```
+
+The query, key are used to compute the attention map, and the value is used to compute the attention output.
+
+#### Multi-Head self-attention
+
+The multi-head self-attention is a self-attention layer that has multiple heads.
+
+Each head has its own query, key, and value.
+
+### Computing Attention Efficiency
+
+- the standard attention has a complexity of $O(n^2)$
+- We can use sparse attention to reduce the complexity to $O(n)$
--- a/content/CSE559A/CSE559A_L13.md
+++ b/content/CSE559A/CSE559A_L13.md
@@ -0,0 +1,59 @@
+# CSE559A Lecture 13
+
+## Positional Encodings
+
+### Fixed Positional Encodings
+
+Set of sinusoids of different frequencies.
+
+$$
+f(p,2i)=\sin(\frac{p}{10000^{2i/d}})\quad f(p,2i+1)=\cos(\frac{p}{10000^{2i/d}})
+$$
+
+[source](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)
+
+### Positional Encodings in Reconstruction
+
+MLP is hard to learn high-frequency information from scaler input $(x,y)$.
+
+Example: network mapping from $(x,y)$ to $(r,g,b)$.
+
+### Generalized Positional Encodings
+
+- Dependence on location, scaler, metadata, etc.
+- Can just be fully learned (use `nn.Embedding` and optimize based on a categorical input.)
+
+## Vision Transformer (ViT)
+
+### Class Token
+
+In Vision Transformers, a special token called the class token is added to the input sequence to aggregate information for classification tasks.
+
+### Hidden CNN Modules
+
+- PxP convolution with stride P (split the image into patches and use positional encoding)
+
+### ViT + ResNet Hybrid
+
+Build a hybrid model that combines the vision transformer after 50 layer of ResNet.
+
+## Moving Forward
+
+At least for now, CNN and ViT architectures have similar performance at least in ImageNet.
+
+- General Consensus: once the architecture is big enough, and not designed terribly, it can do well.
+- Differences remain:
+  - Computational efficiency
+  - Ease of use in other tasks and with other input data
+  - Ease of training
+
+## Wrap up
+
+Self attention as a key building block
+
+Flexible input specification using tokens with positional encodings
+
+A wide variety of architectural styles
+
+Up Next:
+Training deep neural networks
--- a/content/CSE559A/CSE559A_L14.md
+++ b/content/CSE559A/CSE559A_L14.md
@@ -0,0 +1,73 @@
+# CSE559A Lecture 14
+
+## Object Detection
+
+AP (Average Precision)
+
+### Benchmarks
+
+#### PASCAL VOC Challenge
+
+20 Challenge classes.
+
+CNN increases the accuracy of object detection.
+
+#### COCO dataset
+
+Common objects in context.
+
+Semantic segmentation. Every pixel is classified to tags.
+
+Instance segmentation. Every pixel is classified and grouped into instances.
+
+### Object detection: outline
+
+Proposal generation
+
+Object recognition
+
+#### R-CNN
+
+Proposal generation
+
+Use CNN to extract features from proposals.
+
+with SVM to classify proposals.
+
+Use selective search to generate proposals.
+
+Use AlexNet finetuned on PASCAL VOC to extract features.
+
+Pros:
+
+- Much more accurate than previous approaches
+- Andy deep architecture can immediately be "plugged in"
+
+Cons:
+
+- Not a single end-to-end trainable system
+  - Fine-tune network with softmax classifier (log loss)
+  - Train post-hoc linear SVMs (hinge loss)
+  - Train post-hoc bounding box regressors (least squares)
+- Training is slow 2000CNN passes for each image
+- Inference (detection) was slow
+
+#### Fast R-CNN
+
+Proposal generation
+
+Use CNN to extract features from proposals.
+
+##### ROI pooling and ROI alignment
+
+ROI pooling:
+
+- Pooling is applied to the feature map.
+- Pooling is applied to the proposal.
+
+ROI alignment:
+
+- Align the proposal to the feature map.
+- Align the proposal to the feature map.
+
+Use bounding box regression to refine the proposal.
--- a/content/CSE559A/CSE559A_L15.md
+++ b/content/CSE559A/CSE559A_L15.md
@@ -0,0 +1,131 @@
+# CSE559A Lecture 15
+
+## Continue on object detection
+
+### Two strategies for object detection
+
+#### R-CNN: Region proposals + CNN features
+
+![R-CNN](https://notenextra.trance-0.com/CSE559A/R-CNN.png)
+
+#### Fast R-CNN: CNN features + RoI pooling
+
+![Fast R-CNN](https://notenextra.trance-0.com/CSE559A/Fast-R-CNN.png)
+
+Use bilinear interpolation to get the features of the proposal.
+
+#### Region of interest pooling
+
+![RoI pooling](https://notenextra.trance-0.com/CSE559A/RoI-pooling.png)
+
+Use backpropagation to get the gradient of the proposal.
+
+### New materials
+
+#### Faster R-CNN
+
+Use one CNN to generate region proposals. And use another CNN to classify the proposals.
+
+##### Region proposal network
+
+Idea: put an "anchor box" of fixed size over each position in the feature map and try to predict whether this box is likely to contain an object.
+
+Introduce anchor boxes at multiple scales and aspect ratios to handle a wider range of object sizes and shapes.
+
+![Anchor boxes](https://notenextra.trance-0.com/CSE559A/Anchor-boxes.png)
+
+### Single-stage and multi-resolution detection
+
+#### YOLO
+
+You only look once (YOLO) is a state-of-the-art, real-time object detection system.
+
+1. Take conv feature maps at 7x7 resolution
+2. Add two FC layers to predict, at each location, a score for each class and 2 bboxes with confidences
+
+For PASCAL, output is 7×7×30 (30=20 + 2∗(4+1))
+
+![YOLO](https://notenextra.trance-0.com/CSE559A/YOLO.png)
+
+##### YOLO Network Head
+
+```python
+model.add(Conv2D(1024, (3, 3), activation='lrelu', kernel_regularizer=l2(0.0005)))
+model.add(Conv2D(1024, (3, 3), activation='lrelu', kernel_regularizer=l2(0.0005)))
+# use flatten layer for global reasoning
+model.add(Flatten())
+model.add(Dense(512))
+model.add(Dense(1024))
+model.add(Dropout(0.5))
+model.add(Dense(7 * 7 * 30, activation='sigmoid'))
+model.add(YOLO_Reshape(target_shape=(7, 7, 30)))
+model.summary()
+```
+
+#### YOLO results
+
+1. Each grid cell predicts only two boxes and can only have one class – this limits the number of nearby objects that can be predicted
+2. Localization accuracy suffers compared to Fast(er) R-CNN due to coarser features, errors on small boxes
+3. 7x speedup over Faster R-CNN (45-155 FPS vs. 7-18 FPS)
+
+#### YOLOv2
+
+1. Remove FC layer, do convolutional prediction with anchor boxes instead
+2. Increase resolution of input images and conv feature maps
+3. Improve accuracy using batch normalization and other tricks
+
+#### SSD
+
+SSD is a multi-resolution object detection
+
+![SSD](https://notenextra.trance-0.com/CSE559A/SSD.png)
+
+1. Predict boxes of different size from different conv maps
+2. Each level of resolution has its own predictor
+
+##### Feature Pyramid Network
+
+- Improve predictive power of lower-level feature maps by adding contextual information from higher-level feature maps
+- Predict different sizes of bounding boxes from different levels of the pyramid (but share parameters of predictors)
+
+#### RetinaNet
+
+RetinaNet combine feature pyramid network with focal loss to reduce the standard cross-entropy loss for well-classified examples.
+
+![RetinaNet](https://notenextra.trance-0.com/CSE559A/RetinaNet.png)
+
+> Cross-entropy loss:
+> $$CE(p_t) = - \log(p_t)$$
+
+The focal loss is defined as:
+
+$$
+FL(p_t) = - (1 - p_t)^{\gamma} \log(p_t)
+$$
+
+We can increase $\gamma$ to reduce the loss for well-classified examples.
+
+#### YOLOv3
+
+Minor refinements
+
+### Alternative approaches
+
+#### CornerNet
+
+Use a pair of corners to represent the bounding box.
+
+Use hourglass network to accumulate the information of the corners.
+
+#### CenterNet
+
+Use a center point to represent the bounding box.
+
+#### Detection Transformer
+
+Use transformer architecture to detect the object.
+
+![DETR](https://notenextra.trance-0.com/CSE559A/DETR.png)
+
+DETR uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a "no object" class.
+
--- a/content/CSE559A/CSE559A_L16.md
+++ b/content/CSE559A/CSE559A_L16.md
@@ -0,0 +1,114 @@
+# CSE559A Lecture 16
+
+## Dense image labelling
+
+### Semantic segmentation
+
+Use one-hot encoding to represent the class of each pixel.
+
+### General Network design
+
+Design a network with only convolutional layers, make predictions for all pixels at once.
+
+Can the network operate at full image resolution?
+
+Practical solution: first downsample, then upsample
+
+### Outline
+
+- Upgrading a Classification Network to Segmentation
+- Operations for dense prediction
+  - Transposed convolutions, unpooling
+- Architectures for dense prediction
+  - DeconvNet, U-Net, "U-Net"
+- Instance segmentation
+  - Mask R-CNN
+- Other dense prediction problems
+
+### Fully Convolutional Networks
+
+"upgrading" a classification network to a dense prediction network
+
+1. Covert "fully connected" layers to 1x1 convolutions
+2. Make the input image larger
+3. Upsample the output
+
+Start with an existing classification CNN ("an encoder")
+
+Then use bilinear interpolation and transposed convolutions to make full resolution.
+
+### Operations for dense prediction
+
+#### Transposed Convolutions
+
+Use the filter to "paint" in the output: place copies of the filter on the output, multiply by corresponding value in the input, sum where copies of the filter overlap
+
+We can increase the resolution of the output by using a larger stride in the convolution.
+
+- For stride 2, dilate the input by inserting rows and columns of zeros between adjacent entries, convolve with flipped filter
+- Sometimes called convolution with fractional input stride 1/2
+
+#### Unpooling
+
+Max unpooling:
+
+- Copy the maximum value in the input region to all locations in the output
+- Use the location of the maximum value to know where to put the value in the output
+
+Nearest neighbor unpooling:
+
+- Copy the maximum value in the input region to all locations in the output
+- Use the location of the maximum value to know where to put the value in the output
+
+### Architectures for dense prediction
+
+#### DeconvNet
+
+![DeconvNet](https://notenextra.trance-0.com/CSE559A/DeconvNet.png)
+
+_How the information about location is encoded in the network?_
+
+#### U-Net
+
+![U-Net](https://notenextra.trance-0.com/CSE559A/U-Net.png)
+
+- Like FCN, fuse upsampled higher-level feature maps with higher-res, lower-level feature maps (like residual connections)
+- Unlike FCN, fuse by concatenation, predict at the end
+
+#### Extended U-Net Architecture
+
+Many variants of U-Net would replace the "encoder" of the U-Net with other architectures.
+
+![Extended U-Net Architecture Example](https://notenextra.trance-0.com/CSE559A/ExU-Net.png)
+
+##### Encoder/Decoder v.s. U-Net
+
+![Encoder/Decoder v.s. U-Net](https://notenextra.trance-0.com/CSE559A/EncoderDecoder_vs_U-Net.png)
+
+### Instance Segmentation
+
+#### Mask R-CNN
+
+Mask R-CNN = Faster R-CNN + FCN on Region of Interest
+
+### Extend to keypoint prediction?
+
+- Use a similar architecture to Mask R-CNN
+
+_Continue on Tuesday_
+
+### Other tasks
+
+#### Panoptic feature pyramid network
+
+![Panoptic Feature Pyramid Network](https://notenextra.trance-0.com/CSE559A/Panoptic_Feature_Pyramid_Network.png)
+
+#### Depth and normal estimation
+
+![Depth and Normal Estimation](https://notenextra.trance-0.com/CSE559A/Depth_and_Normal_Estimation.png)
+
+D. Eigen and R. Fergus, Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture, ICCV 2015
+
+#### Colorization
+
+R. Zhang, P. Isola, and A. Efros, Colorful Image Colorization, ECCV 2016
--- a/content/CSE559A/CSE559A_L17.md
+++ b/content/CSE559A/CSE559A_L17.md
@@ -0,0 +1,184 @@
+# CSE559A Lecture 17
+
+## Local Features
+
+### Types of local features
+
+#### Edge
+
+Goal: Identify sudden changes in image intensity
+
+Generate edge map as human artists.
+
+An edge is a place of rapid change in the image intensity function.
+
+Take the absolute value of the first derivative of the image intensity function.
+
+For 2d functions, $\frac{\partial f}{\partial x}=\lim_{\Delta x\to 0}\frac{f(x+\Delta x)-f(x)}{\Delta x}$
+
+For discrete images data, $\frac{\partial f}{\partial x}\approx \frac{f(x+1)-f(x)}{1}$
+
+Run convolution with kernel $[1,0,-1]$ to get the first derivative in the x direction, without shifting. (generic kernel is $[1,-1]$)
+
+Prewitt operator:
+
+$$
+M_x=\begin{bmatrix}
+1 & 0 & -1 \\
+1 & 0 & -1 \\
+1 & 0 & -1 \\
+\end{bmatrix}
+\quad 
+M_y=\begin{bmatrix}
+1 & 1 & 1 \\
+0 & 0 & 0 \\
+-1 & -1 & -1 \\
+\end{bmatrix}
+$$
+Sobel operator:
+
+$$
+M_x=\begin{bmatrix}
+1 & 0 & -1 \\
+2 & 0 & -2 \\
+1 & 0 & -1 \\
+\end{bmatrix}
+\quad 
+M_y=\begin{bmatrix}
+1 & 2 & 1 \\
+0 & 0 & 0 \\
+-1 & -2 & -1 \\
+\end{bmatrix}
+$$
+Roberts operator:
+
+$$
+M_x=\begin{bmatrix}
+1 & 0 \\
+0 & -1 \\
+\end{bmatrix}
+\quad 
+M_y=\begin{bmatrix}
+0 & 1 \\
+-1 & 0 \\
+\end{bmatrix}
+$$
+
+Image gradient:
+
+$$
+\nabla f = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}\right)
+$$
+
+Gradient magnitude:
+
+$$
+||\nabla f|| = \sqrt{\left(\frac{\partial f}{\partial x}\right)^2 + \left(\frac{\partial f}{\partial y}\right)^2}
+$$
+
+Gradient direction:
+
+$$
+\theta = \tan^{-1}\left(\frac{\frac{\partial f}{\partial y}}{\frac{\partial f}{\partial x}}\right)
+$$
+
+The gradient points in the direction of the most rapid increase in intensity.
+
+> Application: Gradient-domain image editing
+>
+> Goal: solve for pixel values in the target region to match gradients of the source region while keeping the rest of the image unchanged.
+>
+> [Poisson Image Editing](http://www.cs.virginia.edu/~connelly/class/2014/comp_photo/proj2/poisson.pdf)
+
+Noisy edge detection:
+
+When the intensity function is very noisy, we can use a Gaussian smoothing filter to reduce the noise before taking the gradient.
+
+Suppose pixels of the true image $f_{i,j}$ are corrupted by Gaussian noise $n_{i,j}$ with mean 0 and variance $\sigma^2$.
+Then the noisy image is $g_{i,j}=(f_{i,j}+n_{i,j})-(f_{i,j+1}+n_{i,j+1})\approx N(0,2\sigma^2)$
+
+To find edges, look for peaks in $\frac{d}{dx}(f\circ g)$ where $g$ is the Gaussian smoothing filter.
+
+or we can directly use the Derivative of Gaussian (DoG) filter:
+
+$$
+\frac{d}{dx}g(x,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x^2}{2\sigma^2}}
+$$
+
+##### Separability of Gaussian filter
+
+A Gaussian filter is separable if it can be written as a product of two 1D filters.
+
+$$
+\frac{d}{dx}g(x,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x^2}{2\sigma^2}}
+\quad \frac{d}{dy}g(y,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y^2}{2\sigma^2}}
+$$
+
+##### Separable Derivative of Gaussian (DoG) filter
+
+$$
+\frac{d}{dx}g(x,y)\propto -x\exp\left(-\frac{x^2+y^2}{2\sigma^2}\right)
+\quad \frac{d}{dy}g(x,y)\propto -y\exp\left(-\frac{x^2+y^2}{2\sigma^2}\right)
+$$
+
+##### Derivative of Gaussian: Scale
+
+Using Gaussian derivatives with different values of 𝜎 finds structures at different scales or frequencies
+
+(Take the hybrid image as an example)
+
+##### Canny edge detector
+
+1. Smooth the image with a Gaussian filter
+2. Compute the gradient magnitude and direction of the smoothed image
+3. Thresholding gradient magnitude
+4. Non-maxima suppression
+   - For each location `q` above the threshold, check that the gradient magnitude is higher than at adjacent points `p` and `r` in the direction of the gradient
+5. Thresholding the non-maxima suppressed gradient magnitude
+6. Hysteresis thresholding
+   - Use two thresholds: high and low
+   - Start with a seed edge pixel with a gradient magnitude greater than the high threshold
+   - Follow the gradient direction to find all connected pixels with a gradient magnitude greater than the low threshold
+
+##### Top-down segmentation
+
+Data-driven top-down segmentation:
+
+#### Interest point
+
+Key point matching:
+
+1. Find a set of distinctive keypoints in the image
+2. Define a region of interest around each keypoint
+3. Compute a local descriptor from the normalized region
+4. Match local descriptors between images
+
+Characteristic of good features:
+
+- Repeatability
+  - The same feature can be found in several images despite geometric and photometric transformations 
+- Saliency
+  - Each feature is distinctive
+- Compactness and efficiency
+  - Many fewer features than image pixels
+- Locality
+  - A feature occupies a relatively small area of the image; robust to clutter and occlusion
+
+##### Harris corner detector
+
+### Applications of local features
+
+#### Image alignment
+
+#### 3D reconstruction
+
+#### Motion tracking
+
+#### Robot navigation
+
+#### Indexing and database retrieval
+
+#### Object recognition
+
+
+
--- a/content/CSE559A/CSE559A_L18.md
+++ b/content/CSE559A/CSE559A_L18.md
@@ -0,0 +1,68 @@
+# CSE559A Lecture 18
+
+## Continue on Harris Corner Detector
+
+Goal: Descriptor distinctiveness
+
+- We want to be able to reliably determine which point goes with which.
+- Must provide some invariance to geometric and photometric differences.
+
+Harris corner detector:
+
+> Other existing variants:
+> - Hessian & Harris: [Beaudet '78], [Harris '88]
+> - Laplacian, DoG: [Lindeberg '98], [Lowe 1999]
+> - Harris-/Hessian-Laplace: [Mikolajczyk & Schmid '01]
+> - Harris-/Hessian-Affine: [Mikolajczyk & Schmid '04]
+> - EBR and IBR: [Tuytelaars & Van Gool '04]
+> - MSER: [Matas '02]
+> - Salient Regions: [Kadir & Brady '01]
+> - Others…
+
+### Deriving a corner detection criterion
+
+- Basic idea: we should easily recognize the point by looking through a small window
+- Shifting a window in any direction should give a large change in intensity
+
+Corner is the point where the intensity changes in all directions.
+
+Criterion:
+
+Change in appearance of window $W$ for the shift $(u,v)$:
+
+$$
+E(u,v) = \sum_{x,y\in W} [I(x+u,y+v) - I(x,y)]^2
+$$
+
+First-order Taylor approximation for small shifts $(u,v)$:
+
+$$
+I(x+u,y+v) \approx I(x,y) + I_x u + I_y v
+$$
+
+plug into $E(u,v)$:
+
+$$
+\begin{aligned}
+E(u,v) &= \sum_{(x,y)\in W} [I(x+u,y+v) - I(x,y)]^2 \\
+&\approx \sum_{(x,y)\in W} [I(x,y) + I_x u + I_y v - I(x,y)]^2 \\
+&= \sum_{(x,y)\in W} [I_x u + I_y v]^2 \\
+&= \sum_{(x,y)\in W} [I_x^2 u^2 + 2 I_x I_y u v + I_y^2 v^2]
+\end{aligned}
+$$
+
+Consider the second moment matrix:
+
+$$
+M = \begin{bmatrix}
+I_x^2 & I_x I_y \\
+I_x I_y & I_y^2
+\end{bmatrix}=\begin{bmatrix}
+a & 0 \\
+0 & b
+\end{bmatrix}
+$$
+
+If either $a$ or $b$ is small, then the window is not a corner.
+
+
--- a/content/CSE559A/CSE559A_L19.md
+++ b/content/CSE559A/CSE559A_L19.md
@@ -0,0 +1,71 @@
+# CSE559A Lecture 19
+
+## Feature Detection
+
+### Behavior of corner features with respect to Image Transformations
+
+To be useful for image matching, “the same” corner features need to show up despite geometric and photometric transformations
+
+We need to analyze how the corner response function and the corner locations change in response to various transformations
+
+#### Affine intensity change
+
+Solution:
+
+- Only derivative of intensity are used (invariant to intensity change)
+- Intensity scaling
+
+#### Image translation
+
+Solution:
+
+- Derivatives and window function are shift invariant
+
+#### Image rotation
+
+Second moment ellipse rotates but its shape (i.e. eigenvalues) remains the same
+
+#### Scaling
+
+Classify edges instead of corners
+
+## Automatic Scale selection for interest point detection
+
+### Scale space
+
+We want to extract keypoints with characteristic scales that are equivariant (or covariant) with respect to scaling of the image
+
+Approach: compute a scale-invariant response function over neighborhoods centered at each location $(x,y)$ and a range of scales $\sigma$, find scale-space locations $(x,y,\sigma)$ where this function reaches a local maximum
+
+A particularly convenient response function is given by the scale-normalized Laplacian of Gaussian (LoG) filter:
+
+$$
+\nabla^2_{norm}=\sigma^2\nabla^2\left(\frac{\partial^2}{\partial x^2}g+\frac{\partial^2}{\partial y^2}g\right)
+$$
+
+![Visualization of LoG](https://notenextra.trance-0.com/CSE559A/Laplacian_of_Gaussian.png)
+
+#### Edge detection with LoG
+
+![Edge detection with LoG](https://notenextra.trance-0.com/CSE559A/Edge_detection_with_LoG.png)
+
+#### Blob detection with LoG
+
+![Blob detection with LoG](https://notenextra.trance-0.com/CSE559A/Blob_detection_with_LoG.png)
+
+### Difference of Gaussians (DoG)
+
+DoG has a little more flexibility, since you can select the scales of the Gaussians.
+
+### Scale-invariant feature transform (SIFT)
+
+The main goal of SIFT is to enable image matching in the presence of significant transformations
+
+- To recognize the same keypoint in multiple images, we need to match appearance descriptors or "signatures" in their neighborhoods
+- Descriptors that are locally invariant w.r.t. scale and rotation can handle a wide range of global transformations
+
+### Maximum stable extremal regions (MSER)
+
+Based on Watershed segmentation algorithm
+
+Select regions that are stable over a large parameter range 
--- a/content/CSE559A/CSE559A_L2.md
+++ b/content/CSE559A/CSE559A_L2.md
@@ -0,0 +1,165 @@
+# CSE559A Lecture 2
+
+## The Geometry of Image Formation
+
+Mapping between image and world coordinates.
+
+Today's focus:
+
+$$
+x=K[R\ t]X
+$$
+
+### Pinhole Camera Model
+
+Add a barrier to block off most of the rays.
+
+- Reduce blurring
+- The opening known as the **aperture**
+
+$f$ is the focal length.  
+$c$ is the center of the aperture.
+
+#### Focal length/ Field of View (FOV)/ Zoom
+
+- Focal length: distance between the aperture and the image plane.
+- Field of View (FOV): the angle between the two rays that pass through the aperture and the image plane.
+- Zoom: the ratio of the focal length to the image plane.
+
+#### Other types of projection
+
+Beyond the pinhole/perspective camera model, there are other types of projection.
+
+- Radial distortion
+- 360-degree camera
+  - Equirectangular Panoramas
+- Random lens
+- Rotating sensors
+- Photofinishing
+- Tiltshift lens
+
+### Perspective Geometry
+
+Length and area are not preserved.
+
+Angle is not preserved.
+
+But straight lines are still straight.
+
+Parallel lines in the world intersect at a **vanishing point** on the image plane.
+
+Vanishing lines: the set of all vanishing points of parallel lines in the world on the same plane in the world.
+
+Vertical vanishing point at infinity.
+
+### Camera/Projection Matrix
+
+Linear projection model.
+
+$$
+x=K[R\ t]X
+$$
+
+- $x$: image coordinates 2d (homogeneous coordinates)
+- $X$: world coordinates 3d (homogeneous coordinates)
+- $K$: camera matrix (3x3 and invertible)
+- $R$: camera rotation matrix (3x3)
+- $t$: camera translation vector (3x1)
+
+#### Homogeneous coordinates
+
+- 2D: $$(x, y)\to\begin{bmatrix}x\\y\\1\end{bmatrix}$$
+- 3D: $$(x, y, z)\to\begin{bmatrix}x\\y\\z\\1\end{bmatrix}$$
+
+converting from homogeneous to inhomogeneous coordinates:
+
+- 2D: $$\begin{bmatrix}x\\y\\w\end{bmatrix}\to(x/w, y/w)$$
+- 3D: $$\begin{bmatrix}x\\y\\z\\w\end{bmatrix}\to(x/w, y/w, z/w)$$
+
+When $w=0$, the point is at infinity.
+
+Homogeneous coordinates are invariant under scaling (non-zero scalar).
+
+$$
+k\begin{bmatrix}x\\y\\w\end{bmatrix}=\begin{bmatrix}kx\\ky\\kw\end{bmatrix}\implies\begin{bmatrix}x\\y\end{bmatrix}=\begin{bmatrix}x/k\\y/k\end{bmatrix}
+$$
+
+A convenient way to represent a point at infinity is to use a unit vector.
+
+Line equation: $ax+by+c=0$
+
+$$
+line_i=\begin{bmatrix}a_i\\b_i\\c_i\end{bmatrix}
+$$
+
+
+Append a 1 to pixel coordinates to get homogeneous coordinates.
+
+$$
+pixel_i=\begin{bmatrix}u_i\\v_i\\1\end{bmatrix}
+$$
+
+Line given by cross product of two points:
+
+$$
+line_i=pixel_1\times pixel_2
+$$
+
+Intersection of two lines given by cross product of the lines:
+
+$$
+pixel_i=line_1\times line_2
+$$
+
+#### Pinhole Camera Projection Matrix
+
+Intrinsic Assumptions:
+
+- Unit aspect ratio
+- No skew
+- Optical center at (0,0)
+
+Extrinsic Assumptions:
+
+- No rotation
+- No translation (camera at world origin)
+
+$$
+x=K[I\ 0]X\implies w\begin{bmatrix}u\\v\\1\end{bmatrix}=\begin{bmatrix}f&0&0&0\\0&f&0&0\\0&0&1&0\end{bmatrix}\begin{bmatrix}x\\y\\z\\1\end{bmatrix}
+$$
+
+Removing the assumptions:
+
+Intrinsic assumptions:
+
+- Unit aspect ratio
+- No skew
+
+Extrinsic assumptions:
+
+- No rotation
+- No translation
+
+$$
+x=K[I\ 0]X\implies w\begin{bmatrix}u\\v\\1\end{bmatrix}=\begin{bmatrix}\alpha&0&u_0&0\\0&\beta&v_0&0\\0&0&1&0\end{bmatrix}\begin{bmatrix}x\\y\\z\\1\end{bmatrix}
+$$
+
+Adding skew:
+
+$$
+x=K[I\ 0]X\implies w\begin{bmatrix}u\\v\\1\end{bmatrix}=\begin{bmatrix}\alpha&s&u_0&0\\0&\beta&v_0&0\\0&0&1&0\end{bmatrix}\begin{bmatrix}x\\y\\z\\1\end{bmatrix}
+$$
+
+Finally, adding camera rotation and translation:
+
+$$
+x=K[I\ t]X\implies w\begin{bmatrix}u\\v\\1\end{bmatrix}=\begin{bmatrix}\alpha&s&u_0\\0&\beta&v_0\\0&0&1\end{bmatrix}\begin{bmatrix}r_{11}&r_{12}&r_{13}&t_x\\r_{21}&r_{22}&r_{23}&t_y\\r_{31}&r_{32}&r_{33}&t_z\end{bmatrix}\begin{bmatrix}x\\y\\z\\1\end{bmatrix}
+$$
+
+What is the degrees of freedom of the camera matrix?
+
+- rotation: 3
+- translation: 3
+- camera matrix: 5
+
+Total: 11
--- a/content/CSE559A/CSE559A_L20.md
+++ b/content/CSE559A/CSE559A_L20.md
@@ -0,0 +1,145 @@
+# CSE559A Lecture 20
+
+## Local feature descriptors
+
+Detection: Identify the interest points
+
+Description: Extract vector feature descriptor surrounding each interest point.
+
+Matching: Determine correspondence between descriptors in two views
+
+### Image representation
+
+Histogram of oriented gradients (HOG)
+
+- Quantization
+  - Grids: fast but applicable only with few dimensions
+  - Clustering: slower but can quantize data in higher dimensions
+- Matching
+  - Histogram intersection or Euclidean may be faster
+  - Chi-squared often works better
+  - Earth mover’s distance is good for when nearby bins represent similar values
+
+#### SIFT vector formation
+
+Computed on rotated and scaled version of window according to computed orientation & scale
+
+- resample the window
+
+Based on gradients weighted by a Gaussian of variance half the window (for smooth falloff)
+
+4x4 array of gradient orientation histogram weighted by magnitude
+
+8 orientations x 4x4 array = 128 dimensions
+
+Motivation:  some sensitivity to spatial layout, but not too much.
+
+For matching:
+
+- Extraordinarily robust detection and description technique
+- Can handle changes in viewpoint
+  - Up to about 60 degree out-of-plane rotation
+- Can handle significant changes in illumination
+  - Sometimes even day vs. night
+- Fast and efficient—can run in real time
+- Lots of code available
+
+#### SURF
+
+- Fast approximation of SIFT idea
+- Efficient computation by 2D box filters & integral images
+  - 6 times faster than SIFT
+- Equivalent quality for object identification
+
+#### Shape context
+
+![Shape context descriptor](https://notenextra.trance-0.com/CSE559A/Shape_context_descriptor.png)
+
+#### Self-similarity Descriptor
+
+![Self-similarity descriptor](https://notenextra.trance-0.com/CSE559A/Self-similarity_descriptor.png)
+
+## Local feature matching
+
+### Matching
+
+Simplest approach: Pick the nearest neighbor. Threshold on absolute distance
+
+Problem: Lots of self similarity in many photos
+
+Solution: Nearest neighbor with low ratio test
+
+![Comparison of keypoint detectors](https://notenextra.trance-0.com/CSE559A/Comparison_of_keypoint_detectors.png)
+ 
+## Deep Learning for Correspondence Estimation
+
+![Deep learning for correspondence estimation](https://notenextra.trance-0.com/CSE559A/Deep_learning_for_correspondence_estimation.png)
+
+## Optical Flow
+
+### Field
+
+Motion field: the projection of the 3D scene motion into the image
+Magnitude of vectors is determined by metric motion
+Only caused by motion
+
+Optical flow: the apparent motion of brightness patterns in the image
+Magnitude of vectors is measured in pixels
+Can be caused by lightning
+
+### Brightness constancy constraint, aperture problem
+
+Machine Learning Approach
+
+- Collect examples of inputs and outputs
+- Design a prediction model suitable for the task
+  - Invariances, Equivariances; Complexity; Input and Output shapes and semantics
+- Specify loss functions and train model
+- Limitations: Requires training the model; Requires a sufficiently complete training dataset; Must re-learn known facts; Higher computational complexity
+
+Optimization Approach
+
+- Define properties we expect to hold for a correct solution
+- Translate properties into a cost function
+- Derive an algorithm to solve for the cost function
+- Limitations: Often requires making overly simple assumptions on properties; Some tasks can’t be easily defined
+
+Given frames at times $t-1$ and $t$, estimate the apparent motion field $u(x,y)$ and $v(x,y)$ between them
+Brightness constancy constraint: projection of the same point looks the same in every frame
+
+$$
+I(x,y,t-1) = I(x+u(x,y),y+v(x,y),t)
+$$
+
+Additional assumptions:
+
+- Small motion: points do not move very far
+- Spatial coherence: points move like their neighbors
+
+Trick for solving:
+
+Brightness constancy constraint:
+
+$$
+I(x,y,t-1) = I(x+u(x,y),y+v(x,y),t)
+$$
+
+Linearize the right-hand side using Taylor expansion:
+
+$$
+I(x,y,t-1) \approx I(x,y,t) + I_x u(x,y) + I_y v(x,y)
+$$
+
+$$
+I_x u(x,y) + I_y v(x,y) + I(x,y,t) - I(x,y,t-1) = 0
+$$
+
+Hence,
+
+$$
+I_x u(x,y) + I_y v(x,y) + I_t = 0
+$$
+
+
+
+
--- a/content/CSE559A/CSE559A_L21.md
+++ b/content/CSE559A/CSE559A_L21.md
@@ -0,0 +1,215 @@
+# CSE559A Lecture 21
+
+## Continue on optical flow
+
+### The brightness constancy constraint
+
+$$
+I_x u(x,y) + I_y v(x,y) + I_t = 0
+$$
+Given the gradients $I_x, I_y$ and $I_t$, can we uniquely recover the motion $(u,v)$?
+
+- Suppose $(u, v)$ satisfies the constraint: $\nabla I \cdot (u,v) + I_t = 0$
+- Then $\nabla I \cdot (u+u', v+v') + I_t = 0$ for any $(u', v')$ s.t. $\nabla I \cdot (u', v') = 0$
+- Interpretation: the component of the flow perpendicular to the gradient (i.e., parallel to the edge) cannot be recovered!
+
+#### Aperture problem
+
+- The brightness constancy constraint is only valid for a small patch in the image
+- For a large motion, the patch may look very different
+
+Consider the barber pole illusion
+
+### Estimating optical flow (Lucas-Kanade method)
+
+- Consider a small patch in the image
+- Assume the motion is constant within the patch
+- Then we can solve for the motion $(u, v)$ by minimizing the error:
+
+$$
+I_x u(x,y) + I_y v(x,y) + I_t = 0
+$$
+
+How to get more equations for a pixel?
+Spatial coherence constraint:  assume the pixel’s neighbors have the same (𝑢,𝑣)
+If we have 𝑛 pixels in the neighborhood, then we can set up a linear least squares system:
+
+$$
+\begin{bmatrix}
+I_x(x_1, y_1) & I_y(x_1, y_1) \\
+\vdots & \vdots \\
+I_x(x_n, y_n) & I_y(x_n, y_n)
+\end{bmatrix}
+\begin{bmatrix}
+u \\ v
+\end{bmatrix} = -\begin{bmatrix}
+I_t(x_1, y_1) \\ \vdots \\ I_t(x_n, y_n)
+\end{bmatrix}
+$$
+
+#### Lucas-Kanade flow
+
+Let $A=
+\begin{bmatrix}
+I_x(x_1, y_1) & I_y(x_1, y_1) \\
+\vdots & \vdots \\
+I_x(x_n, y_n) & I_y(x_n, y_n)
+\end{bmatrix}$
+
+$b = \begin{bmatrix}
+I_t(x_1, y_1) \\ \vdots \\ I_t(x_n, y_n)
+\end{bmatrix}$
+
+$d = \begin{bmatrix}
+u \\ v
+\end{bmatrix}$
+
+The solution is $d=(A^T A)^{-1} A^T b$
+
+Lucas-Kanade flow: 
+
+- Find $(u,v)$ minimizing $\sum_{i} (I(x_i+u,y_i+v,t)-I(x_i,y_i,t-1))^2$
+- use Taylor approximation of $I(x_i+u,y_i+v,t)$ for small shifts $(u,v)$ to obtain closed-form solution
+
+### Refinement for Lucas-Kanade
+
+In some cases, the Lucas-Kanade method may not work well:
+- The motion is large (larger than a pixel)
+- A point does not move like its neighbors
+- Brightness constancy does not hold
+
+#### Iterative refinement (for large motion)
+
+Iterative Lukas-Kanade Algorithm
+
+1. Estimate velocity at each pixel by solving Lucas-Kanade equations
+2. Warp It towards It+1 using the estimated flow field
+   - use image warping techniques
+3. Repeat until convergence
+
+Iterative refinement is limited due to Aliasing
+
+#### Coarse-to-fine refinement (for large motion)
+
+- Estimate flow at a coarse level
+- Refine the flow at a finer level
+- Use the refined flow to warp the image
+- Repeat until convergence
+
+![Lucas Kanade coarse-to-fine refinement](https://notenextra.trance-0.com/CSE559A/Lucas_Kanade_coarse-to-fine_refinement.png)
+
+#### Representing moving images with layers (for a point may not move like its neighbors)
+
+- The image can be decomposed into a moving layer and a stationary layer
+- The moving layer is the layer that moves
+- The stationary layer is the layer that does not move
+
+![Lucas Kanade refinement with layers](https://notenextra.trance-0.com/CSE559A/Lucas_Kanade_refinement_with_layers.png)
+
+### SOTA models
+
+#### 2009
+
+Start with something similar to Lucas-Kanade
+
+- gradient constancy
+- energy minimization with smoothing term
+- region matching
+- keypoint matching (long-range)
+
+#### 2015
+
+Deep neural networks
+
+- Use a deep neural network to represent the flow field
+- Use synthetic data to train the network (floating chairs)
+
+#### 2023
+
+GMFlow
+
+use Transformer to model the flow field
+
+## Robust Fitting of parametric models
+
+Challenges:
+
+- Noise in the measured feature locations
+- Extraneous data: clutter (outliers), multiple lines
+- Missing data: occlusions
+
+### Least squares fitting
+
+Normal least squares fitting
+
+$y=mx+b$ is not a good model for the data since there might be vertical lines
+
+Instead we use total least squares
+
+Line parametrization: $ax+by=d$
+
+$(a,b)$ is the unit normal to the line (i.e., $a^2+b^2=1$)
+$d$ is the distance between the line and the origin
+Perpendicular distance between point $(x_i, y_i)$ and line $ax+by=d$ (assuming $a^2+b^2=1$):
+
+$$
+|ax_i + by_i - d|
+$$
+
+Objective function:
+
+$$
+E = \sum_{i=1}^n (ax_i + by_i - d)^2
+$$
+
+Solve for $d$ first: $d =a\bar{x}+b\bar{y}$
+Plugging back in:
+
+$$
+E = \sum_{i=1}^n (a(x_i-\bar{x})+b(y_i-\bar{y}))^2 = \left\|\begin{bmatrix}x_1-\bar{x}&y_1-\bar{y}\\\vdots&\vdots\\x_n-\bar{x}&y_n-\bar{y}\end{bmatrix}\begin{pmatrix}a\\b\end{pmatrix}\right\|^2
+$$
+
+We want to find $N$ that minimizes $\|UN\|^2$ subject to $\|N\|^2= 1$
+Solution is given by the eigenvector of $U^T U$ associated with the smallest eigenvalue
+
+Drawbacks:
+
+- Sensitive to outliers
+
+### Robust fitting
+
+General approach: find model parameters 𝜃 that minimize
+
+$$
+\sum_{i} \rho_{\sigma}(r(x_i;\theta))
+$$
+
+$r(x_i;\theta)$: residual of $x_i$ w.r.t. model parameters $\theta$
+$\rho_{\sigma}$: robust function with scale parameter $\sigma$, e.g., $\rho_{\sigma}(u)=\frac{u^2}{\sigma^2+u^2}$
+
+Nonlinear optimization problem that must be solved iteratively
+
+- Least squares solution can be used for initialization
+- Scale of robust function should be chosen carefully
+
+Drawbacks:
+
+- Need to manually choose the robust function and scale parameter
+
+### RANSAC
+
+Voting schemes
+
+Random sample consensus: very general framework for model fitting in the presence of outliers
+
+Outline:
+
+- Randomly choose a small initial subset of points
+- Fit a model to that subset
+- Find all inlier points that are "close" to the model and reject the rest as outliers
+- Do this many times and choose the model with the most inliers
+
+### Hough transform
+
+
+
--- a/content/CSE559A/CSE559A_L22.md
+++ b/content/CSE559A/CSE559A_L22.md
@@ -0,0 +1,260 @@
+# CSE559A Lecture 22
+
+## Continue on Robust Fitting of parametric models
+
+### RANSAC
+
+#### Definition: RANdom SAmple Consensus
+
+RANSAC is a method to fit a model to a set of data points.
+
+It is a non-deterministic algorithm that can be used to fit a model to a set of data points.
+
+Pros:
+
+- Simple and general 
+- Applicable to many different problems
+- Often works well in practice
+
+Cons:
+
+- Lots of parameters to set
+- Number of iterations grows exponentially as outlier ratio increases
+- Can't always get a good initialization of the model based on the minimum number of samples.
+
+### Hough Transform
+
+Use point-line duality to find lines.
+
+In practice, we don't use (m,b) parameterization.
+
+Instead, we use polar parameterization:
+
+$$
+\rho = x \cos \theta + y \sin \theta
+$$
+
+Algorithm outline:
+
+- Initialize accumulator $H$ to all zeros
+  - For each feature point $(x,y)$
+    - For $\theta = 0$ to $180$
+    - $\rho = x \cos \theta + y \sin \theta$
+    - $H(\theta, \rho) += 1$
+- Find the value(s) of $(\theta, \rho)$ where $H(\theta, \rho)$ is a local maximum (perform NMS on the accumulator array)
+  - The detected line in the image is given by $\rho = x \cos \theta + y \sin \theta$
+
+#### Effect of noise
+
+![Hough transform with noise](https://notenextra.trance-0.com/CSE559A/Hough_transform_noise.png)
+
+Noise makes the peak fuzzy.
+
+#### Effect of outliers
+
+![Hough transform with outliers](https://notenextra.trance-0.com/CSE559A/Hough_transform_outliers.png)
+
+Outliers can break the peak.
+
+#### Pros and Cons
+
+Pros:
+
+- Can deal with non-locality and occlusion
+- Can detect multiple instances of a model
+- Some robustness to noise: noise points unlikely to contribute consistently to any single bin
+- Leads to a surprisingly general strategy for shape localization (more on this next)
+
+Cons:
+
+- Complexity increases exponentially with the number of model parameters 
+  - In practice, not used beyond three or four dimensions
+- Non-target shapes can produce spurious peaks in parameter space
+- It's hard to pick a good grid size
+
+### Generalize Hough Transform
+
+Template representation: for each type of landmark point, store all possible displacement vectors towards the center
+
+Detecting the template:
+
+For each feature in a new image, look up that feature type in the model and vote for the possible center locations associated with that type in the model
+
+#### Implicit shape models
+
+Training:
+
+- Build codebook of patches around extracted interest points using clustering
+- Map the patch around each interest point to closest codebook entry
+- For each codebook entry, store all positions it was found, relative to object center
+
+Testing:
+
+- Given test image, extract patches, match to codebook entry
+- Cast votes for possible positions of object center
+- Search for maxima in voting space
+- Extract weighted segmentation mask based on stored masks for the codebook occurrences
+
+## Image alignment
+
+### Affine transformation
+
+Simple fitting procedure: linear least squares
+Approximates viewpoint changes for roughly planar objects and roughly orthographic cameras
+Can be used to initialize fitting for more complex models
+
+Fitting an affine transformation:
+
+$$
+\begin{bmatrix}
+&&&\cdots\\
+x_i & y_i & 0&0&1&0\\
+0&0&x_i&y_i&0&1\\
+&&&\cdots\\
+\end{bmatrix}
+\begin{bmatrix}
+m_1\\
+m_2\\
+m_3\\
+m_4\\
+t_1\\
+t_2\\
+\end{bmatrix}
+=
+\begin{bmatrix}
+\cdots\\
+\end{bmatrix}
+$$
+
+Only need 3 points to solve for 6 parameters.
+
+### Homography
+
+Recall that
+
+$$
+x' = \frac{a x + b y + c}{g x + h y + i}, \quad y' = \frac{d x + e y + f}{g x + h y + i}
+$$
+
+Use 2D homogeneous coordinates:
+
+$(x,y) \rightarrow \begin{pmatrix}x \\ y \\ 1\end{pmatrix}$
+
+$\begin{pmatrix}x\\y\\w\end{pmatrix} \rightarrow (x/w,y/w)$
+
+Reminder: all homogeneous coordinate vectors that are (non-zero) scalar multiples of each other represent the same point
+
+
+Equation for homography in homogeneous coordinates:
+
+$$
+\begin{pmatrix}
+x' \\
+y' \\
+1
+\end{pmatrix}
+\cong
+\begin{pmatrix}
+h_{11} & h_{12} & h_{13} \\
+h_{21} & h_{22} & h_{23} \\
+h_{31} & h_{32} & h_{33}
+\end{pmatrix}
+\begin{pmatrix}
+x \\
+y \\
+1
+\end{pmatrix}
+$$
+
+Constraint from a match $(x_i,x_i')$, $x_i'\cong Hx_i$
+
+How can we get rid of the scale ambiguity?
+
+Cross product trick:$x_i' × Hx_i=0$
+
+The cross product is defined as:
+
+$$
+\begin{pmatrix}a\\b\\c\end{pmatrix} \times \begin{pmatrix}a'\\b'\\c'\end{pmatrix} = \begin{pmatrix}bc'-b'c\\ca'-c'a\\ab'-a'b\end{pmatrix}
+$$
+
+Let $h_1^T, h_2^T, h_3^T$ be the rows of $H$. Then
+
+$$
+x_i' × Hx_i=\begin{pmatrix}
+    x_i' \\
+    y_i' \\
+    1
+\end{pmatrix} \times \begin{pmatrix}
+    h_1^T x_i \\
+    h_2^T x_i \\
+    h_3^T x_i
+\end{pmatrix}
+=
+\begin{pmatrix}
+    y_i' h_3^T x_i−h_2^T x_i \\
+    h_1^T x_i−x_i' h_3^T x_i \\
+    x_i' h_2^T x_i−y_i' h_1^T x_i
+\end{pmatrix}
+$$
+
+Constraint from a match $(x_i,x_i')$:
+
+$$
+x_i' × Hx_i=\begin{pmatrix}
+    x_i' \\
+    y_i' \\
+    1
+\end{pmatrix} \times \begin{pmatrix}
+    h_1^T x_i \\
+    h_2^T x_i \\
+    h_3^T x_i
+\end{pmatrix}
+=
+\begin{pmatrix}
+    y_i' h_3^T x_i−h_2^T x_i \\
+    h_1^T x_i−x_i' h_3^T x_i \\
+    x_i' h_2^T x_i−y_i' h_1^T x_i
+\end{pmatrix}
+$$
+
+Rearranging the terms:
+
+$$
+\begin{bmatrix}
+    0^T &-x_i^T &y_i' x_i^T \\
+    x_i^T &0^T &-x_i' x_i^T \\
+    y_i' x_i^T &x_i' x_i^T &0^T
+\end{bmatrix}
+\begin{bmatrix}
+    h_1 \\
+    h_2 \\
+    h_3
+\end{bmatrix} = 0
+$$
+
+These equations aren't independent! So, we only need two.
+
+### Robust alignment
+
+#### Descriptor-based feature matching
+
+Extract features
+Compute putative matches
+Loop:
+
+- Hypothesize transformation $T$
+- Verify transformation (search for other matches consistent with $T$)
+
+#### RANSAC
+
+Even after filtering out ambiguous matches, the set of putative matches still contains a very high percentage of outliers
+
+RANSAC loop:
+
+- Randomly select a seed group of matches
+- Compute transformation from seed group
+- Find inliers to this transformation 
+- If the number of inliers is sufficiently large, re-compute least-squares estimate of transformation on all of the inliers
+
+At the end, keep the transformation with the largest number of inliers
--- a/content/CSE559A/CSE559A_L23.md
+++ b/content/CSE559A/CSE559A_L23.md
@@ -0,0 +1,15 @@
+# CSE559A Lecture 23
+
+## DUSt3r
+
+Dense and Unconstrained Stereo 3D Reconstruction of arbitrary image collections.
+
+[Github DUST3R](https://github.com/naver/dust3r)
+
+
+
+
+
+
+
+
--- a/content/CSE559A/CSE559A_L24.md
+++ b/content/CSE559A/CSE559A_L24.md
@@ -0,0 +1 @@
+
--- a/content/CSE559A/CSE559A_L25.md
+++ b/content/CSE559A/CSE559A_L25.md
@@ -0,0 +1,217 @@
+# CSE559A Lecture 25
+
+## Geometry and Multiple Views
+
+### Cues for estimating Depth
+
+#### Multiple Views (the strongest depth cue)
+
+Two common settings:
+
+**Stereo vision**: a pair of cameras, usually with some constraints on the relative position of the two cameras.
+
+**Structure from (camera) motion**: cameras observing a scene from different viewpoints
+
+Structure and depth are inherently ambiguous from single views.
+
+Other hints for depth:
+
+- Occlusion
+- Perspective effects
+- Texture
+- Object motion
+- Shading
+- Focus/Defocus
+
+#### Focus on Stereo and Multiple Views
+
+Stereo correspondence: Given a point in one of the images, where could its corresponding points be in the other images?
+
+Structure: Given projections of the same 3D point in two or more images, compute the 3D coordinates of that point
+
+Motion: Given a set of corresponding points in two or more images, compute the camera parameters
+
+#### A simple example of estimating depth with stereo:
+
+Stereo: shape from "motion" between two views
+
+We'll need to consider:
+
+- Info on camera pose ("calibration")
+- Image point correspondences
+
+![Simple stereo system](https://notenextra.trance-0.com/CSE559A/Simple_stereo_system.png)
+
+Assume parallel optical axes, known camera parameters (i.e., calibrated cameras).  What is expression for Z?
+
+Similar triangles $(p_l, P, p_r)$ and $(O_l, P, O_r)$:
+
+$$
+\frac{T-x_l+x_r}{Z-f}=\frac{T}{Z}
+$$
+
+$$
+Z = \frac{f \cdot T}{x_l-x_r}
+$$
+
+### Camera Calibration
+
+Use an scene with known geometry
+
+- Correspond image points to 3d points
+- Get least squares solution (or non-linear solution)
+
+Solving unknown camera parameters:
+
+$$
+\begin{bmatrix}
+su\\
+sv\\
+s
+\end{bmatrix}
+= \begin{bmatrix}
+m_{11} & m_{12} & m_{13} & m_{14}\\
+m_{21} & m_{22} & m_{23} & m_{24}\\
+m_{31} & m_{32} & m_{33} & m_{34}
+\end{bmatrix}
+\begin{bmatrix}
+X\\
+Y\\
+Z\\
+1
+\end{bmatrix}
+$$
+
+Method 1: Homogenous linear system. Solve for m's entries using least squares.
+
+$$
+\begin{bmatrix} 
+X_1 & Y_1 & Z_1 & 1 & 0 & 0 & 0 & 0 & -u_1X_1 & -u_1Y_1 & -u_1Z_1 & -u_1 \\
+0 & 0 & 0 & 0 & X_1 & Y_1 & Z_1 & 1 & -v_1X_1 & -v_1Y_1 & -v_1Z_1 & -v_1 \\
+\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\
+X_n & Y_n & Z_n & 1 & 0 & 0 & 0 & 0 & -u_nX_n & -u_nY_n & -u_nZ_n & -u_n \\
+0 & 0 & 0 & 0 & X_n & Y_n & Z_n & 1 & -v_nX_n & -v_nY_n & -v_nZ_n & -v_n
+\end{bmatrix}
+\begin{bmatrix} m_{11} \\ m_{12} \\ m_{13} \\ m_{14} \\ m_{21} \\ m_{22} \\ m_{23} \\ m_{24} \\ m_{31} \\ m_{32} \\ m_{33} \\ m_{34} \end{bmatrix} = 0
+$$
+
+Method 2: Non-homogenous linear system. Solve for m's entries using least squares.
+
+**Advantages**
+
+- Easy to formulate and solve
+- Provides initialization for non-linear methods
+
+**Disadvantages**
+
+- Doesn't directly give you camera parameters
+- Doesn't model radial distortion
+- Can't impose constraints, such as known focal length
+
+**Non-linear methods are preferred**
+
+- Define error as difference between projected points and measured points
+- Minimize error using Newton's method or other non-linear optimization
+
+#### Triangulation
+
+Given projections of a 3D point in two or more images (with known camera matrices), find the coordinates of the point
+
+##### Approaches 1: Geometric approach
+
+Find shortest segment connecting the two viewing rays and let $X$ be the midpoint of that segment
+
+![Triangulation geometric approach](https://notenextra.trance-0.com/CSE559A/Triangulation_geometric_approach.png)
+
+##### Approaches 2: Non-linear optimization
+
+Minimize error between projected point and measured point
+
+$$
+||\operatorname{proj}(P_1 X) - x_1||_2^2 + ||\operatorname{proj}(P_2 X) - x_2||_2^2
+$$
+
+![Triangulation non-linear optimization](https://notenextra.trance-0.com/CSE559A/Triangulation_non_linear_optimization.png)
+
+##### Approaches 3: Linear approach
+
+$x_1\cong P_1X$ and $x_2\cong P_2X$
+
+$x_1\times P_1X = 0$ and $x_2\times P_2X = 0$
+
+$[x_{1_{\times}}]P_1X = 0$ and $[x_{2_{\times}}]P_2X = 0$
+
+Rewrite as:
+
+$$
+a\times b=\begin{bmatrix}
+0 & -a_3 & a_2\\
+a_3 & 0 & -a_1\\
+-a_2 & a_1 & 0
+\end{bmatrix}
+\begin{bmatrix}
+b_1\\
+b_2\\
+b_3
+\end{bmatrix}
+=[a_{\times}]b
+$$
+
+Using **singular value decomposition**, we can solve for $X$
+
+### Epipolar Geometry
+
+What constraints must hold between two projections of the same 3D point?
+
+Given a 2D point in one view, where can we find the corresponding point in the other view?
+
+Given only 2D correspondences, how can we calibrate the two cameras, i.e., estimate their relative position and orientation and the intrinsic parameters?
+
+Key ideas:
+
+- We can answer all these questions without knowledge of the 3D scene geometry
+- Important to think about projections of camera centers and visual rays into the other view
+
+#### Epipolar Geometry Setup
+
+![Epipolar geometry setup](https://notenextra.trance-0.com/CSE559A/Epipolar_geometry_setup.png)
+
+Suppose we have two cameras with centers $O,O'$
+
+The baseline is the line connecting the origins
+
+Epipoles $e,e'$ are where the baseline intersects the image planes, or projections of the other camera in each view
+
+Consider a point $X$, which projects to $x$ and $x'$
+
+The plane formed by $X,O,O'$ is called an epipolar plane
+There is a family of planes passing through $O$ and $O'$
+
+Epipolar lines are projections of the baseline into the image planes
+
+**Epipolar lines** connect the epipoles to the projections of $X$
+Equivalently, they are intersections of the epipolar plane with the image planes – thus, they come in matching pairs.
+
+**Application**: This constraint can be used to find correspondences between points in two camera. by the epipolar line in one image, we can find the corresponding feature in the other image.
+
+![Epipolar line for converging cameras](https://notenextra.trance-0.com/CSE559A/Epipolar_line_for_converging_cameras.png)
+
+Epipoles are finite and may be visible in the image.
+
+![Epipolar line for parallel cameras](https://notenextra.trance-0.com/CSE559A/Epipolar_line_for_parallel_cameras.png)
+
+Epipoles are infinite, epipolar lines parallel.
+
+![Epipolar line for perpendicular cameras](https://notenextra.trance-0.com/CSE559A/Epipolar_line_for_perpendicular_cameras.png)
+
+Epipole is "focus of expansion" and coincides with the principal point of the camera
+
+Epipolar lines go out from principal point
+
+Next class:
+
+### The Essential and Fundamental Matrices
+
+### Dense Stereo Matching
+
+
--- a/content/CSE559A/CSE559A_L26.md
+++ b/content/CSE559A/CSE559A_L26.md
@@ -0,0 +1,177 @@
+# CSE559A Lecture 26
+
+## Continue on Geometry and Multiple Views
+
+### The Essential and Fundamental Matrices
+
+#### Math of the epipolar constraint: Calibrated case
+
+Recall Epipolar Geometry
+
+![Epipolar Geometry Configuration](https://notenextra.trance-0.com/CSE559A/Epipolar_geometry_setup.png)
+
+Epipolar constraint:
+
+If we set the config for the first camera as the world origin and $[I|0]\begin{pmatrix}y\\1\end{pmatrix}=x$, and $[R|t]\begin{pmatrix}y\\1\end{pmatrix}=x'$, then
+
+Notice that $x'\cdot [t\times (Ry)]=0$
+
+$$
+x'^T E x_1 = 0
+$$
+
+We denote the constraint defined by the Essential Matrix as $E$.
+
+$E x$ is the epipolar line associated with $x$ ($l'=Ex$)
+
+$E^T x'$ is the epipolar line associated with $x'$ ($l=E^T x'$)
+
+$E e=0$ and $E^T e'=0$ ($x$ and $x'$ don't matter)
+
+$E$ is singular (rank 2) and have five degrees of freedom.
+
+#### Epipolar constraint: Uncalibrated case
+
+If the calibration matrices $K$ and $K'$ are unknown, we can write the epipolar constraint in terms of unknown normalized coordinates:
+
+$$
+x'^T_{norm} E x_{norm} = 0
+$$
+
+where $x_{norm}=K^{-1} x$, $x'_{norm}=K'^{-1} x'$
+
+$$
+x'^T_{norm} E x_{norm} = 0\implies x'^T_{norm} Fx=0
+$$
+
+where $F=K'^{-1}EK^{-1}$ is the **Fundamental Matrix**.
+
+$$
+(x',y',1)\begin{bmatrix}
+f_{11} & f_{12} & f_{13} \\
+f_{21} & f_{22} & f_{23} \\
+f_{31} & f_{32} & f_{33}
+\end{bmatrix}\begin{pmatrix}
+x\\y\\1
+\end{pmatrix}=0
+$$
+
+Properties of $F$:
+
+$F x$ is the epipolar line associated with $x$ ($l'=F x$)
+
+$F^T x'$ is the epipolar line associated with $x'$ ($l=F^T x'$)
+
+$F e=0$ and $F^T e'=0$
+
+$F$ is singular (rank two) and has seven degrees of freedom
+
+#### Estimating the fundamental matrix
+
+Given: correspondences $x=(x,y,1)^T$ and $x'=(x',y',1)^T$
+
+Constraint: $x'^T F x=0$
+
+$$
+(x',y',1)\begin{bmatrix}
+f_{11} & f_{12} & f_{13} \\
+f_{21} & f_{22} & f_{23} \\
+f_{31} & f_{32} & f_{33}
+\end{bmatrix}\begin{pmatrix}
+x\\y\\1
+\end{pmatrix}=0
+$$
+
+**Each pair of correspondences gives one equation (one constraint)**
+
+At least 8 pairs of correspondences are needed to solve for the 9 elements of $F$ (The eight point algorithm)
+
+We know $F$ needs to be singular/rank 2. How do we force it to be singular?
+
+Solution: take SVD of the initial estimate and throw out the smallest singular value
+
+$$
+F=U\begin{bmatrix}
+\sigma_1 & 0 \\
+0 & \sigma_2 \\
+0 & 0
+\end{bmatrix}V^T
+$$
+
+## Structure from Motion
+
+Not always uniquely solvable.
+
+If we scale the entire scene by some factor $k$ and, at the same time, scale the camera matrices by the factor of $1/k$, the projections of the scene points remain exactly the same:
+$x\cong PX  =(1/k P)(kX)$
+
+Without a reference measurement, it is impossible to recover the absolute scale of the scene!
+
+In general, if we transform the scene using a transformation $Q$ and apply the inverse transformation to the camera matrices, then the image observations do not change:
+
+$x\cong PX  =(P Q^{-1})(QX)$
+
+### Types of Ambiguities
+
+![Ambiguities in projection](https://notenextra.trance-0.com/CSE559A/Ambiguities_in_projection.png)
+
+### Affine projection : more general than orthographic
+
+A general affine projection is a 3D-to-2D linear mapping plus translation:
+
+$$
+P=\begin{bmatrix}
+a_{11} & a_{12} & a_{13} & t_1 \\
+a_{21} & a_{22} & a_{23} & t_2 \\
+0 & 0 & 0 & 1
+\end{bmatrix}=\begin{bmatrix}
+A & t \\
+0^T & 1
+\end{bmatrix}
+$$
+
+In non-homogeneous coordinates:
+
+$$
+\begin{pmatrix}
+x\\y\\1
+\end{pmatrix}=\begin{bmatrix}
+a_{11} & a_{12} & a_{13} \\
+a_{21} & a_{22} & a_{23}
+\end{bmatrix}\begin{pmatrix}
+X\\Y\\Z
+\end{pmatrix}+\begin{pmatrix}
+t_1\\t_2
+\end{pmatrix}=AX+t
+$$
+
+### Affine Structure from Motion
+
+Given: 𝑚 images of 𝑛 fixed 3D points such that
+
+$$
+x_{ij}=A_iX_j+t_i, \quad i=1,\dots,m, \quad j=1,\dots,n
+$$
+
+Problem: use the 𝑚𝑛 correspondences $x_{ij}$ to estimate 𝑚 projection matrices $A_i$ and translation vectors $t_i$, and 𝑛 points $X_j$
+
+The reconstruction is defined up to an arbitrary affine transformation $Q$ (12 degrees of freedom):
+
+$$
+\begin{bmatrix}
+A & t \\
+0^T & 1
+\end{bmatrix}\rightarrow\begin{bmatrix}
+A & t \\
+0^T & 1
+\end{bmatrix}Q^{-1}, \quad \begin{pmatrix}X_j\\1\end{pmatrix}\rightarrow Q\begin{pmatrix}X_j\\1\end{pmatrix}
+$$
+
+How many constraints and unknowns for $m$ images and $n$ points? 
+
+$2mn$ constraints and $8m + 3n$ unknowns
+
+To be able to solve this problem, we must have $2mn \geq 8m+3n-12$ (affine ambiguity takes away 12 dof)
+
+E.g., for two views, we need four point correspondences
+
--- a/content/CSE559A/CSE559A_L3.md
+++ b/content/CSE559A/CSE559A_L3.md
@@ -0,0 +1,357 @@
+# CSE559A Lecture 3
+
+## Image formation
+
+### Degrees of Freedom
+
+$$
+x=K[R|t]X
+$$
+
+$$
+w\begin{bmatrix}
+x\\
+y\\
+1
+\end{bmatrix}
+=
+\begin{bmatrix}
+    \alpha & s & u_0 \\
+    0 & \beta & v_0 \\
+    0 & 0 & 1
+\end{bmatrix}
+\begin{bmatrix}
+r_{11} & r_{12} & r_{13} &t_x\\
+r_{21} & r_{22} & r_{23} &t_y\\
+r_{31} & r_{32} & r_{33} &t_z\\
+\end{bmatrix}
+\begin{bmatrix}
+x\\
+y\\
+z\\
+1
+\end{bmatrix}
+$$
+
+### Impact of translation of camera
+
+$$
+p=K[R|t]\begin{bmatrix}
+x\\
+y\\
+z\\
+0
+\end{bmatrix}=K[R]\begin{bmatrix}
+x\\
+y\\
+z\\
+\end{bmatrix}
+$$
+
+Projection of a vanishing point or projection of a point at infinity is invariant to translation.
+
+### Recover world coordinates from pixel coordinates
+
+$$
+\begin{bmatrix}
+u\\
+v\\
+1
+\end{bmatrix}=K[R|t]^{-1}X
+$$
+
+Key issue: where is the world origin $w$? Suppose $w=1/s$
+
+$$
+\begin{aligned}
+    \begin{bmatrix}
+        u\\
+        v\\
+        1
+    \end{bmatrix}
+    &=sK[R|t]X\\
+    K^{-1}\begin{bmatrix}
+        u\\
+        v\\
+        1
+    \end{bmatrix}
+    &=s[R|t]X\\
+    R^{-1}K^{-1}\begin{bmatrix}
+        u\\
+        v\\
+        1
+    \end{bmatrix}&=s[I|R^{-1}t]X\\
+    R^{-1}K^{-1}\begin{bmatrix}
+        u\\
+        v\\
+        1
+    \end{bmatrix}&=[I|R^{-1}t]sX\\
+    R^{-1}K^{-1}\begin{bmatrix}
+        u\\
+        v\\
+        1
+    \end{bmatrix}&=sX+sR^{-1}t\\
+    \frac{1}{s}R^{-1}K^{-1}\begin{bmatrix}
+        u\\
+        v\\
+        1
+    \end{bmatrix}-R^{-1}t&=X\\
+\end{aligned}
+$$
+
+## Projective Geometry
+
+### Orthographic Projection
+
+Special case of perspective projection when $f\to\infty$
+
+- Distance for the center of projection is infinite
+- Also called parallel projection
+- Projection matrix is
+
+$$
+w\begin{bmatrix}
+u\\
+v\\
+1
+\end{bmatrix}=
+\begin{bmatrix}
+f & 0 & 0 & 0\\
+0 & f & 0 & 0\\
+0 & 0 & 0 & s\\
+\end{bmatrix}
+\begin{bmatrix}
+x\\
+y\\
+z\\
+1
+\end{bmatrix}
+$$
+
+Continue in later part of the course
+
+## Image processing foundations
+
+### Motivation for image processing
+
+Representational Motivation:
+
+- We need more than raw pixel values
+
+Computational Motivation:
+
+- Many image processing operations must be run across many locations in a image
+- A loop in python is slow
+- High-level libraries reduce errors, developer time, and algorithm runtime
+- Two common libraries:
+  - Torch+Torchvision: Focus on deep learning
+  - scikit-image: Focus on classical image processing algorithms
+
+### Operations on images
+
+#### Point operations
+
+Operations that are applied to one pixel at a time
+
+Negative image
+
+$$
+I_{neg}(x,y)=L-1-I(x,y)
+$$
+
+Power law transformation:
+
+$$
+I_{out}(x,y)=cI(x,y)^{\gamma}
+$$
+
+- $c$ is a constant
+- $\gamma$ is the gamma value
+
+Contrast stretching
+
+use function to stretch the range of pixel values
+
+$$
+I_{out}(x,y)=f(I(x,y))
+$$
+
+- $f$ is a function that stretches the range of pixel values
+
+Image histogram
+
+- Histogram of an image is a plot of the frequency of each pixel value
+
+Limitations:
+
+- No spatial information
+- No information about the relationship between pixels
+
+#### Linear filtering in spatial domain
+
+Operations that are applied to a neighborhood at each position
+
+Used to:
+
+- Enhance image features
+  - Denoise, sharpen, resize
+- Extract information about image structure
+  - Edge detection, corner detection, blob detection
+- Detect image patterns
+  - Template matching
+- Convolutional Neural Networks
+
+Image filtering
+
+Do dot product of the image with a kernel
+
+$$
+h[m,n]=\sum_{k=0}^{m-i}\sum_{l=0}^{n-i}g[k,l]f[m+k,n+l]
+$$
+
+```python
+def filter2d(image, kernel):
+    """
+    Apply a 2D filter to an image, do not use this in practice
+    """
+    for i in range(image.shape[0]):
+        for j in range(image.shape[1]):
+            image[i, j] = np.dot(kernel, image[i-1:i+2, j-1:j+2])
+    return image
+```
+
+Computational cost: $k^2mn$, assume $k$ is the size of the kernel and $m$ and $n$ are the dimensions of the image
+
+Do not use this in practice, use built-in functions instead.
+
+**Box filter**
+
+$$
+\frac{1}{9}\begin{bmatrix}
+1 & 1 & 1\\
+1 & 1 & 1\\
+1 & 1 & 1
+\end{bmatrix}
+$$
+
+Smooths the image
+
+**Identity filter**
+
+$$
+\begin{bmatrix}
+0 & 0 & 0\\
+0 & 1 & 0\\
+0 & 0 & 0
+\end{bmatrix}
+$$
+
+Does not change the image
+
+**Sharpening filter**
+
+$$
+\begin{bmatrix}
+0 & 0 & 0 \\
+0 & 2 & 0 \\
+0 & 0 & 0
+\end{bmatrix}-
+\begin{bmatrix}
+1 & 1 & 1 \\
+1 & 1 & 1 \\
+1 & 1 & 1
+\end{bmatrix}
+$$
+
+Enhances the image edges
+
+**Vertical edge detection**
+
+$$
+\begin{bmatrix}
+1 & 0 & -1 \\
+2 & 0 & -2 \\
+1 & 0 & -1
+\end{bmatrix}
+$$
+
+Detects vertical edges
+
+**Horizontal edge detection**
+
+$$
+\begin{bmatrix}
+1 & 2 & 1 \\
+0 & 0 & 0 \\
+-1 & -2 & -1
+\end{bmatrix}
+$$
+
+Detects horizontal edges
+
+Key property:
+
+- Linear:
+  - `filter(I,f_1+f_2)=filter(I,f_1)+filter(I,f_2)`
+- Scale invariant:
+  - `filter(I,af)=a*filter(I,f)`
+- Shift invariant:
+  - `filter(I,shift(f))=shift(filter(I,f))`
+- Commutative:
+  - `filter(I,f_1)*filter(I,f_2)=filter(I,f_2)*filter(I,f_1)`
+- Associative:
+  - `filter(I,f_1)*(filter(I,f_2)*filter(I,f_3))=(filter(I,f_1)*filter(I,f_2))*filter(I,f_3)`
+- Distributive:
+  - `filter(I,f_1+f_2)=filter(I,f_1)+filter(I,f_2)`
+- Identity:
+  - `filter(I,f_0)=I`
+
+Important filter:
+
+**Gaussian filter**
+
+$$
+G(x,y)=\frac{1}{2\pi\sigma^2}e^{-\frac{x^2+y^2}{2\sigma^2}}
+$$
+
+Smooths the image (Gaussian blur)
+
+Common mistake: Make filter too large, visualize the filter before applying it (make the value on the edge $3\sigma$)
+
+Properties of Gaussian filter:
+
+- Remove high frequency components
+- Convolution with self is another Gaussian filter
+- Separable kernel:
+  - `G(x,y)=G(x)G(y)` (factorable into the product of two 1D Gaussian filters)
+
+##### Filter Separability
+
+- Separable filter:
+  - `f(x,y)=f(x)f(y)`
+
+Example:
+
+$$
+\begin{bmatrix}
+1 & 2 & 1 \\
+2 & 4 & 2 \\
+1 & 2 & 1
+\end{bmatrix}=
+\begin{bmatrix}
+1 \\
+2 \\
+1
+\end{bmatrix}\times
+\begin{bmatrix}
+1 & 2 & 1
+\end{bmatrix}
+$$
+
+Gaussian filter is separable
+
+$$
+G(x,y)=\frac{1}{2\pi\sigma^2}e^{-\frac{x^2+y^2}{2\sigma^2}}=G(x)G(y)
+$$
+
+This reduces the computational cost of the filter from $k^2mn$ to $2kmn$
--- a/content/CSE559A/CSE559A_L4.md
+++ b/content/CSE559A/CSE559A_L4.md
@@ -0,0 +1,196 @@
+# CSE559A Lecture 4
+
+## Practical issues with filtering
+
+$$
+h[m,n]=\sum_{k=0}^{m-i}\sum_{l=0}^{n-i}g[k,l]f[m+k,n+l]
+$$
+
+Loss of information on edges of image
+
+- The filter window falls off the edge of the image
+- Need to extrapolate
+- Methods:
+  - clip filter
+  - wrap around (extend the image periodically)
+  - copy edge (extend the image by copying the edge pixels)
+  - reflect across edge (extend the image by reflecting the edge pixels)
+
+## Convolution vs Correlation
+
+- Convolution:
+  - The filter is flipped and convolved with the image
+
+$$
+h[m,n]=\sum_{k=i}^{m}\sum_{l=i}^{n}g[k,l]f[m-k,n-l]
+$$
+
+- Correlation:
+  - The filter is not flipped and convolved with the image
+
+$$
+h[m,n]=\sum_{k=0}^{m-i}\sum_{l=0}^{n-i}g[k,l]f[m+k,n+l]
+$$
+
+does not matter for deep learning
+
+```python
+scipy.signal.convolve2d(image, kernel, mode='same')
+scipy.signal.correlate2d(image, kernel, mode='same')
+```
+
+but pytorch uses correlation for convolution, the convolution in pytorch is actually a correlation in scipy.
+
+## Frequency domain representation of linear image filters
+
+TL;DR: It can be helpful to think about linear spatial filters in terms fro their frequency domain representation
+
+- Fourier transform and frequency domain
+- The convolution theorem
+
+Hybrid image: More in homework 2
+
+Human eye is sensitive to low frequencies in far field, high frequencies in near field
+
+### Change of basis from an image perspective
+
+For vectors:
+
+- Vector -> Invertible matrix multiplication -> New vector
+- Normally we think of the standard/natural basis, with unit vectors in the direction of the axes
+
+For images:
+
+- Image -> Vector -> Invertible matrix multiplication -> New vector -> New image
+- Standard basis is just a collection of one-hot images
+
+Use `im.flatten()` to convert an image to a vector
+
+$$
+Image(M^{-1}GMVec(I))
+$$
+
+- M is the change of basis matrix, $M^{-1}M=I$
+- G is the operation we want to perform
+- Vec(I) is the vectorized image
+
+#### Lossy image compression (JPEG)
+
+- JPEG is a lossy compression algorithm
+- It uses the DCT (Discrete Cosine Transform) to transform the image to the frequency domain
+- The DCT is a linear operation, so it can be represented as a matrix multiplication
+- The JPEG algorithm then quantizes the coefficients and entropy codes them (use Huffman coding)
+
+## Thinking in frequency domain
+
+### Fourier transform
+
+Any univariate function can be represented as a weighted sum of sine and cosine functions
+
+$$
+X[k]=\sum_{n=N-1}^{0}x[n]e^{-2\pi ikn/N}=\sum_{n=0}^{N-1}x[n]\left[\sin\left(\frac{2\pi}{N}kn\right)+i\cos\left(\frac{2\pi}{N}kn\right)\right]
+$$
+
+- $X[k]$ is the Fourier transform of $x[n]$
+- $e^{-2\pi ikn/N}$ is the basis function
+- $x[n]$ is the original function
+
+Real part:
+
+$$
+\text{Re}(X[k])=\sum_{n=0}^{N-1}x[n]\cos\left(\frac{2\pi}{N}kn\right)
+$$
+
+Imaginary part:
+
+$$
+\text{Im}(X[k])=\sum_{n=0}^{N-1}x[n]\sin\left(\frac{2\pi}{N}kn\right)
+$$
+
+Fourier transform stores the magnitude and phase of the sine and cosine function at each frequency
+
+- Amplitude: encodes how much signal there is at a particular frequency
+- Phase: encodes the spacial information (indirectly)
+- For mathematical convenience, this is often written as a complex number
+
+Amplitude: $A=\pm\sqrt{\text{Re}(\omega)^2+\text{Im}(\omega)^2}$
+
+Phase: $\phi=\tan^{-1}\left(\frac{\text{Im}(\omega)}{\text{Re}(\omega)}\right)$
+
+So use $A\sin(\omega+\phi)$ to represent the signal
+
+Example:
+
+$g(t)=\sin(2\pi ft)+\frac{1}{3}\sin(2\pi (3f)t)$
+
+### Fourier analysis of images
+
+Intensity image and Fourier image
+
+Signals can be composed.
+
+![jpeg basis](https://notenextra.trance-0.com/CSE559A/8x8_DCT_basis.png)
+
+Note: frequency domain is often visualized using a log of the absolute value of the Fourier transform
+
+Blurring the image is to delete the high frequency components (removing the center of the frequency domain)
+
+## Convolution theorem
+
+The Fourier transform of the convolution of two functions is the product of their Fourier transforms
+
+$$
+F[f*g]=F[f]F[g]
+$$
+
+- $F$ is the Fourier transform
+- $*$ is the convolution
+
+Convolution in spatial domain is equivalent to multiplication in frequency domain
+
+$$
+g*h=F^{-1}[F[g]F[h]]
+$$
+
+- $F^{-1}$ is the inverse Fourier transform
+
+### Is convolution invertible?
+
+- Redo the convolution in the image domain is division in the frequency domain
+
+$$
+g*h=F^{-1}\left[\frac{F[g]}{F[h]}\right]
+$$
+
+- This is not always possible, because $F[h]$ may be zero and we may not know the filter
+
+Small perturbations in the frequency domain can cause large perturbations in the spatial domain and vice versa
+
+Deconvolution is hard and a active area of research
+
+- Even if you know the filter, it is not always possible to invert the convolution, requires strong regularization
+- If you don't know the filter, it is even harder
+
+## 2D image transformations
+
+### Array slicing and image wrapping
+
+Fast operation for extracting a subimage
+
+- cropped image `image[10:20, 10:20]`
+- flipped image `image[::-1, ::-1]`
+
+Image wrapping allows more flexible operations
+
+#### Upsampling an image
+
+- Upsampling an image is the process of increasing the resolution of the image
+
+Bilinear interpolation:
+
+- Use the average of the 4 nearest pixels to determine the value of the new pixel
+
+Other interpolation methods:
+
+- Bicubic interpolation: Use the average of the 16 nearest pixels to determine the value of the new pixel
+- Nearest neighbor interpolation: Use the value of the nearest pixel to determine the value of the new pixel
--- a/content/CSE559A/CSE559A_L5.md
+++ b/content/CSE559A/CSE559A_L5.md
@@ -0,0 +1,222 @@
+# CSE559A Lecture 5
+
+## Continue on linear interpolation
+
+- In linear interpolation, extreme values are at the boundary.
+- In bicubic interpolation, extreme values may be inside.
+
+`scipy.interpolate.RegularGridInterpolator`
+
+### Image transformations
+
+Image warping is a process of applying transformation $T$ to an image.
+
+Parametric (global) warping: $T(x,y)=(x',y')$
+
+Geometric transformation: $T(x,y)=(x',y')$ This applies to each pixel in the same way. (global)
+
+#### Translation
+
+$T(x,y)=(x+a,y+b)$
+
+matrix form:
+
+$$
+\begin{pmatrix}
+x'\\y'
+\end{pmatrix}
+=
+\begin{pmatrix}
+1&0\\0&1
+\end{pmatrix}
+\begin{pmatrix}
+x\\y
+\end{pmatrix}
+
+\begin{pmatrix}
+a\\b
+\end{pmatrix}
+$$
+
+#### Scaling
+
+$T(x,y)=(s_xx,s_yy)$ matrix form:
+
+$$
+\begin{pmatrix}
+x'\\y'
+\end{pmatrix}
+=
+\begin{pmatrix}
+s_x&0\\0&s_y
+\end{pmatrix}
+\begin{pmatrix}
+x\\y
+\end{pmatrix}
+$$
+
+#### Rotation
+
+$T(x,y)=(x\cos\theta-y\sin\theta,x\sin\theta+y\cos\theta)$
+
+matrix form:
+
+$$
+\begin{pmatrix}
+x'\\y'
+\end{pmatrix}
+=
+\begin{pmatrix}
+\cos\theta&-\sin\theta\\\sin\theta&\cos\theta
+\end{pmatrix}
+\begin{pmatrix}
+x\\y
+\end{pmatrix}
+$$
+
+To undo the rotation, we need to rotate the image by $-\theta$. This is equivalent to apply $R^T$ to the image.
+
+#### Affine transformation
+
+$T(x,y)=(a_1x+a_2y+a_3,b_1x+b_2y+b_3)$
+
+matrix form:
+
+$$
+\begin{pmatrix}
+x'\\y'
+\end{pmatrix}
+=
+\begin{pmatrix}
+a_1&a_2&a_3\\b_1&b_2&b_3
+\end{pmatrix}
+\begin{pmatrix}
+x\\y\\1
+\end{pmatrix}
+$$
+
+Taking all the transformations together.
+
+#### Projective homography
+
+$T(x,y)=(\frac{ax+by+c}{gx+hy+i},\frac{dx+ey+f}{gx+hy+i})$
+
+$$
+\begin{pmatrix}
+x'\\y'\\1
+\end{pmatrix}
+=
+\begin{pmatrix}
+a&b&c\\d&e&f\\g&h&i
+\end{pmatrix}
+\begin{pmatrix}
+x\\y\\1
+\end{pmatrix}
+$$
+
+### Image warping
+
+#### Forward warping
+
+Send each pixel to its new position and do the matching.
+
+- May cause gaps where the pixel is not mapped to any pixel.
+
+#### Inverse warping
+
+Send each new position to its original position and do the matching.
+
+- Some mapping may not be invertible.
+
+#### Which one is better?
+
+- Inverse warping is better because it usually more efficient, doesn't have a problem with holes.
+- However, it may not always be possible to find the inverse mapping.
+
+## Sampling and Aliasing
+
+### Naive sampling
+
+- Remove half of the rows and columns in the image.
+
+Example:
+
+When sampling a sine wave, the result may interpret as different wave.
+
+#### Nyquist-Shannon sampling theorem
+
+- A bandlimited signal can be uniquely determined by its samples if the sampling rate is greater than twice the maximum frequency of the signal.
+
+- If the sampling rate is less than twice the maximum frequency of the signal, the signal will be aliased.
+
+#### Anti-aliasing
+
+- Sample more frequently. (not always possible)
+- Get rid of all frequencies that are greater than half of the new sampling frequency.
+  - Use a low-pass filter to get rid of all frequencies that are greater than half of the new sampling frequency. (eg, Gaussian filter)
+
+```python
+import scipy.ndimage as ndimage
+def down_sample(height, width, image):
+    # Apply Gaussian blur to the image
+    im_blur = ndimage.gaussian_filter(image, sigma=1)
+    # Down sample the image by taking every second pixel
+    return im_blur[::2, ::2]
+```
+
+## Nonlinear filtering
+
+### Median filter
+
+Replace the value of a pixel with the median value of its neighbors.
+
+- Good for removing salt and pepper noise. (black and white dot noise)
+
+### Morphological operations
+
+Binary image: image with only 0 and 1.
+
+Let $B$ be a structuring element and $A$ be the original image (binary image).
+
+- Erosion: $A\ominus B = \{p\mid B_p\subseteq A\}$, this is the set of all points that are completely covered by $B$.
+- Dilation: $A\oplus B = \{p\mid B_p\cap A\neq\emptyset\}$, this is the set of all points that are at least partially covered by $B$.
+- Opening: $A\circ B = (A\ominus B)\oplus B$, this is the set of all points that are at least partially covered by $B$ after erosion.
+- Closing: $A\bullet B = (A\oplus B)\ominus B$, this is the set of all points that are completely covered by $B$ after dilation.
+
+Boundary extraction: use XOR operation on eroded image and original image.
+
+Connected component labeling: label the connected components in the image. _use prebuild function in scipy.ndimage_
+
+## Light,Camera/Eyes, and Color
+
+### Principles of grouping and Gestalt Laws
+
+- Proximity: objects that are close to each other are more likely to be grouped together.
+- Similarity: objects that are similar are more likely to be grouped together.
+- Closure: objects that form a closed path are more likely to be grouped together.
+- Continuity: objects that form a continuous path are more likely to be grouped together.
+
+### Light and surface interactions
+
+A photon's life choices:
+
+- Absorption
+- Diffuse reflection (nice to model) (lambertian surface)
+- Specular reflection (mirror-like) (perfect mirror)
+- Transparency
+- Refraction
+- Fluorescence (returns different color)
+- Subsurface scattering (candles)
+- Photosphorescence
+- Interreflection
+
+#### BRDF (Bidirectional Reflectance Distribution Function)
+
+$$
+\rho(\theta_i,\phi_i,\theta_o,\phi_o)
+$$
+
+- $\theta_i$ is the angle of incidence.
+- $\phi_i$ is the azimuthal angle of incidence.
+- $\theta_o$ is the angle of reflection.
+- $\phi_o$ is the azimuthal angle of reflection.
--- a/content/CSE559A/CSE559A_L6.md
+++ b/content/CSE559A/CSE559A_L6.md
@@ -0,0 +1,213 @@
+# CSE559A Lecture 6
+
+## Continue on Light, eye/camera, and color
+
+### BRDF (Bidirectional Reflectance Distribution Function)
+
+$$
+\rho(\theta_i,\phi_i,\theta_o,\phi_o)
+$$
+
+#### Diffuse Reflection
+
+- Dull, matte surface like chalk or latex paint
+
+- Most often used in computer vision
+- Brightness _does_ depend on direction of illumination
+
+Diffuse reflection governed by Lambert's law: $I_d = k_d N\cdot L I_i$
+
+- $N$: surface normal
+- $L$: light direction
+- $I_i$: incident light intensity
+- $k_d$: albedo
+
+$$
+\rho(\theta_i,\phi_i,\theta_o,\phi_o)=k_d \cos\theta_i
+$$
+
+#### Photometric Stereo
+
+Suppose there are three light sources, $L_1, L_2, L_3$, and we have the following measurements:
+
+$$
+I_1 = k_d N\cdot L_1
+$$
+
+$$
+I_2 = k_d N\cdot L_2
+$$
+
+$$
+I_3 = k_d N\cdot L_3
+$$
+
+We can solve for $N$ by taking the dot product of $N$ and each light direction and then solving the system of equations.
+
+Will not do this in the lecture.
+
+#### Specular Reflection
+
+- Mirror-like surface
+
+$$
+I_e=\begin{cases}
+I_i & \text{if } V=R \\
+0 & \text{if } V\neq R
+\end{cases}
+$$
+
+- $V$: view direction
+- $R$: reflection direction
+- $\theta_i$: angle between the incident light and the surface normal
+
+Near-perfect mirror have a high light around $R$.
+
+common model:
+
+$$
+I_e=k_s (V\cdot R)^{n_s}I_i
+$$
+
+- $k_s$: specular reflection coefficient
+- $n_s$: shininess (imperfection of the surface)
+- $I_i$: incident light intensity
+
+#### Phong illumination model
+
+- Phong approximation of surface reflectance
+  - Assume reflectance is modeled by three compoents
+    - Diffuse reflection
+    - Specular reflection
+    - Ambient reflection
+
+$$
+I_e=k_a I_a + I_i \left[k_d (N\cdot L) + k_s (V\cdot R)^{n_s}\right]
+$$
+
+- $k_a$: ambient reflection coefficient
+- $I_a$: ambient light intensity
+- $k_d$: diffuse reflection coefficient
+- $k_s$: specular reflection coefficient
+- $n_s$: shininess
+- $I_i$: incident light intensity
+
+Many other models.
+
+#### Measuring BRDF
+
+Use Gonioreflectometer.
+
+- Device for measuring the reflectance of a surface as a function of the incident and reflected angles.
+- Can be used to measure the BRDF of a surface.
+
+BRDF dataset:
+
+- MERL dataset
+- CURET dataset
+
+### Camera/Eye
+
+#### DSLR Camera
+
+- Pinhole camera model
+- Lens
+- Aperture (the pinhole)
+- Sensor
+- ...
+
+#### Digital Camera block diagram
+
+![Digital Camera block diagram](https://notenextra.trance-0.com/CSE559A/DigitalCameraBlockDiagram.png)
+
+Scanning protocols:
+
+- Global shutter: all pixels are exposed at the same time
+- Interlaced: odd and even lines are exposed at different times
+- Rolling shutter: each line is exposed as it is read out
+
+#### Eye
+
+- Pupil
+- Iris
+- Retina
+- Rods and cones
+- ...
+
+#### Eye Movements
+
+- Saccade
+  - Can be consciously controlled. Related to perceptual attention.
+  - 200ms to initiation, 20 to 200ms to carry out. Large amplitude.
+- Smooth pursuit
+  - Tracking an object
+  - Difficult w/o an object to track!
+- Microsaccade and Ocular microtremor (OMT)
+  - Involuntary. Smaller amplitude. Especially evident during prolonged
+    fixation.
+
+#### Contrast Sensitivity
+
+- Uniform contrast image content, with increasing frequency
+- Why not uniform across the top?
+- Low frequencies: harder to see because of slower intensity changes
+- Higher frequencies: harder to see because of ability of our visual system to resolve fine features
+
+### Color Perception
+
+Visible light spectrum: 380 to 780 nm
+
+- 400 to 500 nm: blue
+- 500 to 600 nm: green
+- 600 to 700 nm: red
+
+#### HSV model
+
+We use Gaussian functions to model the sensitivity of the human eye to different wavelengths.
+
+- Hue: color (the wavelength of the highest peak of the sensitivity curve)
+- Saturation: color purity (the variance of the sensitivity curve)
+- Value: color brightness (the highest peak of the sensitivity curve)
+
+#### Color Sensing in Camera (RGB)
+
+- 3-chip vs. 1-chip: quality vs. cost
+
+Bayer filter:
+
+- Why more green?
+  - Human eye is more sensitive to green light.
+
+#### Color spaces
+
+Images in python:
+
+As matrix.
+
+```python
+import matplotlib.pyplot as plt
+
+from mpl_toolkits.mplot3d import Axes3D
+from skimage import io
+
+def plot_rgb_3d(image_path):
+    image = io.imread(image_path)
+    r, g, b = image[:,:,0], image[:,:,1], image[:,:,2]
+    fig = plt.figure()
+    ax = fig.add_subplot(111, projection='3d')
+    ax.scatter(r.flatten(), g.flatten(), b.flatten(), c=image.reshape(-1, 3)/255.0, marker='.')
+    ax.set_xlabel('Red')
+    ax.set_ylabel('Green')
+    ax.set_zlabel('Blue')
+    plt.show()
+
+plot_rgb_3d('image.jpg')
+```
+
+Other color spaces:
+
+- YCbCr (fast to compute, usually used in TV)
+- HSV
+- L\*a\*b\* (CIELAB, perceptually uniform color space)
+
+Most information is in the intensity channel.
--- a/content/CSE559A/CSE559A_L7.md
+++ b/content/CSE559A/CSE559A_L7.md
@@ -0,0 +1,228 @@
+# CSE559A Lecture 7
+
+## Computer Vision (In Artificial Neural Networks for Image Understanding)
+
+Early example of image understanding using Neural Networks: [Back propagation for zip code recognition]
+
+Central idea; representation change, on each layer of feature.
+
+Plan for next few weeks:
+
+1. How do we train such models?
+2. What are those building blocks
+3. How should we combine those building blocks?
+
+## How do we train such models?
+
+CV is finally useful...
+
+1. Image classification
+2. Image segmentation
+3. Object detection
+
+ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
+
+- 1000 classes
+- 1.2 million images
+- 10000 test images
+
+### Deep Learning (Just neural networks)
+
+Bigger datasets, larger models, faster computers, lots of incremental improvements.
+
+```python
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+class Net(nn.Module):
+    def __init__(self):
+        super(Net, self).__init__()
+        self.conv1 = nn.Conv2d(1, 6, 5)
+        self.conv2 = nn.Conv2d(6, 16, 5)
+        self.fc1 = nn.Linear(16 * 5 * 5, 120)
+        self.fc2 = nn.Linear(120, 84)
+        self.fc3 = nn.Linear(84, 10)
+
+    def forward(self, x):
+        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
+        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
+        x = x.view(-1, self.num_flat_features(x))
+        x = F.relu(self.fc1(x))
+        x = F.relu(self.fc2(x))
+        x = self.fc3(x)
+        return x
+
+    def num_flat_features(self, x):
+        size = x.size()[1:]
+        num_features = 1
+        for s in size:
+            num_features *= s
+        return num_features
+
+# create pytorch dataset and dataloader
+dataset = torch.utils.data.TensorDataset(torch.randn(1000, 1, 28, 28), torch.randint(10, (1000,)))
+dataloader = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True, num_workers=2)
+
+# training process
+
+net = Net()
+optimizer = optim.Adam(net.parameters(), lr=0.001)
+criterion = nn.CrossEntropyLoss()
+
+# loop over the dataset multiple times
+for epoch in range(2):
+    for i, data in enumerate(dataloader, 0):
+        inputs, labels = data
+        optimizer.zero_grad()
+        outputs = net(inputs)
+        loss = criterion(outputs, labels)
+        loss.backward()
+        optimizer.step()
+
+print(f"Finished Training")
+```
+
+Some generated code above.
+
+### Supervised Learning
+
+Training: given a dataset, learn a mapping from input to output.
+
+Testing: given a new input, predict the output.
+
+Example: Linear classification models
+
+Find a linear function that separates the data.
+
+$$
+f(x) = w^T x + b
+$$
+
+[Linear classification models](http://cs231n.github.io/linear-classify/)
+
+Simple representation of a linear classifier.
+
+### Empirical loss minimization framework
+
+Given a training set, find a model that minimizes the loss function.
+
+Assume iid samples.
+
+Example of loss function:
+
+l1 loss:
+
+$$
+\ell(f(x; w), y) = |f(x; w) - y|
+$$
+
+l2 loss:
+
+$$
+\ell(f(x; w), y) = (f(x; w) - y)^2
+$$
+
+### Linear classification models
+
+$$
+\hat{L}(w) = \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; w), y_i)
+$$
+
+hard to find the global minimum.
+
+#### Linear regression
+
+However, if we use l2 loss, we can find the global minimum.
+
+$$
+\hat{L}(w) = \frac{1}{n} \sum_{i=1}^n (f(x_i; w) - y_i)^2
+$$
+
+This is a convex function, so we can find the global minimum.
+
+The gradient is:
+
+$$
+\nabla_w||Xw-Y||^2 = 2X^T(Xw-Y)
+$$
+
+Set the gradient to 0, we get:
+
+$$
+w = (X^T X)^{-1} X^T Y
+$$
+
+From the maximum likelihood perspective, we can also derive the same result.
+
+#### Logistic regression
+
+Sigmoid function:
+
+$$
+\sigma(x) = \frac{1}{1 + e^{-x}}
+$$
+
+The loss of logistic regression is not convex, so we cannot find the global minimum using normal equations.
+
+#### Gradient Descent
+
+Full batch gradient descent:
+
+$$
+w \leftarrow w - \eta \nabla_w \hat{L}(w)
+$$
+
+Stochastic gradient descent:
+
+$$
+w \leftarrow w - \eta \nabla_w \hat{L}(w; x_i, y_i)
+$$
+
+Mini-batch gradient descent:
+
+$$
+w \leftarrow w - \eta \nabla_w \hat{L}(w; x_i, y_i)
+$$
+
+Mini-batch Gradient Descent:
+
+$$
+w \leftarrow w - \eta \nabla_w \hat{L}(w; x_i, y_i)
+$$
+
+at each step, we update the weights using the average gradient of the mini-batch.
+
+the mini-batch is selected randomly from the training set.
+
+#### Multi-class classification
+
+Use softmax function to convert the output to a probability distribution.
+
+## Neural Networks
+
+From linear to non-linear.
+
+- Shadow approach:
+  - Use feature transformation to make the data linearly separable.
+- Deep approach:
+  - Stack multiple layers of linear models.
+
+Common non-linear functions:
+
+- ReLU:
+  - $$
+    \text{ReLU}(x) = \max(0, x)
+    $$
+- Sigmoid:
+  - $$
+    \text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
+    $$
+- Tanh:
+  - $$
+    \text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
+    $$
+
+
+
+### Backpropagation
--- a/content/CSE559A/CSE559A_L8.md
+++ b/content/CSE559A/CSE559A_L8.md
@@ -0,0 +1,80 @@
+# CSE559A Lecture 8
+
+Paper review sharing.
+
+## Recap: Three ways to think about linear classifiers
+
+Geometric view: Hyperplanes in the feature space
+
+Algebraic view: Linear functions of the features
+
+Visual view: One template per class
+
+## Continue on linear classification models
+
+Two layer networks as combination of templates.
+
+Interpretability is lost during the depth increase.
+
+A two layer network is a **universal approximator** (we can approximate any continuous function to arbitrary accuracy). But the hidden layer may need to be huge.
+
+[Multi-layer networks demo](https://playground.tensorflow.org)
+
+### Supervised learning outline
+
+1. Collect training data
+2. Specify model (select hyper-parameters)
+3. Train model
+
+#### Hyper-parameters selection
+
+- Number of layers, number of units per layer, learning rate, etc.
+- Type of non-linearity, regularization, etc.
+- Type of loss function, etc.
+- SGD settings: batch size, number of epochs, etc.
+
+#### Hyper-parameter searching
+
+Use validation set to evaluate the performance of the model.
+
+Never peek the test set.
+
+Use the training set to do K-fold cross validation.
+
+### Backpropagation
+
+#### Computation graphs
+
+SGD update for each parameter
+
+$$
+w_k\gets w_k-\eta\frac{\partial e}{\partial w_k}
+$$
+
+$e$ is the error function.
+
+#### Using the chain rule
+
+Suppose $k=1$, $e=l(f_1(x,w_1),y)$
+
+Example: $e=(f_1(x,w_1)-y)^2$
+
+So $h_1=f_1(x,w_1)=w^T_1x$, $e=l(h_1,y)=(y-h_1)^2$
+
+$$
+\frac{\partial e}{\partial w_1}=\frac{\partial e}{\partial h_1}\frac{\partial h_1}{\partial w_1}
+$$
+
+$$
+\frac{\partial e}{\partial h_1}=2(h_1-y)
+$$
+
+$$
+\frac{\partial h_1}{\partial w_1}=x
+$$
+
+$$
+\frac{\partial e}{\partial w_1}=2(h_1-y)x
+$$
+
+#### General backpropagation algorithm
--- a/content/CSE559A/CSE559A_L9.md
+++ b/content/CSE559A/CSE559A_L9.md
@@ -0,0 +1,102 @@
+# CSE559A Lecture 9
+
+## Continue on ML for computer vision
+
+### Backpropagation
+
+#### Computation graphs
+
+SGD update for each parameter
+
+$$
+w_k\gets w_k-\eta\frac{\partial e}{\partial w_k}
+$$
+
+$e$ is the error function.
+
+#### Using the chain rule
+
+Suppose $k=1$, $e=l(f_1(x,w_1),y)$
+
+Example: $e=(f_1(x,w_1)-y)^2$
+
+So $h_1=f_1(x,w_1)=w^T_1x$, $e=l(h_1,y)=(y-h_1)^2$
+
+$$
+\frac{\partial e}{\partial w_1}=\frac{\partial e}{\partial h_1}\frac{\partial h_1}{\partial w_1}
+$$
+
+$$
+\frac{\partial e}{\partial h_1}=2(h_1-y)
+$$
+
+$$
+\frac{\partial h_1}{\partial w_1}=x
+$$
+
+$$
+\frac{\partial e}{\partial w_1}=2(h_1-y)x
+$$
+
+For the general cases,
+
+$$
+\frac{\partial e}{\partial w_k}=\frac{\partial e}{\partial h_K}\frac{\partial h_K}{\partial h_{K-1}}\cdots\frac{\partial h_{k+2}}{\partial h_{k+1}}\frac{\partial h_{k+1}}{\partial h_k}\frac{\partial h_k}{\partial w_k}
+$$
+
+Where the upstream gradient $\frac{\partial e}{\partial h_K}$ is known, and the local gradient $\frac{\partial h_k}{\partial w_k}$ is known.
+
+#### General backpropagation algorithm
+
+The adding layer is the gradient distributor layer.
+The multiplying layer is the gradient switcher layer.
+The max operation is the gradient router layer.
+
+![Images of propagation](https://notenextra.trance-0.com/CSE559A/General_computation_graphs_for_MLP.png)
+
+Simple example: Element-wise operation (ReLU)
+
+$f(x)=ReLU(x)=max(0,x)$
+
+$$
+\frac{\partial z}{\partial x}=\begin{pmatrix}
+\frac{\partial z_1}{\partial x_1} & 0 & \cdots & 0 \\
+0 & \frac{\partial z_2}{\partial x_2} & \cdots & 0 \\
+\vdots & \vdots & \ddots & \vdots \\
+0 & 0 & \cdots & \frac{\partial z_n}{\partial x_n}
+\end{pmatrix}
+$$
+
+Where $\frac{\partial z_i}{\partial x_j}=1$ if $i=j$ and $z_i>0$, otherwise $\frac{\partial z_i}{\partial x_j}=0$.
+
+When $\forall x_i<0$ then $\frac{\partial z}{\partial x}=0$ (dead ReLU)
+
+Other examples on ppt.
+
+## Convolutional Neural Networks
+
+### Basic Convolutional layer
+
+#### Flatten layer
+
+Fully connected layer, operate on vectorized image.
+
+With the multi-layer perceptron, the neural network trying to fit the templates.
+
+![Flatten layer](https://notenextra.trance-0.com/CSE559A/Flatten_layer.png)
+
+#### Convolutional layer
+
+Limit the receptive fields of units, tiles them over the input image, and share the weights.
+
+Equivalent to sliding the learned filter over the image , computing dot products at each location.
+
+![Convolutional layer](https://notenextra.trance-0.com/CSE559A/Convolutional_layer.png)
+
+Padding: Add a border of zeros around the image. (higher padding, larger output size)
+
+Stride: The step size of the filter. (higher stride, smaller output size)
+
+### Variants 1x1 convolutions, depthwise convolutions
+
+### Backward pass
--- a/content/CSE559A/_meta.js
+++ b/content/CSE559A/_meta.js
@@ -0,0 +1,32 @@
+export default {
+   //index: "Course Description",
+    "---":{
+        type: 'separator'
+    },
+    CSE559A_L1: "Computer Vision (Lecture 1)",
+    CSE559A_L2: "Computer Vision (Lecture 2)",
+    CSE559A_L3: "Computer Vision (Lecture 3)",
+    CSE559A_L4: "Computer Vision (Lecture 4)",
+    CSE559A_L5: "Computer Vision (Lecture 5)",
+    CSE559A_L6: "Computer Vision (Lecture 6)",
+    CSE559A_L7: "Computer Vision (Lecture 7)",
+    CSE559A_L8: "Computer Vision (Lecture 8)",
+    CSE559A_L9: "Computer Vision (Lecture 9)",
+    CSE559A_L10: "Computer Vision (Lecture 10)",
+    CSE559A_L11: "Computer Vision (Lecture 11)",
+    CSE559A_L12: "Computer Vision (Lecture 12)",
+    CSE559A_L13: "Computer Vision (Lecture 13)",
+    CSE559A_L14: "Computer Vision (Lecture 14)",
+    CSE559A_L15: "Computer Vision (Lecture 15)",
+    CSE559A_L16: "Computer Vision (Lecture 16)",
+    CSE559A_L17: "Computer Vision (Lecture 17)",
+    CSE559A_L18: "Computer Vision (Lecture 18)",
+    CSE559A_L19: "Computer Vision (Lecture 19)",
+    CSE559A_L20: "Computer Vision (Lecture 20)",
+    CSE559A_L21: "Computer Vision (Lecture 21)",
+    CSE559A_L22: "Computer Vision (Lecture 22)",
+    CSE559A_L23: "Computer Vision (Lecture 23)",
+    CSE559A_L24: "Computer Vision (Lecture 24)",
+    CSE559A_L25: "Computer Vision (Lecture 25)",
+    CSE559A_L26: "Computer Vision (Lecture 26)",
+}
--- a/content/CSE559A/index.md
+++ b/content/CSE559A/index.md
@@ -0,0 +1,4 @@
+# CSE 559A: Computer Vision
+
+## Course Description
+