updates

2025-11-11 12:45:58 -06:00
parent 51c9f091d6
commit 34afc00a7f
5 changed files with 387 additions and 0 deletions
--- a/content/CSE510/CSE510_L22.md
+++ b/content/CSE510/CSE510_L22.md
@@ -0,0 +1,50 @@
+# CSE510 Deep Reinforcement Learning (Lecture 22)
+
+## Offline Reinforcement Learning
+
+### Requirements for Current Successes
+
+- Access to the Environment Model or Simulator
+- Not Costly for Exploration or Trial-and-Error
+
+#### Background: Offline RL
+
+- The success of modern machine learning
+  - Scalable data-driven learning methods (GPT-4, CLIP,DALL·E, Sora)
+- Reinforcement learning
+  - Online learning paradigm
+  - Interaction is expensive & dangerous
+    - Healthcare, Robotics, Recommendation...
+- Can we develop data-driven offline RL?
+
+#### Definition in Offline RL
+
+- the policy $\pi_k$ is updated with a static dataset $\mathcal{D}$, which is collected by _unknown behavior policy_ $\pi_\beta$
+- Interaction is not allowed
+
+- $\mathcal{D}=\{(s_i,a_i,s_i',r_i)\}$
+- $s\sim d^{\pi_\beta} (s)$
+- $a\sim \pi_\beta (a|s)$
+- $s'\sim p(s'|s,a)$
+- $r\gets r(s,a)$
+- Objective: $\max_\pi\sum _{t=0}^{T}\mathbb{E}_{s_t\sim d^\pi(s),a_t\sim \pi(a|s)}[\gamma^tr(s_t,a_t)]$
+
+#### Key challenge in Offline RL
+
+Distribution Shift
+
+How about using the traditional reinforcement learning (bootstrapping)?
+
+$$
+Q(s,a)=r(s,a)+\gamma \max_{a'\in A} Q(s',a')
+$$
+
+$$
+\pi(s)=\arg\max_{a\in A} Q(s,a)
+$$
+
+but notice that
+
+$$
+P_{\pi_beta}(s,a)\neq P_{\pi_f}(s,a)
+$$
--- a/content/CSE510/_meta.js
+++ b/content/CSE510/_meta.js
@@ -24,4 +24,5 @@ export default {
    CSE510_L19: "CSE510 Deep Reinforcement Learning (Lecture 19)",
    CSE510_L20: "CSE510 Deep Reinforcement Learning (Lecture 20)",
    CSE510_L21: "CSE510 Deep Reinforcement Learning (Lecture 21)",
+    CSE510_L22: "CSE510 Deep Reinforcement Learning (Lecture 22)",
 }
--- a/content/CSE5313/CSE5313_L21.md
+++ b/content/CSE5313/CSE5313_L21.md
@@ -0,0 +1,316 @@
+# CSE5313 Coding and information theory for data science (Lecture 21)
+
+## Gradient coding
+
+### Intro to Statistical Machine Learning
+
+Given by the learning problem:
+
+**Unknown** target function $f:X\to Y$.
+
+Training data $\mathcal{D}=\{(x_i,y_i)\}_{i=1}^n$.
+
+Learning algorithm: $\mathbb{A}$ and Hypothesis set $\mathcal{H}$.
+
+Goal is to find $g\approx f$,
+
+Common hypothesis sets:
+
+- Linear classifiers:
+
+$$
+f(x)=sign (wx^\top)
+$$
+
+- Linear regressors:
+
+$$
+f(x)=wx^\top
+$$
+
+- Neural networks:
+  - Concatenated linear classifiers (or differentiable approximation thereof).
+
+The dataset $\mathcal{D}$ is taken from unknown distribution $D$.
+
+#### Common approach – Empirical Risk Minimization
+
+- Q: How to quantify $f\approx g$ ?
+- A: Use a loss function.
+  - A function which measures the deviation between $g$'s output and $f$'s output.
+- Using the chosen loss function, define two measures:
+  - True risk: $\mathbb{E}_{D}[\mathbb{L}(f,g)]$
+  - Empirical risk: $\frac{1}{n}\sum_{i=1}^n \ell(g(x_i),y_i)$
+
+Machine learning is over $\mathbb{R}$.
+
+### Gradient Descent (Motivation)
+
+Parameterize $g$ using some real vector $\vec{w}$, we want to minimize $ER(\vec{w})$.
+
+Algorithm:
+
+- Initialize $\vec{w}_0$.
+- For $t=1,2,\cdots,T$:
+  - Computer $\nabla_{\vec{w}} ER(\vec{w})$.
+  - $\vec{w}\gets \vec{w}-\eta\nabla_{\vec{w}} ER(\vec{w})$
+  - Terminate if some stop condition are met.
+
+Bottleneck: Calculating $\nabla_{\vec{w}} ER(\vec{w})=\frac{1}{n}\sum_{i=1}^n \nabla_{\vec{w}} \ell(g(x_i),y_i)$.
+
+Potentially, $O(PN)$ where $N$ is number of points, dimension of feature space.
+
+Solution: Parallelize.
+
+#### Distributed Gradient Descent
+
+Idea: use a distributed system with **master** and **workers**.
+
+Problem: Stragglers (slow servers, 5-6 slower than average time).
+
+Potential Solutions:
+
+- Wait for all servers:
+  - Accurate, but slow
+- Sum results without the slowest ones
+  - Less accurate, but faster
+- Introduce redundancy
+  - Send each $\mathcal{D}_i$ to more than one server.
+  - Each server receives more than one $\mathcal{D}_i$.
+  - Each server sends a linear combination of the partials gradient to its $\mathcal{D}_i$.
+  - The master decodes the sum of partial gradients from the linear combination.
+
+### Problem setups
+
+System setup: 1 master $M$, $n$ workers $W_1,\cdots,W_n$.
+
+A dataset $\mathcal{D}=\{(x_i,y_i)\}_{i=1}^n$.
+
+Each server $j$:
+
+- Receives some $d$ data points $\mathcal{D}_j=\{(x_{j,i},y_{j,i})\}_{i=1}^d$. where $d$ is the replication factor.
+- Computes a certain vector $v_i$ from each $(x_{j,i},y_{j,i})$.
+- Returns a linear combination $u_i=\sum_{j=1}^n \alpha_j v_j$. to the master.
+
+The master:
+
+- Waits for the first $n-s$ $u_i$'s to arrive ($s$ is the straggler tolerance factor).
+- Linear combines the $u_i$'s to get the final gradient.
+
+Goal: Retrieve $\sum_i v_i$ regardless of which $n-s$ workers responded.
+
+- Computation of the full gradient that tolerates any $s$ straggler.
+
+The $\alpha_{i,j}$'s are fixed (do not depend on data). These form a **gradient coding matrix** $B\in\mathbb{C}^{n\times n}$.
+
+Row $i$ has $d$ non-zero entries $\alpha_{i,j}$. at some positions.
+
+The $\lambda_i$'s:
+
+- Might depend on the identity of the $n-s$ responses.
+- Nevertheless must exists in any case.
+
+Recall:
+
+- The master must be able to recover $\sum_i v_i$ form any $n-s$ responses.
+
+Let $\mathcal{K}$ be the indices of the responses.
+
+Let $B_\mathcal{K}$ be the submatrix of $B$ indexed by $\mathcal{K}$.
+
+Must have:
+
+- For every $\mathcal{K}$ of size $n-s$ there exists coefficients $\lambda_1,\cdots,\lambda_{n-s}$ such that:
+
+$$
+(\lambda_1,\cdots,\lambda_{n-s})B_\mathcal{K}=(1,1,1,1,\cdots,1)=\mathbb{I}
+$$
+
+Then if $\mathcal{K}=\{i_1,\cdots,i_{n-s}\}$ responded,
+
+$$
+(\lambda_1 v_{i_1},\cdots,\lambda_{n-s} v_{i_{n-s}})\begin{pmatrix}
+u_{i_1}\\
+u_{i_2}\\
+\vdots\\
+u_{i_{n-s}}
+\end{pmatrix}=\sum_i v_i
+$$
+
+#### Definition of gradient coding matrix.
+
+For replication factor $d$ and straggler tolerance factor $s$: $B\in\mathbb{C}^{n\times n}$ is a gradient coding matrix if:
+
+- $\mathbb{I}$ is in th span of any $n-s$ rows.
+- Every row of $B$ contains at most $d$ nonzero elements.
+
+Grading coding matrix implies gradient coding algorithm:
+
+- The master sends $S_i$ to worker $i$. where $i=1,\cdots,n$ are the nonzero indices of row $i$.
+- Worker $i$:
+  - Computes $\mathcal{D}_{i,\ell}\to v_{i,\ell}$ for $\ell=1,\cdots,d$.
+  - Sends $u_i=\sum_{j=1}^d \alpha_{i,j}v_{i,j}$ to the master.
+- Let $\mathcal{K}=\{i_1,\cdots,i_{n-s}\}$ be the indices of the first $n-s$ responses.
+- Since $\mathbb{I}$ is in the span of any $n-s$ rows of $B$, there exists $\lambda_1,\cdots,\lambda_{n-s}$ such that $(\lambda_1,\cdots,\lambda_{n-s})((u_{i_1},\cdots,u_{i_{n-s}}))=\sum_i v_i$.
+
+#### Construction of Gradient Coding Matrices
+
+Goal:
+
+- For a given straggler tolerance parameter $s$, we wish to construct a gradient coding matrix $B$ with the smallest possible $d$.
+
+- Tools:
+
+I. Cyclic Reed-Solomon codes over the complex numbers.
+II. Definition of $\mathcal{C}^\perp$ (dual of $\mathcal {C}$) and $\mathcal{C}^R$ (reverse of $\mathcal{C}$).
+III. A simple lemma about MDS codes.
+
+Recall: An $n,k$ Reed-Solomon code over a field $\mathcal{F}$ is as follows.
+
+- Fix distinct $\alpha_1,\cdots,\alpha_n-1\in\mathcal{F}$.
+- $\mathcal{C}=\{f(\alpha_1),f(\alpha_2),\cdots,f(\alpha_n-1)\}$.
+- Dimension $k$ and minimum distance $n-k+1$ follow from $\mathcal{F}$ being a field.
+- Also works for $\mathcal {F}=\mathbb{C}$.
+
+### I. Cyclic Reed-Solomon codes over the complex numbers.
+
+The following Reed-Solomon code over the complex numbers is called a cyclic code.
+
+- Let $i=\sqrt{-1}$.
+- For $j\in \{0,\cdots,n-1\}$, choose $a_j=e^{2\pi i j/n}$. The $a_j$'s are roots of unity of order $n$.
+- Use these $a_j$'s to define a Reed-Solomon code as usual.
+
+This code is cyclic:
+
+- Let $c=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1}))$ for some $f_c(x)\in \mathbb{C}[x]$.
+- Need to show that $c'=f_c(a_j)$ for all $j\in \{0,\cdots,n-1\}$.
+
+$$
+c'=(f_c(a_1),f_c(a_2),\cdots,f_c(a_{n-1}),f_c(a_0))=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1}))
+$$
+
+### II. Dual and Reversed Codes
+
+- Let $\mathcal{C}=[n,k,d]_{\mathbb{F}}$ be an MDS code.
+
+#### Definition for dual code of $\mathcal{C}$
+
+The dual code of $\mathcal{C}$ is
+
+$$
+\mathcal{C}^\perp=\{c'\in \mathbb{F}^n|c'c^\top=0\text{ for all }c\in \mathcal{C}\}
+$$
+
+Claim: $\mathcal{C}^\perp$ is an $[n,n-k,k+1]_{\mathbb{F}}$ code.
+
+#### Definition for reversed code of $\mathcal{C}$
+
+The reversed code of $\mathcal{C}$ is
+
+$$
+\mathcal{C}^R=\{(c_{n-1},\cdots,c_0)|(c_0,\cdots,c_{n-1})\in \mathcal{C}\}
+$$
+
+We claim that if $\mathcal{C}$ is cyclic, then $\mathcal{C}^R$ is cyclic.
+
+### III. Lemma about MDS codes
+
+Let $\mathcal{C}=[n,k,n-k+1]_{\mathbb{F}}$ be an MDS code.
+
+#### Lemma
+
+For any subset $\mathcal{K}\subset \{0,\cdots,n-1\}$, of size $n-k+1$ there exists $c\in \mathcal{C}$ whose support (set of nonzero indices) is $\mathcal{K}$.
+
+<details>
+<summary>Proof</summary>
+
+Let $G\in \mathbb{F}^{k\times n}$ be a generator matrix, and let $G_{\mathcal{K}^c}\in \mathbb{F}^{k\times (k-1)}$ be its restriction to columns not indexed by $\mathcal{K}$.
+
+$G_{\mathcal{K}^c}$ has more rows than columns, so there exists $v\in \mathbb{F}^{k}$ such that $vG_{\mathcal{K}^c}=0$.
+
+So $c=vG$ has at least $|\mathcal{K}^c|=k-1$ zeros inn entires indexed by $\mathcal{K}^c$.
+
+The remaining $n-(k-1)=d$ entries of $c$, indexed by $\mathcal{K}$, must be nonzero.
+
+Thus the suppose of $c$ is $\mathcal{K}$.
+
+</details>
+
+### Construct gradient coding matrix
+
+Consider nay $n$ workers and $s$ stragglers.
+
+Let $d=s+1$
+
+Let $\mathcal{C}=[n,n-s]_{\mathbb{C}}$ be the cyclic RS code build by I.
+
+Then by III, there exists $c\in \mathcal{c}$ whose support is the $n-(n-s)+1=s+1$ first entires.
+
+Denote $c=(\beta_1,\cdots,\beta_{s+1},0,0,\cdots,0)$. for some nonzero $\beta_1,\cdots,\beta_{s+1}$.
+
+Build:
+
+$B\in \mathbb{C}^{n\times n}$ whose columns are all cyclic shifts of $c$.
+
+We claim that $B$ is a gradient coding matrix.
+
+$$
+\begin{pmatrix}
+\beta_1     & 0           &        & 0          & \beta_{s+1} & \beta_s& \cdots & \beta_2 \\
+\vdots      & \beta_1     &        & \vdots     & 0           & \beta_{s+1} &  & \vdots \\
+\beta_{s+1} & \vdots      &        &            & \vdots      & 0 &  & \beta_{s+1}\\
+0           & \beta_{s+1} & \ddots & 0          &             &\vdots & \dots & 0\\
+\vdots      & 0           & \ddots & \beta_1     & & & & &\\
+            & \vdots      & \ddots & \vdots     & & & &0\\
+0           & 0           &        & \beta_{s+1}& \beta_s& \beta_{s-1}& \cdots & \beta_1\\
+\end{pmatrix}
+$$
+
+<details>
+<summary>Proof</summary>
+
+Every row is a codeword in $\mathcal{C}^R$.
+
+- Specifically, a shift of $(0,\cdots,0,\beta_{s+1},\cdots,\beta_1)$.
+- Then every row contains $\leq d=s+1$ nonzeros.
+
+$\mathcal{I}$ is in the span of any $n-s$ rows of $B$.
+
+- Observe that $\mathcal{I}\in \mathcal{C}$, (evaluate the polynomial $f(x)=1$ at $\alpha_1,\cdots,\alpha_n$).
+- Then $\mathcal{I}\in \mathcal{C}^R$.
+- Therefore, it suffices to show that any $n-s$ span $\mathcal{C}^R$.
+- Since $\dim \mathcal{C}=\dim \mathcal{C}^R=n-s$, it suffices to show that any $n-s$ rows are independent.
+
+Observe: The left most $n-s$ columns are linearly independent, and therefore span $\mathcal{C}$.
+
+Assume for contradiction there exists $n-s$ dependent rows.
+
+Then $\exists v\in \mathbb{C}^{n}$ such that $vB=0$.
+
+$v$ is orthogonal to the basis of $\mathcal{C}$.
+
+So $v\in \mathcal{C}^\perp$.
+
+From II, $\mathcal{C}^\perp$ is an $[n,s]$ MDS code, and hence every $v\in \mathcal{C}^\perp$ is of Hamming weight $\geq n-s+1$.
+
+This is a contradiction.
+
+</details>
+
+### Bound for gradient coding
+
+We want $s$ to be large and $d$ to be small.
+
+How small can $d$ with respect to $s$?
+
+- A: Build a bipartite graph.
+  - Left side: $n$ workers $W_1,\cdots,W_n$.
+  - Right side: $n$ partial datasets $D_1,\cdots,D_n$.
+  - Connect $W_i$ to $D_i$ if worker $i$ contains $D_i$.
+    - Equivalently if $B_{i,i}\neq 0$.
+  - $\deg (W_i) = d$ by definition.
+  - $\deg (\mathcal{D}_j)\geq s+1$.
+  - Sum degree on left $nd$ and right $\geq n(s+1)$.
+  - So $d\geq s+1$.
+
+We can break the lower bound using approximate computation.
--- a/content/CSE5313/_meta.js
+++ b/content/CSE5313/_meta.js
@@ -24,4 +24,5 @@ export default {
    CSE5313_L18: "CSE5313 Coding and information theory for data science (Lecture 18)",
    CSE5313_L19: "CSE5313 Coding and information theory for data science (Lecture 19)",
    CSE5313_L20: "CSE5313 Coding and information theory for data science (Lecture 20)",
+    CSE5313_L21: "CSE5313 Coding and information theory for data science (Lecture 21)",
 }
--- a/content/CSE5519/CSE5519_B5.md
+++ b/content/CSE5519/CSE5519_B5.md
@@ -1,2 +1,21 @@
 # CSE5519 Advances in Computer Vision (Topic B: 2025: Vision-Language Models)

+## Molmo and PixMo:
+
+[link to paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Deitke_Molmo_and_PixMo_Open_Weights_and_Open_Data_for_State-of-the-Art_CVPR_2025_paper.pdf)
+
+## Novelty in Molmo and PixMo
+
+PixMo dataset (712k images with long 200+ words description)
+
+- Simplified two-stage training pipline
+  - Standard ViT architecture with tokenizer and image encoder (CLIP) and pooling the embeddings to the decoder only LLM.
+- overlapping multi-crop policy
+  - Add overlapping region and image cropping to truncate the large image.
+- training over multiple annotations
+  - Text-only residual dropout
+- optimizer setups
+
+> [!TIP]
+>
+> This paper provides an interesting dataset and a refined training pipeline that is comparable to current closed-source SOTA performance. What is the contribution of the paper from the algorithm perspective? It seems that it is just a test for a new dataset with a slightly altered training pipeline.