From 34afc00a7f9fa82088005314cc38389dfb51cfcc Mon Sep 17 00:00:00 2001 From: Trance-0 <60459821+Trance-0@users.noreply.github.com> Date: Tue, 11 Nov 2025 12:45:58 -0600 Subject: [PATCH] updates --- content/CSE510/CSE510_L22.md | 50 ++++++ content/CSE510/_meta.js | 1 + content/CSE5313/CSE5313_L21.md | 316 +++++++++++++++++++++++++++++++++ content/CSE5313/_meta.js | 1 + content/CSE5519/CSE5519_B5.md | 19 ++ 5 files changed, 387 insertions(+) create mode 100644 content/CSE510/CSE510_L22.md create mode 100644 content/CSE5313/CSE5313_L21.md diff --git a/content/CSE510/CSE510_L22.md b/content/CSE510/CSE510_L22.md new file mode 100644 index 0000000..18066e9 --- /dev/null +++ b/content/CSE510/CSE510_L22.md @@ -0,0 +1,50 @@ +# CSE510 Deep Reinforcement Learning (Lecture 22) + +## Offline Reinforcement Learning + +### Requirements for Current Successes + +- Access to the Environment Model or Simulator +- Not Costly for Exploration or Trial-and-Error + +#### Background: Offline RL + +- The success of modern machine learning + - Scalable data-driven learning methods (GPT-4, CLIP,DALL·E, Sora) +- Reinforcement learning + - Online learning paradigm + - Interaction is expensive & dangerous + - Healthcare, Robotics, Recommendation... +- Can we develop data-driven offline RL? + +#### Definition in Offline RL + +- the policy $\pi_k$ is updated with a static dataset $\mathcal{D}$, which is collected by _unknown behavior policy_ $\pi_\beta$ +- Interaction is not allowed + +- $\mathcal{D}=\{(s_i,a_i,s_i',r_i)\}$ +- $s\sim d^{\pi_\beta} (s)$ +- $a\sim \pi_\beta (a|s)$ +- $s'\sim p(s'|s,a)$ +- $r\gets r(s,a)$ +- Objective: $\max_\pi\sum _{t=0}^{T}\mathbb{E}_{s_t\sim d^\pi(s),a_t\sim \pi(a|s)}[\gamma^tr(s_t,a_t)]$ + +#### Key challenge in Offline RL + +Distribution Shift + +How about using the traditional reinforcement learning (bootstrapping)? + +$$ +Q(s,a)=r(s,a)+\gamma \max_{a'\in A} Q(s',a') +$$ + +$$ +\pi(s)=\arg\max_{a\in A} Q(s,a) +$$ + +but notice that + +$$ +P_{\pi_beta}(s,a)\neq P_{\pi_f}(s,a) +$$ \ No newline at end of file diff --git a/content/CSE510/_meta.js b/content/CSE510/_meta.js index 780b495..9680727 100644 --- a/content/CSE510/_meta.js +++ b/content/CSE510/_meta.js @@ -24,4 +24,5 @@ export default { CSE510_L19: "CSE510 Deep Reinforcement Learning (Lecture 19)", CSE510_L20: "CSE510 Deep Reinforcement Learning (Lecture 20)", CSE510_L21: "CSE510 Deep Reinforcement Learning (Lecture 21)", + CSE510_L22: "CSE510 Deep Reinforcement Learning (Lecture 22)", } \ No newline at end of file diff --git a/content/CSE5313/CSE5313_L21.md b/content/CSE5313/CSE5313_L21.md new file mode 100644 index 0000000..8a731c7 --- /dev/null +++ b/content/CSE5313/CSE5313_L21.md @@ -0,0 +1,316 @@ +# CSE5313 Coding and information theory for data science (Lecture 21) + +## Gradient coding + +### Intro to Statistical Machine Learning + +Given by the learning problem: + +**Unknown** target function $f:X\to Y$. + +Training data $\mathcal{D}=\{(x_i,y_i)\}_{i=1}^n$. + +Learning algorithm: $\mathbb{A}$ and Hypothesis set $\mathcal{H}$. + +Goal is to find $g\approx f$, + +Common hypothesis sets: + +- Linear classifiers: + +$$ +f(x)=sign (wx^\top) +$$ + +- Linear regressors: + +$$ +f(x)=wx^\top +$$ + +- Neural networks: + - Concatenated linear classifiers (or differentiable approximation thereof). + +The dataset $\mathcal{D}$ is taken from unknown distribution $D$. + +#### Common approach – Empirical Risk Minimization + +- Q: How to quantify $f\approx g$ ? +- A: Use a loss function. + - A function which measures the deviation between $g$'s output and $f$'s output. +- Using the chosen loss function, define two measures: + - True risk: $\mathbb{E}_{D}[\mathbb{L}(f,g)]$ + - Empirical risk: $\frac{1}{n}\sum_{i=1}^n \ell(g(x_i),y_i)$ + +Machine learning is over $\mathbb{R}$. + +### Gradient Descent (Motivation) + +Parameterize $g$ using some real vector $\vec{w}$, we want to minimize $ER(\vec{w})$. + +Algorithm: + +- Initialize $\vec{w}_0$. +- For $t=1,2,\cdots,T$: + - Computer $\nabla_{\vec{w}} ER(\vec{w})$. + - $\vec{w}\gets \vec{w}-\eta\nabla_{\vec{w}} ER(\vec{w})$ + - Terminate if some stop condition are met. + +Bottleneck: Calculating $\nabla_{\vec{w}} ER(\vec{w})=\frac{1}{n}\sum_{i=1}^n \nabla_{\vec{w}} \ell(g(x_i),y_i)$. + +Potentially, $O(PN)$ where $N$ is number of points, dimension of feature space. + +Solution: Parallelize. + +#### Distributed Gradient Descent + +Idea: use a distributed system with **master** and **workers**. + +Problem: Stragglers (slow servers, 5-6 slower than average time). + +Potential Solutions: + +- Wait for all servers: + - Accurate, but slow +- Sum results without the slowest ones + - Less accurate, but faster +- Introduce redundancy + - Send each $\mathcal{D}_i$ to more than one server. + - Each server receives more than one $\mathcal{D}_i$. + - Each server sends a linear combination of the partials gradient to its $\mathcal{D}_i$. + - The master decodes the sum of partial gradients from the linear combination. + +### Problem setups + +System setup: 1 master $M$, $n$ workers $W_1,\cdots,W_n$. + +A dataset $\mathcal{D}=\{(x_i,y_i)\}_{i=1}^n$. + +Each server $j$: + +- Receives some $d$ data points $\mathcal{D}_j=\{(x_{j,i},y_{j,i})\}_{i=1}^d$. where $d$ is the replication factor. +- Computes a certain vector $v_i$ from each $(x_{j,i},y_{j,i})$. +- Returns a linear combination $u_i=\sum_{j=1}^n \alpha_j v_j$. to the master. + +The master: + +- Waits for the first $n-s$ $u_i$'s to arrive ($s$ is the straggler tolerance factor). +- Linear combines the $u_i$'s to get the final gradient. + +Goal: Retrieve $\sum_i v_i$ regardless of which $n-s$ workers responded. + +- Computation of the full gradient that tolerates any $s$ straggler. + +The $\alpha_{i,j}$'s are fixed (do not depend on data). These form a **gradient coding matrix** $B\in\mathbb{C}^{n\times n}$. + +Row $i$ has $d$ non-zero entries $\alpha_{i,j}$. at some positions. + +The $\lambda_i$'s: + +- Might depend on the identity of the $n-s$ responses. +- Nevertheless must exists in any case. + +Recall: + +- The master must be able to recover $\sum_i v_i$ form any $n-s$ responses. + +Let $\mathcal{K}$ be the indices of the responses. + +Let $B_\mathcal{K}$ be the submatrix of $B$ indexed by $\mathcal{K}$. + +Must have: + +- For every $\mathcal{K}$ of size $n-s$ there exists coefficients $\lambda_1,\cdots,\lambda_{n-s}$ such that: + +$$ +(\lambda_1,\cdots,\lambda_{n-s})B_\mathcal{K}=(1,1,1,1,\cdots,1)=\mathbb{I} +$$ + +Then if $\mathcal{K}=\{i_1,\cdots,i_{n-s}\}$ responded, + +$$ +(\lambda_1 v_{i_1},\cdots,\lambda_{n-s} v_{i_{n-s}})\begin{pmatrix} +u_{i_1}\\ +u_{i_2}\\ +\vdots\\ +u_{i_{n-s}} +\end{pmatrix}=\sum_i v_i +$$ + +#### Definition of gradient coding matrix. + +For replication factor $d$ and straggler tolerance factor $s$: $B\in\mathbb{C}^{n\times n}$ is a gradient coding matrix if: + +- $\mathbb{I}$ is in th span of any $n-s$ rows. +- Every row of $B$ contains at most $d$ nonzero elements. + +Grading coding matrix implies gradient coding algorithm: + +- The master sends $S_i$ to worker $i$. where $i=1,\cdots,n$ are the nonzero indices of row $i$. +- Worker $i$: + - Computes $\mathcal{D}_{i,\ell}\to v_{i,\ell}$ for $\ell=1,\cdots,d$. + - Sends $u_i=\sum_{j=1}^d \alpha_{i,j}v_{i,j}$ to the master. +- Let $\mathcal{K}=\{i_1,\cdots,i_{n-s}\}$ be the indices of the first $n-s$ responses. +- Since $\mathbb{I}$ is in the span of any $n-s$ rows of $B$, there exists $\lambda_1,\cdots,\lambda_{n-s}$ such that $(\lambda_1,\cdots,\lambda_{n-s})((u_{i_1},\cdots,u_{i_{n-s}}))=\sum_i v_i$. + +#### Construction of Gradient Coding Matrices + +Goal: + +- For a given straggler tolerance parameter $s$, we wish to construct a gradient coding matrix $B$ with the smallest possible $d$. + +- Tools: + +I. Cyclic Reed-Solomon codes over the complex numbers. +II. Definition of $\mathcal{C}^\perp$ (dual of $\mathcal {C}$) and $\mathcal{C}^R$ (reverse of $\mathcal{C}$). +III. A simple lemma about MDS codes. + +Recall: An $n,k$ Reed-Solomon code over a field $\mathcal{F}$ is as follows. + +- Fix distinct $\alpha_1,\cdots,\alpha_n-1\in\mathcal{F}$. +- $\mathcal{C}=\{f(\alpha_1),f(\alpha_2),\cdots,f(\alpha_n-1)\}$. +- Dimension $k$ and minimum distance $n-k+1$ follow from $\mathcal{F}$ being a field. +- Also works for $\mathcal {F}=\mathbb{C}$. + +### I. Cyclic Reed-Solomon codes over the complex numbers. + +The following Reed-Solomon code over the complex numbers is called a cyclic code. + +- Let $i=\sqrt{-1}$. +- For $j\in \{0,\cdots,n-1\}$, choose $a_j=e^{2\pi i j/n}$. The $a_j$'s are roots of unity of order $n$. +- Use these $a_j$'s to define a Reed-Solomon code as usual. + +This code is cyclic: + +- Let $c=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1}))$ for some $f_c(x)\in \mathbb{C}[x]$. +- Need to show that $c'=f_c(a_j)$ for all $j\in \{0,\cdots,n-1\}$. + +$$ +c'=(f_c(a_1),f_c(a_2),\cdots,f_c(a_{n-1}),f_c(a_0))=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1})) +$$ + +### II. Dual and Reversed Codes + +- Let $\mathcal{C}=[n,k,d]_{\mathbb{F}}$ be an MDS code. + +#### Definition for dual code of $\mathcal{C}$ + +The dual code of $\mathcal{C}$ is + +$$ +\mathcal{C}^\perp=\{c'\in \mathbb{F}^n|c'c^\top=0\text{ for all }c\in \mathcal{C}\} +$$ + +Claim: $\mathcal{C}^\perp$ is an $[n,n-k,k+1]_{\mathbb{F}}$ code. + +#### Definition for reversed code of $\mathcal{C}$ + +The reversed code of $\mathcal{C}$ is + +$$ +\mathcal{C}^R=\{(c_{n-1},\cdots,c_0)|(c_0,\cdots,c_{n-1})\in \mathcal{C}\} +$$ + +We claim that if $\mathcal{C}$ is cyclic, then $\mathcal{C}^R$ is cyclic. + +### III. Lemma about MDS codes + +Let $\mathcal{C}=[n,k,n-k+1]_{\mathbb{F}}$ be an MDS code. + +#### Lemma + +For any subset $\mathcal{K}\subset \{0,\cdots,n-1\}$, of size $n-k+1$ there exists $c\in \mathcal{C}$ whose support (set of nonzero indices) is $\mathcal{K}$. + +
+Proof + +Let $G\in \mathbb{F}^{k\times n}$ be a generator matrix, and let $G_{\mathcal{K}^c}\in \mathbb{F}^{k\times (k-1)}$ be its restriction to columns not indexed by $\mathcal{K}$. + +$G_{\mathcal{K}^c}$ has more rows than columns, so there exists $v\in \mathbb{F}^{k}$ such that $vG_{\mathcal{K}^c}=0$. + +So $c=vG$ has at least $|\mathcal{K}^c|=k-1$ zeros inn entires indexed by $\mathcal{K}^c$. + +The remaining $n-(k-1)=d$ entries of $c$, indexed by $\mathcal{K}$, must be nonzero. + +Thus the suppose of $c$ is $\mathcal{K}$. + +
+ +### Construct gradient coding matrix + +Consider nay $n$ workers and $s$ stragglers. + +Let $d=s+1$ + +Let $\mathcal{C}=[n,n-s]_{\mathbb{C}}$ be the cyclic RS code build by I. + +Then by III, there exists $c\in \mathcal{c}$ whose support is the $n-(n-s)+1=s+1$ first entires. + +Denote $c=(\beta_1,\cdots,\beta_{s+1},0,0,\cdots,0)$. for some nonzero $\beta_1,\cdots,\beta_{s+1}$. + +Build: + +$B\in \mathbb{C}^{n\times n}$ whose columns are all cyclic shifts of $c$. + +We claim that $B$ is a gradient coding matrix. + +$$ +\begin{pmatrix} +\beta_1 & 0 & & 0 & \beta_{s+1} & \beta_s& \cdots & \beta_2 \\ +\vdots & \beta_1 & & \vdots & 0 & \beta_{s+1} & & \vdots \\ +\beta_{s+1} & \vdots & & & \vdots & 0 & & \beta_{s+1}\\ +0 & \beta_{s+1} & \ddots & 0 & &\vdots & \dots & 0\\ +\vdots & 0 & \ddots & \beta_1 & & & & &\\ + & \vdots & \ddots & \vdots & & & &0\\ +0 & 0 & & \beta_{s+1}& \beta_s& \beta_{s-1}& \cdots & \beta_1\\ +\end{pmatrix} +$$ + +
+Proof + +Every row is a codeword in $\mathcal{C}^R$. + +- Specifically, a shift of $(0,\cdots,0,\beta_{s+1},\cdots,\beta_1)$. +- Then every row contains $\leq d=s+1$ nonzeros. + +$\mathcal{I}$ is in the span of any $n-s$ rows of $B$. + +- Observe that $\mathcal{I}\in \mathcal{C}$, (evaluate the polynomial $f(x)=1$ at $\alpha_1,\cdots,\alpha_n$). +- Then $\mathcal{I}\in \mathcal{C}^R$. +- Therefore, it suffices to show that any $n-s$ span $\mathcal{C}^R$. +- Since $\dim \mathcal{C}=\dim \mathcal{C}^R=n-s$, it suffices to show that any $n-s$ rows are independent. + +Observe: The left most $n-s$ columns are linearly independent, and therefore span $\mathcal{C}$. + +Assume for contradiction there exists $n-s$ dependent rows. + +Then $\exists v\in \mathbb{C}^{n}$ such that $vB=0$. + +$v$ is orthogonal to the basis of $\mathcal{C}$. + +So $v\in \mathcal{C}^\perp$. + +From II, $\mathcal{C}^\perp$ is an $[n,s]$ MDS code, and hence every $v\in \mathcal{C}^\perp$ is of Hamming weight $\geq n-s+1$. + +This is a contradiction. + +
+ +### Bound for gradient coding + +We want $s$ to be large and $d$ to be small. + +How small can $d$ with respect to $s$? + +- A: Build a bipartite graph. + - Left side: $n$ workers $W_1,\cdots,W_n$. + - Right side: $n$ partial datasets $D_1,\cdots,D_n$. + - Connect $W_i$ to $D_i$ if worker $i$ contains $D_i$. + - Equivalently if $B_{i,i}\neq 0$. + - $\deg (W_i) = d$ by definition. + - $\deg (\mathcal{D}_j)\geq s+1$. + - Sum degree on left $nd$ and right $\geq n(s+1)$. + - So $d\geq s+1$. + +We can break the lower bound using approximate computation. \ No newline at end of file diff --git a/content/CSE5313/_meta.js b/content/CSE5313/_meta.js index bcc1380..f0aa3e9 100644 --- a/content/CSE5313/_meta.js +++ b/content/CSE5313/_meta.js @@ -24,4 +24,5 @@ export default { CSE5313_L18: "CSE5313 Coding and information theory for data science (Lecture 18)", CSE5313_L19: "CSE5313 Coding and information theory for data science (Lecture 19)", CSE5313_L20: "CSE5313 Coding and information theory for data science (Lecture 20)", + CSE5313_L21: "CSE5313 Coding and information theory for data science (Lecture 21)", } \ No newline at end of file diff --git a/content/CSE5519/CSE5519_B5.md b/content/CSE5519/CSE5519_B5.md index 2d3a459..87ccf46 100644 --- a/content/CSE5519/CSE5519_B5.md +++ b/content/CSE5519/CSE5519_B5.md @@ -1,2 +1,21 @@ # CSE5519 Advances in Computer Vision (Topic B: 2025: Vision-Language Models) +## Molmo and PixMo: + +[link to paper](https://openaccess.thecvf.com/content/CVPR2025/papers/Deitke_Molmo_and_PixMo_Open_Weights_and_Open_Data_for_State-of-the-Art_CVPR_2025_paper.pdf) + +## Novelty in Molmo and PixMo + +PixMo dataset (712k images with long 200+ words description) + +- Simplified two-stage training pipline + - Standard ViT architecture with tokenizer and image encoder (CLIP) and pooling the embeddings to the decoder only LLM. +- overlapping multi-crop policy + - Add overlapping region and image cropping to truncate the large image. +- training over multiple annotations + - Text-only residual dropout +- optimizer setups + +> [!TIP] +> +> This paper provides an interesting dataset and a refined training pipeline that is comparable to current closed-source SOTA performance. What is the contribution of the paper from the algorithm perspective? It seems that it is just a test for a new dataset with a slightly altered training pipeline.