Files
NoteNextra-origin/content/CSE5313/CSE5313_L21.md
Trance-0 34afc00a7f updates
2025-11-11 12:45:58 -06:00

316 lines
9.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CSE5313 Coding and information theory for data science (Lecture 21)
## Gradient coding
### Intro to Statistical Machine Learning
Given by the learning problem:
**Unknown** target function $f:X\to Y$.
Training data $\mathcal{D}=\{(x_i,y_i)\}_{i=1}^n$.
Learning algorithm: $\mathbb{A}$ and Hypothesis set $\mathcal{H}$.
Goal is to find $g\approx f$,
Common hypothesis sets:
- Linear classifiers:
$$
f(x)=sign (wx^\top)
$$
- Linear regressors:
$$
f(x)=wx^\top
$$
- Neural networks:
- Concatenated linear classifiers (or differentiable approximation thereof).
The dataset $\mathcal{D}$ is taken from unknown distribution $D$.
#### Common approach Empirical Risk Minimization
- Q: How to quantify $f\approx g$ ?
- A: Use a loss function.
- A function which measures the deviation between $g$'s output and $f$'s output.
- Using the chosen loss function, define two measures:
- True risk: $\mathbb{E}_{D}[\mathbb{L}(f,g)]$
- Empirical risk: $\frac{1}{n}\sum_{i=1}^n \ell(g(x_i),y_i)$
Machine learning is over $\mathbb{R}$.
### Gradient Descent (Motivation)
Parameterize $g$ using some real vector $\vec{w}$, we want to minimize $ER(\vec{w})$.
Algorithm:
- Initialize $\vec{w}_0$.
- For $t=1,2,\cdots,T$:
- Computer $\nabla_{\vec{w}} ER(\vec{w})$.
- $\vec{w}\gets \vec{w}-\eta\nabla_{\vec{w}} ER(\vec{w})$
- Terminate if some stop condition are met.
Bottleneck: Calculating $\nabla_{\vec{w}} ER(\vec{w})=\frac{1}{n}\sum_{i=1}^n \nabla_{\vec{w}} \ell(g(x_i),y_i)$.
Potentially, $O(PN)$ where $N$ is number of points, dimension of feature space.
Solution: Parallelize.
#### Distributed Gradient Descent
Idea: use a distributed system with **master** and **workers**.
Problem: Stragglers (slow servers, 5-6 slower than average time).
Potential Solutions:
- Wait for all servers:
- Accurate, but slow
- Sum results without the slowest ones
- Less accurate, but faster
- Introduce redundancy
- Send each $\mathcal{D}_i$ to more than one server.
- Each server receives more than one $\mathcal{D}_i$.
- Each server sends a linear combination of the partials gradient to its $\mathcal{D}_i$.
- The master decodes the sum of partial gradients from the linear combination.
### Problem setups
System setup: 1 master $M$, $n$ workers $W_1,\cdots,W_n$.
A dataset $\mathcal{D}=\{(x_i,y_i)\}_{i=1}^n$.
Each server $j$:
- Receives some $d$ data points $\mathcal{D}_j=\{(x_{j,i},y_{j,i})\}_{i=1}^d$. where $d$ is the replication factor.
- Computes a certain vector $v_i$ from each $(x_{j,i},y_{j,i})$.
- Returns a linear combination $u_i=\sum_{j=1}^n \alpha_j v_j$. to the master.
The master:
- Waits for the first $n-s$ $u_i$'s to arrive ($s$ is the straggler tolerance factor).
- Linear combines the $u_i$'s to get the final gradient.
Goal: Retrieve $\sum_i v_i$ regardless of which $n-s$ workers responded.
- Computation of the full gradient that tolerates any $s$ straggler.
The $\alpha_{i,j}$'s are fixed (do not depend on data). These form a **gradient coding matrix** $B\in\mathbb{C}^{n\times n}$.
Row $i$ has $d$ non-zero entries $\alpha_{i,j}$. at some positions.
The $\lambda_i$'s:
- Might depend on the identity of the $n-s$ responses.
- Nevertheless must exists in any case.
Recall:
- The master must be able to recover $\sum_i v_i$ form any $n-s$ responses.
Let $\mathcal{K}$ be the indices of the responses.
Let $B_\mathcal{K}$ be the submatrix of $B$ indexed by $\mathcal{K}$.
Must have:
- For every $\mathcal{K}$ of size $n-s$ there exists coefficients $\lambda_1,\cdots,\lambda_{n-s}$ such that:
$$
(\lambda_1,\cdots,\lambda_{n-s})B_\mathcal{K}=(1,1,1,1,\cdots,1)=\mathbb{I}
$$
Then if $\mathcal{K}=\{i_1,\cdots,i_{n-s}\}$ responded,
$$
(\lambda_1 v_{i_1},\cdots,\lambda_{n-s} v_{i_{n-s}})\begin{pmatrix}
u_{i_1}\\
u_{i_2}\\
\vdots\\
u_{i_{n-s}}
\end{pmatrix}=\sum_i v_i
$$
#### Definition of gradient coding matrix.
For replication factor $d$ and straggler tolerance factor $s$: $B\in\mathbb{C}^{n\times n}$ is a gradient coding matrix if:
- $\mathbb{I}$ is in th span of any $n-s$ rows.
- Every row of $B$ contains at most $d$ nonzero elements.
Grading coding matrix implies gradient coding algorithm:
- The master sends $S_i$ to worker $i$. where $i=1,\cdots,n$ are the nonzero indices of row $i$.
- Worker $i$:
- Computes $\mathcal{D}_{i,\ell}\to v_{i,\ell}$ for $\ell=1,\cdots,d$.
- Sends $u_i=\sum_{j=1}^d \alpha_{i,j}v_{i,j}$ to the master.
- Let $\mathcal{K}=\{i_1,\cdots,i_{n-s}\}$ be the indices of the first $n-s$ responses.
- Since $\mathbb{I}$ is in the span of any $n-s$ rows of $B$, there exists $\lambda_1,\cdots,\lambda_{n-s}$ such that $(\lambda_1,\cdots,\lambda_{n-s})((u_{i_1},\cdots,u_{i_{n-s}}))=\sum_i v_i$.
#### Construction of Gradient Coding Matrices
Goal:
- For a given straggler tolerance parameter $s$, we wish to construct a gradient coding matrix $B$ with the smallest possible $d$.
- Tools:
I. Cyclic Reed-Solomon codes over the complex numbers.
II. Definition of $\mathcal{C}^\perp$ (dual of $\mathcal {C}$) and $\mathcal{C}^R$ (reverse of $\mathcal{C}$).
III. A simple lemma about MDS codes.
Recall: An $n,k$ Reed-Solomon code over a field $\mathcal{F}$ is as follows.
- Fix distinct $\alpha_1,\cdots,\alpha_n-1\in\mathcal{F}$.
- $\mathcal{C}=\{f(\alpha_1),f(\alpha_2),\cdots,f(\alpha_n-1)\}$.
- Dimension $k$ and minimum distance $n-k+1$ follow from $\mathcal{F}$ being a field.
- Also works for $\mathcal {F}=\mathbb{C}$.
### I. Cyclic Reed-Solomon codes over the complex numbers.
The following Reed-Solomon code over the complex numbers is called a cyclic code.
- Let $i=\sqrt{-1}$.
- For $j\in \{0,\cdots,n-1\}$, choose $a_j=e^{2\pi i j/n}$. The $a_j$'s are roots of unity of order $n$.
- Use these $a_j$'s to define a Reed-Solomon code as usual.
This code is cyclic:
- Let $c=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1}))$ for some $f_c(x)\in \mathbb{C}[x]$.
- Need to show that $c'=f_c(a_j)$ for all $j\in \{0,\cdots,n-1\}$.
$$
c'=(f_c(a_1),f_c(a_2),\cdots,f_c(a_{n-1}),f_c(a_0))=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1}))
$$
### II. Dual and Reversed Codes
- Let $\mathcal{C}=[n,k,d]_{\mathbb{F}}$ be an MDS code.
#### Definition for dual code of $\mathcal{C}$
The dual code of $\mathcal{C}$ is
$$
\mathcal{C}^\perp=\{c'\in \mathbb{F}^n|c'c^\top=0\text{ for all }c\in \mathcal{C}\}
$$
Claim: $\mathcal{C}^\perp$ is an $[n,n-k,k+1]_{\mathbb{F}}$ code.
#### Definition for reversed code of $\mathcal{C}$
The reversed code of $\mathcal{C}$ is
$$
\mathcal{C}^R=\{(c_{n-1},\cdots,c_0)|(c_0,\cdots,c_{n-1})\in \mathcal{C}\}
$$
We claim that if $\mathcal{C}$ is cyclic, then $\mathcal{C}^R$ is cyclic.
### III. Lemma about MDS codes
Let $\mathcal{C}=[n,k,n-k+1]_{\mathbb{F}}$ be an MDS code.
#### Lemma
For any subset $\mathcal{K}\subset \{0,\cdots,n-1\}$, of size $n-k+1$ there exists $c\in \mathcal{C}$ whose support (set of nonzero indices) is $\mathcal{K}$.
<details>
<summary>Proof</summary>
Let $G\in \mathbb{F}^{k\times n}$ be a generator matrix, and let $G_{\mathcal{K}^c}\in \mathbb{F}^{k\times (k-1)}$ be its restriction to columns not indexed by $\mathcal{K}$.
$G_{\mathcal{K}^c}$ has more rows than columns, so there exists $v\in \mathbb{F}^{k}$ such that $vG_{\mathcal{K}^c}=0$.
So $c=vG$ has at least $|\mathcal{K}^c|=k-1$ zeros inn entires indexed by $\mathcal{K}^c$.
The remaining $n-(k-1)=d$ entries of $c$, indexed by $\mathcal{K}$, must be nonzero.
Thus the suppose of $c$ is $\mathcal{K}$.
</details>
### Construct gradient coding matrix
Consider nay $n$ workers and $s$ stragglers.
Let $d=s+1$
Let $\mathcal{C}=[n,n-s]_{\mathbb{C}}$ be the cyclic RS code build by I.
Then by III, there exists $c\in \mathcal{c}$ whose support is the $n-(n-s)+1=s+1$ first entires.
Denote $c=(\beta_1,\cdots,\beta_{s+1},0,0,\cdots,0)$. for some nonzero $\beta_1,\cdots,\beta_{s+1}$.
Build:
$B\in \mathbb{C}^{n\times n}$ whose columns are all cyclic shifts of $c$.
We claim that $B$ is a gradient coding matrix.
$$
\begin{pmatrix}
\beta_1 & 0 & & 0 & \beta_{s+1} & \beta_s& \cdots & \beta_2 \\
\vdots & \beta_1 & & \vdots & 0 & \beta_{s+1} & & \vdots \\
\beta_{s+1} & \vdots & & & \vdots & 0 & & \beta_{s+1}\\
0 & \beta_{s+1} & \ddots & 0 & &\vdots & \dots & 0\\
\vdots & 0 & \ddots & \beta_1 & & & & &\\
& \vdots & \ddots & \vdots & & & &0\\
0 & 0 & & \beta_{s+1}& \beta_s& \beta_{s-1}& \cdots & \beta_1\\
\end{pmatrix}
$$
<details>
<summary>Proof</summary>
Every row is a codeword in $\mathcal{C}^R$.
- Specifically, a shift of $(0,\cdots,0,\beta_{s+1},\cdots,\beta_1)$.
- Then every row contains $\leq d=s+1$ nonzeros.
$\mathcal{I}$ is in the span of any $n-s$ rows of $B$.
- Observe that $\mathcal{I}\in \mathcal{C}$, (evaluate the polynomial $f(x)=1$ at $\alpha_1,\cdots,\alpha_n$).
- Then $\mathcal{I}\in \mathcal{C}^R$.
- Therefore, it suffices to show that any $n-s$ span $\mathcal{C}^R$.
- Since $\dim \mathcal{C}=\dim \mathcal{C}^R=n-s$, it suffices to show that any $n-s$ rows are independent.
Observe: The left most $n-s$ columns are linearly independent, and therefore span $\mathcal{C}$.
Assume for contradiction there exists $n-s$ dependent rows.
Then $\exists v\in \mathbb{C}^{n}$ such that $vB=0$.
$v$ is orthogonal to the basis of $\mathcal{C}$.
So $v\in \mathcal{C}^\perp$.
From II, $\mathcal{C}^\perp$ is an $[n,s]$ MDS code, and hence every $v\in \mathcal{C}^\perp$ is of Hamming weight $\geq n-s+1$.
This is a contradiction.
</details>
### Bound for gradient coding
We want $s$ to be large and $d$ to be small.
How small can $d$ with respect to $s$?
- A: Build a bipartite graph.
- Left side: $n$ workers $W_1,\cdots,W_n$.
- Right side: $n$ partial datasets $D_1,\cdots,D_n$.
- Connect $W_i$ to $D_i$ if worker $i$ contains $D_i$.
- Equivalently if $B_{i,i}\neq 0$.
- $\deg (W_i) = d$ by definition.
- $\deg (\mathcal{D}_j)\geq s+1$.
- Sum degree on left $nd$ and right $\geq n(s+1)$.
- So $d\geq s+1$.
We can break the lower bound using approximate computation.