9.9 KiB
CSE5313 Coding and information theory for data science (Lecture 21)
Gradient coding
Intro to Statistical Machine Learning
Given by the learning problem:
Unknown target function f:X\to Y.
Training data \mathcal{D}=\{(x_i,y_i)\}_{i=1}^n.
Learning algorithm: \mathbb{A} and Hypothesis set \mathcal{H}.
Goal is to find g\approx f,
Common hypothesis sets:
- Linear classifiers:
f(x)=sign (wx^\top)
- Linear regressors:
f(x)=wx^\top
- Neural networks:
- Concatenated linear classifiers (or differentiable approximation thereof).
The dataset \mathcal{D} is taken from unknown distribution D.
Common approach – Empirical Risk Minimization
- Q: How to quantify
f\approx g? - A: Use a loss function.
- A function which measures the deviation between $g$'s output and $f$'s output.
- Using the chosen loss function, define two measures:
- True risk:
\mathbb{E}_{D}[\mathbb{L}(f,g)] - Empirical risk:
\frac{1}{n}\sum_{i=1}^n \ell(g(x_i),y_i)
- True risk:
Machine learning is over \mathbb{R}.
Gradient Descent (Motivation)
Parameterize g using some real vector \vec{w}, we want to minimize ER(\vec{w}).
Algorithm:
- Initialize
\vec{w}_0. - For
t=1,2,\cdots,T:- Computer
\nabla_{\vec{w}} ER(\vec{w}). \vec{w}\gets \vec{w}-\eta\nabla_{\vec{w}} ER(\vec{w})- Terminate if some stop condition are met.
- Computer
Bottleneck: Calculating \nabla_{\vec{w}} ER(\vec{w})=\frac{1}{n}\sum_{i=1}^n \nabla_{\vec{w}} \ell(g(x_i),y_i).
Potentially, O(PN) where N is number of points, dimension of feature space.
Solution: Parallelize.
Distributed Gradient Descent
Idea: use a distributed system with master and workers.
Problem: Stragglers (slow servers, 5-6 slower than average time).
Potential Solutions:
- Wait for all servers:
- Accurate, but slow
- Sum results without the slowest ones
- Less accurate, but faster
- Introduce redundancy
- Send each
\mathcal{D}_ito more than one server. - Each server receives more than one
\mathcal{D}_i. - Each server sends a linear combination of the partials gradient to its
\mathcal{D}_i. - The master decodes the sum of partial gradients from the linear combination.
- Send each
Problem setups
System setup: 1 master M, n workers W_1,\cdots,W_n.
A dataset \mathcal{D}=\{(x_i,y_i)\}_{i=1}^n.
Each server j:
- Receives some
ddata points\mathcal{D}_j=\{(x_{j,i},y_{j,i})\}_{i=1}^d. wheredis the replication factor. - Computes a certain vector
v_ifrom each(x_{j,i},y_{j,i}). - Returns a linear combination
u_i=\sum_{j=1}^n \alpha_j v_j. to the master.
The master:
- Waits for the first
n-s$u_i$'s to arrive (sis the straggler tolerance factor). - Linear combines the $u_i$'s to get the final gradient.
Goal: Retrieve \sum_i v_i regardless of which n-s workers responded.
- Computation of the full gradient that tolerates any
sstraggler.
The $\alpha_{i,j}$'s are fixed (do not depend on data). These form a gradient coding matrix B\in\mathbb{C}^{n\times n}.
Row i has d non-zero entries \alpha_{i,j}. at some positions.
The $\lambda_i$'s:
- Might depend on the identity of the
n-sresponses. - Nevertheless must exists in any case.
Recall:
- The master must be able to recover
\sum_i v_iform anyn-sresponses.
Let \mathcal{K} be the indices of the responses.
Let B_\mathcal{K} be the submatrix of B indexed by \mathcal{K}.
Must have:
- For every
\mathcal{K}of sizen-sthere exists coefficients\lambda_1,\cdots,\lambda_{n-s}such that:
(\lambda_1,\cdots,\lambda_{n-s})B_\mathcal{K}=(1,1,1,1,\cdots,1)=\mathbb{I}
Then if \mathcal{K}=\{i_1,\cdots,i_{n-s}\} responded,
(\lambda_1 v_{i_1},\cdots,\lambda_{n-s} v_{i_{n-s}})\begin{pmatrix}
u_{i_1}\\
u_{i_2}\\
\vdots\\
u_{i_{n-s}}
\end{pmatrix}=\sum_i v_i
Definition of gradient coding matrix.
For replication factor d and straggler tolerance factor s: B\in\mathbb{C}^{n\times n} is a gradient coding matrix if:
\mathbb{I}is in th span of anyn-srows.- Every row of
Bcontains at mostdnonzero elements.
Grading coding matrix implies gradient coding algorithm:
- The master sends
S_ito workeri. wherei=1,\cdots,nare the nonzero indices of rowi. - Worker
i:- Computes
\mathcal{D}_{i,\ell}\to v_{i,\ell}for\ell=1,\cdots,d. - Sends
u_i=\sum_{j=1}^d \alpha_{i,j}v_{i,j}to the master.
- Computes
- Let
\mathcal{K}=\{i_1,\cdots,i_{n-s}\}be the indices of the firstn-sresponses. - Since
\mathbb{I}is in the span of anyn-srows ofB, there exists\lambda_1,\cdots,\lambda_{n-s}such that(\lambda_1,\cdots,\lambda_{n-s})((u_{i_1},\cdots,u_{i_{n-s}}))=\sum_i v_i.
Construction of Gradient Coding Matrices
Goal:
-
For a given straggler tolerance parameter
s, we wish to construct a gradient coding matrixBwith the smallest possibled. -
Tools:
I. Cyclic Reed-Solomon codes over the complex numbers.
II. Definition of \mathcal{C}^\perp (dual of \mathcal {C}) and \mathcal{C}^R (reverse of \mathcal{C}).
III. A simple lemma about MDS codes.
Recall: An n,k Reed-Solomon code over a field \mathcal{F} is as follows.
- Fix distinct
\alpha_1,\cdots,\alpha_n-1\in\mathcal{F}. \mathcal{C}=\{f(\alpha_1),f(\alpha_2),\cdots,f(\alpha_n-1)\}.- Dimension
kand minimum distancen-k+1follow from\mathcal{F}being a field. - Also works for
\mathcal {F}=\mathbb{C}.
I. Cyclic Reed-Solomon codes over the complex numbers.
The following Reed-Solomon code over the complex numbers is called a cyclic code.
- Let
i=\sqrt{-1}. - For
j\in \{0,\cdots,n-1\}, choosea_j=e^{2\pi i j/n}. The $a_j$'s are roots of unity of ordern. - Use these $a_j$'s to define a Reed-Solomon code as usual.
This code is cyclic:
- Let
c=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1}))for somef_c(x)\in \mathbb{C}[x]. - Need to show that
c'=f_c(a_j)for allj\in \{0,\cdots,n-1\}.
c'=(f_c(a_1),f_c(a_2),\cdots,f_c(a_{n-1}),f_c(a_0))=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1}))
II. Dual and Reversed Codes
- Let
\mathcal{C}=[n,k,d]_{\mathbb{F}}be an MDS code.
Definition for dual code of \mathcal{C}
The dual code of \mathcal{C} is
\mathcal{C}^\perp=\{c'\in \mathbb{F}^n|c'c^\top=0\text{ for all }c\in \mathcal{C}\}
Claim: \mathcal{C}^\perp is an [n,n-k,k+1]_{\mathbb{F}} code.
Definition for reversed code of \mathcal{C}
The reversed code of \mathcal{C} is
\mathcal{C}^R=\{(c_{n-1},\cdots,c_0)|(c_0,\cdots,c_{n-1})\in \mathcal{C}\}
We claim that if \mathcal{C} is cyclic, then \mathcal{C}^R is cyclic.
III. Lemma about MDS codes
Let \mathcal{C}=[n,k,n-k+1]_{\mathbb{F}} be an MDS code.
Lemma
For any subset \mathcal{K}\subset \{0,\cdots,n-1\}, of size n-k+1 there exists c\in \mathcal{C} whose support (set of nonzero indices) is \mathcal{K}.
Proof
Let G\in \mathbb{F}^{k\times n} be a generator matrix, and let G_{\mathcal{K}^c}\in \mathbb{F}^{k\times (k-1)} be its restriction to columns not indexed by \mathcal{K}.
G_{\mathcal{K}^c} has more rows than columns, so there exists v\in \mathbb{F}^{k} such that vG_{\mathcal{K}^c}=0.
So c=vG has at least |\mathcal{K}^c|=k-1 zeros inn entires indexed by \mathcal{K}^c.
The remaining n-(k-1)=d entries of c, indexed by \mathcal{K}, must be nonzero.
Thus the suppose of c is \mathcal{K}.
Construct gradient coding matrix
Consider nay n workers and s stragglers.
Let d=s+1
Let \mathcal{C}=[n,n-s]_{\mathbb{C}} be the cyclic RS code build by I.
Then by III, there exists c\in \mathcal{c} whose support is the n-(n-s)+1=s+1 first entires.
Denote c=(\beta_1,\cdots,\beta_{s+1},0,0,\cdots,0). for some nonzero \beta_1,\cdots,\beta_{s+1}.
Build:
B\in \mathbb{C}^{n\times n} whose columns are all cyclic shifts of c.
We claim that B is a gradient coding matrix.
\begin{pmatrix}
\beta_1 & 0 & & 0 & \beta_{s+1} & \beta_s& \cdots & \beta_2 \\
\vdots & \beta_1 & & \vdots & 0 & \beta_{s+1} & & \vdots \\
\beta_{s+1} & \vdots & & & \vdots & 0 & & \beta_{s+1}\\
0 & \beta_{s+1} & \ddots & 0 & &\vdots & \dots & 0\\
\vdots & 0 & \ddots & \beta_1 & & & & &\\
& \vdots & \ddots & \vdots & & & &0\\
0 & 0 & & \beta_{s+1}& \beta_s& \beta_{s-1}& \cdots & \beta_1\\
\end{pmatrix}
Proof
Every row is a codeword in \mathcal{C}^R.
- Specifically, a shift of
(0,\cdots,0,\beta_{s+1},\cdots,\beta_1). - Then every row contains
\leq d=s+1nonzeros.
\mathcal{I} is in the span of any n-s rows of B.
- Observe that
\mathcal{I}\in \mathcal{C}, (evaluate the polynomialf(x)=1at\alpha_1,\cdots,\alpha_n). - Then
\mathcal{I}\in \mathcal{C}^R. - Therefore, it suffices to show that any
n-sspan\mathcal{C}^R. - Since
\dim \mathcal{C}=\dim \mathcal{C}^R=n-s, it suffices to show that anyn-srows are independent.
Observe: The left most n-s columns are linearly independent, and therefore span \mathcal{C}.
Assume for contradiction there exists n-s dependent rows.
Then \exists v\in \mathbb{C}^{n} such that vB=0.
v is orthogonal to the basis of \mathcal{C}.
So v\in \mathcal{C}^\perp.
From II, \mathcal{C}^\perp is an [n,s] MDS code, and hence every v\in \mathcal{C}^\perp is of Hamming weight \geq n-s+1.
This is a contradiction.
Bound for gradient coding
We want s to be large and d to be small.
How small can d with respect to s?
- A: Build a bipartite graph.
- Left side:
nworkersW_1,\cdots,W_n. - Right side:
npartial datasetsD_1,\cdots,D_n. - Connect
W_itoD_iif workericontainsD_i.- Equivalently if
B_{i,i}\neq 0.
- Equivalently if
\deg (W_i) = dby definition.\deg (\mathcal{D}_j)\geq s+1.- Sum degree on left
ndand right\geq n(s+1). - So
d\geq s+1.
- Left side:
We can break the lower bound using approximate computation.