Files
NoteNextra-origin/content/CSE5313/CSE5313_L21.md
Trance-0 34afc00a7f updates
2025-11-11 12:45:58 -06:00

9.9 KiB
Raw Blame History

CSE5313 Coding and information theory for data science (Lecture 21)

Gradient coding

Intro to Statistical Machine Learning

Given by the learning problem:

Unknown target function f:X\to Y.

Training data \mathcal{D}=\{(x_i,y_i)\}_{i=1}^n.

Learning algorithm: \mathbb{A} and Hypothesis set \mathcal{H}.

Goal is to find g\approx f,

Common hypothesis sets:

  • Linear classifiers:

f(x)=sign (wx^\top)
  • Linear regressors:

f(x)=wx^\top
  • Neural networks:
    • Concatenated linear classifiers (or differentiable approximation thereof).

The dataset \mathcal{D} is taken from unknown distribution D.

Common approach Empirical Risk Minimization

  • Q: How to quantify f\approx g ?
  • A: Use a loss function.
    • A function which measures the deviation between $g$'s output and $f$'s output.
  • Using the chosen loss function, define two measures:
    • True risk: \mathbb{E}_{D}[\mathbb{L}(f,g)]
    • Empirical risk: \frac{1}{n}\sum_{i=1}^n \ell(g(x_i),y_i)

Machine learning is over \mathbb{R}.

Gradient Descent (Motivation)

Parameterize g using some real vector \vec{w}, we want to minimize ER(\vec{w}).

Algorithm:

  • Initialize \vec{w}_0.
  • For t=1,2,\cdots,T:
    • Computer \nabla_{\vec{w}} ER(\vec{w}).
    • \vec{w}\gets \vec{w}-\eta\nabla_{\vec{w}} ER(\vec{w})
    • Terminate if some stop condition are met.

Bottleneck: Calculating \nabla_{\vec{w}} ER(\vec{w})=\frac{1}{n}\sum_{i=1}^n \nabla_{\vec{w}} \ell(g(x_i),y_i).

Potentially, O(PN) where N is number of points, dimension of feature space.

Solution: Parallelize.

Distributed Gradient Descent

Idea: use a distributed system with master and workers.

Problem: Stragglers (slow servers, 5-6 slower than average time).

Potential Solutions:

  • Wait for all servers:
    • Accurate, but slow
  • Sum results without the slowest ones
    • Less accurate, but faster
  • Introduce redundancy
    • Send each \mathcal{D}_i to more than one server.
    • Each server receives more than one \mathcal{D}_i.
    • Each server sends a linear combination of the partials gradient to its \mathcal{D}_i.
    • The master decodes the sum of partial gradients from the linear combination.

Problem setups

System setup: 1 master M, n workers W_1,\cdots,W_n.

A dataset \mathcal{D}=\{(x_i,y_i)\}_{i=1}^n.

Each server j:

  • Receives some d data points \mathcal{D}_j=\{(x_{j,i},y_{j,i})\}_{i=1}^d. where d is the replication factor.
  • Computes a certain vector v_i from each (x_{j,i},y_{j,i}).
  • Returns a linear combination u_i=\sum_{j=1}^n \alpha_j v_j. to the master.

The master:

  • Waits for the first n-s $u_i$'s to arrive (s is the straggler tolerance factor).
  • Linear combines the $u_i$'s to get the final gradient.

Goal: Retrieve \sum_i v_i regardless of which n-s workers responded.

  • Computation of the full gradient that tolerates any s straggler.

The $\alpha_{i,j}$'s are fixed (do not depend on data). These form a gradient coding matrix B\in\mathbb{C}^{n\times n}.

Row i has d non-zero entries \alpha_{i,j}. at some positions.

The $\lambda_i$'s:

  • Might depend on the identity of the n-s responses.
  • Nevertheless must exists in any case.

Recall:

  • The master must be able to recover \sum_i v_i form any n-s responses.

Let \mathcal{K} be the indices of the responses.

Let B_\mathcal{K} be the submatrix of B indexed by \mathcal{K}.

Must have:

  • For every \mathcal{K} of size n-s there exists coefficients \lambda_1,\cdots,\lambda_{n-s} such that:

(\lambda_1,\cdots,\lambda_{n-s})B_\mathcal{K}=(1,1,1,1,\cdots,1)=\mathbb{I}

Then if \mathcal{K}=\{i_1,\cdots,i_{n-s}\} responded,


(\lambda_1 v_{i_1},\cdots,\lambda_{n-s} v_{i_{n-s}})\begin{pmatrix}
u_{i_1}\\
u_{i_2}\\
\vdots\\
u_{i_{n-s}}
\end{pmatrix}=\sum_i v_i

Definition of gradient coding matrix.

For replication factor d and straggler tolerance factor s: B\in\mathbb{C}^{n\times n} is a gradient coding matrix if:

  • \mathbb{I} is in th span of any n-s rows.
  • Every row of B contains at most d nonzero elements.

Grading coding matrix implies gradient coding algorithm:

  • The master sends S_i to worker i. where i=1,\cdots,n are the nonzero indices of row i.
  • Worker i:
    • Computes \mathcal{D}_{i,\ell}\to v_{i,\ell} for \ell=1,\cdots,d.
    • Sends u_i=\sum_{j=1}^d \alpha_{i,j}v_{i,j} to the master.
  • Let \mathcal{K}=\{i_1,\cdots,i_{n-s}\} be the indices of the first n-s responses.
  • Since \mathbb{I} is in the span of any n-s rows of B, there exists \lambda_1,\cdots,\lambda_{n-s} such that (\lambda_1,\cdots,\lambda_{n-s})((u_{i_1},\cdots,u_{i_{n-s}}))=\sum_i v_i.

Construction of Gradient Coding Matrices

Goal:

  • For a given straggler tolerance parameter s, we wish to construct a gradient coding matrix B with the smallest possible d.

  • Tools:

I. Cyclic Reed-Solomon codes over the complex numbers. II. Definition of \mathcal{C}^\perp (dual of \mathcal {C}) and \mathcal{C}^R (reverse of \mathcal{C}). III. A simple lemma about MDS codes.

Recall: An n,k Reed-Solomon code over a field \mathcal{F} is as follows.

  • Fix distinct \alpha_1,\cdots,\alpha_n-1\in\mathcal{F}.
  • \mathcal{C}=\{f(\alpha_1),f(\alpha_2),\cdots,f(\alpha_n-1)\}.
  • Dimension k and minimum distance n-k+1 follow from \mathcal{F} being a field.
  • Also works for \mathcal {F}=\mathbb{C}.

I. Cyclic Reed-Solomon codes over the complex numbers.

The following Reed-Solomon code over the complex numbers is called a cyclic code.

  • Let i=\sqrt{-1}.
  • For j\in \{0,\cdots,n-1\}, choose a_j=e^{2\pi i j/n}. The $a_j$'s are roots of unity of order n.
  • Use these $a_j$'s to define a Reed-Solomon code as usual.

This code is cyclic:

  • Let c=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1})) for some f_c(x)\in \mathbb{C}[x].
  • Need to show that c'=f_c(a_j) for all j\in \{0,\cdots,n-1\}.

c'=(f_c(a_1),f_c(a_2),\cdots,f_c(a_{n-1}),f_c(a_0))=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1}))

II. Dual and Reversed Codes

  • Let \mathcal{C}=[n,k,d]_{\mathbb{F}} be an MDS code.

Definition for dual code of \mathcal{C}

The dual code of \mathcal{C} is


\mathcal{C}^\perp=\{c'\in \mathbb{F}^n|c'c^\top=0\text{ for all }c\in \mathcal{C}\}

Claim: \mathcal{C}^\perp is an [n,n-k,k+1]_{\mathbb{F}} code.

Definition for reversed code of \mathcal{C}

The reversed code of \mathcal{C} is


\mathcal{C}^R=\{(c_{n-1},\cdots,c_0)|(c_0,\cdots,c_{n-1})\in \mathcal{C}\}

We claim that if \mathcal{C} is cyclic, then \mathcal{C}^R is cyclic.

III. Lemma about MDS codes

Let \mathcal{C}=[n,k,n-k+1]_{\mathbb{F}} be an MDS code.

Lemma

For any subset \mathcal{K}\subset \{0,\cdots,n-1\}, of size n-k+1 there exists c\in \mathcal{C} whose support (set of nonzero indices) is \mathcal{K}.

Proof

Let G\in \mathbb{F}^{k\times n} be a generator matrix, and let G_{\mathcal{K}^c}\in \mathbb{F}^{k\times (k-1)} be its restriction to columns not indexed by \mathcal{K}.

G_{\mathcal{K}^c} has more rows than columns, so there exists v\in \mathbb{F}^{k} such that vG_{\mathcal{K}^c}=0.

So c=vG has at least |\mathcal{K}^c|=k-1 zeros inn entires indexed by \mathcal{K}^c.

The remaining n-(k-1)=d entries of c, indexed by \mathcal{K}, must be nonzero.

Thus the suppose of c is \mathcal{K}.

Construct gradient coding matrix

Consider nay n workers and s stragglers.

Let d=s+1

Let \mathcal{C}=[n,n-s]_{\mathbb{C}} be the cyclic RS code build by I.

Then by III, there exists c\in \mathcal{c} whose support is the n-(n-s)+1=s+1 first entires.

Denote c=(\beta_1,\cdots,\beta_{s+1},0,0,\cdots,0). for some nonzero \beta_1,\cdots,\beta_{s+1}.

Build:

B\in \mathbb{C}^{n\times n} whose columns are all cyclic shifts of c.

We claim that B is a gradient coding matrix.


\begin{pmatrix}
\beta_1     & 0           &        & 0          & \beta_{s+1} & \beta_s& \cdots & \beta_2 \\
\vdots      & \beta_1     &        & \vdots     & 0           & \beta_{s+1} &  & \vdots \\
\beta_{s+1} & \vdots      &        &            & \vdots      & 0 &  & \beta_{s+1}\\
0           & \beta_{s+1} & \ddots & 0          &             &\vdots & \dots & 0\\
\vdots      & 0           & \ddots & \beta_1     & & & & &\\
            & \vdots      & \ddots & \vdots     & & & &0\\
0           & 0           &        & \beta_{s+1}& \beta_s& \beta_{s-1}& \cdots & \beta_1\\
\end{pmatrix}
Proof

Every row is a codeword in \mathcal{C}^R.

  • Specifically, a shift of (0,\cdots,0,\beta_{s+1},\cdots,\beta_1).
  • Then every row contains \leq d=s+1 nonzeros.

\mathcal{I} is in the span of any n-s rows of B.

  • Observe that \mathcal{I}\in \mathcal{C}, (evaluate the polynomial f(x)=1 at \alpha_1,\cdots,\alpha_n).
  • Then \mathcal{I}\in \mathcal{C}^R.
  • Therefore, it suffices to show that any n-s span \mathcal{C}^R.
  • Since \dim \mathcal{C}=\dim \mathcal{C}^R=n-s, it suffices to show that any n-s rows are independent.

Observe: The left most n-s columns are linearly independent, and therefore span \mathcal{C}.

Assume for contradiction there exists n-s dependent rows.

Then \exists v\in \mathbb{C}^{n} such that vB=0.

v is orthogonal to the basis of \mathcal{C}.

So v\in \mathcal{C}^\perp.

From II, \mathcal{C}^\perp is an [n,s] MDS code, and hence every v\in \mathcal{C}^\perp is of Hamming weight \geq n-s+1.

This is a contradiction.

Bound for gradient coding

We want s to be large and d to be small.

How small can d with respect to s?

  • A: Build a bipartite graph.
    • Left side: n workers W_1,\cdots,W_n.
    • Right side: n partial datasets D_1,\cdots,D_n.
    • Connect W_i to D_i if worker i contains D_i.
      • Equivalently if B_{i,i}\neq 0.
    • \deg (W_i) = d by definition.
    • \deg (\mathcal{D}_j)\geq s+1.
    • Sum degree on left nd and right \geq n(s+1).
    • So d\geq s+1.

We can break the lower bound using approximate computation.