Files
NoteNextra-origin/content/CSE5313/CSE5313_L22.md
2025-11-18 13:25:21 -06:00

12 KiB
Raw Permalink Blame History

CSE5313 Coding and information theory for data science (Lecture 22)

Approximate Gradient Coding

Exact gradient computation and approximate gradient computation

In the previous formulation, the gradient \sum_i v_i is computed exactly.

  • Accurate
  • Requires d \geq s + 1 (high replication factor).
  • Need to know s in advance!

However:

  • Approximate gradient computation are very common!
    • E.g., stochastic gradient descent.
  • Machine learning is inherently inaccurate.
    • Relies on biased data, unverified assumptions about model, etc.

Idea: If we relax the exact computation requirement, can have d < s + 1?

  • No fixed s anymore.

Approximate computation:

  • Exact computation: \nabla \triangleq v = \sum_i v_i = (1, \cdots, 1) v_1, \cdots, v_n^\top. .
  • Approximate computation: \nabla \triangleq v = \sum_i v_i \approx u v_1, \cdots, v_n^\top,
    • Where d_2(u, \mathbb{I}) is "small" (d_2(u, v) =\sqrt{ \sum_i (u_i - v_i)^2}).
    • Why?
  • Lemma: Let v_u = u (v_1, \cdots, v_n)^\top. If d_2(u, \mathbb{I}) \leq \epsilon then d_2(v, v_u) \leq \epsilon \cdot \ell_{spec}(V).
    • V is the matrix whose rows are the $v_i$'s.
    • \ell_{spec} is the spectral norm (positive sqrt of maximum eigenvalue of V^\top V).
  • Idea: Distribute S_1, \cdots, S_n as before, and
    • as the master gets more and more responses,
    • it can reconstruct u v_1, \cdots, v_n^\top,
    • such that d_2(u, \ell) gets smaller and smaller.

Note

d \geq s + 1 no longer holds. s no longer a parameter of the system, but s = n - \#responses at any given time.

Trivial Scheme

Off the bat, the "do nothing" approach:

  • Send S_i to worker i, i.e., d = 1.
  • Worker i replies with the $i$'th partial gradient v_i.
  • The master averages up all the responses.

How good is that?

  • For u = \frac{n}{n-s} \cdot \mathbb{I}, the factor \frac{n}{n-s} corrects the \frac{1}{n} in v_i = \frac{1}{n} \cdot \nabla \text{ on } S_i.
  • Is this \approx \sum_i v_i? In other words, what is d_2(\frac{n}{n-s} \cdot \mathbb{I}, \mathbb{I})?

Trivial scheme: \frac{n}{n-s} \cdot \mathbb{I} approximation.

Must do better than that!

Roadmap

  • Quick reminder from linear algebra.
    • Eigenvectors and orthogonality.
  • Quick reminder from graph theory.
    • Adjacency matrix of a graph.
  • Graph theoretic concept: expander graphs.
    • "Well connected" graphs.
    • Extensively studied.
  • An approximate gradient coding scheme from expander graphs.

Linear algebra - Reminder

  • Let A \in \mathbb{R}^{n \times n}.
  • If A v = \lambda v then \lambda is an eigenvalue and v is an eigenvector.
  • v_1, \cdots, v_n \in \mathbb{R}^n are orthonormal:
    • \|v_i\|_2 = 1 for all i.
    • v_i \cdot v_j^\top = 0 for all i \neq j.
  • Nice property: \| \alpha_1 v_1 + \cdots + \alpha_n v_n \|_2^2 = \sqrt{\sum_i \alpha_i^2}.
  • A is called symmetric if A = A^\top.
  • Theorem: A real and symmetric matrix has an orthonormal basis of eigenvectors.
    • That is, there exists an orthonormal basis v_1, \cdots, v_n such that A v_i = \lambda_i v_i for some $\lambda_i$'s.

Graph theory - Reminder

  • Undirected graph G = V, E.
  • V is a vertex set, usually n = 1,2, \cdots, n.
  • E \subseteq \binom{V}{2} is an edge set (i.e., E is a collection of subsets of V of size two).
  • Each edge e \in E is of the form e = (a, b) for some distinct a, b \in V.
  • Spectral graph theory:
    • Analyze properties of graphs (combinatorial object) using matrices (algebraic object).
    • Specifically, for a graph G let A_G \in \{0,1\}^{n \times n} be the adjacency matrix of G.
    • A_{i,j} = 1 if and only if \{i,j\} \in E (otherwise 0).
    • A is real and symmetric.
    • Therefore, has an orthonormal basis of eigenvectors.

Some nice properties of adjacency matrices

  • Let G = (V, E) be $d$-regular, with adjacency matrix A_G whose (real) eigenvalues \lambda_1 \geq \cdots \geq \lambda_n.
  • Some theorems:
    • \lambda_1 = d.
    • \lambda_n \geq -d, equality if and only if G is bipartite.
    • A_G \mathbb{I}^\top = \lambda_1 \mathbb{I}^\top = d \mathbb{I}^\top (easy to show!).
      • Does that ring a bell? ;)
    • If \lambda_1 = \lambda_2 then G is not connected.

Expander graphs - Intuition.

  • An important family of graphs.
  • Multiple applications in:
    • Algorithms, complexity theory, error correcting codes, etc.
  • Intuition: A graph is called an expander if there are no "lonely small sets" of nodes.
  • Every set of at most n/2 nodes is "well connected" to the remaining nodes in the graph.
  • A bit more formally:
    • An infinite family of graphs G_n n=1 \infty (where G_n has n nodes) is called an expander family, if the "minimal connectedness" of small sets in G_n does not go to zero with n.

Expander graphs - Definitions.

  • All graphs in this lecture are $d$-regular, i.e., all nodes have the same degree d.
  • For sets of nodes S, T \subseteq V, let E(S, T) be the set of edges between S and T. I.e., E(S, T) = \{(i,j) \in E | i \in S \text{ and } j \in T\}.
  • For a set of nodes S let:
    • S^c = n \setminus S be its complement.
    • Let \partial S = E(S, S^c) be the boundary of S.
    • I.e., the set of edges between S and its complement S^c.
  • The expansion parameter h_G of G is:
    • I.e., how many edges leave S, relative to its size.
    • How "well connected" S is to the remaining nodes.

Note

h_G = \min_{S \subseteq V, |S| \leq n/2} \frac{\partial S}{|S|}.

  • An infinite family of $d$-regular graphs (G_n)_{n=1}^\infty (where G_n has n nodes) is called an expander family if h(G_n) \geq \epsilon for all n.
    • Same d and same \epsilon for all n.
  • Expander families with large \epsilon are hard to build explicitly.
  • Example: (Lubotsky, Philips and Sarnak '88)
    • V = \mathbb{Z}_p (prime).
    • Connect x to x + 1, x - 1, x^{-1}.
    • d = 3, very small \epsilon.
  • However, random graphs are expanders with high probability.

Expander graphs - Eigenvalues

  • There is a strong connection between the expansion parameter of a graph and the eigenvalues \lambda_1 \geq \cdots \geq \lambda_n of its adjacency matrix.
  • Some theorems (no proof):
    • \frac{d-\lambda_2}{2} \leq h_G \leq \sqrt{2d (d - \lambda_2)}.
    • d - \lambda_2 is called the spectral gap of G.
      • If the spectral gap is large, G is a good expander.
      • How large can it be?
    • Let \lambda = \max \{|\lambda_2|, |\lambda_n|\}. Then \lambda \geq 2 d - 1 - o_n(1). (Alon-Boppana Theorem).
    • Graphs which achieve the Alon-Boppana bound (i.e., \lambda \leq 2 d - 1) are called Ramanujan graphs.
      • The "best" expanders.
      • Some construction are known.
      • Efficient construction of Ramanujan graphs for all parameters is very recent (2016).

Approximate GC from Expander Graphs

Back to approximate gradient coding.

  • Let d be any replication parameter.
  • Let G be an expander graph (i.e., taken from an infinite expander family (G_n)_{n=1}^\infty)
    • With eigenvalues \lambda_1 \geq \cdots \geq \lambda_n, and respective eigenvectors w_1, \cdots, w_n
    • Assume \|w_1\|_2 =\| w_2\|_2 = \cdots = \|w_n\|_2 = 1, and w_i w_j^\top = 0 for all i \neq j.
  • Let the gradient coding matrix B=\frac{1}{d} A_G.
    • The eigenvalues of B are \mu_1 = 1\geq \mu_2 \geq \cdots \geq \mu_n. where \mu_i = \frac{\lambda_i}{d}.
    • Let \mu = \max \{|\mu_2|, |\mu_n|\}.
    • d nonzero entries in each row \Rightarrow Replication factor d.
  • Claim: For any number of stragglers s, we can get close to \mathbb{I}.
    • Much better than the trivial scheme.
    • Proximity is a function of d and \lambda.
  • For every s and any set \mathcal{K} of n - s responses, we build an "decoding vector".
    • A function of s and of the identities of the responding workers.
    • Will be used to linearly combine the n - s responses to get the approximate gradient.
  • Let w_{\mathcal{K}} \in \mathbb{R}^n such that (w_{\mathcal{K}})_i = \begin{cases} -1 & \text{if } i \notin \mathcal{K} \\ \frac{s}{n-s} & \text{if } i \in \mathcal{K} \end{cases}.

Lemma 1: w_{\mathcal{K}} is spanned by w_2, \cdots, w_n, the n - 1 last eigenvectors of A_G.

Proof

w_2, \cdots, w_n are independent, and all orthogonal to w_1 = \mathbb{I}.

\Rightarrow The span of w_2, \cdots, w_n is exactly all vectors whose sum of entries is zero.

Sum of entries of w_{\mathcal{K}} is zero \Rightarrow w_{\mathcal{K}} is in their span.

Corollary: w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n for some $\alpha_i$'s in \mathbb{R}.

Lemma 2: From direct computation, the norm of w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n is \frac{ns}{n-s}.

Corollary: w_{\mathcal{K}}^2 = \sum_{i=2}^n \alpha_i^2 = \frac{ns}{n-s} (from Lemma 2 + orthonormality of w_2, \cdots, w_n).

The scheme:

  • If the set of responses is \mathcal{K}, the decoding vector is w_{\mathcal{K}} + \ell_2.
  • Notice that \operatorname{supp}(w_{\mathcal{K}} + \ell_2) = \mathcal{K}.
  • The responses the master receives are the rows of B v_1, \cdots, v_n^\top indexed by \mathcal{K}.
  • \Rightarrow The master can compute w_{\mathcal{K}} + \ell_2 B v_1, \cdots, v_n^\top.

Left to show: How close is w_{\mathcal{K}} + \ell_2 B to \ell_2?

Proof

Recall that:

  1. w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n.
  2. $w_i$'s are eigenvectors of A_G (with eigenvalues \lambda_i) and of B = \frac{1}{d} A_G (with eigenvalues \mu_i = \frac{\lambda_i}{d}).

d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 w_2 + \cdots + \alpha_n w_n B, \mathbb{I}) (from 1.)

d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n, \mathbb{I}) (eigenvalues of B, and \mu_1 = 1)

d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n , \mathbb{I}) (by def.).

d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2

d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2 (w_i's are orthonormal).

d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \sqrt{\sum_{i=2}^n \alpha_i^2 \mu_i^2} (w_i's are orthonormal).

Improvement factor

Corollary: If B = \frac{1}{d} A_G for a (d-regular) Ramanujan graph G,

  • \Rightarrow improvement factor \approx \frac{2}{d}.
  • Some explicit constructions of Ramanujan graphs (Lubotsky, Philips and Sarnak '88)
    • with 2 \frac{2}{d} \approx 0.5!

Recap

  • Expander graph: A $d$-regular graph with no lonely small subsets of nodes.
    • Every subset with \leq n/2 has a large ratio \partial S / S (not \rightarrow 0 with n).
    • Many constructions exist, random graph is expander w.h.p.
    • The expansion factor is determined by the spectral gap d - \lambda_2,
    • Where \lambda = \max \{\lambda_2, \lambda_n\}, and \lambda_1 = d \geq \lambda_2 \geq \cdots \geq \lambda_n are the eigenvalues of A_G.
    • "Best" expander = Ramanujan graph = has \lambda \leq 2 d - 1.
  • "Do nothing" approach: approximation \frac{ns}{n-s}.
  • Approximate gradient coding:
    • Send d subsets S_{j1}, \cdots, S_{jd} to each node i, which returns a linear combination according to a coefficient matrix B.
    • Let B = \frac{1}{d} A_G, for G a Ramanujan graph: approximation \frac{\lambda}{d} \frac{ns}{n-s}.
    • Up to 50% closer than "do nothing", at the price of higher computation load

Note

Faster = more computation load.