12 KiB
CSE5313 Coding and information theory for data science (Lecture 22)
Approximate Gradient Coding
Exact gradient computation and approximate gradient computation
In the previous formulation, the gradient \sum_i v_i
is computed exactly.
- Accurate
- Requires
d \geq s + 1(high replication factor). - Need to know
sin advance!
However:
- Approximate gradient computation are very common!
- E.g., stochastic gradient descent.
- Machine learning is inherently inaccurate.
- Relies on biased data, unverified assumptions about model, etc.
Idea: If we relax the exact computation requirement, can have d < s + 1?
- No fixed
sanymore.
Approximate computation:
- Exact computation:
\nabla \triangleq v = \sum_i v_i = (1, \cdots, 1) v_1, \cdots, v_n^\top. ⊤. - Approximate computation:
\nabla \triangleq v = \sum_i v_i \approx u v_1, \cdots, v_n^\top,- Where
d_2(u, \mathbb{I})is "small" (d_2(u, v) =\sqrt{ \sum_i (u_i - v_i)^2}). - Why?
- Where
- Lemma: Let
v_u = u (v_1, \cdots, v_n)^\top. Ifd_2(u, \mathbb{I}) \leq \epsilonthend_2(v, v_u) \leq \epsilon \cdot \ell_{spec}(V).Vis the matrix whose rows are the $v_i$'s.\ell_{spec}is the spectral norm (positive sqrt of maximum eigenvalue ofV^\top V).
- Idea: Distribute
S_1, \cdots, S_nas before, and- as the master gets more and more responses,
- it can reconstruct
u v_1, \cdots, v_n^\top, - such that
d_2(u, \ell)gets smaller and smaller.
Note
d \geq s + 1no longer holds.sno longer a parameter of the system, buts = n - \#responsesat any given time.
Trivial Scheme
Off the bat, the "do nothing" approach:
- Send
S_ito workeri, i.e.,d = 1. - Worker
ireplies with the $i$'th partial gradientv_i. - The master averages up all the responses.
How good is that?
- For
u = \frac{n}{n-s} \cdot \mathbb{I}, the factor\frac{n}{n-s}corrects the\frac{1}{n}inv_i = \frac{1}{n} \cdot \nabla \text{ on } S_i. - Is this
\approx \sum_i v_i? In other words, what isd_2(\frac{n}{n-s} \cdot \mathbb{I}, \mathbb{I})?
Trivial scheme: \frac{n}{n-s} \cdot \mathbb{I} approximation.
Must do better than that!
Roadmap
- Quick reminder from linear algebra.
- Eigenvectors and orthogonality.
- Quick reminder from graph theory.
- Adjacency matrix of a graph.
- Graph theoretic concept: expander graphs.
- "Well connected" graphs.
- Extensively studied.
- An approximate gradient coding scheme from expander graphs.
Linear algebra - Reminder
- Let
A \in \mathbb{R}^{n \times n}. - If
A v = \lambda vthen\lambdais an eigenvalue andvis an eigenvector. v_1, \cdots, v_n \in \mathbb{R}^nare orthonormal:\|v_i\|_2 = 1for alli.v_i \cdot v_j^\top = 0for alli \neq j.
- Nice property:
\| \alpha_1 v_1 + \cdots + \alpha_n v_n \|_2^2 = \sqrt{\sum_i \alpha_i^2}. Ais called symmetric ifA = A^\top.- Theorem: A real and symmetric matrix has an orthonormal basis of eigenvectors.
- That is, there exists an orthonormal basis
v_1, \cdots, v_nsuch thatA v_i = \lambda_i v_ifor some $\lambda_i$'s.
- That is, there exists an orthonormal basis
Graph theory - Reminder
- Undirected graph
G = V, E. Vis a vertex set, usuallyn = 1,2, \cdots, n.E \subseteq \binom{V}{2}is an edge set (i.e.,Eis a collection of subsets ofVof size two).- Each edge
e \in Eis of the forme = (a, b)for some distincta, b \in V. - Spectral graph theory:
- Analyze properties of graphs (combinatorial object) using matrices (algebraic object).
- Specifically, for a graph
GletA_G \in \{0,1\}^{n \times n}be the adjacency matrix ofG. A_{i,j} = 1if and only if\{i,j\} \in E(otherwise 0).Ais real and symmetric.- Therefore, has an orthonormal basis of eigenvectors.
Some nice properties of adjacency matrices
- Let
G = (V, E)be $d$-regular, with adjacency matrixA_Gwhose (real) eigenvalues\lambda_1 \geq \cdots \geq \lambda_n. - Some theorems:
\lambda_1 = d.\lambda_n \geq -d, equality if and only ifGis bipartite.A_G \mathbb{I}^\top = \lambda_1 \mathbb{I}^\top = d \mathbb{I}^\top(easy to show!).- Does that ring a bell? ;)
- If
\lambda_1 = \lambda_2thenGis not connected.
Expander graphs - Intuition.
- An important family of graphs.
- Multiple applications in:
- Algorithms, complexity theory, error correcting codes, etc.
- Intuition: A graph is called an expander if there are no "lonely small sets" of nodes.
- Every set of at most
n/2nodes is "well connected" to the remaining nodes in the graph. - A bit more formally:
- An infinite family of graphs
G_nn=1\infty(whereG_nhasnnodes) is called an expander family, if the "minimal connectedness" of small sets inG_ndoes not go to zero withn.
- An infinite family of graphs
Expander graphs - Definitions.
- All graphs in this lecture are $d$-regular, i.e., all nodes have the same degree
d. - For sets of nodes
S, T \subseteq V, letE(S, T)be the set of edges betweenSandT. I.e.,E(S, T) = \{(i,j) \in E | i \in S \text{ and } j \in T\}. - For a set of nodes
Slet:S^c = n \setminus Sbe its complement.- Let
\partial S = E(S, S^c)be the boundary ofS. - I.e., the set of edges between
Sand its complementS^c.
- The expansion parameter
h_GofGis:- I.e., how many edges leave
S, relative to its size. - How "well connected"
Sis to the remaining nodes.
- I.e., how many edges leave
Note
h_G = \min_{S \subseteq V, |S| \leq n/2} \frac{\partial S}{|S|}.
- An infinite family of $d$-regular graphs
(G_n)_{n=1}^\infty(whereG_nhasnnodes) is called an expander family ifh(G_n) \geq \epsilonfor alln.- Same
dand same\epsilonfor alln.
- Same
- Expander families with large
\epsilonare hard to build explicitly. - Example: (Lubotsky, Philips and Sarnak '88)
V = \mathbb{Z}_p(prime).- Connect
xtox + 1, x - 1, x^{-1}. d = 3, very small\epsilon.
- However, random graphs are expanders with high probability.
Expander graphs - Eigenvalues
- There is a strong connection between the expansion parameter of a graph and the eigenvalues
\lambda_1 \geq \cdots \geq \lambda_nof its adjacency matrix. - Some theorems (no proof):
\frac{d-\lambda_2}{2} \leq h_G \leq \sqrt{2d (d - \lambda_2)}.d - \lambda_2is called the spectral gap ofG.- If the spectral gap is large,
Gis a good expander. - How large can it be?
- If the spectral gap is large,
- Let
\lambda = \max \{|\lambda_2|, |\lambda_n|\}. Then\lambda \geq 2 d - 1 - o_n(1). (Alon-Boppana Theorem). - Graphs which achieve the Alon-Boppana bound (i.e.,
\lambda \leq 2 d - 1) are called Ramanujan graphs.- The "best" expanders.
- Some construction are known.
- Efficient construction of Ramanujan graphs for all parameters is very recent (2016).
Approximate GC from Expander Graphs
Back to approximate gradient coding.
- Let
dbe any replication parameter. - Let
Gbe an expander graph (i.e., taken from an infinite expander family(G_n)_{n=1}^\infty)- With eigenvalues
\lambda_1 \geq \cdots \geq \lambda_n, and respective eigenvectorsw_1, \cdots, w_n - Assume
\|w_1\|_2 =\| w_2\|_2 = \cdots = \|w_n\|_2 = 1, andw_i w_j^\top = 0for alli \neq j.
- With eigenvalues
- Let the gradient coding matrix
B=\frac{1}{d} A_G.- The eigenvalues of
Bare\mu_1 = 1\geq \mu_2 \geq \cdots \geq \mu_n. where\mu_i = \frac{\lambda_i}{d}. - Let
\mu = \max \{|\mu_2|, |\mu_n|\}. dnonzero entries in each row\RightarrowReplication factord.
- The eigenvalues of
- Claim: For any number of stragglers
s, we can get close to\mathbb{I}.- Much better than the trivial scheme.
- Proximity is a function of
dand\lambda.
- For every
sand any set\mathcal{K}ofn - sresponses, we build an "decoding vector".- A function of
sand of the identities of the responding workers. - Will be used to linearly combine the
n - sresponses to get the approximate gradient.
- A function of
- Let
w_{\mathcal{K}} \in \mathbb{R}^nsuch that(w_{\mathcal{K}})_i = \begin{cases} -1 & \text{if } i \notin \mathcal{K} \\ \frac{s}{n-s} & \text{if } i \in \mathcal{K} \end{cases}.
Lemma 1: w_{\mathcal{K}} is spanned by w_2, \cdots, w_n, the n - 1 last eigenvectors of A_G.
Proof
w_2, \cdots, w_n are independent, and all orthogonal to w_1 = \mathbb{I}.
\Rightarrow The span of w_2, \cdots, w_n is exactly all vectors whose sum of entries is zero.
Sum of entries of w_{\mathcal{K}} is zero \Rightarrow w_{\mathcal{K}} is in their span.
Corollary: w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n for some $\alpha_i$'s in \mathbb{R}.
Lemma 2: From direct computation, the norm of w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n is \frac{ns}{n-s}.
Corollary: w_{\mathcal{K}}^2 = \sum_{i=2}^n \alpha_i^2 = \frac{ns}{n-s} (from Lemma 2 + orthonormality of w_2, \cdots, w_n).
The scheme:
- If the set of responses is
\mathcal{K}, the decoding vector isw_{\mathcal{K}} + \ell_2. - Notice that
\operatorname{supp}(w_{\mathcal{K}} + \ell_2) = \mathcal{K}. - The responses the master receives are the rows of
B v_1, \cdots, v_n^\topindexed by\mathcal{K}. \RightarrowThe master can computew_{\mathcal{K}} + \ell_2 B v_1, \cdots, v_n^\top.
Left to show: How close is w_{\mathcal{K}} + \ell_2 B to \ell_2?
Proof
Recall that:
w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n.- $w_i$'s are eigenvectors of
A_G(with eigenvalues\lambda_i) and ofB = \frac{1}{d} A_G(with eigenvalues\mu_i = \frac{\lambda_i}{d}).
d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 w_2 + \cdots + \alpha_n w_n B, \mathbb{I}) (from 1.)
d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n, \mathbb{I}) (eigenvalues of B, and \mu_1 = 1)
d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n , \mathbb{I}) (by def.).
d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2
d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2 (w_i's are orthonormal).
d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \sqrt{\sum_{i=2}^n \alpha_i^2 \mu_i^2} (w_i's are orthonormal).
Improvement factor
Corollary: If B = \frac{1}{d} A_G for a (d-regular) Ramanujan graph G,
\Rightarrowimprovement factor\approx \frac{2}{d}.- Some explicit constructions of Ramanujan graphs (Lubotsky, Philips and Sarnak '88)
- with 2
\frac{2}{d}\approx0.5!
- with 2
Recap
- Expander graph: A $d$-regular graph with no lonely small subsets of nodes.
- Every subset with
\leq n/2has a large ratio\partial S / S(not\rightarrow 0withn). - Many constructions exist, random graph is expander w.h.p.
- The expansion factor is determined by the spectral gap
d - \lambda_2, - Where
\lambda = \max \{\lambda_2, \lambda_n\}, and\lambda_1 = d \geq \lambda_2 \geq \cdots \geq \lambda_nare the eigenvalues ofA_G. - "Best" expander = Ramanujan graph = has
\lambda \leq 2 d - 1.
- Every subset with
- "Do nothing" approach: approximation
\frac{ns}{n-s}. - Approximate gradient coding:
- Send
dsubsetsS_{j1}, \cdots, S_{jd}to each nodei, which returns a linear combination according to a coefficient matrixB. - Let
B = \frac{1}{d} A_G, forGa Ramanujan graph: approximation\frac{\lambda}{d} \frac{ns}{n-s}. - Up to 50% closer than "do nothing", at the price of higher computation load
- Send
Note
Faster = more computation load.