# CSE5313 Coding and information theory for data science (Lecture 21) ## Approximate Gradient Coding ### Exact gradient computation and approximate gradient computation In the previous formulation, the gradient $\sum_i v_i$ is computed exactly. - Accurate - Requires $d \geq s + 1$ (high replication factor). - Need to know $s$ in advance! However: - Approximate gradient computation are very common! - E.g., stochastic gradient descent. - Machine learning is inherently inaccurate. - Relies on biased data, unverified assumptions about model, etc. Idea: If we relax the exact computation requirement, can have $d < s + 1$? - No fixed $s$ anymore. Approximate computation: - Exact computation: $\nabla \triangleq v = \sum_i v_i = (1, \cdots, 1) v_1, \cdots, v_n^\top$. ⊤. - Approximate computation: $\nabla \triangleq v = \sum_i v_i \approx u v_1, \cdots, v_n^\top$, - Where $d_2(u, \mathbb{I})$ is "small" ($d_2(u, v) =\sqrt{ \sum_i (u_i - v_i)^2}$). - Why? - Lemma: Let $v_u = u (v_1, \cdots, v_n)^\top$. If $d_2(u, \mathbb{I}) \leq \epsilon$ then $d_2(v, v_u) \leq \epsilon \cdot \ell_{spec}(V)$. - $V$ is the matrix whose rows are the $v_i$'s. - $\ell_{spec}$ is the spectral norm (positive sqrt of maximum eigenvalue of $V^\top V$). - Idea: Distribute $S_1, \cdots, S_n$ as before, and - as the master gets more and more responses, - it can reconstruct $u v_1, \cdots, v_n^\top$, - such that $d_2(u, \ell)$ gets smaller and smaller. > [!NOTE] > $d \geq s + 1$ no longer holds. > $s$ no longer a parameter of the system, but $s = n - \#responses$ at any given time. ### Trivial Scheme Off the bat, the "do nothing" approach: - Send $S_i$ to worker $i$, i.e., $d = 1$. - Worker $i$ replies with the $i$'th partial gradient $v_i$. - The master averages up all the responses. How good is that? - For $u = \frac{n}{n-s} \cdot \mathbb{I}$, the factor $\frac{n}{n-s}$ corrects the $\frac{1}{n}$ in $v_i = \frac{1}{n} \cdot \nabla \text{ on } S_i$. - Is this $\approx \sum_i v_i$? In other words, what is $d_2(\frac{n}{n-s} \cdot \mathbb{I}, \mathbb{I})$? Trivial scheme: $\frac{n}{n-s} \cdot \mathbb{I}$ approximation. Must do better than that! ### Roadmap - Quick reminder from linear algebra. - Eigenvectors and orthogonality. - Quick reminder from graph theory. - Adjacency matrix of a graph. - Graph theoretic concept: expander graphs. - "Well connected" graphs. - Extensively studied. - An approximate gradient coding scheme from expander graphs. ### Linear algebra - Reminder - Let $A \in \mathbb{R}^{n \times n}$. - If $A v = \lambda v$ then $\lambda$ is an eigenvalue and $v$ is an eigenvector. - $v_1, \cdots, v_n \in \mathbb{R}^n$ are orthonormal: - $\|v_i\|_2 = 1$ for all $i$. - $v_i \cdot v_j^\top = 0$ for all $i \neq j$. - Nice property: $\| \alpha_1 v_1 + \cdots + \alpha_n v_n \|_2^2 = \sqrt{\sum_i \alpha_i^2}$. - $A$ is called symmetric if $A = A^\top$. - Theorem: A **real and symmetric** matrix has an orthonormal basis of eigenvectors. - That is, there exists an orthonormal basis $v_1, \cdots, v_n$ such that $A v_i = \lambda_i v_i$ for some $\lambda_i$'s. ### Graph theory - Reminder - Undirected graph $G = V, E$. - $V$ is a vertex set, usually $n = 1,2, \cdots, n$. - $E \subseteq \binom{V}{2}$ is an edge set (i.e., $E$ is a collection of subsets of $V$ of size two). - Each edge $e \in E$ is of the form $e = (a, b)$ for some distinct $a, b \in V$. - Spectral graph theory: - Analyze properties of graphs (combinatorial object) using matrices (algebraic object). - Specifically, for a graph $G$ let $A_G \in \{0,1\}^{n \times n}$ be the adjacency matrix of $G$. - $A_{i,j} = 1$ if and only if $\{i,j\} \in E$ (otherwise 0). - $A$ is real and symmetric. - Therefore, has an orthonormal basis of eigenvectors. #### Some nice properties of adjacency matrices - Let $G = (V, E)$ be $d$-regular, with adjacency matrix $A_G$ whose (real) eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$. - Some theorems: - $\lambda_1 = d$. - $\lambda_n \geq -d$, equality if and only if $G$ is bipartite. - $A_G \mathbb{I}^\top = \lambda_1 \mathbb{I}^\top = d \mathbb{I}^\top$ (easy to show!). - Does that ring a bell? ;) - If $\lambda_1 = \lambda_2$ then $G$ is not connected. #### Expander graphs - Intuition. - An important family of graphs. - Multiple applications in: - Algorithms, complexity theory, error correcting codes, etc. - Intuition: A graph is called an expander if there are no "lonely small sets" of nodes. - Every set of at most $n/2$ nodes is "well connected" to the remaining nodes in the graph. - A bit more formally: - An infinite family of graphs $G_n$ $n=1$ $\infty$ (where $G_n$ has $n$ nodes) is called an **expander family**, if the "minimal connectedness" of small sets in $G_n$ does not go to zero with $n$. #### Expander graphs - Definitions. - All graphs in this lecture are $d$-regular, i.e., all nodes have the same degree $d$. - For sets of nodes $S, T \subseteq V$, let $E(S, T)$ be the set of edges between $S$ and $T$. I.e., $E(S, T) = \{(i,j) \in E | i \in S \text{ and } j \in T\}$. - For a set of nodes $S$ let: - $S^c = n \setminus S$ be its complement. - Let $\partial S = E(S, S^c)$ be the boundary of $S$. - I.e., the set of edges between $S$ and its complement $S^c$. - The expansion parameter $h_G$ of $G$ is: - I.e., how many edges leave $S$, relative to its size. - How "well connected" $S$ is to the remaining nodes. > [!NOTE] > $h_G = \min_{S \subseteq V, |S| \leq n/2} \frac{\partial S}{|S|}$. - An infinite family of $d$-regular graphs $(G_n)_{n=1}^\infty$ (where $G_n$ has $n$ nodes) is called an **expander family** if $h(G_n) \geq \epsilon$ for all $n$. - Same $d$ and same $\epsilon$ for all $n$. - Expander families with large $\epsilon$ are hard to build explicitly. - Example: (Lubotsky, Philips and Sarnak '88) - $V = \mathbb{Z}_p$ (prime). - Connect $x$ to $x + 1, x - 1, x^{-1}$. - $d = 3$, very small $\epsilon$. - However, **random** graphs are expanders with high probability. #### Expander graphs - Eigenvalues - There is a strong connection between the expansion parameter of a graph and the eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$ of its adjacency matrix. - Some theorems (no proof): - $\frac{d-\lambda_2}{2} \leq h_G \leq \sqrt{2d (d - \lambda_2)}$. - $d - \lambda_2$ is called the **spectral gap** of $G$. - If the spectral gap is large, $G$ is a good expander. - How large can it be? - Let $\lambda = \max \{|\lambda_2|, |\lambda_n|\}$. Then $\lambda \geq 2 d - 1 - o_n(1)$. (Alon-Boppana Theorem). - Graphs which achieve the Alon-Boppana bound (i.e., $\lambda \leq 2 d - 1$) are called **Ramanujan graphs**. - The "best" expanders. - Some construction are known. - Efficient construction of Ramanujan graphs for all parameters is very recent (2016). #### Approximate GC from Expander Graphs Back to approximate gradient coding. - Let $d$ be any replication parameter. - Let $G$ be an expander graph (i.e., taken from an infinite expander family $(G_n)_{n=1}^\infty$) - With eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$, and respective eigenvectors $w_1, \cdots, w_n$ - Assume $\|w_1\|_2 =\| w_2\|_2 = \cdots = \|w_n\|_2 = 1$, and $w_i w_j^\top = 0$ for all $i \neq j$. - Let the gradient coding matrix $B=\frac{1}{d} A_G$. - The eigenvalues of $B$ are $\mu_1 = 1\geq \mu_2 \geq \cdots \geq \mu_n$. where $\mu_i = \frac{\lambda_i}{d}$. - Let $\mu = \max \{|\mu_2|, |\mu_n|\}$. - $d$ nonzero entries in each row $\Rightarrow$ Replication factor $d$. - Claim: For any number of stragglers $s$, we can get close to $\mathbb{I}$. - Much better than the trivial scheme. - Proximity is a function of $d$ and $\lambda$. - For every $s$ and any set $\mathcal{K}$ of $n - s$ responses, we build an "decoding vector". - A function of $s$ and of the identities of the responding workers. - Will be used to linearly combine the $n - s$ responses to get the approximate gradient. - Let $w_{\mathcal{K}} \in \mathbb{R}^n$ such that $(w_{\mathcal{K}})_i = \begin{cases} -1 & \text{if } i \notin \mathcal{K} \\ \frac{s}{n-s} & \text{if } i \in \mathcal{K} \end{cases}$. Lemma 1: $w_{\mathcal{K}}$ is spanned by $w_2, \cdots, w_n$, the $n - 1$ last eigenvectors of $A_G$.
Proof $w_2, \cdots, w_n$ are independent, and all orthogonal to $w_1 = \mathbb{I}$. $\Rightarrow$ The span of $w_2, \cdots, w_n$ is exactly all vectors whose sum of entries is zero. Sum of entries of $w_{\mathcal{K}}$ is zero $\Rightarrow$ $w_{\mathcal{K}}$ is in their span. Corollary: $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ for some $\alpha_i$'s in $\mathbb{R}$.
Lemma 2: From direct computation, the norm of $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ is $\frac{ns}{n-s}$. Corollary: $w_{\mathcal{K}}^2 = \sum_{i=2}^n \alpha_i^2 = \frac{ns}{n-s}$ (from Lemma 2 + orthonormality of $w_2, \cdots, w_n$). The scheme: - If the set of responses is $\mathcal{K}$, the decoding vector is $w_{\mathcal{K}} + \ell_2$. - Notice that $\operatorname{supp}(w_{\mathcal{K}} + \ell_2) = \mathcal{K}$. - The responses the master receives are the rows of $B v_1, \cdots, v_n^\top$ indexed by $\mathcal{K}$. - $\Rightarrow$ The master can compute $w_{\mathcal{K}} + \ell_2 B v_1, \cdots, v_n^\top$. Left to show: How close is $w_{\mathcal{K}} + \ell_2 B$ to $\ell_2$?
Proof Recall that: 1. $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$. 2. $w_i$'s are eigenvectors of $A_G$ (with eigenvalues $\lambda_i$) and of $B = \frac{1}{d} A_G$ (with eigenvalues $\mu_i = \frac{\lambda_i}{d}$). $d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 w_2 + \cdots + \alpha_n w_n B, \mathbb{I})$ (from 1.) $d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n, \mathbb{I})$ (eigenvalues of $B$, and $\mu_1 = 1$) $d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n , \mathbb{I})$ (by def.). $d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$ $d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$ (w_i's are orthonormal). $d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \sqrt{\sum_{i=2}^n \alpha_i^2 \mu_i^2}$ (w_i's are orthonormal).
#### Improvement factor Corollary: If $B = \frac{1}{d} A_G$ for a (d-regular) Ramanujan graph $G$, - $\Rightarrow$ improvement factor $\approx \frac{2}{d}$. - Some explicit constructions of Ramanujan graphs (Lubotsky, Philips and Sarnak '88) - with 2 $\frac{2}{d}$ $\approx$ 0.5! ### Recap - Expander graph: A $d$-regular graph with no lonely small subsets of nodes. - Every subset with $\leq n/2$ has a large ratio $\partial S / S$ (not $\rightarrow 0$ with $n$). - Many constructions exist, random graph is expander w.h.p. - The expansion factor is determined by the spectral gap $d - \lambda_2$, - Where $\lambda = \max \{\lambda_2, \lambda_n\}$, and $\lambda_1 = d \geq \lambda_2 \geq \cdots \geq \lambda_n$ are the eigenvalues of $A_G$. - "Best" expander = Ramanujan graph = has $\lambda \leq 2 d - 1$. - "Do nothing" approach: approximation $\frac{ns}{n-s}$. - Approximate gradient coding: - Send $d$ subsets $S_{j1}, \cdots, S_{jd}$ to each node $i$, which returns a linear combination according to a coefficient matrix $B$. - Let $B = \frac{1}{d} A_G$, for $G$ a Ramanujan graph: approximation $\frac{\lambda}{d} \frac{ns}{n-s}$. - Up to 50% closer than "do nothing", at the price of higher computation load >[!NOTE] > Faster = more computation load.