@@ -0,0 +1,247 @@
# CSE5313 Coding and information theory for data science (Lecture 21)
## Approximate Gradient Coding
### Exact gradient computation and approximate gradient computation
In the previous formulation, the gradient $\sum_i v_i$
is computed exactly.
- Accurate
- Requires $d \geq s + 1$ (high replication factor).
- Need to know $s$ in advance!
However:
- Approximate gradient computation are very common!
- E.g., stochastic gradient descent.
- Machine learning is inherently inaccurate.
- Relies on biased data, unverified assumptions about model, etc.
Idea: If we relax the exact computation requirement, can have $d < s + 1 $?
- No fixed $ s $ anymore .
Approximate computation:
- Exact computation: $ \nabla \triangleq v = \sum_i v_i = (1, \cdots , 1 ) v_1 , \cdots , v_n ^ \top $.
⊤ .
- Approximate computation: $ \nabla \triangleq v = \sum_i v_i \approx u v_1 , \cdots , v_n ^ \top $,
- Where $ d_2 ( u , \mathbb { I })$ is " small " ($ d_2 ( u , v ) = \sqrt { \sum_i ( u_i - v_i )^ 2 }$).
- Why ?
- Lemma: Let $ v_u = u ( v_1 , \cdots , v_n )^ \top $. If $ d_2 ( u , \mathbb { I }) \leq \epsilon $ then $ d_2 ( v , v_u ) \leq \epsilon \cdot \ell_ { spec }( V )$.
- $ V $ is the matrix whose rows are the $ v_i $' s .
- $ \ell_ { spec }$ is the spectral norm ( positive sqrt of maximum eigenvalue of $ V ^ \top V $).
- Idea: Distribute $ S_1 , \cdots , S_n $ as before , and
- as the master gets more and more responses ,
- it can reconstruct $ u v_1 , \cdots , v_n ^ \top $,
- such that $ d_2 ( u , \ell )$ gets smaller and smaller .
> [!NOTE]
> $d \geq s + 1$ no longer holds.
> $s$ no longer a parameter of the system, but $s = n - \#responses$ at any given time.
### Trivial Scheme
Off the bat , the " do nothing " approach:
- Send $ S_i $ to worker $ i $, i . e ., $ d = 1$.
- Worker $ i $ replies with the $ i $' th partial gradient $ v_i $.
- The master averages up all the responses .
How good is that ?
- For $ u = \frac{n}{n-s} \cdot \mathbb { I }$, the factor $ \frac { n }{ n-s }$ corrects the $ \frac { 1 }{ n }$ in $ v_i = \frac{1}{n} \cdot \nabla \text { on } S_i $.
- Is this $ \approx \sum_i v_i $? In other words , what is $ d_2 ( \frac { n }{ n-s } \cdot \mathbb { I }, \mathbb { I })$?
Trivial scheme: $ \frac { n }{ n-s } \cdot \mathbb { I }$ approximation .
Must do better than that !
### Roadmap
- Quick reminder from linear algebra .
- Eigenvectors and orthogonality .
- Quick reminder from graph theory .
- Adjacency matrix of a graph .
- Graph theoretic concept: expander graphs .
- " Well connected " graphs .
- Extensively studied .
- An approximate gradient coding scheme from expander graphs .
### Linear algebra - Reminder
- Let $ A \in \mathbb { R }^{ n \times n }$.
- If $ A v = \lambda v $ then $ \lambda $ is an eigenvalue and $ v $ is an eigenvector .
- $ v_1 , \cdots , v_n \in \mathbb { R }^ n $ are orthonormal:
- $ \|v_i \|_2 = 1$ for all $ i $.
- $ v_i \cdot v_j ^ \top = 0$ for all $ i \neq j $.
- Nice property: $ \| \alpha_1 v_1 + \cdots + \alpha_n v_n \|_2 ^ 2 = \sqrt{ \sum_i \alpha_i ^ 2 }$.
- $ A $ is called symmetric if $ A = A^ \top$.
- Theorem: A **real and symmetric** matrix has an orthonormal basis of eigenvectors .
- That is , there exists an orthonormal basis $ v_1 , \cdots , v_n $ such that $ A v_i = \lambda_i v_i $ for some $ \lambda_i $' s .
### Graph theory - Reminder
- Undirected graph $ G = V, E $.
- $ V $ is a vertex set , usually $ n = 1,2, \cdots , n $.
- $ E \subseteq \binom { V }{ 2 }$ is an edge set ( i . e ., $ E $ is a collection of subsets of $ V $ of size two ).
- Each edge $ e \in E $ is of the form $ e = (a, b )$ for some distinct $ a , b \in V $.
- Spectral graph theory:
- Analyze properties of graphs ( combinatorial object ) using matrices ( algebraic object ).
- Specifically , for a graph $ G $ let $ A_G \in \{0 , 1 \}^{ n \times n }$ be the adjacency matrix of $ G $.
- $ A_ { i , j } = 1 $ if and only if $ \{i , j \} \in E $ ( otherwise 0 ).
- $ A $ is real and symmetric .
- Therefore , has an orthonormal basis of eigenvectors .
#### Some nice properties of adjacency matrices
- Let $ G = (V, E )$ be $ d $ -regular , with adjacency matrix $ A_G $ whose ( real ) eigenvalues $ \lambda_1 \geq \cdots \geq \lambda_n $.
- Some theorems:
- $ \lambda_1 = d$.
- $ \lambda_n \geq -d $, equality if and only if $ G $ is bipartite .
- $ A_G \mathbb { I }^ \top = \lambda_1 \mathbb { I }^ \top = d \mathbb { I }^ \top $ ( easy to show !).
- Does that ring a bell ? ;)
- If $ \lambda_1 = \lambda_2$ then $ G $ is not connected .
#### Expander graphs - Intuition.
- An important family of graphs .
- Multiple applications in:
- Algorithms , complexity theory , error correcting codes , etc .
- Intuition: A graph is called an expander if there are no " lonely small sets " of nodes .
- Every set of at most $ n / 2 $ nodes is " well connected " to the remaining nodes in the graph .
- A bit more formally:
- An infinite family of graphs $ G_n $ $ n = 1$ $ \infty $ ( where $ G_n $ has $ n $ nodes ) is called an **expander family** , if the " minimal connectedness " of small sets in $ G_n $ does not go to zero with $ n $.
#### Expander graphs - Definitions.
- All graphs in this lecture are $ d $ -regular , i . e ., all nodes have the same degree $ d $.
- For sets of nodes $ S , T \subseteq V $, let $ E ( S , T )$ be the set of edges between $ S $ and $ T $. I . e ., $ E ( S , T ) = \{( i , j ) \in E | i \in S \text { and } j \in T \}$.
- For a set of nodes $ S $ let:
- $ S ^ c = n \setminus S $ be its complement .
- Let $ \partial S = E(S, S ^ c )$ be the boundary of $ S $.
- I . e ., the set of edges between $ S $ and its complement $ S ^ c $.
- The expansion parameter $ h_G $ of $ G $ is:
- I . e ., how many edges leave $ S $, relative to its size .
- How " well connected " $ S $ is to the remaining nodes .
> [!NOTE]
> $h_G = \min_{S \subseteq V, |S| \leq n/2} \frac{\partial S}{|S|}$.
- An infinite family of $ d $ -regular graphs $( G_n ) _ { n = 1}^ \infty$ ( where $ G_n $ has $ n $ nodes ) is called an **expander family** if $ h ( G_n ) \geq \epsilon $ for all $ n $.
- Same $ d $ and same $ \epsilon $ for all $ n $.
- Expander families with large $ \epsilon $ are hard to build explicitly .
- Example: ( Lubotsky , Philips and Sarnak ' 88 )
- $ V = \mathbb{Z}_p$ ( prime ).
- Connect $ x $ to $ x + 1 , x - 1 , x ^{ -1 }$.
- $ d = 3$, very small $ \epsilon $.
- However , **random** graphs are expanders with high probability .
#### Expander graphs - Eigenvalues
- There is a strong connection between the expansion parameter of a graph and the eigenvalues $ \lambda_1 \geq \cdots \geq \lambda_n $ of its adjacency matrix .
- Some theorems ( no proof ) :
- $ \frac { d- \lambda_2 }{ 2 } \leq h_G \leq \sqrt { 2d ( d - \lambda_2 )}$.
- $ d - \lambda_2 $ is called the **spectral gap** of $ G $.
- If the spectral gap is large , $ G $ is a good expander .
- How large can it be ?
- Let $ \lambda = \max \{| \lambda_2 |, | \lambda_n | \}$. Then $ \lambda \geq 2 d - 1 - o_n ( 1 )$. ( Alon-Boppana Theorem ).
- Graphs which achieve the Alon-Boppana bound ( i . e ., $ \lambda \leq 2 d - 1 $) are called **Ramanujan graphs** .
- The " best " expanders .
- Some construction are known .
- Efficient construction of Ramanujan graphs for all parameters is very recent ( 2016 ).
#### Approximate GC from Expander Graphs
Back to approximate gradient coding .
- Let $ d $ be any replication parameter .
- Let $ G $ be an expander graph ( i . e ., taken from an infinite expander family $( G_n ) _ { n = 1}^ \infty$)
- With eigenvalues $ \lambda_1 \geq \cdots \geq \lambda_n $, and respective eigenvectors $ w_1 , \cdots , w_n $
- Assume $ \|w_1 \|_2 = \| w_2 \|_2 = \cdots = \|w_n \|_2 = 1 $, and $ w_i w_j ^ \top = 0$ for all $ i \neq j $.
- Let the gradient coding matrix $ B = \frac{1}{d} A_G $.
- The eigenvalues of $ B $ are $ \mu_1 = 1 \geq \mu_2 \geq \cdots \geq \mu_n $. where $ \mu_i = \frac{ \lambda_i}{d}$.
- Let $ \mu = \max \{| \mu_2 |, | \mu_n | \}$.
- $ d $ nonzero entries in each row $ \Rightarrow $ Replication factor $ d $.
- Claim: For any number of stragglers $ s $, we can get close to $ \mathbb { I }$.
- Much better than the trivial scheme .
- Proximity is a function of $ d $ and $ \lambda $.
- For every $ s $ and any set $ \mathcal { K }$ of $ n - s $ responses , we build an " decoding vector ".
- A function of $ s $ and of the identities of the responding workers .
- Will be used to linearly combine the $ n - s $ responses to get the approximate gradient .
- Let $ w_ { \mathcal { K }} \in \mathbb { R }^ n $ such that $( w_ { \mathcal { K }}) _i = \begin{cases} -1 & \text { if } i \notin \mathcal { K } \\ \frac { s }{ n-s } & \text { if } i \in \mathcal { K } \end { cases }$.
Lemma 1: $ w_ { \mathcal { K }}$ is spanned by $ w_2 , \cdots , w_n $, the $ n - 1 $ last eigenvectors of $ A_G $.
< details >
< summary > Proof</ summary >
$w_2, \cdots, w_n$ are independent, and all orthogonal to $w_1 = \mathbb{I}$.
$\Rightarrow$ The span of $w_2, \cdots, w_n$ is exactly all vectors whose sum of entries is zero.
Sum of entries of $w_{\mathcal{K}}$ is zero $\Rightarrow$ $w_{\mathcal{K}}$ is in their span.
Corollary: $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ for some $\alpha_i$'s in $\mathbb{R}$.
</ details >
Lemma 2: From direct computation, the norm of $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ is $\frac{ns}{n-s}$.
Corollary: $w_{\mathcal{K}}^2 = \sum_{i=2}^n \alpha_i^2 = \frac{ns}{n-s}$ (from Lemma 2 + orthonormality of $w_2, \cdots, w_n$).
The scheme:
- If the set of responses is $\mathcal{K}$, the decoding vector is $w_{\mathcal{K}} + \ell_2$.
- Notice that $\operatorname{supp}(w_{\mathcal{K}} + \ell_2) = \mathcal{K}$.
- The responses the master receives are the rows of $B v_1, \cdots, v_n^\top$ indexed by $\mathcal{K}$.
- $\Rightarrow$ The master can compute $w_{\mathcal{K}} + \ell_2 B v_1, \cdots, v_n^\top$.
Left to show: How close is $w_{\mathcal{K}} + \ell_2 B$ to $\ell_2$?
< details >
< summary > Proof</ summary >
Recall that:
1. $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$.
2. $w_i$'s are eigenvectors of $A_G$ (with eigenvalues $\lambda_i$) and of $B = \frac{1}{d} A_G$ (with eigenvalues $\mu_i = \frac{\lambda_i}{d}$).
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 w_2 + \cdots + \alpha_n w_n B, \mathbb{I})$ (from 1.)
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n, \mathbb{I})$ (eigenvalues of $B$, and $\mu_1 = 1$)
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n , \mathbb{I})$ (by def.).
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$ (w_i's are orthonormal).
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \sqrt{\sum_{i=2}^n \alpha_i^2 \mu_i^2}$ (w_i's are orthonormal).
</ details >
#### Improvement factor
Corollary: If $B = \frac{1}{d} A_G$ for a (d-regular) Ramanujan graph $G$,
- $\Rightarrow$ improvement factor $\approx \frac{2}{d}$.
- Some explicit constructions of Ramanujan graphs (Lubotsky, Philips and Sarnak '88)
- with 2 $\frac{2}{d}$ $\approx$ 0.5!
### Recap
- Expander graph: A $d$-regular graph with no lonely small subsets of nodes.
- Every subset with $\leq n/2$ has a large ratio $\partial S / S$ (not $\rightarrow 0$ with $n$).
- Many constructions exist, random graph is expander w.h.p.
- The expansion factor is determined by the spectral gap $d - \lambda_2$,
- Where $\lambda = \max \{\lambda_2, \lambda_n\}$, and $\lambda_1 = d \geq \lambda_2 \geq \cdots \geq \lambda_n$ are the eigenvalues of $A_G$.
- "Best" expander = Ramanujan graph = has $\lambda \leq 2 d - 1$.
- "Do nothing" approach: approximation $\frac{ns}{n-s}$.
- Approximate gradient coding:
- Send $d$ subsets $S_{j1}, \cdots, S_{jd}$ to each node $i$, which returns a linear combination according to a coefficient matrix $B$.
- Let $B = \frac{1}{d} A_G$, for $G$ a Ramanujan graph: approximation $\frac{\lambda}{d} \frac{ns}{n-s}$.
- Up to 50% closer than "do nothing", at the price of higher computation load
>[!NOTE]
> Faster = more computation load.