updates?
This commit is contained in:
247
content/CSE5313/CSE5313_L22.md
Normal file
247
content/CSE5313/CSE5313_L22.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# CSE5313 Coding and information theory for data science (Lecture 21)
|
||||
|
||||
## Approximate Gradient Coding
|
||||
|
||||
### Exact gradient computation and approximate gradient computation
|
||||
|
||||
In the previous formulation, the gradient $\sum_i v_i$
|
||||
is computed exactly.
|
||||
|
||||
- Accurate
|
||||
- Requires $d \geq s + 1$ (high replication factor).
|
||||
- Need to know $s$ in advance!
|
||||
|
||||
However:
|
||||
|
||||
- Approximate gradient computation are very common!
|
||||
- E.g., stochastic gradient descent.
|
||||
- Machine learning is inherently inaccurate.
|
||||
- Relies on biased data, unverified assumptions about model, etc.
|
||||
|
||||
Idea: If we relax the exact computation requirement, can have $d < s + 1$?
|
||||
|
||||
- No fixed $s$ anymore.
|
||||
|
||||
Approximate computation:
|
||||
|
||||
- Exact computation: $\nabla \triangleq v = \sum_i v_i = (1, \cdots, 1) v_1, \cdots, v_n^\top$.
|
||||
⊤.
|
||||
- Approximate computation: $\nabla \triangleq v = \sum_i v_i \approx u v_1, \cdots, v_n^\top$,
|
||||
- Where $d_2(u, \mathbb{I})$ is "small" ($d_2(u, v) =\sqrt{ \sum_i (u_i - v_i)^2}$).
|
||||
- Why?
|
||||
- Lemma: Let $v_u = u (v_1, \cdots, v_n)^\top$. If $d_2(u, \mathbb{I}) \leq \epsilon$ then $d_2(v, v_u) \leq \epsilon \cdot \ell_{spec}(V)$.
|
||||
- $V$ is the matrix whose rows are the $v_i$'s.
|
||||
- $\ell_{spec}$ is the spectral norm (positive sqrt of maximum eigenvalue of $V^\top V$).
|
||||
- Idea: Distribute $S_1, \cdots, S_n$ as before, and
|
||||
- as the master gets more and more responses,
|
||||
- it can reconstruct $u v_1, \cdots, v_n^\top$,
|
||||
- such that $d_2(u, \ell)$ gets smaller and smaller.
|
||||
|
||||
> [!NOTE]
|
||||
> $d \geq s + 1$ no longer holds.
|
||||
> $s$ no longer a parameter of the system, but $s = n - \#responses$ at any given time.
|
||||
|
||||
### Trivial Scheme
|
||||
|
||||
Off the bat, the "do nothing" approach:
|
||||
|
||||
- Send $S_i$ to worker $i$, i.e., $d = 1$.
|
||||
- Worker $i$ replies with the $i$'th partial gradient $v_i$.
|
||||
- The master averages up all the responses.
|
||||
|
||||
How good is that?
|
||||
|
||||
- For $u = \frac{n}{n-s} \cdot \mathbb{I}$, the factor $\frac{n}{n-s}$ corrects the $\frac{1}{n}$ in $v_i = \frac{1}{n} \cdot \nabla \text{ on } S_i$.
|
||||
- Is this $\approx \sum_i v_i$? In other words, what is $d_2(\frac{n}{n-s} \cdot \mathbb{I}, \mathbb{I})$?
|
||||
|
||||
Trivial scheme: $\frac{n}{n-s} \cdot \mathbb{I}$ approximation.
|
||||
|
||||
Must do better than that!
|
||||
|
||||
### Roadmap
|
||||
|
||||
- Quick reminder from linear algebra.
|
||||
- Eigenvectors and orthogonality.
|
||||
- Quick reminder from graph theory.
|
||||
- Adjacency matrix of a graph.
|
||||
- Graph theoretic concept: expander graphs.
|
||||
- "Well connected" graphs.
|
||||
- Extensively studied.
|
||||
- An approximate gradient coding scheme from expander graphs.
|
||||
|
||||
### Linear algebra - Reminder
|
||||
|
||||
- Let $A \in \mathbb{R}^{n \times n}$.
|
||||
- If $A v = \lambda v$ then $\lambda$ is an eigenvalue and $v$ is an eigenvector.
|
||||
- $v_1, \cdots, v_n \in \mathbb{R}^n$ are orthonormal:
|
||||
- $\|v_i\|_2 = 1$ for all $i$.
|
||||
- $v_i \cdot v_j^\top = 0$ for all $i \neq j$.
|
||||
- Nice property: $\| \alpha_1 v_1 + \cdots + \alpha_n v_n \|_2^2 = \sqrt{\sum_i \alpha_i^2}$.
|
||||
- $A$ is called symmetric if $A = A^\top$.
|
||||
- Theorem: A **real and symmetric** matrix has an orthonormal basis of eigenvectors.
|
||||
- That is, there exists an orthonormal basis $v_1, \cdots, v_n$ such that $A v_i = \lambda_i v_i$ for some $\lambda_i$'s.
|
||||
|
||||
### Graph theory - Reminder
|
||||
|
||||
- Undirected graph $G = V, E$.
|
||||
- $V$ is a vertex set, usually $n = 1,2, \cdots, n$.
|
||||
- $E \subseteq \binom{V}{2}$ is an edge set (i.e., $E$ is a collection of subsets of $V$ of size two).
|
||||
- Each edge $e \in E$ is of the form $e = (a, b)$ for some distinct $a, b \in V$.
|
||||
- Spectral graph theory:
|
||||
- Analyze properties of graphs (combinatorial object) using matrices (algebraic object).
|
||||
- Specifically, for a graph $G$ let $A_G \in \{0,1\}^{n \times n}$ be the adjacency matrix of $G$.
|
||||
- $A_{i,j} = 1$ if and only if $\{i,j\} \in E$ (otherwise 0).
|
||||
- $A$ is real and symmetric.
|
||||
- Therefore, has an orthonormal basis of eigenvectors.
|
||||
|
||||
#### Some nice properties of adjacency matrices
|
||||
|
||||
- Let $G = (V, E)$ be $d$-regular, with adjacency matrix $A_G$ whose (real) eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$.
|
||||
- Some theorems:
|
||||
- $\lambda_1 = d$.
|
||||
- $\lambda_n \geq -d$, equality if and only if $G$ is bipartite.
|
||||
- $A_G \mathbb{I}^\top = \lambda_1 \mathbb{I}^\top = d \mathbb{I}^\top$ (easy to show!).
|
||||
- Does that ring a bell? ;)
|
||||
- If $\lambda_1 = \lambda_2$ then $G$ is not connected.
|
||||
|
||||
#### Expander graphs - Intuition.
|
||||
|
||||
- An important family of graphs.
|
||||
- Multiple applications in:
|
||||
- Algorithms, complexity theory, error correcting codes, etc.
|
||||
- Intuition: A graph is called an expander if there are no "lonely small sets" of nodes.
|
||||
- Every set of at most $n/2$ nodes is "well connected" to the remaining nodes in the graph.
|
||||
- A bit more formally:
|
||||
- An infinite family of graphs $G_n$ $n=1$ $\infty$ (where $G_n$ has $n$ nodes) is called an **expander family**, if the "minimal connectedness" of small sets in $G_n$ does not go to zero with $n$.
|
||||
|
||||
#### Expander graphs - Definitions.
|
||||
|
||||
- All graphs in this lecture are $d$-regular, i.e., all nodes have the same degree $d$.
|
||||
- For sets of nodes $S, T \subseteq V$, let $E(S, T)$ be the set of edges between $S$ and $T$. I.e., $E(S, T) = \{(i,j) \in E | i \in S \text{ and } j \in T\}$.
|
||||
- For a set of nodes $S$ let:
|
||||
- $S^c = n \setminus S$ be its complement.
|
||||
- Let $\partial S = E(S, S^c)$ be the boundary of $S$.
|
||||
- I.e., the set of edges between $S$ and its complement $S^c$.
|
||||
- The expansion parameter $h_G$ of $G$ is:
|
||||
- I.e., how many edges leave $S$, relative to its size.
|
||||
- How "well connected" $S$ is to the remaining nodes.
|
||||
|
||||
> [!NOTE]
|
||||
> $h_G = \min_{S \subseteq V, |S| \leq n/2} \frac{\partial S}{|S|}$.
|
||||
|
||||
- An infinite family of $d$-regular graphs $(G_n)_{n=1}^\infty$ (where $G_n$ has $n$ nodes) is called an **expander family** if $h(G_n) \geq \epsilon$ for all $n$.
|
||||
- Same $d$ and same $\epsilon$ for all $n$.
|
||||
- Expander families with large $\epsilon$ are hard to build explicitly.
|
||||
- Example: (Lubotsky, Philips and Sarnak '88)
|
||||
- $V = \mathbb{Z}_p$ (prime).
|
||||
- Connect $x$ to $x + 1, x - 1, x^{-1}$.
|
||||
- $d = 3$, very small $\epsilon$.
|
||||
- However, **random** graphs are expanders with high probability.
|
||||
|
||||
#### Expander graphs - Eigenvalues
|
||||
|
||||
- There is a strong connection between the expansion parameter of a graph and the eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$ of its adjacency matrix.
|
||||
- Some theorems (no proof):
|
||||
- $\frac{d-\lambda_2}{2} \leq h_G \leq \sqrt{2d (d - \lambda_2)}$.
|
||||
- $d - \lambda_2$ is called the **spectral gap** of $G$.
|
||||
- If the spectral gap is large, $G$ is a good expander.
|
||||
- How large can it be?
|
||||
- Let $\lambda = \max \{|\lambda_2|, |\lambda_n|\}$. Then $\lambda \geq 2 d - 1 - o_n(1)$. (Alon-Boppana Theorem).
|
||||
- Graphs which achieve the Alon-Boppana bound (i.e., $\lambda \leq 2 d - 1$) are called **Ramanujan graphs**.
|
||||
- The "best" expanders.
|
||||
- Some construction are known.
|
||||
- Efficient construction of Ramanujan graphs for all parameters is very recent (2016).
|
||||
|
||||
#### Approximate GC from Expander Graphs
|
||||
|
||||
Back to approximate gradient coding.
|
||||
|
||||
- Let $d$ be any replication parameter.
|
||||
- Let $G$ be an expander graph (i.e., taken from an infinite expander family $(G_n)_{n=1}^\infty$)
|
||||
- With eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$, and respective eigenvectors $w_1, \cdots, w_n$
|
||||
- Assume $\|w_1\|_2 =\| w_2\|_2 = \cdots = \|w_n\|_2 = 1$, and $w_i w_j^\top = 0$ for all $i \neq j$.
|
||||
- Let the gradient coding matrix $B=\frac{1}{d} A_G$.
|
||||
- The eigenvalues of $B$ are $\mu_1 = 1\geq \mu_2 \geq \cdots \geq \mu_n$. where $\mu_i = \frac{\lambda_i}{d}$.
|
||||
- Let $\mu = \max \{|\mu_2|, |\mu_n|\}$.
|
||||
- $d$ nonzero entries in each row $\Rightarrow$ Replication factor $d$.
|
||||
- Claim: For any number of stragglers $s$, we can get close to $\mathbb{I}$.
|
||||
- Much better than the trivial scheme.
|
||||
- Proximity is a function of $d$ and $\lambda$.
|
||||
- For every $s$ and any set $\mathcal{K}$ of $n - s$ responses, we build an "decoding vector".
|
||||
- A function of $s$ and of the identities of the responding workers.
|
||||
- Will be used to linearly combine the $n - s$ responses to get the approximate gradient.
|
||||
- Let $w_{\mathcal{K}} \in \mathbb{R}^n$ such that $(w_{\mathcal{K}})_i = \begin{cases} -1 & \text{if } i \notin \mathcal{K} \\ \frac{s}{n-s} & \text{if } i \in \mathcal{K} \end{cases}$.
|
||||
|
||||
Lemma 1: $w_{\mathcal{K}}$ is spanned by $w_2, \cdots, w_n$, the $n - 1$ last eigenvectors of $A_G$.
|
||||
|
||||
<details>
|
||||
<summary>Proof</summary>
|
||||
|
||||
$w_2, \cdots, w_n$ are independent, and all orthogonal to $w_1 = \mathbb{I}$.
|
||||
|
||||
$\Rightarrow$ The span of $w_2, \cdots, w_n$ is exactly all vectors whose sum of entries is zero.
|
||||
|
||||
Sum of entries of $w_{\mathcal{K}}$ is zero $\Rightarrow$ $w_{\mathcal{K}}$ is in their span.
|
||||
|
||||
Corollary: $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ for some $\alpha_i$'s in $\mathbb{R}$.
|
||||
</details>
|
||||
|
||||
Lemma 2: From direct computation, the norm of $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ is $\frac{ns}{n-s}$.
|
||||
|
||||
Corollary: $w_{\mathcal{K}}^2 = \sum_{i=2}^n \alpha_i^2 = \frac{ns}{n-s}$ (from Lemma 2 + orthonormality of $w_2, \cdots, w_n$).
|
||||
|
||||
The scheme:
|
||||
|
||||
- If the set of responses is $\mathcal{K}$, the decoding vector is $w_{\mathcal{K}} + \ell_2$.
|
||||
- Notice that $\operatorname{supp}(w_{\mathcal{K}} + \ell_2) = \mathcal{K}$.
|
||||
- The responses the master receives are the rows of $B v_1, \cdots, v_n^\top$ indexed by $\mathcal{K}$.
|
||||
- $\Rightarrow$ The master can compute $w_{\mathcal{K}} + \ell_2 B v_1, \cdots, v_n^\top$.
|
||||
|
||||
Left to show: How close is $w_{\mathcal{K}} + \ell_2 B$ to $\ell_2$?
|
||||
|
||||
<details>
|
||||
<summary>Proof</summary>
|
||||
|
||||
Recall that:
|
||||
|
||||
1. $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$.
|
||||
2. $w_i$'s are eigenvectors of $A_G$ (with eigenvalues $\lambda_i$) and of $B = \frac{1}{d} A_G$ (with eigenvalues $\mu_i = \frac{\lambda_i}{d}$).
|
||||
|
||||
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 w_2 + \cdots + \alpha_n w_n B, \mathbb{I})$ (from 1.)
|
||||
|
||||
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n, \mathbb{I})$ (eigenvalues of $B$, and $\mu_1 = 1$)
|
||||
|
||||
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n , \mathbb{I})$ (by def.).
|
||||
|
||||
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$
|
||||
|
||||
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$ (w_i's are orthonormal).
|
||||
|
||||
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \sqrt{\sum_{i=2}^n \alpha_i^2 \mu_i^2}$ (w_i's are orthonormal).
|
||||
|
||||
</details>
|
||||
|
||||
#### Improvement factor
|
||||
|
||||
Corollary: If $B = \frac{1}{d} A_G$ for a (d-regular) Ramanujan graph $G$,
|
||||
|
||||
- $\Rightarrow$ improvement factor $\approx \frac{2}{d}$.
|
||||
- Some explicit constructions of Ramanujan graphs (Lubotsky, Philips and Sarnak '88)
|
||||
- with 2 $\frac{2}{d}$ $\approx$ 0.5!
|
||||
|
||||
### Recap
|
||||
|
||||
- Expander graph: A $d$-regular graph with no lonely small subsets of nodes.
|
||||
- Every subset with $\leq n/2$ has a large ratio $\partial S / S$ (not $\rightarrow 0$ with $n$).
|
||||
- Many constructions exist, random graph is expander w.h.p.
|
||||
- The expansion factor is determined by the spectral gap $d - \lambda_2$,
|
||||
- Where $\lambda = \max \{\lambda_2, \lambda_n\}$, and $\lambda_1 = d \geq \lambda_2 \geq \cdots \geq \lambda_n$ are the eigenvalues of $A_G$.
|
||||
- "Best" expander = Ramanujan graph = has $\lambda \leq 2 d - 1$.
|
||||
- "Do nothing" approach: approximation $\frac{ns}{n-s}$.
|
||||
- Approximate gradient coding:
|
||||
- Send $d$ subsets $S_{j1}, \cdots, S_{jd}$ to each node $i$, which returns a linear combination according to a coefficient matrix $B$.
|
||||
- Let $B = \frac{1}{d} A_G$, for $G$ a Ramanujan graph: approximation $\frac{\lambda}{d} \frac{ns}{n-s}$.
|
||||
- Up to 50% closer than "do nothing", at the price of higher computation load
|
||||
|
||||
>[!NOTE]
|
||||
> Faster = more computation load.
|
||||
Reference in New Issue
Block a user