Files
NoteNextra-origin/content/CSE5313/CSE5313_L22.md
2025-11-18 13:25:21 -06:00

248 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CSE5313 Coding and information theory for data science (Lecture 22)
## Approximate Gradient Coding
### Exact gradient computation and approximate gradient computation
In the previous formulation, the gradient $\sum_i v_i$
is computed exactly.
- Accurate
- Requires $d \geq s + 1$ (high replication factor).
- Need to know $s$ in advance!
However:
- Approximate gradient computation are very common!
- E.g., stochastic gradient descent.
- Machine learning is inherently inaccurate.
- Relies on biased data, unverified assumptions about model, etc.
Idea: If we relax the exact computation requirement, can have $d < s + 1$?
- No fixed $s$ anymore.
Approximate computation:
- Exact computation: $\nabla \triangleq v = \sum_i v_i = (1, \cdots, 1) v_1, \cdots, v_n^\top$.
.
- Approximate computation: $\nabla \triangleq v = \sum_i v_i \approx u v_1, \cdots, v_n^\top$,
- Where $d_2(u, \mathbb{I})$ is "small" ($d_2(u, v) =\sqrt{ \sum_i (u_i - v_i)^2}$).
- Why?
- Lemma: Let $v_u = u (v_1, \cdots, v_n)^\top$. If $d_2(u, \mathbb{I}) \leq \epsilon$ then $d_2(v, v_u) \leq \epsilon \cdot \ell_{spec}(V)$.
- $V$ is the matrix whose rows are the $v_i$'s.
- $\ell_{spec}$ is the spectral norm (positive sqrt of maximum eigenvalue of $V^\top V$).
- Idea: Distribute $S_1, \cdots, S_n$ as before, and
- as the master gets more and more responses,
- it can reconstruct $u v_1, \cdots, v_n^\top$,
- such that $d_2(u, \ell)$ gets smaller and smaller.
> [!NOTE]
> $d \geq s + 1$ no longer holds.
> $s$ no longer a parameter of the system, but $s = n - \#responses$ at any given time.
### Trivial Scheme
Off the bat, the "do nothing" approach:
- Send $S_i$ to worker $i$, i.e., $d = 1$.
- Worker $i$ replies with the $i$'th partial gradient $v_i$.
- The master averages up all the responses.
How good is that?
- For $u = \frac{n}{n-s} \cdot \mathbb{I}$, the factor $\frac{n}{n-s}$ corrects the $\frac{1}{n}$ in $v_i = \frac{1}{n} \cdot \nabla \text{ on } S_i$.
- Is this $\approx \sum_i v_i$? In other words, what is $d_2(\frac{n}{n-s} \cdot \mathbb{I}, \mathbb{I})$?
Trivial scheme: $\frac{n}{n-s} \cdot \mathbb{I}$ approximation.
Must do better than that!
### Roadmap
- Quick reminder from linear algebra.
- Eigenvectors and orthogonality.
- Quick reminder from graph theory.
- Adjacency matrix of a graph.
- Graph theoretic concept: expander graphs.
- "Well connected" graphs.
- Extensively studied.
- An approximate gradient coding scheme from expander graphs.
### Linear algebra - Reminder
- Let $A \in \mathbb{R}^{n \times n}$.
- If $A v = \lambda v$ then $\lambda$ is an eigenvalue and $v$ is an eigenvector.
- $v_1, \cdots, v_n \in \mathbb{R}^n$ are orthonormal:
- $\|v_i\|_2 = 1$ for all $i$.
- $v_i \cdot v_j^\top = 0$ for all $i \neq j$.
- Nice property: $\| \alpha_1 v_1 + \cdots + \alpha_n v_n \|_2^2 = \sqrt{\sum_i \alpha_i^2}$.
- $A$ is called symmetric if $A = A^\top$.
- Theorem: A **real and symmetric** matrix has an orthonormal basis of eigenvectors.
- That is, there exists an orthonormal basis $v_1, \cdots, v_n$ such that $A v_i = \lambda_i v_i$ for some $\lambda_i$'s.
### Graph theory - Reminder
- Undirected graph $G = V, E$.
- $V$ is a vertex set, usually $n = 1,2, \cdots, n$.
- $E \subseteq \binom{V}{2}$ is an edge set (i.e., $E$ is a collection of subsets of $V$ of size two).
- Each edge $e \in E$ is of the form $e = (a, b)$ for some distinct $a, b \in V$.
- Spectral graph theory:
- Analyze properties of graphs (combinatorial object) using matrices (algebraic object).
- Specifically, for a graph $G$ let $A_G \in \{0,1\}^{n \times n}$ be the adjacency matrix of $G$.
- $A_{i,j} = 1$ if and only if $\{i,j\} \in E$ (otherwise 0).
- $A$ is real and symmetric.
- Therefore, has an orthonormal basis of eigenvectors.
#### Some nice properties of adjacency matrices
- Let $G = (V, E)$ be $d$-regular, with adjacency matrix $A_G$ whose (real) eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$.
- Some theorems:
- $\lambda_1 = d$.
- $\lambda_n \geq -d$, equality if and only if $G$ is bipartite.
- $A_G \mathbb{I}^\top = \lambda_1 \mathbb{I}^\top = d \mathbb{I}^\top$ (easy to show!).
- Does that ring a bell? ;)
- If $\lambda_1 = \lambda_2$ then $G$ is not connected.
#### Expander graphs - Intuition.
- An important family of graphs.
- Multiple applications in:
- Algorithms, complexity theory, error correcting codes, etc.
- Intuition: A graph is called an expander if there are no "lonely small sets" of nodes.
- Every set of at most $n/2$ nodes is "well connected" to the remaining nodes in the graph.
- A bit more formally:
- An infinite family of graphs $G_n$ $n=1$ $\infty$ (where $G_n$ has $n$ nodes) is called an **expander family**, if the "minimal connectedness" of small sets in $G_n$ does not go to zero with $n$.
#### Expander graphs - Definitions.
- All graphs in this lecture are $d$-regular, i.e., all nodes have the same degree $d$.
- For sets of nodes $S, T \subseteq V$, let $E(S, T)$ be the set of edges between $S$ and $T$. I.e., $E(S, T) = \{(i,j) \in E | i \in S \text{ and } j \in T\}$.
- For a set of nodes $S$ let:
- $S^c = n \setminus S$ be its complement.
- Let $\partial S = E(S, S^c)$ be the boundary of $S$.
- I.e., the set of edges between $S$ and its complement $S^c$.
- The expansion parameter $h_G$ of $G$ is:
- I.e., how many edges leave $S$, relative to its size.
- How "well connected" $S$ is to the remaining nodes.
> [!NOTE]
> $h_G = \min_{S \subseteq V, |S| \leq n/2} \frac{\partial S}{|S|}$.
- An infinite family of $d$-regular graphs $(G_n)_{n=1}^\infty$ (where $G_n$ has $n$ nodes) is called an **expander family** if $h(G_n) \geq \epsilon$ for all $n$.
- Same $d$ and same $\epsilon$ for all $n$.
- Expander families with large $\epsilon$ are hard to build explicitly.
- Example: (Lubotsky, Philips and Sarnak '88)
- $V = \mathbb{Z}_p$ (prime).
- Connect $x$ to $x + 1, x - 1, x^{-1}$.
- $d = 3$, very small $\epsilon$.
- However, **random** graphs are expanders with high probability.
#### Expander graphs - Eigenvalues
- There is a strong connection between the expansion parameter of a graph and the eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$ of its adjacency matrix.
- Some theorems (no proof):
- $\frac{d-\lambda_2}{2} \leq h_G \leq \sqrt{2d (d - \lambda_2)}$.
- $d - \lambda_2$ is called the **spectral gap** of $G$.
- If the spectral gap is large, $G$ is a good expander.
- How large can it be?
- Let $\lambda = \max \{|\lambda_2|, |\lambda_n|\}$. Then $\lambda \geq 2 d - 1 - o_n(1)$. (Alon-Boppana Theorem).
- Graphs which achieve the Alon-Boppana bound (i.e., $\lambda \leq 2 d - 1$) are called **Ramanujan graphs**.
- The "best" expanders.
- Some construction are known.
- Efficient construction of Ramanujan graphs for all parameters is very recent (2016).
#### Approximate GC from Expander Graphs
Back to approximate gradient coding.
- Let $d$ be any replication parameter.
- Let $G$ be an expander graph (i.e., taken from an infinite expander family $(G_n)_{n=1}^\infty$)
- With eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$, and respective eigenvectors $w_1, \cdots, w_n$
- Assume $\|w_1\|_2 =\| w_2\|_2 = \cdots = \|w_n\|_2 = 1$, and $w_i w_j^\top = 0$ for all $i \neq j$.
- Let the gradient coding matrix $B=\frac{1}{d} A_G$.
- The eigenvalues of $B$ are $\mu_1 = 1\geq \mu_2 \geq \cdots \geq \mu_n$. where $\mu_i = \frac{\lambda_i}{d}$.
- Let $\mu = \max \{|\mu_2|, |\mu_n|\}$.
- $d$ nonzero entries in each row $\Rightarrow$ Replication factor $d$.
- Claim: For any number of stragglers $s$, we can get close to $\mathbb{I}$.
- Much better than the trivial scheme.
- Proximity is a function of $d$ and $\lambda$.
- For every $s$ and any set $\mathcal{K}$ of $n - s$ responses, we build an "decoding vector".
- A function of $s$ and of the identities of the responding workers.
- Will be used to linearly combine the $n - s$ responses to get the approximate gradient.
- Let $w_{\mathcal{K}} \in \mathbb{R}^n$ such that $(w_{\mathcal{K}})_i = \begin{cases} -1 & \text{if } i \notin \mathcal{K} \\ \frac{s}{n-s} & \text{if } i \in \mathcal{K} \end{cases}$.
Lemma 1: $w_{\mathcal{K}}$ is spanned by $w_2, \cdots, w_n$, the $n - 1$ last eigenvectors of $A_G$.
<details>
<summary>Proof</summary>
$w_2, \cdots, w_n$ are independent, and all orthogonal to $w_1 = \mathbb{I}$.
$\Rightarrow$ The span of $w_2, \cdots, w_n$ is exactly all vectors whose sum of entries is zero.
Sum of entries of $w_{\mathcal{K}}$ is zero $\Rightarrow$ $w_{\mathcal{K}}$ is in their span.
Corollary: $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ for some $\alpha_i$'s in $\mathbb{R}$.
</details>
Lemma 2: From direct computation, the norm of $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ is $\frac{ns}{n-s}$.
Corollary: $w_{\mathcal{K}}^2 = \sum_{i=2}^n \alpha_i^2 = \frac{ns}{n-s}$ (from Lemma 2 + orthonormality of $w_2, \cdots, w_n$).
The scheme:
- If the set of responses is $\mathcal{K}$, the decoding vector is $w_{\mathcal{K}} + \ell_2$.
- Notice that $\operatorname{supp}(w_{\mathcal{K}} + \ell_2) = \mathcal{K}$.
- The responses the master receives are the rows of $B v_1, \cdots, v_n^\top$ indexed by $\mathcal{K}$.
- $\Rightarrow$ The master can compute $w_{\mathcal{K}} + \ell_2 B v_1, \cdots, v_n^\top$.
Left to show: How close is $w_{\mathcal{K}} + \ell_2 B$ to $\ell_2$?
<details>
<summary>Proof</summary>
Recall that:
1. $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$.
2. $w_i$'s are eigenvectors of $A_G$ (with eigenvalues $\lambda_i$) and of $B = \frac{1}{d} A_G$ (with eigenvalues $\mu_i = \frac{\lambda_i}{d}$).
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 w_2 + \cdots + \alpha_n w_n B, \mathbb{I})$ (from 1.)
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n, \mathbb{I})$ (eigenvalues of $B$, and $\mu_1 = 1$)
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n , \mathbb{I})$ (by def.).
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$ (w_i's are orthonormal).
$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \sqrt{\sum_{i=2}^n \alpha_i^2 \mu_i^2}$ (w_i's are orthonormal).
</details>
#### Improvement factor
Corollary: If $B = \frac{1}{d} A_G$ for a (d-regular) Ramanujan graph $G$,
- $\Rightarrow$ improvement factor $\approx \frac{2}{d}$.
- Some explicit constructions of Ramanujan graphs (Lubotsky, Philips and Sarnak '88)
- with 2 $\frac{2}{d}$ $\approx$ 0.5!
### Recap
- Expander graph: A $d$-regular graph with no lonely small subsets of nodes.
- Every subset with $\leq n/2$ has a large ratio $\partial S / S$ (not $\rightarrow 0$ with $n$).
- Many constructions exist, random graph is expander w.h.p.
- The expansion factor is determined by the spectral gap $d - \lambda_2$,
- Where $\lambda = \max \{\lambda_2, \lambda_n\}$, and $\lambda_1 = d \geq \lambda_2 \geq \cdots \geq \lambda_n$ are the eigenvalues of $A_G$.
- "Best" expander = Ramanujan graph = has $\lambda \leq 2 d - 1$.
- "Do nothing" approach: approximation $\frac{ns}{n-s}$.
- Approximate gradient coding:
- Send $d$ subsets $S_{j1}, \cdots, S_{jd}$ to each node $i$, which returns a linear combination according to a coefficient matrix $B$.
- Let $B = \frac{1}{d} A_G$, for $G$ a Ramanujan graph: approximation $\frac{\lambda}{d} \frac{ns}{n-s}$.
- Up to 50% closer than "do nothing", at the price of higher computation load
>[!NOTE]
> Faster = more computation load.