NoteNextra-origin/content/CSE5313/CSE5313_L22.md

# CSE5313 Coding and information theory for data science (Lecture 22)

## Approximate Gradient Coding

### Exact gradient computation and approximate gradient computation

In the previous formulation, the gradient $\sum_i v_i$
is computed exactly.

- Accurate
- Requires $d \geq s + 1$ (high replication factor).
- Need to know $s$ in advance!

However:

- Approximate gradient computation are very common!
  - E.g., stochastic gradient descent.
- Machine learning is inherently inaccurate.
  - Relies on biased data, unverified assumptions about model, etc.

Idea: If we relax the exact computation requirement, can have $d < s + 1$?

- No fixed $s$ anymore.

Approximate computation:

- Exact computation: $\nabla \triangleq v = \sum_i v_i = (1, \cdots, 1) v_1, \cdots, v_n^\top$.
⊤.
- Approximate computation: $\nabla \triangleq v = \sum_i v_i \approx u v_1, \cdots, v_n^\top$,
  - Where $d_2(u, \mathbb{I})$ is "small" ($d_2(u, v) =\sqrt{ \sum_i (u_i - v_i)^2}$).
  - Why?
- Lemma: Let $v_u = u (v_1, \cdots, v_n)^\top$. If $d_2(u, \mathbb{I}) \leq \epsilon$ then $d_2(v, v_u) \leq \epsilon \cdot \ell_{spec}(V)$.
  - $V$ is the matrix whose rows are the $v_i$'s.
  - $\ell_{spec}$ is the spectral norm (positive sqrt of maximum eigenvalue of $V^\top V$).
- Idea: Distribute $S_1, \cdots, S_n$ as before, and
  - as the master gets more and more responses,
  - it can reconstruct $u v_1, \cdots, v_n^\top$,
  - such that $d_2(u, \ell)$ gets smaller and smaller.

> [!NOTE]
> $d \geq s + 1$ no longer holds.
> $s$ no longer a parameter of the system, but $s = n - \#responses$ at any given time.

### Trivial Scheme

Off the bat, the "do nothing" approach:

- Send $S_i$ to worker $i$, i.e., $d = 1$.
- Worker $i$ replies with the $i$'th partial gradient $v_i$.
- The master averages up all the responses.

How good is that?

- For $u = \frac{n}{n-s} \cdot \mathbb{I}$, the factor $\frac{n}{n-s}$ corrects the $\frac{1}{n}$ in $v_i = \frac{1}{n} \cdot \nabla \text{ on } S_i$.
- Is this $\approx \sum_i v_i$? In other words, what is $d_2(\frac{n}{n-s} \cdot \mathbb{I}, \mathbb{I})$?

Trivial scheme: $\frac{n}{n-s} \cdot \mathbb{I}$ approximation.

Must do better than that!

### Roadmap

- Quick reminder from linear algebra.
  - Eigenvectors and orthogonality.
- Quick reminder from graph theory.
  - Adjacency matrix of a graph.
- Graph theoretic concept: expander graphs.
  - "Well connected" graphs.
  - Extensively studied.
- An approximate gradient coding scheme from expander graphs.

### Linear algebra - Reminder

- Let $A \in \mathbb{R}^{n \times n}$.
- If $A v = \lambda v$ then $\lambda$ is an eigenvalue and $v$ is an eigenvector.
- $v_1, \cdots, v_n \in \mathbb{R}^n$ are orthonormal:
  - $\|v_i\|_2 = 1$ for all $i$.
  - $v_i \cdot v_j^\top = 0$ for all $i \neq j$.
- Nice property: $\| \alpha_1 v_1 + \cdots + \alpha_n v_n \|_2^2 = \sqrt{\sum_i \alpha_i^2}$.
- $A$ is called symmetric if $A = A^\top$.
- Theorem: A **real and symmetric** matrix has an orthonormal basis of eigenvectors.
  - That is, there exists an orthonormal basis $v_1, \cdots, v_n$ such that $A v_i = \lambda_i v_i$ for some $\lambda_i$'s.

### Graph theory - Reminder

- Undirected graph $G = V, E$.
- $V$ is a vertex set, usually $n = 1,2, \cdots, n$.
- $E \subseteq \binom{V}{2}$ is an edge set (i.e., $E$ is a collection of subsets of $V$ of size two).
- Each edge $e \in E$ is of the form $e = (a, b)$ for some distinct $a, b \in V$.
- Spectral graph theory:
  - Analyze properties of graphs (combinatorial object) using matrices (algebraic object).
  - Specifically, for a graph $G$ let $A_G \in \{0,1\}^{n \times n}$ be the adjacency matrix of $G$.
  - $A_{i,j} = 1$ if and only if $\{i,j\} \in E$ (otherwise 0).
  - $A$ is real and symmetric.
  - Therefore, has an orthonormal basis of eigenvectors.

#### Some nice properties of adjacency matrices

- Let $G = (V, E)$ be $d$-regular, with adjacency matrix $A_G$ whose (real) eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$.
- Some theorems:
  - $\lambda_1 = d$.
  - $\lambda_n \geq -d$, equality if and only if $G$ is bipartite.
  - $A_G \mathbb{I}^\top = \lambda_1 \mathbb{I}^\top = d \mathbb{I}^\top$ (easy to show!).
    - Does that ring a bell? ;)
  - If $\lambda_1 = \lambda_2$ then $G$ is not connected.

#### Expander graphs - Intuition.

- An important family of graphs.
- Multiple applications in:
  - Algorithms, complexity theory, error correcting codes, etc.
- Intuition: A graph is called an expander if there are no "lonely small sets" of nodes.
- Every set of at most $n/2$ nodes is "well connected" to the remaining nodes in the graph.
- A bit more formally:
  - An infinite family of graphs $G_n$ $n=1$ $\infty$ (where $G_n$ has $n$ nodes) is called an **expander family**, if the "minimal connectedness" of small sets in $G_n$ does not go to zero with $n$.

#### Expander graphs - Definitions.

- All graphs in this lecture are $d$-regular, i.e., all nodes have the same degree $d$.
- For sets of nodes $S, T \subseteq V$, let $E(S, T)$ be the set of edges between $S$ and $T$. I.e., $E(S, T) = \{(i,j) \in E | i \in S \text{ and } j \in T\}$.
- For a set of nodes $S$ let:
  - $S^c = n \setminus S$ be its complement.
  - Let $\partial S = E(S, S^c)$ be the boundary of $S$.
  - I.e., the set of edges between $S$ and its complement $S^c$.
- The expansion parameter $h_G$ of $G$ is:
  - I.e., how many edges leave $S$, relative to its size.
  - How "well connected" $S$ is to the remaining nodes.

> [!NOTE]
> $h_G = \min_{S \subseteq V, |S| \leq n/2} \frac{\partial S}{|S|}$.

- An infinite family of $d$-regular graphs $(G_n)_{n=1}^\infty$ (where $G_n$ has $n$ nodes) is called an **expander family** if $h(G_n) \geq \epsilon$ for all $n$.
  - Same $d$ and same $\epsilon$ for all $n$.
- Expander families with large $\epsilon$ are hard to build explicitly.
- Example: (Lubotsky, Philips and Sarnak '88)
  - $V = \mathbb{Z}_p$ (prime).
  - Connect $x$ to $x + 1, x - 1, x^{-1}$.
  - $d = 3$, very small $\epsilon$.
- However, **random** graphs are expanders with high probability.

#### Expander graphs - Eigenvalues

- There is a strong connection between the expansion parameter of a graph and the eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$ of its adjacency matrix.
- Some theorems (no proof):
  - $\frac{d-\lambda_2}{2} \leq h_G \leq \sqrt{2d (d - \lambda_2)}$.
  - $d - \lambda_2$ is called the **spectral gap** of $G$.
    - If the spectral gap is large, $G$ is a good expander.
    - How large can it be?
  - Let $\lambda = \max \{|\lambda_2|, |\lambda_n|\}$. Then $\lambda \geq 2 d - 1 - o_n(1)$. (Alon-Boppana Theorem).
  - Graphs which achieve the Alon-Boppana bound (i.e., $\lambda \leq 2 d - 1$) are called **Ramanujan graphs**.
    - The "best" expanders.
    - Some construction are known.
    - Efficient construction of Ramanujan graphs for all parameters is very recent (2016).

#### Approximate GC from Expander Graphs

Back to approximate gradient coding.

- Let $d$ be any replication parameter.
- Let $G$ be an expander graph (i.e., taken from an infinite expander family $(G_n)_{n=1}^\infty$)
  - With eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$, and respective eigenvectors $w_1, \cdots, w_n$
  - Assume $\|w_1\|_2 =\| w_2\|_2 = \cdots = \|w_n\|_2 = 1$, and $w_i w_j^\top = 0$ for all $i \neq j$.
- Let the gradient coding matrix $B=\frac{1}{d} A_G$.
  - The eigenvalues of $B$ are $\mu_1 = 1\geq \mu_2 \geq \cdots \geq \mu_n$. where $\mu_i = \frac{\lambda_i}{d}$.
  - Let $\mu = \max \{|\mu_2|, |\mu_n|\}$.
  - $d$ nonzero entries in each row $\Rightarrow$ Replication factor $d$.
- Claim: For any number of stragglers $s$, we can get close to $\mathbb{I}$.
  - Much better than the trivial scheme.
  - Proximity is a function of $d$ and $\lambda$.
- For every $s$ and any set $\mathcal{K}$ of $n - s$ responses, we build an "decoding vector".
  - A function of $s$ and of the identities of the responding workers.
  - Will be used to linearly combine the $n - s$ responses to get the approximate gradient.
- Let $w_{\mathcal{K}} \in \mathbb{R}^n$ such that $(w_{\mathcal{K}})_i = \begin{cases} -1 & \text{if } i \notin \mathcal{K} \\ \frac{s}{n-s} & \text{if } i \in \mathcal{K} \end{cases}$.

Lemma 1: $w_{\mathcal{K}}$ is spanned by $w_2, \cdots, w_n$, the $n - 1$ last eigenvectors of $A_G$.

<details>
<summary>Proof</summary>

$w_2, \cdots, w_n$ are independent, and all orthogonal to $w_1 = \mathbb{I}$.

$\Rightarrow$ The span of $w_2, \cdots, w_n$ is exactly all vectors whose sum of entries is zero.

Sum of entries of $w_{\mathcal{K}}$ is zero $\Rightarrow$ $w_{\mathcal{K}}$ is in their span.

Corollary: $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ for some $\alpha_i$'s in $\mathbb{R}$.
</details>

Lemma 2: From direct computation, the norm of $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ is $\frac{ns}{n-s}$.

Corollary: $w_{\mathcal{K}}^2 = \sum_{i=2}^n \alpha_i^2 = \frac{ns}{n-s}$ (from Lemma 2 + orthonormality of $w_2, \cdots, w_n$).

The scheme:

- If the set of responses is $\mathcal{K}$, the decoding vector is $w_{\mathcal{K}} + \ell_2$.
- Notice that $\operatorname{supp}(w_{\mathcal{K}} + \ell_2) = \mathcal{K}$.
- The responses the master receives are the rows of $B v_1, \cdots, v_n^\top$ indexed by $\mathcal{K}$.
- $\Rightarrow$ The master can compute $w_{\mathcal{K}} + \ell_2 B v_1, \cdots, v_n^\top$.

Left to show: How close is $w_{\mathcal{K}} + \ell_2 B$ to $\ell_2$?

<details>
<summary>Proof</summary>

Recall that:

1. $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$.
2. $w_i$'s are eigenvectors of $A_G$ (with eigenvalues $\lambda_i$) and of $B = \frac{1}{d} A_G$ (with eigenvalues $\mu_i = \frac{\lambda_i}{d}$).

$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 w_2 + \cdots + \alpha_n w_n B, \mathbb{I})$ (from 1.)

$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n, \mathbb{I})$ (eigenvalues of $B$, and $\mu_1 = 1$)

$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n , \mathbb{I})$ (by def.).

$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$

$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$ (w_i's are orthonormal).

$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \sqrt{\sum_{i=2}^n \alpha_i^2 \mu_i^2}$ (w_i's are orthonormal).

</details>

#### Improvement factor

Corollary: If $B = \frac{1}{d} A_G$ for a (d-regular) Ramanujan graph $G$,

- $\Rightarrow$ improvement factor $\approx \frac{2}{d}$.
- Some explicit constructions of Ramanujan graphs (Lubotsky, Philips and Sarnak '88)
  - with 2 $\frac{2}{d}$ $\approx$ 0.5!

### Recap

- Expander graph: A $d$-regular graph with no lonely small subsets of nodes.
  - Every subset with $\leq n/2$ has a large ratio $\partial S / S$ (not $\rightarrow 0$ with $n$).
  - Many constructions exist, random graph is expander w.h.p.
  - The expansion factor is determined by the spectral gap $d - \lambda_2$,
  - Where $\lambda = \max \{\lambda_2, \lambda_n\}$, and $\lambda_1 = d \geq \lambda_2 \geq \cdots \geq \lambda_n$ are the eigenvalues of $A_G$.
  - "Best" expander = Ramanujan graph = has $\lambda \leq 2 d - 1$.
- "Do nothing" approach: approximation $\frac{ns}{n-s}$.
- Approximate gradient coding:
  - Send $d$ subsets $S_{j1}, \cdots, S_{jd}$ to each node $i$, which returns a linear combination according to a coefficient matrix $B$.
  - Let $B = \frac{1}{d} A_G$, for $G$ a Ramanujan graph: approximation $\frac{\lambda}{d} \frac{ns}{n-s}$.
  - Up to 50% closer than "do nothing", at the price of higher computation load

>[!NOTE]
> Faster = more computation load.