updates?

2025-11-14 11:15:12 -06:00
parent 1b75ef050f
commit 0597afb511
14 changed files with 746 additions and 411 deletions
--- a/content/CSE5313/CSE5313_L22.md
+++ b/content/CSE5313/CSE5313_L22.md
@@ -0,0 +1,247 @@
+# CSE5313 Coding and information theory for data science (Lecture 21)
+
+## Approximate Gradient Coding
+
+### Exact gradient computation and approximate gradient computation
+
+In the previous formulation, the gradient $\sum_i v_i$
+is computed exactly.
+
+- Accurate
+- Requires $d \geq s + 1$ (high replication factor). 
+- Need to know $s$ in advance!
+
+However:
+
+- Approximate gradient computation are very common!
+  - E.g., stochastic gradient descent.
+- Machine learning is inherently inaccurate.
+  - Relies on biased data, unverified assumptions about model, etc.
+
+Idea: If we relax the exact computation requirement, can have $d < s + 1$?
+
+- No fixed $s$ anymore.
+
+Approximate computation:
+
+- Exact computation: $\nabla \triangleq v = \sum_i v_i = (1, \cdots, 1) v_1, \cdots, v_n^\top$.
+⊤.
+- Approximate computation: $\nabla \triangleq v = \sum_i v_i \approx u v_1, \cdots, v_n^\top$,
+  - Where $d_2(u, \mathbb{I})$ is "small" ($d_2(u, v) =\sqrt{ \sum_i (u_i - v_i)^2}$).
+  - Why?
+- Lemma: Let $v_u = u (v_1, \cdots, v_n)^\top$. If $d_2(u, \mathbb{I}) \leq \epsilon$ then $d_2(v, v_u) \leq \epsilon \cdot \ell_{spec}(V)$.
+  - $V$ is the matrix whose rows are the $v_i$'s.
+  - $\ell_{spec}$ is the spectral norm (positive sqrt of maximum eigenvalue of $V^\top V$).
+- Idea: Distribute $S_1, \cdots, S_n$ as before, and
+  - as the master gets more and more responses,
+  - it can reconstruct $u v_1, \cdots, v_n^\top$,
+  - such that $d_2(u, \ell)$ gets smaller and smaller.
+
+> [!NOTE]
+> $d \geq s + 1$ no longer holds.
+> $s$ no longer a parameter of the system, but $s = n - \#responses$ at any given time.
+
+### Trivial Scheme
+
+Off the bat, the "do nothing" approach:
+
+- Send $S_i$ to worker $i$, i.e., $d = 1$.
+- Worker $i$ replies with the $i$'th partial gradient $v_i$.
+- The master averages up all the responses.
+
+How good is that?
+
+- For $u = \frac{n}{n-s} \cdot \mathbb{I}$, the factor $\frac{n}{n-s}$ corrects the $\frac{1}{n}$ in $v_i = \frac{1}{n} \cdot \nabla \text{ on } S_i$.
+- Is this $\approx \sum_i v_i$? In other words, what is $d_2(\frac{n}{n-s} \cdot \mathbb{I}, \mathbb{I})$?
+
+Trivial scheme: $\frac{n}{n-s} \cdot \mathbb{I}$ approximation.
+
+Must do better than that!
+
+### Roadmap
+
+- Quick reminder from linear algebra.
+  - Eigenvectors and orthogonality.
+- Quick reminder from graph theory.
+  - Adjacency matrix of a graph.
+- Graph theoretic concept: expander graphs.
+  - "Well connected" graphs.
+  - Extensively studied.
+- An approximate gradient coding scheme from expander graphs.
+
+### Linear algebra - Reminder
+
+- Let $A \in \mathbb{R}^{n \times n}$.
+- If $A v = \lambda v$ then $\lambda$ is an eigenvalue and $v$ is an eigenvector.
+- $v_1, \cdots, v_n \in \mathbb{R}^n$ are orthonormal:
+  - $\|v_i\|_2 = 1$ for all $i$.
+  - $v_i \cdot v_j^\top = 0$ for all $i \neq j$.
+- Nice property: $\| \alpha_1 v_1 + \cdots + \alpha_n v_n \|_2^2 = \sqrt{\sum_i \alpha_i^2}$.
+- $A$ is called symmetric if $A = A^\top$.
+- Theorem: A **real and symmetric** matrix has an orthonormal basis of eigenvectors.
+  - That is, there exists an orthonormal basis $v_1, \cdots, v_n$ such that $A v_i = \lambda_i v_i$ for some $\lambda_i$'s.
+
+### Graph theory - Reminder
+
+- Undirected graph $G = V, E$.
+- $V$ is a vertex set, usually $n = 1,2, \cdots, n$.
+- $E \subseteq \binom{V}{2}$ is an edge set (i.e., $E$ is a collection of subsets of $V$ of size two).
+- Each edge $e \in E$ is of the form $e = (a, b)$ for some distinct $a, b \in V$.
+- Spectral graph theory:
+  - Analyze properties of graphs (combinatorial object) using matrices (algebraic object).
+  - Specifically, for a graph $G$ let $A_G \in \{0,1\}^{n \times n}$ be the adjacency matrix of $G$.
+  - $A_{i,j} = 1$ if and only if $\{i,j\} \in E$ (otherwise 0).
+  - $A$ is real and symmetric.
+  - Therefore, has an orthonormal basis of eigenvectors.
+
+#### Some nice properties of adjacency matrices
+
+- Let $G = (V, E)$ be $d$-regular, with adjacency matrix $A_G$ whose (real) eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$.
+- Some theorems:
+  - $\lambda_1 = d$.
+  - $\lambda_n \geq -d$, equality if and only if $G$ is bipartite.
+  - $A_G \mathbb{I}^\top = \lambda_1 \mathbb{I}^\top = d \mathbb{I}^\top$ (easy to show!).
+    - Does that ring a bell? ;)
+  - If $\lambda_1 = \lambda_2$ then $G$ is not connected.
+
+#### Expander graphs - Intuition.
+
+- An important family of graphs.
+- Multiple applications in:
+  - Algorithms, complexity theory, error correcting codes, etc.
+- Intuition: A graph is called an expander if there are no "lonely small sets" of nodes.
+- Every set of at most $n/2$ nodes is "well connected" to the remaining nodes in the graph.
+- A bit more formally:
+  - An infinite family of graphs $G_n$ $n=1$ $\infty$ (where $G_n$ has $n$ nodes) is called an **expander family**, if the "minimal connectedness" of small sets in $G_n$ does not go to zero with $n$.
+
+#### Expander graphs - Definitions.
+
+- All graphs in this lecture are $d$-regular, i.e., all nodes have the same degree $d$.
+- For sets of nodes $S, T \subseteq V$, let $E(S, T)$ be the set of edges between $S$ and $T$. I.e., $E(S, T) = \{(i,j) \in E | i \in S \text{ and } j \in T\}$.
+- For a set of nodes $S$ let:
+  - $S^c = n \setminus S$ be its complement.
+  - Let $\partial S = E(S, S^c)$ be the boundary of $S$.
+  - I.e., the set of edges between $S$ and its complement $S^c$.
+- The expansion parameter $h_G$ of $G$ is:
+  - I.e., how many edges leave $S$, relative to its size.
+  - How "well connected" $S$ is to the remaining nodes.
+
+> [!NOTE]
+> $h_G = \min_{S \subseteq V, |S| \leq n/2} \frac{\partial S}{|S|}$.
+
+- An infinite family of $d$-regular graphs $(G_n)_{n=1}^\infty$ (where $G_n$ has $n$ nodes) is called an **expander family** if $h(G_n) \geq \epsilon$ for all $n$.
+  - Same $d$ and same $\epsilon$ for all $n$.
+- Expander families with large $\epsilon$ are hard to build explicitly.
+- Example: (Lubotsky, Philips and Sarnak '88)
+  - $V = \mathbb{Z}_p$ (prime).
+  - Connect $x$ to $x + 1, x - 1, x^{-1}$.
+  - $d = 3$, very small $\epsilon$.
+- However, **random** graphs are expanders with high probability.
+
+#### Expander graphs - Eigenvalues
+
+- There is a strong connection between the expansion parameter of a graph and the eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$ of its adjacency matrix.
+- Some theorems (no proof):
+  - $\frac{d-\lambda_2}{2} \leq h_G \leq \sqrt{2d (d - \lambda_2)}$.
+  - $d - \lambda_2$ is called the **spectral gap** of $G$.
+    - If the spectral gap is large, $G$ is a good expander.
+    - How large can it be?
+  - Let $\lambda = \max \{|\lambda_2|, |\lambda_n|\}$. Then $\lambda \geq 2 d - 1 - o_n(1)$. (Alon-Boppana Theorem).
+  - Graphs which achieve the Alon-Boppana bound (i.e., $\lambda \leq 2 d - 1$) are called **Ramanujan graphs**.
+    - The "best" expanders.
+    - Some construction are known.
+    - Efficient construction of Ramanujan graphs for all parameters is very recent (2016).
+
+#### Approximate GC from Expander Graphs
+
+Back to approximate gradient coding.
+
+- Let $d$ be any replication parameter.
+- Let $G$ be an expander graph (i.e., taken from an infinite expander family $(G_n)_{n=1}^\infty$)
+  - With eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$, and respective eigenvectors $w_1, \cdots, w_n$
+  - Assume $\|w_1\|_2 =\| w_2\|_2 = \cdots = \|w_n\|_2 = 1$, and $w_i w_j^\top = 0$ for all $i \neq j$.
+- Let the gradient coding matrix $B=\frac{1}{d} A_G$.
+  - The eigenvalues of $B$ are $\mu_1 = 1\geq \mu_2 \geq \cdots \geq \mu_n$. where $\mu_i = \frac{\lambda_i}{d}$.
+  - Let $\mu = \max \{|\mu_2|, |\mu_n|\}$.
+  - $d$ nonzero entries in each row $\Rightarrow$ Replication factor $d$.
+- Claim: For any number of stragglers $s$, we can get close to $\mathbb{I}$.
+  - Much better than the trivial scheme.
+  - Proximity is a function of $d$ and $\lambda$.
+- For every $s$ and any set $\mathcal{K}$ of $n - s$ responses, we build an "decoding vector".
+  - A function of $s$ and of the identities of the responding workers.
+  - Will be used to linearly combine the $n - s$ responses to get the approximate gradient.
+- Let $w_{\mathcal{K}} \in \mathbb{R}^n$ such that $(w_{\mathcal{K}})_i = \begin{cases} -1 & \text{if } i \notin \mathcal{K} \\ \frac{s}{n-s} & \text{if } i \in \mathcal{K} \end{cases}$.
+
+Lemma 1: $w_{\mathcal{K}}$ is spanned by $w_2, \cdots, w_n$, the $n - 1$ last eigenvectors of $A_G$.
+
+<details>
+<summary>Proof</summary>
+
+$w_2, \cdots, w_n$ are independent, and all orthogonal to $w_1 = \mathbb{I}$.
+
+$\Rightarrow$ The span of $w_2, \cdots, w_n$ is exactly all vectors whose sum of entries is zero.
+
+Sum of entries of $w_{\mathcal{K}}$ is zero $\Rightarrow$ $w_{\mathcal{K}}$ is in their span.
+
+Corollary: $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ for some $\alpha_i$'s in $\mathbb{R}$.
+</details>
+
+Lemma 2: From direct computation, the norm of $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ is $\frac{ns}{n-s}$.
+
+Corollary: $w_{\mathcal{K}}^2 = \sum_{i=2}^n \alpha_i^2 = \frac{ns}{n-s}$ (from Lemma 2 + orthonormality of $w_2, \cdots, w_n$).
+
+The scheme:
+
+- If the set of responses is $\mathcal{K}$, the decoding vector is $w_{\mathcal{K}} + \ell_2$.
+- Notice that $\operatorname{supp}(w_{\mathcal{K}} + \ell_2) = \mathcal{K}$.
+- The responses the master receives are the rows of $B v_1, \cdots, v_n^\top$ indexed by $\mathcal{K}$.
+- $\Rightarrow$ The master can compute $w_{\mathcal{K}} + \ell_2 B v_1, \cdots, v_n^\top$.
+
+Left to show: How close is $w_{\mathcal{K}} + \ell_2 B$ to $\ell_2$?
+
+<details>
+<summary>Proof</summary>
+
+Recall that:
+
+1. $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$.
+2. $w_i$'s are eigenvectors of $A_G$ (with eigenvalues $\lambda_i$) and of $B = \frac{1}{d} A_G$ (with eigenvalues $\mu_i = \frac{\lambda_i}{d}$).
+
+$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 w_2 + \cdots + \alpha_n w_n B, \mathbb{I})$ (from 1.)
+
+$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n, \mathbb{I})$ (eigenvalues of $B$, and $\mu_1 = 1$)
+
+$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n , \mathbb{I})$ (by def.).
+
+$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$
+
+$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$ (w_i's are orthonormal).
+
+$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \sqrt{\sum_{i=2}^n \alpha_i^2 \mu_i^2}$ (w_i's are orthonormal).
+
+</details>
+
+#### Improvement factor
+
+Corollary: If $B = \frac{1}{d} A_G$ for a (d-regular) Ramanujan graph $G$,
+
+- $\Rightarrow$ improvement factor $\approx \frac{2}{d}$.
+- Some explicit constructions of Ramanujan graphs (Lubotsky, Philips and Sarnak '88)
+  - with 2 $\frac{2}{d}$ $\approx$ 0.5!
+
+### Recap
+
+- Expander graph: A $d$-regular graph with no lonely small subsets of nodes.
+  - Every subset with $\leq n/2$ has a large ratio $\partial S / S$ (not $\rightarrow 0$ with $n$).
+  - Many constructions exist, random graph is expander w.h.p.
+  - The expansion factor is determined by the spectral gap $d - \lambda_2$,
+  - Where $\lambda = \max \{\lambda_2, \lambda_n\}$, and $\lambda_1 = d \geq \lambda_2 \geq \cdots \geq \lambda_n$ are the eigenvalues of $A_G$.
+  - "Best" expander = Ramanujan graph = has $\lambda \leq 2 d - 1$.
+- "Do nothing" approach: approximation $\frac{ns}{n-s}$.
+- Approximate gradient coding:
+  - Send $d$ subsets $S_{j1}, \cdots, S_{jd}$ to each node $i$, which returns a linear combination according to a coefficient matrix $B$.
+  - Let $B = \frac{1}{d} A_G$, for $G$ a Ramanujan graph: approximation $\frac{\lambda}{d} \frac{ns}{n-s}$.
+  - Up to 50% closer than "do nothing", at the price of higher computation load
+
+>[!NOTE]
+> Faster = more computation load.