partial updates

2025-11-18 13:25:21 -06:00
parent 946d0b605f
commit 9416bd4956
10 changed files with 1218 additions and 136 deletions
--- a/content/CSE5313/CSE5313_L22.md
+++ b/content/CSE5313/CSE5313_L22.md
@@ -1,4 +1,4 @@
-# CSE5313 Coding and information theory for data science (Lecture 21)
+# CSE5313 Coding and information theory for data science (Lecture 22)

 ## Approximate Gradient Coding

--- a/content/CSE5313/CSE5313_L23.md
+++ b/content/CSE5313/CSE5313_L23.md
@@ -0,0 +1,242 @@
+# CSE5313 Coding and information theory for data science (Lecture 23)
+
+## Coded Computing
+
+### Motivation
+
+Some facts:
+
+- Moore's law is saturating.
+  - Improving CPU performance is hard.
+- Modern datasets are growing remarkably large.
+  - E.g., TikTok, YouTube.
+- Learning tasks are computationally heavy.
+  - E.g., training neural networks.
+
+Solution: Distributed Computing for Scalability
+
+- Offloading computation tasks to multiple computation nodes.
+- Gather and accumulate computation results.
+- E.g., Apache Hadoop, Apache Spark, MapReduce.
+
+### General Framework
+
+- The system involves 1 master node and $P$ worker nodes.
+- The master has a dataset $D$ and wants $f(D)$, where $f$ is some function.
+- The master partitions $D=(D_1,\cdots,D_P)$, and sends $D_i$ to node $i$.
+- Every node $i$ computes $g(D_i)$, where $g$ is somem function.
+- Finally, the master collects $g(D_1),\cdots,g(D_P)$ and computes $f(D)=h(g(D_1),\cdots,g(D_P))$, where $h$ is some function.
+
+#### Challenges
+
+Stragglers
+
+- Nodes that are significantly slower than the others.
+
+Adversaries
+
+- Nodes that return errounous results.
+  - Computation/communication errors.
+  - Adversarial attacks.
+
+Privary
+
+- Nodes may be curious about the dataset.
+
+### Resemblance to communication channel
+
+Suppose $f,g=\operatorname{id}$, and let $D=(D_1,\cdots,D_P)\in \mathbb{F}^p$ a message.
+
+- $D_i$ is a field element
+- $\mathbb{F}$ could be $\mathbb{R}$ or $\mathbb{C}$, $\mathbb{F}^q$.
+
+Observation: This is a distributed storage system.
+
+- An erasure - node that does not respond.
+- An error - node that returns errounous results.
+
+Solution:
+
+- Add redundancy to the message
+- Error-correcting codes.
+
+### Coded Distributed Computing
+
+- The master partitions $D$ and encodes it before sending to $P$ workers.
+- Workers perform computations on coded data $\tilde{D}$ and generate coded results $g(\tilde{D})$.
+- The master decode the coded results and obtain $f(D)=h(g(\tilde{D}))$.
+
+### Outline
+
+Matrix-Vector Multiplication
+
+- MDS codes.
+- Short-Dot codes.
+
+Matrix-Matrix Multiplication
+
+- Polynomial codes.
+- MatDot codes.
+
+Polynomial Evaluation
+
+- Lagrange codes.
+- Application to BLockchain.
+
+### Trivial solution - replication
+
+Why no straggler tolerance?
+
+- We employ an individual worker node $i$ to compute $y_i=(a_i,\ldots,a_{iN})\cdot (x_1,\ldots,x_N)^T$.
+
+Replicate the computation?
+
+- Let $r+1$ nodes compute every $y_i$.
+
+We need $P=rM+M$ worker nodes to tolerate $r$ erasures and $\lfloor \frac{r}{2}\rfloor$ adversaries.
+
+### Use of MDS codes
+
+Let $2|M$ and $P=3$.
+
+Let $A_1,A_2$ be submatrices of $A$ such that $A=[A_1^\top|A_2^\top]^\top$.
+
+- Worker node 1 conputes $A_1\cdot x$.
+- Worker node 2 conputes $A_2\cdot x$.
+- Worker node 3 conputes $(A_1+A_2)\cdot x$.
+
+Observation: the results can be obtained from any two worker nodes.
+
+Let $G\in \mathbb{F}^{M\times P}$ be the generator matrix of an $(P,M)$ MDS code.
+
+The master node computes $F=G^\top A\in \mathbb{F}^{P\times N}$.
+
+Every worker node $i$ computers $F_i\cdot x$.
+
+- $F_i=(G^\top A)_i$ is the i-th row of $G^\top A$.
+
+Notice that $Fx=G^\top A\cdot x=G^\top y$ is the codeword of $y$.
+
+Node $i$ computes an entry in this codeword.
+
+$1$ response = $1$ entyr of the codeword.
+
+The master does **not** need all workers to respond to obtain $y$.
+
+- The MDS property allows decoding from any $M$ $y_i$'s
+- This scheme tolerates $P-M$ erasures, and the recovery threshold $K=M$.
+- We need $P=r+M$ worker nodes to tolerate $r$ stragglers or $\frac{r}{2}$ adversaries.
+  - With replication, we need $P=rM+M$ worker nodes.
+
+#### Potential improvements for MDS codes
+
+- The matrix $A$ is usually a (trained) model, and $x$ is the data (feature vector).
+- $x$ is transmitted frequently, while the row of $A$ (or $G^\top A$) is communicated in advance.
+- Every worker needs to receive the entire $x$ and compute the dot product.
+- Communication-heavy
+- Can we design a scheme that allows every node only receive only a part of $x$?
+
+### Short-Dot codes
+
+[link to paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8758338)
+
+We want to create a matrix $F\in \mathbb{F}^{P\times M}$ from $A$ such that:
+
+- Every node computes $F_i\cdot x$.
+- Every $K$ rows linearly span the row space of $A$.
+- Each row of $F$ contains at most $s$ non-zero entries.
+
+In the MDS method, $F=G^\top A$.
+
+- The recovery threshold $K=M$.
+- Every worker node needs to receive $s=N$ symbols (the entire x)
+
+No free lunch
+
+Can we trade the recovery threshold $K$ for a smaller $s$?
+
+- Every worker node receives less than $N$ symbols.
+- The master will need more than $M$ responses to recover the computation result.
+
+#### Construction of Short-Dot codes
+
+Choose a super-regular matrix $B\in \mathbb{F}^{P\time K}$, where $P$ is the number of worker nodes.
+
+- A matrix is supper-regular if every square submatrix is invertible.
+- Lagrange/Cauchy matrix is super-regular (next lecture).
+
+Create matrix $\tilde{A}$ by stacking some $Z\in \mathbb{F}^{(K-M)\times N}$ below matrix $A$.
+
+Let $F=B\dot \tilde{A}\in \mathbb{F}^{P\times N}$.
+
+**Short-Dot**: create matrix $F\in \mathbb{F}^{P\times N}$ such that:
+
+- Every $K$ rows linearly span the row space of $A$.
+- Each row of $F$ contains at most $s=\frac{P-K+M}{P}$. $N$ non-zero entries (sparse).
+
+#### Recovery of Short-Dot codes
+
+Claim: Every $K$ rows of $F$ linearly span the row space of $A$.
+
+<details>
+<summary>Proof</summary>
+
+Since $B$ is supper-regular, it is also MDS, i.e., every $K\times K$ submatrix of $B$ is invertible.
+
+Hence, every row of $A$ can be represented as a linear combination of any $K$ rows of $F$.
+
+That is, for every $\mathcal{X}\subseteq[P],|\mathcal{X}|=K$, we can have $\tilde{A}=(B^{\mathcal{X}})^{-1}F^{\mathcal{X}}$.
+
+</details>
+
+What about the sparsity of $F$?
+
+- Want each row of $F$ to be sparse.
+
+#### Sparsity of Short-Dot codes
+
+Build $P\times P$ square matrix whose each row/column contains $P-K+M$ non-zero entries.
+
+Concatenate $\frac{N}{P}$ such matrices and obtain 
+
+[Missing slides 18]
+
+We now investigate what 𝑍 should look like to construct such a matrix 𝐹.
+• Recall that each column of 𝐹 must contains 𝐾 − 𝑀 zeros.
+– They are indexed by the set 𝒰 ∈ 𝑃 , where |𝒰| = 𝐾 − 𝑀.
+– Let 𝐵
+𝒰 ∈ 𝔽
+𝐾−𝑀 ×𝐾 be a submatrix of 𝐵 containing rows indexed by 𝒰.
+• Since 𝐹 = 𝐵𝐴ሚ , it follows that 𝐹𝑗 = 𝐵𝐴ሚ
+𝑗
+, where 𝐹𝑗 and 𝐴ሚ
+𝑗 are the 𝑗-th column of 𝐹 and 𝐴ሚ.
+• Next, we have 𝐵
+𝒰𝐴ሚ
+𝑗 = 0 (𝐾−𝑀)×1.
+• Split 𝐵
+𝒰 = [𝐵 1,𝑀
+𝒰
+,𝐵[𝑀+1,𝐾]
+𝒰 ], 𝐴ሚ
+𝑗 = 𝐴𝑗
+𝑇
+, 𝑍𝑗
+𝑇
+𝑇
+.
+• 𝐵
+𝒰𝐴ሚ
+𝑗 = 𝐵 1,𝑀
+𝒰 𝐴𝑗 + 𝐵[𝑀+1,𝐾]
+𝒰 𝑍𝑗 = 0 (𝐾−𝑀)×1.
+• 𝑍𝑗= −(𝐵 𝑀+1,𝐾
+𝒰
+)
+−1 𝐵 1,𝑀
+𝒰 𝐴𝑗
+.
+• Note that 𝐵 𝑀+1,𝐾
+𝒰 ∈ 𝔽
+𝐾−𝑀 × 𝐾−𝑀 is invertible.
+– Since 𝐵 is super-regular. 
--- a/content/CSE5313/_meta.js
+++ b/content/CSE5313/_meta.js
@@ -26,4 +26,5 @@ export default {
    CSE5313_L20: "CSE5313 Coding and information theory for data science (Lecture 20)",
    CSE5313_L21: "CSE5313 Coding and information theory for data science (Lecture 21)",
    CSE5313_L22: "CSE5313 Coding and information theory for data science (Lecture 22)",
+    CSE5313_L23: "CSE5313 Coding and information theory for data science (Lecture 23)",
 }