partial updates

This commit is contained in:
Trance-0
2025-11-18 13:25:21 -06:00
parent 946d0b605f
commit 9416bd4956
10 changed files with 1218 additions and 136 deletions

View File

@@ -1,4 +1,4 @@
# CSE5313 Coding and information theory for data science (Lecture 21)
# CSE5313 Coding and information theory for data science (Lecture 22)
## Approximate Gradient Coding

View File

@@ -0,0 +1,242 @@
# CSE5313 Coding and information theory for data science (Lecture 23)
## Coded Computing
### Motivation
Some facts:
- Moore's law is saturating.
- Improving CPU performance is hard.
- Modern datasets are growing remarkably large.
- E.g., TikTok, YouTube.
- Learning tasks are computationally heavy.
- E.g., training neural networks.
Solution: Distributed Computing for Scalability
- Offloading computation tasks to multiple computation nodes.
- Gather and accumulate computation results.
- E.g., Apache Hadoop, Apache Spark, MapReduce.
### General Framework
- The system involves 1 master node and $P$ worker nodes.
- The master has a dataset $D$ and wants $f(D)$, where $f$ is some function.
- The master partitions $D=(D_1,\cdots,D_P)$, and sends $D_i$ to node $i$.
- Every node $i$ computes $g(D_i)$, where $g$ is somem function.
- Finally, the master collects $g(D_1),\cdots,g(D_P)$ and computes $f(D)=h(g(D_1),\cdots,g(D_P))$, where $h$ is some function.
#### Challenges
Stragglers
- Nodes that are significantly slower than the others.
Adversaries
- Nodes that return errounous results.
- Computation/communication errors.
- Adversarial attacks.
Privary
- Nodes may be curious about the dataset.
### Resemblance to communication channel
Suppose $f,g=\operatorname{id}$, and let $D=(D_1,\cdots,D_P)\in \mathbb{F}^p$ a message.
- $D_i$ is a field element
- $\mathbb{F}$ could be $\mathbb{R}$ or $\mathbb{C}$, $\mathbb{F}^q$.
Observation: This is a distributed storage system.
- An erasure - node that does not respond.
- An error - node that returns errounous results.
Solution:
- Add redundancy to the message
- Error-correcting codes.
### Coded Distributed Computing
- The master partitions $D$ and encodes it before sending to $P$ workers.
- Workers perform computations on coded data $\tilde{D}$ and generate coded results $g(\tilde{D})$.
- The master decode the coded results and obtain $f(D)=h(g(\tilde{D}))$.
### Outline
Matrix-Vector Multiplication
- MDS codes.
- Short-Dot codes.
Matrix-Matrix Multiplication
- Polynomial codes.
- MatDot codes.
Polynomial Evaluation
- Lagrange codes.
- Application to BLockchain.
### Trivial solution - replication
Why no straggler tolerance?
- We employ an individual worker node $i$ to compute $y_i=(a_i,\ldots,a_{iN})\cdot (x_1,\ldots,x_N)^T$.
Replicate the computation?
- Let $r+1$ nodes compute every $y_i$.
We need $P=rM+M$ worker nodes to tolerate $r$ erasures and $\lfloor \frac{r}{2}\rfloor$ adversaries.
### Use of MDS codes
Let $2|M$ and $P=3$.
Let $A_1,A_2$ be submatrices of $A$ such that $A=[A_1^\top|A_2^\top]^\top$.
- Worker node 1 conputes $A_1\cdot x$.
- Worker node 2 conputes $A_2\cdot x$.
- Worker node 3 conputes $(A_1+A_2)\cdot x$.
Observation: the results can be obtained from any two worker nodes.
Let $G\in \mathbb{F}^{M\times P}$ be the generator matrix of an $(P,M)$ MDS code.
The master node computes $F=G^\top A\in \mathbb{F}^{P\times N}$.
Every worker node $i$ computers $F_i\cdot x$.
- $F_i=(G^\top A)_i$ is the i-th row of $G^\top A$.
Notice that $Fx=G^\top A\cdot x=G^\top y$ is the codeword of $y$.
Node $i$ computes an entry in this codeword.
$1$ response = $1$ entyr of the codeword.
The master does **not** need all workers to respond to obtain $y$.
- The MDS property allows decoding from any $M$ $y_i$'s
- This scheme tolerates $P-M$ erasures, and the recovery threshold $K=M$.
- We need $P=r+M$ worker nodes to tolerate $r$ stragglers or $\frac{r}{2}$ adversaries.
- With replication, we need $P=rM+M$ worker nodes.
#### Potential improvements for MDS codes
- The matrix $A$ is usually a (trained) model, and $x$ is the data (feature vector).
- $x$ is transmitted frequently, while the row of $A$ (or $G^\top A$) is communicated in advance.
- Every worker needs to receive the entire $x$ and compute the dot product.
- Communication-heavy
- Can we design a scheme that allows every node only receive only a part of $x$?
### Short-Dot codes
[link to paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8758338)
We want to create a matrix $F\in \mathbb{F}^{P\times M}$ from $A$ such that:
- Every node computes $F_i\cdot x$.
- Every $K$ rows linearly span the row space of $A$.
- Each row of $F$ contains at most $s$ non-zero entries.
In the MDS method, $F=G^\top A$.
- The recovery threshold $K=M$.
- Every worker node needs to receive $s=N$ symbols (the entire x)
No free lunch
Can we trade the recovery threshold $K$ for a smaller $s$?
- Every worker node receives less than $N$ symbols.
- The master will need more than $M$ responses to recover the computation result.
#### Construction of Short-Dot codes
Choose a super-regular matrix $B\in \mathbb{F}^{P\time K}$, where $P$ is the number of worker nodes.
- A matrix is supper-regular if every square submatrix is invertible.
- Lagrange/Cauchy matrix is super-regular (next lecture).
Create matrix $\tilde{A}$ by stacking some $Z\in \mathbb{F}^{(K-M)\times N}$ below matrix $A$.
Let $F=B\dot \tilde{A}\in \mathbb{F}^{P\times N}$.
**Short-Dot**: create matrix $F\in \mathbb{F}^{P\times N}$ such that:
- Every $K$ rows linearly span the row space of $A$.
- Each row of $F$ contains at most $s=\frac{P-K+M}{P}$. $N$ non-zero entries (sparse).
#### Recovery of Short-Dot codes
Claim: Every $K$ rows of $F$ linearly span the row space of $A$.
<details>
<summary>Proof</summary>
Since $B$ is supper-regular, it is also MDS, i.e., every $K\times K$ submatrix of $B$ is invertible.
Hence, every row of $A$ can be represented as a linear combination of any $K$ rows of $F$.
That is, for every $\mathcal{X}\subseteq[P],|\mathcal{X}|=K$, we can have $\tilde{A}=(B^{\mathcal{X}})^{-1}F^{\mathcal{X}}$.
</details>
What about the sparsity of $F$?
- Want each row of $F$ to be sparse.
#### Sparsity of Short-Dot codes
Build $P\times P$ square matrix whose each row/column contains $P-K+M$ non-zero entries.
Concatenate $\frac{N}{P}$ such matrices and obtain
[Missing slides 18]
We now investigate what 𝑍 should look like to construct such a matrix 𝐹.
• Recall that each column of 𝐹 must contains 𝐾 𝑀 zeros.
They are indexed by the set 𝒰𝑃 , where |𝒰| = 𝐾 𝑀.
Let 𝐵
𝒰𝔽
𝐾𝑀 ×𝐾 be a submatrix of 𝐵 containing rows indexed by 𝒰.
• Since 𝐹 = 𝐵𝐴ሚ , it follows that 𝐹𝑗 = 𝐵𝐴ሚ
𝑗
, where 𝐹𝑗 and 𝐴ሚ
𝑗 are the 𝑗-th column of 𝐹 and 𝐴ሚ.
• Next, we have 𝐵
𝒰𝐴ሚ
𝑗 = 0 (𝐾𝑀)×1.
• Split 𝐵
𝒰 = [𝐵 1,𝑀
𝒰
,𝐵[𝑀+1,𝐾]
𝒰 ], 𝐴ሚ
𝑗 = 𝐴𝑗
𝑇
, 𝑍𝑗
𝑇
𝑇
.
𝐵
𝒰𝐴ሚ
𝑗 = 𝐵 1,𝑀
𝒰 𝐴𝑗 + 𝐵[𝑀+1,𝐾]
𝒰 𝑍𝑗 = 0 (𝐾𝑀)×1.
𝑍𝑗= (𝐵 𝑀+1,𝐾
𝒰
)
1 𝐵 1,𝑀
𝒰 𝐴𝑗
.
• Note that 𝐵 𝑀+1,𝐾
𝒰𝔽
𝐾𝑀 × 𝐾𝑀 is invertible.
Since 𝐵 is super-regular.

View File

@@ -26,4 +26,5 @@ export default {
CSE5313_L20: "CSE5313 Coding and information theory for data science (Lecture 20)",
CSE5313_L21: "CSE5313 Coding and information theory for data science (Lecture 21)",
CSE5313_L22: "CSE5313 Coding and information theory for data science (Lecture 22)",
CSE5313_L23: "CSE5313 Coding and information theory for data science (Lecture 23)",
}