242 lines
6.8 KiB
Markdown
242 lines
6.8 KiB
Markdown
# CSE5313 Coding and information theory for data science (Lecture 23)
|
||
|
||
## Coded Computing
|
||
|
||
### Motivation
|
||
|
||
Some facts:
|
||
|
||
- Moore's law is saturating.
|
||
- Improving CPU performance is hard.
|
||
- Modern datasets are growing remarkably large.
|
||
- E.g., TikTok, YouTube.
|
||
- Learning tasks are computationally heavy.
|
||
- E.g., training neural networks.
|
||
|
||
Solution: Distributed Computing for Scalability
|
||
|
||
- Offloading computation tasks to multiple computation nodes.
|
||
- Gather and accumulate computation results.
|
||
- E.g., Apache Hadoop, Apache Spark, MapReduce.
|
||
|
||
### General Framework
|
||
|
||
- The system involves 1 master node and $P$ worker nodes.
|
||
- The master has a dataset $D$ and wants $f(D)$, where $f$ is some function.
|
||
- The master partitions $D=(D_1,\cdots,D_P)$, and sends $D_i$ to node $i$.
|
||
- Every node $i$ computes $g(D_i)$, where $g$ is somem function.
|
||
- Finally, the master collects $g(D_1),\cdots,g(D_P)$ and computes $f(D)=h(g(D_1),\cdots,g(D_P))$, where $h$ is some function.
|
||
|
||
#### Challenges
|
||
|
||
Stragglers
|
||
|
||
- Nodes that are significantly slower than the others.
|
||
|
||
Adversaries
|
||
|
||
- Nodes that return errounous results.
|
||
- Computation/communication errors.
|
||
- Adversarial attacks.
|
||
|
||
Privary
|
||
|
||
- Nodes may be curious about the dataset.
|
||
|
||
### Resemblance to communication channel
|
||
|
||
Suppose $f,g=\operatorname{id}$, and let $D=(D_1,\cdots,D_P)\in \mathbb{F}^p$ a message.
|
||
|
||
- $D_i$ is a field element
|
||
- $\mathbb{F}$ could be $\mathbb{R}$ or $\mathbb{C}$, $\mathbb{F}^q$.
|
||
|
||
Observation: This is a distributed storage system.
|
||
|
||
- An erasure - node that does not respond.
|
||
- An error - node that returns errounous results.
|
||
|
||
Solution:
|
||
|
||
- Add redundancy to the message
|
||
- Error-correcting codes.
|
||
|
||
### Coded Distributed Computing
|
||
|
||
- The master partitions $D$ and encodes it before sending to $P$ workers.
|
||
- Workers perform computations on coded data $\tilde{D}$ and generate coded results $g(\tilde{D})$.
|
||
- The master decode the coded results and obtain $f(D)=h(g(\tilde{D}))$.
|
||
|
||
### Outline
|
||
|
||
Matrix-Vector Multiplication
|
||
|
||
- MDS codes.
|
||
- Short-Dot codes.
|
||
|
||
Matrix-Matrix Multiplication
|
||
|
||
- Polynomial codes.
|
||
- MatDot codes.
|
||
|
||
Polynomial Evaluation
|
||
|
||
- Lagrange codes.
|
||
- Application to BLockchain.
|
||
|
||
### Trivial solution - replication
|
||
|
||
Why no straggler tolerance?
|
||
|
||
- We employ an individual worker node $i$ to compute $y_i=(a_i,\ldots,a_{iN})\cdot (x_1,\ldots,x_N)^T$.
|
||
|
||
Replicate the computation?
|
||
|
||
- Let $r+1$ nodes compute every $y_i$.
|
||
|
||
We need $P=rM+M$ worker nodes to tolerate $r$ erasures and $\lfloor \frac{r}{2}\rfloor$ adversaries.
|
||
|
||
### Use of MDS codes
|
||
|
||
Let $2|M$ and $P=3$.
|
||
|
||
Let $A_1,A_2$ be submatrices of $A$ such that $A=[A_1^\top|A_2^\top]^\top$.
|
||
|
||
- Worker node 1 conputes $A_1\cdot x$.
|
||
- Worker node 2 conputes $A_2\cdot x$.
|
||
- Worker node 3 conputes $(A_1+A_2)\cdot x$.
|
||
|
||
Observation: the results can be obtained from any two worker nodes.
|
||
|
||
Let $G\in \mathbb{F}^{M\times P}$ be the generator matrix of an $(P,M)$ MDS code.
|
||
|
||
The master node computes $F=G^\top A\in \mathbb{F}^{P\times N}$.
|
||
|
||
Every worker node $i$ computers $F_i\cdot x$.
|
||
|
||
- $F_i=(G^\top A)_i$ is the i-th row of $G^\top A$.
|
||
|
||
Notice that $Fx=G^\top A\cdot x=G^\top y$ is the codeword of $y$.
|
||
|
||
Node $i$ computes an entry in this codeword.
|
||
|
||
$1$ response = $1$ entyr of the codeword.
|
||
|
||
The master does **not** need all workers to respond to obtain $y$.
|
||
|
||
- The MDS property allows decoding from any $M$ $y_i$'s
|
||
- This scheme tolerates $P-M$ erasures, and the recovery threshold $K=M$.
|
||
- We need $P=r+M$ worker nodes to tolerate $r$ stragglers or $\frac{r}{2}$ adversaries.
|
||
- With replication, we need $P=rM+M$ worker nodes.
|
||
|
||
#### Potential improvements for MDS codes
|
||
|
||
- The matrix $A$ is usually a (trained) model, and $x$ is the data (feature vector).
|
||
- $x$ is transmitted frequently, while the row of $A$ (or $G^\top A$) is communicated in advance.
|
||
- Every worker needs to receive the entire $x$ and compute the dot product.
|
||
- Communication-heavy
|
||
- Can we design a scheme that allows every node only receive only a part of $x$?
|
||
|
||
### Short-Dot codes
|
||
|
||
[link to paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8758338)
|
||
|
||
We want to create a matrix $F\in \mathbb{F}^{P\times M}$ from $A$ such that:
|
||
|
||
- Every node computes $F_i\cdot x$.
|
||
- Every $K$ rows linearly span the row space of $A$.
|
||
- Each row of $F$ contains at most $s$ non-zero entries.
|
||
|
||
In the MDS method, $F=G^\top A$.
|
||
|
||
- The recovery threshold $K=M$.
|
||
- Every worker node needs to receive $s=N$ symbols (the entire x)
|
||
|
||
No free lunch
|
||
|
||
Can we trade the recovery threshold $K$ for a smaller $s$?
|
||
|
||
- Every worker node receives less than $N$ symbols.
|
||
- The master will need more than $M$ responses to recover the computation result.
|
||
|
||
#### Construction of Short-Dot codes
|
||
|
||
Choose a super-regular matrix $B\in \mathbb{F}^{P\time K}$, where $P$ is the number of worker nodes.
|
||
|
||
- A matrix is supper-regular if every square submatrix is invertible.
|
||
- Lagrange/Cauchy matrix is super-regular (next lecture).
|
||
|
||
Create matrix $\tilde{A}$ by stacking some $Z\in \mathbb{F}^{(K-M)\times N}$ below matrix $A$.
|
||
|
||
Let $F=B\dot \tilde{A}\in \mathbb{F}^{P\times N}$.
|
||
|
||
**Short-Dot**: create matrix $F\in \mathbb{F}^{P\times N}$ such that:
|
||
|
||
- Every $K$ rows linearly span the row space of $A$.
|
||
- Each row of $F$ contains at most $s=\frac{P-K+M}{P}$. $N$ non-zero entries (sparse).
|
||
|
||
#### Recovery of Short-Dot codes
|
||
|
||
Claim: Every $K$ rows of $F$ linearly span the row space of $A$.
|
||
|
||
<details>
|
||
<summary>Proof</summary>
|
||
|
||
Since $B$ is supper-regular, it is also MDS, i.e., every $K\times K$ submatrix of $B$ is invertible.
|
||
|
||
Hence, every row of $A$ can be represented as a linear combination of any $K$ rows of $F$.
|
||
|
||
That is, for every $\mathcal{X}\subseteq[P],|\mathcal{X}|=K$, we can have $\tilde{A}=(B^{\mathcal{X}})^{-1}F^{\mathcal{X}}$.
|
||
|
||
</details>
|
||
|
||
What about the sparsity of $F$?
|
||
|
||
- Want each row of $F$ to be sparse.
|
||
|
||
#### Sparsity of Short-Dot codes
|
||
|
||
Build $P\times P$ square matrix whose each row/column contains $P-K+M$ non-zero entries.
|
||
|
||
Concatenate $\frac{N}{P}$ such matrices and obtain
|
||
|
||
[Missing slides 18]
|
||
|
||
We now investigate what 𝑍 should look like to construct such a matrix 𝐹.
|
||
• Recall that each column of 𝐹 must contains 𝐾 − 𝑀 zeros.
|
||
– They are indexed by the set 𝒰 ∈ 𝑃 , where |𝒰| = 𝐾 − 𝑀.
|
||
– Let 𝐵
|
||
𝒰 ∈ 𝔽
|
||
𝐾−𝑀 ×𝐾 be a submatrix of 𝐵 containing rows indexed by 𝒰.
|
||
• Since 𝐹 = 𝐵𝐴ሚ , it follows that 𝐹𝑗 = 𝐵𝐴ሚ
|
||
𝑗
|
||
, where 𝐹𝑗 and 𝐴ሚ
|
||
𝑗 are the 𝑗-th column of 𝐹 and 𝐴ሚ.
|
||
• Next, we have 𝐵
|
||
𝒰𝐴ሚ
|
||
𝑗 = 0 (𝐾−𝑀)×1.
|
||
• Split 𝐵
|
||
𝒰 = [𝐵 1,𝑀
|
||
𝒰
|
||
,𝐵[𝑀+1,𝐾]
|
||
𝒰 ], 𝐴ሚ
|
||
𝑗 = 𝐴𝑗
|
||
𝑇
|
||
, 𝑍𝑗
|
||
𝑇
|
||
𝑇
|
||
.
|
||
• 𝐵
|
||
𝒰𝐴ሚ
|
||
𝑗 = 𝐵 1,𝑀
|
||
𝒰 𝐴𝑗 + 𝐵[𝑀+1,𝐾]
|
||
𝒰 𝑍𝑗 = 0 (𝐾−𝑀)×1.
|
||
• 𝑍𝑗= −(𝐵 𝑀+1,𝐾
|
||
𝒰
|
||
)
|
||
−1 𝐵 1,𝑀
|
||
𝒰 𝐴𝑗
|
||
.
|
||
• Note that 𝐵 𝑀+1,𝐾
|
||
𝒰 ∈ 𝔽
|
||
𝐾−𝑀 × 𝐾−𝑀 is invertible.
|
||
– Since 𝐵 is super-regular. |