6.8 KiB
CSE5313 Coding and information theory for data science (Lecture 23)
Coded Computing
Motivation
Some facts:
- Moore's law is saturating.
- Improving CPU performance is hard.
- Modern datasets are growing remarkably large.
- E.g., TikTok, YouTube.
- Learning tasks are computationally heavy.
- E.g., training neural networks.
Solution: Distributed Computing for Scalability
- Offloading computation tasks to multiple computation nodes.
- Gather and accumulate computation results.
- E.g., Apache Hadoop, Apache Spark, MapReduce.
General Framework
- The system involves 1 master node and
Pworker nodes. - The master has a dataset
Dand wantsf(D), wherefis some function. - The master partitions
D=(D_1,\cdots,D_P), and sendsD_ito nodei. - Every node
icomputesg(D_i), wheregis somem function. - Finally, the master collects
g(D_1),\cdots,g(D_P)and computesf(D)=h(g(D_1),\cdots,g(D_P)), wherehis some function.
Challenges
Stragglers
- Nodes that are significantly slower than the others.
Adversaries
- Nodes that return errounous results.
- Computation/communication errors.
- Adversarial attacks.
Privary
- Nodes may be curious about the dataset.
Resemblance to communication channel
Suppose f,g=\operatorname{id}, and let D=(D_1,\cdots,D_P)\in \mathbb{F}^p a message.
D_iis a field element\mathbb{F}could be\mathbb{R}or\mathbb{C},\mathbb{F}^q.
Observation: This is a distributed storage system.
- An erasure - node that does not respond.
- An error - node that returns errounous results.
Solution:
- Add redundancy to the message
- Error-correcting codes.
Coded Distributed Computing
- The master partitions
Dand encodes it before sending toPworkers. - Workers perform computations on coded data
\tilde{D}and generate coded resultsg(\tilde{D}). - The master decode the coded results and obtain
f(D)=h(g(\tilde{D})).
Outline
Matrix-Vector Multiplication
- MDS codes.
- Short-Dot codes.
Matrix-Matrix Multiplication
- Polynomial codes.
- MatDot codes.
Polynomial Evaluation
- Lagrange codes.
- Application to BLockchain.
Trivial solution - replication
Why no straggler tolerance?
- We employ an individual worker node
ito computey_i=(a_i,\ldots,a_{iN})\cdot (x_1,\ldots,x_N)^T.
Replicate the computation?
- Let
r+1nodes compute everyy_i.
We need P=rM+M worker nodes to tolerate r erasures and \lfloor \frac{r}{2}\rfloor adversaries.
Use of MDS codes
Let 2|M and P=3.
Let A_1,A_2 be submatrices of A such that A=[A_1^\top|A_2^\top]^\top.
- Worker node 1 conputes
A_1\cdot x. - Worker node 2 conputes
A_2\cdot x. - Worker node 3 conputes
(A_1+A_2)\cdot x.
Observation: the results can be obtained from any two worker nodes.
Let G\in \mathbb{F}^{M\times P} be the generator matrix of an (P,M) MDS code.
The master node computes F=G^\top A\in \mathbb{F}^{P\times N}.
Every worker node i computers F_i\cdot x.
F_i=(G^\top A)_iis the i-th row ofG^\top A.
Notice that Fx=G^\top A\cdot x=G^\top y is the codeword of y.
Node i computes an entry in this codeword.
1 response = 1 entyr of the codeword.
The master does not need all workers to respond to obtain y.
- The MDS property allows decoding from any
M$y_i$'s - This scheme tolerates
P-Merasures, and the recovery thresholdK=M. - We need
P=r+Mworker nodes to toleraterstragglers or\frac{r}{2}adversaries.- With replication, we need
P=rM+Mworker nodes.
- With replication, we need
Potential improvements for MDS codes
- The matrix
Ais usually a (trained) model, andxis the data (feature vector). xis transmitted frequently, while the row ofA(orG^\top A) is communicated in advance.- Every worker needs to receive the entire
xand compute the dot product. - Communication-heavy
- Can we design a scheme that allows every node only receive only a part of
x?
Short-Dot codes
We want to create a matrix F\in \mathbb{F}^{P\times M} from A such that:
- Every node computes
F_i\cdot x. - Every
Krows linearly span the row space ofA. - Each row of
Fcontains at mostsnon-zero entries.
In the MDS method, F=G^\top A.
- The recovery threshold
K=M. - Every worker node needs to receive
s=Nsymbols (the entire x)
No free lunch
Can we trade the recovery threshold K for a smaller s?
- Every worker node receives less than
Nsymbols. - The master will need more than
Mresponses to recover the computation result.
Construction of Short-Dot codes
Choose a super-regular matrix B\in \mathbb{F}^{P\times K}, where P is the number of worker nodes.
- A matrix is supper-regular if every square submatrix is invertible.
- Lagrange/Cauchy matrix is super-regular (next lecture).
Create matrix \tilde{A} by stacking some Z\in \mathbb{F}^{(K-M)\times N} below matrix A.
Let F=B\cdot \tilde{A}\in \mathbb{F}^{P\times N}.
Short-Dot: create matrix F\in \mathbb{F}^{P\times N} such that:
- Every
Krows linearly span the row space ofA. - Each row of
Fcontains at mosts=\frac{P-K+M}{P}.Nnon-zero entries (sparse).
Recovery of Short-Dot codes
Claim: Every K rows of F linearly span the row space of A.
Proof
Since B is supper-regular, it is also MDS, i.e., every K\times K submatrix of B is invertible.
Hence, every row of A can be represented as a linear combination of any K rows of F.
That is, for every \mathcal{X}\subseteq[P],|\mathcal{X}|=K, we can have \tilde{A}=(B^{\mathcal{X}})^{-1}F^{\mathcal{X}}.
What about the sparsity of F?
- Want each row of
Fto be sparse.
Sparsity of Short-Dot codes
Build P\times P square matrix whose each row/column contains P-K+M non-zero entries.
Concatenate \frac{N}{P} such matrices and obtain
[Missing slides 18]
We now investigate what 𝑍 should look like to construct such a matrix 𝐹. • Recall that each column of 𝐹 must contains 𝐾 − 𝑀 zeros. – They are indexed by the set 𝒰 ∈ 𝑃 , where |𝒰| = 𝐾 − 𝑀. – Let 𝐵 𝒰 ∈ 𝔽 𝐾−𝑀 ×𝐾 be a submatrix of 𝐵 containing rows indexed by 𝒰. • Since 𝐹 = 𝐵𝐴ሚ , it follows that 𝐹𝑗 = 𝐵𝐴ሚ 𝑗 , where 𝐹𝑗 and 𝐴ሚ 𝑗 are the 𝑗-th column of 𝐹 and 𝐴ሚ. • Next, we have 𝐵 𝒰𝐴ሚ 𝑗 = 0 (𝐾−𝑀)×1. • Split 𝐵 𝒰 = [𝐵 1,𝑀 𝒰 ,𝐵[𝑀+1,𝐾] 𝒰 ], 𝐴ሚ 𝑗 = 𝐴𝑗 𝑇 , 𝑍𝑗 𝑇 𝑇 . • 𝐵 𝒰𝐴ሚ 𝑗 = 𝐵 1,𝑀 𝒰 𝐴𝑗 + 𝐵[𝑀+1,𝐾] 𝒰 𝑍𝑗 = 0 (𝐾−𝑀)×1. • 𝑍𝑗= −(𝐵 𝑀+1,𝐾 𝒰 ) −1 𝐵 1,𝑀 𝒰 𝐴𝑗 . • Note that 𝐵 𝑀+1,𝐾 𝒰 ∈ 𝔽 𝐾−𝑀 × 𝐾−𝑀 is invertible. – Since 𝐵 is super-regular.