# CSE5313 Coding and information theory for data science (Lecture 23) ## Coded Computing ### Motivation Some facts: - Moore's law is saturating. - Improving CPU performance is hard. - Modern datasets are growing remarkably large. - E.g., TikTok, YouTube. - Learning tasks are computationally heavy. - E.g., training neural networks. Solution: Distributed Computing for Scalability - Offloading computation tasks to multiple computation nodes. - Gather and accumulate computation results. - E.g., Apache Hadoop, Apache Spark, MapReduce. ### General Framework - The system involves 1 master node and $P$ worker nodes. - The master has a dataset $D$ and wants $f(D)$, where $f$ is some function. - The master partitions $D=(D_1,\cdots,D_P)$, and sends $D_i$ to node $i$. - Every node $i$ computes $g(D_i)$, where $g$ is somem function. - Finally, the master collects $g(D_1),\cdots,g(D_P)$ and computes $f(D)=h(g(D_1),\cdots,g(D_P))$, where $h$ is some function. #### Challenges Stragglers - Nodes that are significantly slower than the others. Adversaries - Nodes that return errounous results. - Computation/communication errors. - Adversarial attacks. Privary - Nodes may be curious about the dataset. ### Resemblance to communication channel Suppose $f,g=\operatorname{id}$, and let $D=(D_1,\cdots,D_P)\in \mathbb{F}^p$ a message. - $D_i$ is a field element - $\mathbb{F}$ could be $\mathbb{R}$ or $\mathbb{C}$, $\mathbb{F}^q$. Observation: This is a distributed storage system. - An erasure - node that does not respond. - An error - node that returns errounous results. Solution: - Add redundancy to the message - Error-correcting codes. ### Coded Distributed Computing - The master partitions $D$ and encodes it before sending to $P$ workers. - Workers perform computations on coded data $\tilde{D}$ and generate coded results $g(\tilde{D})$. - The master decode the coded results and obtain $f(D)=h(g(\tilde{D}))$. ### Outline Matrix-Vector Multiplication - MDS codes. - Short-Dot codes. Matrix-Matrix Multiplication - Polynomial codes. - MatDot codes. Polynomial Evaluation - Lagrange codes. - Application to BLockchain. ### Trivial solution - replication Why no straggler tolerance? - We employ an individual worker node $i$ to compute $y_i=(a_i,\ldots,a_{iN})\cdot (x_1,\ldots,x_N)^T$. Replicate the computation? - Let $r+1$ nodes compute every $y_i$. We need $P=rM+M$ worker nodes to tolerate $r$ erasures and $\lfloor \frac{r}{2}\rfloor$ adversaries. ### Use of MDS codes Let $2|M$ and $P=3$. Let $A_1,A_2$ be submatrices of $A$ such that $A=[A_1^\top|A_2^\top]^\top$. - Worker node 1 conputes $A_1\cdot x$. - Worker node 2 conputes $A_2\cdot x$. - Worker node 3 conputes $(A_1+A_2)\cdot x$. Observation: the results can be obtained from any two worker nodes. Let $G\in \mathbb{F}^{M\times P}$ be the generator matrix of an $(P,M)$ MDS code. The master node computes $F=G^\top A\in \mathbb{F}^{P\times N}$. Every worker node $i$ computers $F_i\cdot x$. - $F_i=(G^\top A)_i$ is the i-th row of $G^\top A$. Notice that $Fx=G^\top A\cdot x=G^\top y$ is the codeword of $y$. Node $i$ computes an entry in this codeword. $1$ response = $1$ entyr of the codeword. The master does **not** need all workers to respond to obtain $y$. - The MDS property allows decoding from any $M$ $y_i$'s - This scheme tolerates $P-M$ erasures, and the recovery threshold $K=M$. - We need $P=r+M$ worker nodes to tolerate $r$ stragglers or $\frac{r}{2}$ adversaries. - With replication, we need $P=rM+M$ worker nodes. #### Potential improvements for MDS codes - The matrix $A$ is usually a (trained) model, and $x$ is the data (feature vector). - $x$ is transmitted frequently, while the row of $A$ (or $G^\top A$) is communicated in advance. - Every worker needs to receive the entire $x$ and compute the dot product. - Communication-heavy - Can we design a scheme that allows every node only receive only a part of $x$? ### Short-Dot codes [link to paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8758338) We want to create a matrix $F\in \mathbb{F}^{P\times M}$ from $A$ such that: - Every node computes $F_i\cdot x$. - Every $K$ rows linearly span the row space of $A$. - Each row of $F$ contains at most $s$ non-zero entries. In the MDS method, $F=G^\top A$. - The recovery threshold $K=M$. - Every worker node needs to receive $s=N$ symbols (the entire x) No free lunch Can we trade the recovery threshold $K$ for a smaller $s$? - Every worker node receives less than $N$ symbols. - The master will need more than $M$ responses to recover the computation result. #### Construction of Short-Dot codes Choose a super-regular matrix $B\in \mathbb{F}^{P\times K}$, where $P$ is the number of worker nodes. - A matrix is supper-regular if every square submatrix is invertible. - Lagrange/Cauchy matrix is super-regular (next lecture). Create matrix $\tilde{A}$ by stacking some $Z\in \mathbb{F}^{(K-M)\times N}$ below matrix $A$. Let $F=B\cdot \tilde{A}\in \mathbb{F}^{P\times N}$. **Short-Dot**: create matrix $F\in \mathbb{F}^{P\times N}$ such that: - Every $K$ rows linearly span the row space of $A$. - Each row of $F$ contains at most $s=\frac{P-K+M}{P}$. $N$ non-zero entries (sparse). #### Recovery of Short-Dot codes Claim: Every $K$ rows of $F$ linearly span the row space of $A$.
Proof Since $B$ is supper-regular, it is also MDS, i.e., every $K\times K$ submatrix of $B$ is invertible. Hence, every row of $A$ can be represented as a linear combination of any $K$ rows of $F$. That is, for every $\mathcal{X}\subseteq[P],|\mathcal{X}|=K$, we can have $\tilde{A}=(B^{\mathcal{X}})^{-1}F^{\mathcal{X}}$.
What about the sparsity of $F$? - Want each row of $F$ to be sparse. #### Sparsity of Short-Dot codes Build $P\times P$ square matrix whose each row/column contains $P-K+M$ non-zero entries. Concatenate $\frac{N}{P}$ such matrices and obtain [Missing slides 18] We now investigate what ๐‘ should look like to construct such a matrix ๐น. โ€ข Recall that each column of ๐น must contains ๐พ โˆ’ ๐‘€ zeros. โ€“ They are indexed by the set ๐’ฐ โˆˆ ๐‘ƒ , where |๐’ฐ| = ๐พ โˆ’ ๐‘€. โ€“ Let ๐ต ๐’ฐ โˆˆ ๐”ฝ ๐พโˆ’๐‘€ ร—๐พ be a submatrix of ๐ต containing rows indexed by ๐’ฐ. โ€ข Since ๐น = ๐ต๐ดแˆš , it follows that ๐น๐‘— = ๐ต๐ดแˆš ๐‘— , where ๐น๐‘— and ๐ดแˆš ๐‘— are the ๐‘—-th column of ๐น and ๐ดแˆš. โ€ข Next, we have ๐ต ๐’ฐ๐ดแˆš ๐‘— = 0 (๐พโˆ’๐‘€)ร—1. โ€ข Split ๐ต ๐’ฐ = [๐ต 1,๐‘€ ๐’ฐ ,๐ต[๐‘€+1,๐พ] ๐’ฐ ], ๐ดแˆš ๐‘— = ๐ด๐‘— ๐‘‡ , ๐‘๐‘— ๐‘‡ ๐‘‡ . โ€ข ๐ต ๐’ฐ๐ดแˆš ๐‘— = ๐ต 1,๐‘€ ๐’ฐ ๐ด๐‘— + ๐ต[๐‘€+1,๐พ] ๐’ฐ ๐‘๐‘— = 0 (๐พโˆ’๐‘€)ร—1. โ€ข ๐‘๐‘—= โˆ’(๐ต ๐‘€+1,๐พ ๐’ฐ ) โˆ’1 ๐ต 1,๐‘€ ๐’ฐ ๐ด๐‘— . โ€ข Note that ๐ต ๐‘€+1,๐พ ๐’ฐ โˆˆ ๐”ฝ ๐พโˆ’๐‘€ ร— ๐พโˆ’๐‘€ is invertible. โ€“ Since ๐ต is super-regular.