NoteNextra-origin/content/CSE5313/CSE5313_L23.md

# CSE5313 Coding and information theory for data science (Lecture 23)

## Coded Computing

### Motivation

Some facts:

- Moore's law is saturating.
  - Improving CPU performance is hard.
- Modern datasets are growing remarkably large.
  - E.g., TikTok, YouTube.
- Learning tasks are computationally heavy.
  - E.g., training neural networks.

Solution: Distributed Computing for Scalability

- Offloading computation tasks to multiple computation nodes.
- Gather and accumulate computation results.
- E.g., Apache Hadoop, Apache Spark, MapReduce.

### General Framework

- The system involves 1 master node and $P$ worker nodes.
- The master has a dataset $D$ and wants $f(D)$, where $f$ is some function.
- The master partitions $D=(D_1,\cdots,D_P)$, and sends $D_i$ to node $i$.
- Every node $i$ computes $g(D_i)$, where $g$ is somem function.
- Finally, the master collects $g(D_1),\cdots,g(D_P)$ and computes $f(D)=h(g(D_1),\cdots,g(D_P))$, where $h$ is some function.

#### Challenges

Stragglers

- Nodes that are significantly slower than the others.

Adversaries

- Nodes that return errounous results.
  - Computation/communication errors.
  - Adversarial attacks.

Privary

- Nodes may be curious about the dataset.

### Resemblance to communication channel

Suppose $f,g=\operatorname{id}$, and let $D=(D_1,\cdots,D_P)\in \mathbb{F}^p$ a message.

- $D_i$ is a field element
- $\mathbb{F}$ could be $\mathbb{R}$ or $\mathbb{C}$, $\mathbb{F}^q$.

Observation: This is a distributed storage system.

- An erasure - node that does not respond.
- An error - node that returns errounous results.

Solution:

- Add redundancy to the message
- Error-correcting codes.

### Coded Distributed Computing

- The master partitions $D$ and encodes it before sending to $P$ workers.
- Workers perform computations on coded data $\tilde{D}$ and generate coded results $g(\tilde{D})$.
- The master decode the coded results and obtain $f(D)=h(g(\tilde{D}))$.

### Outline

Matrix-Vector Multiplication

- MDS codes.
- Short-Dot codes.

Matrix-Matrix Multiplication

- Polynomial codes.
- MatDot codes.

Polynomial Evaluation

- Lagrange codes.
- Application to BLockchain.

### Trivial solution - replication

Why no straggler tolerance?

- We employ an individual worker node $i$ to compute $y_i=(a_i,\ldots,a_{iN})\cdot (x_1,\ldots,x_N)^T$.

Replicate the computation?

- Let $r+1$ nodes compute every $y_i$.

We need $P=rM+M$ worker nodes to tolerate $r$ erasures and $\lfloor \frac{r}{2}\rfloor$ adversaries.

### Use of MDS codes

Let $2|M$ and $P=3$.

Let $A_1,A_2$ be submatrices of $A$ such that $A=[A_1^\top|A_2^\top]^\top$.

- Worker node 1 conputes $A_1\cdot x$.
- Worker node 2 conputes $A_2\cdot x$.
- Worker node 3 conputes $(A_1+A_2)\cdot x$.

Observation: the results can be obtained from any two worker nodes.

Let $G\in \mathbb{F}^{M\times P}$ be the generator matrix of an $(P,M)$ MDS code.

The master node computes $F=G^\top A\in \mathbb{F}^{P\times N}$.

Every worker node $i$ computers $F_i\cdot x$.

- $F_i=(G^\top A)_i$ is the i-th row of $G^\top A$.

Notice that $Fx=G^\top A\cdot x=G^\top y$ is the codeword of $y$.

Node $i$ computes an entry in this codeword.

$1$ response = $1$ entyr of the codeword.

The master does **not** need all workers to respond to obtain $y$.

- The MDS property allows decoding from any $M$ $y_i$'s
- This scheme tolerates $P-M$ erasures, and the recovery threshold $K=M$.
- We need $P=r+M$ worker nodes to tolerate $r$ stragglers or $\frac{r}{2}$ adversaries.
  - With replication, we need $P=rM+M$ worker nodes.

#### Potential improvements for MDS codes

- The matrix $A$ is usually a (trained) model, and $x$ is the data (feature vector).
- $x$ is transmitted frequently, while the row of $A$ (or $G^\top A$) is communicated in advance.
- Every worker needs to receive the entire $x$ and compute the dot product.
- Communication-heavy
- Can we design a scheme that allows every node only receive only a part of $x$?

### Short-Dot codes

[link to paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8758338)

We want to create a matrix $F\in \mathbb{F}^{P\times M}$ from $A$ such that:

- Every node computes $F_i\cdot x$.
- Every $K$ rows linearly span the row space of $A$.
- Each row of $F$ contains at most $s$ non-zero entries.

In the MDS method, $F=G^\top A$.

- The recovery threshold $K=M$.
- Every worker node needs to receive $s=N$ symbols (the entire x)

No free lunch

Can we trade the recovery threshold $K$ for a smaller $s$?

- Every worker node receives less than $N$ symbols.
- The master will need more than $M$ responses to recover the computation result.

#### Construction of Short-Dot codes

Choose a super-regular matrix $B\in \mathbb{F}^{P\time K}$, where $P$ is the number of worker nodes.

- A matrix is supper-regular if every square submatrix is invertible.
- Lagrange/Cauchy matrix is super-regular (next lecture).

Create matrix $\tilde{A}$ by stacking some $Z\in \mathbb{F}^{(K-M)\times N}$ below matrix $A$.

Let $F=B\dot \tilde{A}\in \mathbb{F}^{P\times N}$.

**Short-Dot**: create matrix $F\in \mathbb{F}^{P\times N}$ such that:

- Every $K$ rows linearly span the row space of $A$.
- Each row of $F$ contains at most $s=\frac{P-K+M}{P}$. $N$ non-zero entries (sparse).

#### Recovery of Short-Dot codes

Claim: Every $K$ rows of $F$ linearly span the row space of $A$.

<details>
<summary>Proof</summary>

Since $B$ is supper-regular, it is also MDS, i.e., every $K\times K$ submatrix of $B$ is invertible.

Hence, every row of $A$ can be represented as a linear combination of any $K$ rows of $F$.

That is, for every $\mathcal{X}\subseteq[P],|\mathcal{X}|=K$, we can have $\tilde{A}=(B^{\mathcal{X}})^{-1}F^{\mathcal{X}}$.

</details>

What about the sparsity of $F$?

- Want each row of $F$ to be sparse.

#### Sparsity of Short-Dot codes

Build $P\times P$ square matrix whose each row/column contains $P-K+M$ non-zero entries.

Concatenate $\frac{N}{P}$ such matrices and obtain

[Missing slides 18]

We now investigate what 𝑍 should look like to construct such a matrix 𝐹.
• Recall that each column of 𝐹 must contains 𝐾 − 𝑀 zeros.
– They are indexed by the set 𝒰 ∈ 𝑃 , where |𝒰| = 𝐾 − 𝑀.
– Let 𝐵
𝒰 ∈ 𝔽
𝐾−𝑀 ×𝐾 be a submatrix of 𝐵 containing rows indexed by 𝒰.
• Since 𝐹 = 𝐵𝐴ሚ , it follows that 𝐹𝑗 = 𝐵𝐴ሚ
𝑗
, where 𝐹𝑗 and 𝐴ሚ
𝑗 are the 𝑗-th column of 𝐹 and 𝐴ሚ.
• Next, we have 𝐵
𝒰𝐴ሚ
𝑗 = 0 (𝐾−𝑀)×1.
• Split 𝐵
𝒰 = [𝐵 1,𝑀
𝒰
,𝐵[𝑀+1,𝐾]
𝒰 ], 𝐴ሚ
𝑗 = 𝐴𝑗
𝑇
, 𝑍𝑗
𝑇
𝑇
.
• 𝐵
𝒰𝐴ሚ
𝑗 = 𝐵 1,𝑀
𝒰 𝐴𝑗 + 𝐵[𝑀+1,𝐾]
𝒰 𝑍𝑗 = 0 (𝐾−𝑀)×1.
• 𝑍𝑗= −(𝐵 𝑀+1,𝐾
𝒰
)
−1 𝐵 1,𝑀
𝒰 𝐴𝑗
.
• Note that 𝐵 𝑀+1,𝐾
𝒰 ∈ 𝔽
𝐾−𝑀 × 𝐾−𝑀 is invertible.
– Since 𝐵 is super-regular.