This commit is contained in:
Zheyuan Wu
2025-10-16 12:46:47 -05:00
parent 59001fc539
commit 2d0dae4461
3 changed files with 298 additions and 0 deletions

View File

@@ -0,0 +1,297 @@
# CSE5313 Coding and information theory for data science (Lecture 14)
## The repair problem
Main challenge:
- Locality (number of contacted servers)
- Bandwidth (number of bits transferred)
From last lecture we build optimal Local Recoverable Codes (LRCs) for storage systems.
Let $\mathcal{C} = [n, k]_q$ which is $r$-LRC, with minimum distance $d$.
- Bound 1: $\frac{k}{n}\leq \frac{r}{r+1}$.
- Bound 2: $d\leq n-k-\frac{k}{r}+2$.
- Optimal LRC:
- Let $\mathcal{A} = \{\alpha_1, \ldots, \alpha_n\}$, partition $\mathcal{A}$ to $\mathcal{A}_i$ for $i=1$ to $\frac{n}{r+1}$.
- $g\in \mathbb{F}_q[x]$ is good if $\deg(g) = r+1$ and $g$ is constant on all $\mathcal{A}_i$'s.
- $\mathcal{C} = \{f_a(\alpha_i)\}_{i=1}^{n}|a\in \mathbb{F}_q^k\}$, where
- $f_a(x) = \sum_{i=0}^{r-1} f_{a,i}(x)\cdot x^i$,
- $f_{a,i}(x) = \sum_{j=0}^{k/r-1} a_{i,j}\cdot g(x)^j$, where $g$ is a good polynomial.
- $g$ is a "good" polynomial.
- $\dim \mathcal{C} = k$ and $d = n-k-\frac{k}{r}+2$.
## Minimizing the repair bandwidth
Goal: understand repair bandwidth.
- What is the minimum repair bandwidth?
- Is repair bandwidth in trade-off with other parameters?
- Tool: The information flow graph.
Spoiler alert:
- Tradeoff: Storage Repair and bandwidth.
- Codes which achieve an optimal tradeoff.
### Information flow graph
We can model the repair problem as a directed graph.
- Source: System admin.
- Sink: Data collector.
- Nodes: Storage servers.
- Nodes leave/crash
- Newcomer replaces them $\to$ new nodes.
- Edges: Represents transmission of information. (Number of $\mathbb{F}_q$ elements is weight.)
Main observation:
- $k$ elements from $\mathbb{F}_q$ must "flow" from the source (system admin) to the sink (data collector).
- Any cut $(U,\overline{U})$ which separates source from sink must have capacity at least $k$.
Roadmap:
Information flow graph $\to$ Minimum cut analysis $\to$ Bound on file size $\to$ Storage/bandwidth tradeoff.
#### Basic definitions for information flow graph
> [!WARNING]
>
> This is not the same as definitions in linear codes. $k$ is not the dimension of the code and $d$ is not the minimum distance of the code for general cases.
Parameters:
- $n$ is the number of nodes in the initial system (before any node leaves/crashes).
- $k$ is the number of nodes required to reconstruct the file $k$.
- $d$ is the number of nodes required to repair a failed node.
- $\alpha$ is the storage at each node.
- $\beta$ is the edge capacity for repair.
- $B$ is the file size.
Goal: Find the trade off between $n,k,d,\alpha,\beta,B$ using min-cut analysis of the information flow graph.
Initial system:
We denote the system admin as $S$
Sever as $1,2,\ldots,n$, each with edge capacity $\alpha$.
For each new server:
We have two nodes $in$ and $out$, the edge weight is $\alpha$.
Connect to $d$ previous nodes $out$'s with edge capacity $\beta$.
Data collector:connects to $k$ arbitrary nodes with each edge capacity $\alpha$.
Observe that:
- File size $B$.
- Any cut separating $S$ form $DC$ must have capacity at least $B$.
- Otherwise, two different files are indistinguishable to $DC$.
#### Bound on bandwidth
Claim: $mincut\geq \sum_{i=1}^{k-1}\min\{(d-i)\beta, \alpha\}$.
Intuition: Let $(U, \overline{U})$ be a cut separating $S$ form $DC$.
- The DC contacts newest $k$ nodes, say $n_1,n_2,\ldots,n_k$.
- The cut must decide if to cross $\alpha$ edges or $\beta$ edges.
- At east $d-i$ edges to go to $U$.
<details>
<summary>Proof</summary>
Let $(U, \overline{U})$ be a cut separating $S$ form $DC$, assuming $S\in U$ and $DC\in \overline{U}$.
Every directed acyclic graph has **topological sort**.
Let $x_{out}^1$ be the first $out$ node in $\overline{U}$. There are two cases:
- $x_{in}^1\in U$. Then $\alpha$ edges must be crossed.
- $x_{in}^1\in \overline{U}$. Then all $d$ incoming edges to $x_{in}^1$ must be crossed. (Otherwise, there exists an earlier $out$ node $x_{out}^j$ with $x_{in}^j\in U$, contradicting the topological sort.)
So $x_{out}^1$ contributes at least $\min\{d\beta, \alpha\}$ to the cut capacity.
Let $x_{out}^2$ be the second $out$ node in $\overline{U}$. There are two cases:
- $x_{in}^2\in U$. Then $\alpha$ edges must be crossed.
- $x_{in}^2\in \overline{U}$. Then at least $d-1$ incoming edges ($1$ edge may come from $x_{out}^1$) to $x_{in}^2$ must be crossed. (Otherwise, there exists an earlier $out$ node $x_{out}^j$ with $x_{in}^j\in U$, contradicting the topological sort.)
So $x_{out}^2$ contributes at least $\min\{(d-1)\beta, \alpha\}$ to the cut capacity.
By repeating this process, we can show that the minimum cut capacity is at least $\sum_{i=1}^{k-1}\min\{(d-i)\beta, \alpha\}$.
</details>
#### Storage/bandwidth tradeoff
Claim: There exists an information graph with $mincut = \sum_{i=1}^{k-1}\min\{(d-i)\beta, \alpha\}$.
Homework: Build this graph as follows:
- Construct the initial graph with $n$ nodes and the system admin.
- Add $n+k$ nodes, each node connects to the most recent $d$ nodes.
- Find the minimum cut capacity.
Corollary: $B\leq \sum_{i=1}^{k-1}\min\{(d-i)\beta, \alpha\}$.
#### Definition of regenerate code
A code which attains $B=\sum_{i=1}^{k-1}\min\{(d-i)\beta, \alpha\}$ is called a regenerate code.
Goal: Find tradeoff between storage $\alpha$ to repair bandwidth $d\beta$.
Let $\gamma = d\beta$, then $B \leq \sum_{i=0}^{k-1}\min\{(1-i/d)\gamma, \alpha\}$.
Tool: Fix $\gamma$ and $d$, and minimize for $\alpha$ (not shown).
Result: The storage/bandwidth tradeoff.
- Each point on/above the line is feasible.
- Points on the line = regenerating codes.
- One endpoint: Minimum Bandwidth Regenerating (MBR) codes.
- another endpoint: Minimum Storage Regenerating (MSR) codes
![Storage/bandwidth tradeoff](https://notenextra.trance-0.com/CSE5313/Storage_bandwidth_tradeoff.png)
For Minimum Storage Regenerating (MSR) codes, we have $\alpha = \frac{B}{k}$, $\beta = \frac{B}{k(d-k+1)}$
For Minimum Bandwidth Regenerating (MBR) codes, we have $\alpha = \frac{dB}{kd-\frac{k(k-1)}{2}}$, $\beta = \frac{B}{kd-\frac{k(k-1)}{2}}$
Notes:
- In MSR $\alpha=B/k$, Data collector contacts $k$ nodes and downloads $B/k$ from each to reconstruct the file of size $B$, that is optimal.
- In MBR $\beta=B/(kd-\frac{k(k-1)}{2})$, new comer download exactly what it stores, which is the same as replication. This has much smaller storage overhead in the replication.
Regenerating codes, Magic #1:
- MBR: Same repair-bandwidth as replication ($\alpha$), at lower storage costs.
- MSR: Same reconstruction-bandwidth ($B/k$) as replication, at lower storage costs.
Regenerating codes, Magic #2:
- In MSR: $\gamma = d\beta = \frac{dB}{k(d-k+1)}$, $\alpha = \frac{B}{k}$
- In MBR: $\gamma = d\beta = \frac{dB}{kd-\frac{k(k-1)}{2}}$, $\alpha = \frac{dB}{kd-\frac{k(k-1)}{2}}$
- Both decreasing functions of $d$.
- $\Rightarrow$ Less repair-bandwidth by contacting more nodes, minimized at $d = n - 1$.
### Constructing Minimum bandwidth regenerating (MBR) codes from Minimum distance codes
Observation: For MBR code with parameters $n, k, d$ and $\beta = 1$, one can construct MBR with parameters $n, k, d$ and any $\beta$.
Next: Construct MBR for $[n, k, d = n - 1]$ and $\beta = 1$.
In any MBR: $\alpha, \beta = \frac{dB}{kd-\frac{k(k-1)}{2}}, \frac{B}{kd-\frac{k(k-1)}{2}}$
Specifically:
- Storage $\alpha = d\beta = d = n - 1$.
- File size $B = kd - \binom{k}{2}\beta = kd - \binom{k}{2}$
Take an $[\binom{n}{2}, B]$ MDS code (e.g., Reed-Solomon).
Need $q\geq \frac{n}{2}$.
Consider a complete graph $K_n$ on $n$ nodes.
- $\binom{n}{2}$ edges.
- Place each codeword symbol on a distinct edge.
- Storage server $i$ stores all codeword symbols adjacent with node $i$.
- $\alpha = n - 1$.
#### Repairing on MBR codes
New comer contacts each node $j\neq i$;
And downloads the symbol on edge $(i,j)$.
We get $\alpha=n-1=d\beta$ which is optimal.
#### Reconstruction on MBR codes
We use $[\binom{n}{2}, B]_q$ MDS code. So any $B$ symbols suffice to reconstruct the file.
Any $t$ nodes have $\binom{t}{2}$ edges between them, and $(n-1)t-2\binom{t}{2}$ edges to other nodes.
Overall $(n-1)t-\binom{t}{2}$. For $t=k$, we get $kd-\binom{k}{2}=B$.
### Constructing Minimum bandwidth regenerating (MBR) codes from Product-Matrix MBR codes
Recall: File size in MBR $B=kd-\binom{k}{2}=\binom{k+1}{2}+k(d-k)$.
Step 1: Arrange the $B=\binom{k+1}{2}+k(d-k)$ symbols in a matrix $M$ follows:
$$
M=\begin{pmatrix}
S & T\\
T^T & 0
\end{pmatrix}\in \mathbb{F}_q^{d\times d}
$$
- $S$ is a $k\times k$ symmetric matrix. contains $\binom{k+1}{2}$ symbols.
- $T$ is a $k\times (d-k)$ matrix. contains $k(d-k)$ symbols.
So there are $B$ elements overall.
Step 2: Construct the encoding matrix $C=(\Psi,\Delta)\in \mathbb{F}_q^{n\times d}$
$\Psi\in \mathbb{F}_q^{n\times k}$ such that
- Any $k$ rows are linearly independent.
- Example: Vandermonde matrix.
$\Delta\in \mathbb{F}_q^{n\times (d-k)}$ such that
- Any $d$ rows of $C$ are linearly independent.
- Example: Complete $\Psi$ to a full $n\times d$ Vandermonde matrix.
Step 3: Encoding of the data $M\in \mathbb{F}_q^{d\times d}$ using the encoding matrix $C\in \mathbb{F}_q^{n\times d}$.
- Multiply $M$ by $C$.
- Store the $i$ the row of $CM$ in the node $i$.
- Note $CM=(\Psi,\Delta)M=(\Psi S+\Delta T, \Psi T)$
#### Repairing on Product-Matrix MBR codes
Assume node $i$ storing $c_iM$ is lost.
Repair from (any) nodes $H = \{h_1, \ldots, h_d\}$.
- Node $h_j$ stores $c_{h_j}M$.
Newcomer contacts each $h_j$: “My name is $i$, and Im lost.”
Node $h_j$ sends $c_{h_j}M c_i^T$ (inner product).
Newcomer assembles $C_H Mc_i^T$.
$CH$ invertible by construction!
- Recover $Mc_i^T$.
- Recover $c_i^TM$ ($M$ is symmetric)
#### Reconstruction on Product-Matrix MBR codes
Data Collector (DC) contacts (any) nodes $D = \{d_1, \ldots, d_k\}$.
- Node $d_j$ stores $c_{d_j}M$.
Downloads $c_{d_j}M$ from node $d_j$.
DC assembles $C_D M$.
- Recall $CM=(\Psi S,\Delta)M=(\Psi S+\Delta T, \Psi T)$
- $C_D M=(\Psi_D S,\Delta_D)M=(\Psi_D S+\Delta_D T, \Psi_D T)$
$\Psi_D$ invertible by construction.
- DC computes $\Psi_D^{-1}C_DM = (S+\Psi_D^{-1}\Delta_D^T, T)$
- DC obtains $T$.
- Subtracts $\Psi_D^{-1}\Delta_D T^T$ from $S+\Psi_D^{-1}\Delta_D T^T$ to obtain $S$.

View File

@@ -16,4 +16,5 @@ export default {
CSE5313_L11: "CSE5313 Coding and information theory for data science (Recitation 11)", CSE5313_L11: "CSE5313 Coding and information theory for data science (Recitation 11)",
CSE5313_L12: "CSE5313 Coding and information theory for data science (Lecture 12)", CSE5313_L12: "CSE5313 Coding and information theory for data science (Lecture 12)",
CSE5313_L13: "CSE5313 Coding and information theory for data science (Lecture 13)", CSE5313_L13: "CSE5313 Coding and information theory for data science (Lecture 13)",
CSE5313_L14: "CSE5313 Coding and information theory for data science (Lecture 14)",
} }

Binary file not shown.

After

Width:  |  Height:  |  Size: 130 KiB