6.5 KiB
CSE5313 Coding and information theory for data science (Lecture 12)
Challenge 1: Reconstruction
- Minimize reconstruction bandwidth.
Challenge 2: Repair
- Maintaining data consistency.
- Failed servers must be repaired:
- By contacting few other servers (locality, due to geographical constraints).
- By minimizing bandwidth.
Challenge 3: Storage overhead
- Minimize space consumption.
- Minimize redundancy.
Code for storage systems
Naive solution: Replication
Locality is 1 (by copying from another server).
This gives the optimal reconstruction bandwidth.
Use codes to improve storage efficiency
Locality is n-d+1, high bandwidth.
Parity codes
Let X_1,X_2,\ldots,X_n\in \mathbb{F}_2^t be the data blocks, take extra server to store the parity.
Reconstruction:
Optimal for reconstruction bandwidth. Only need k servers to reconstruct the file.
Overhead:
Only need one additional server
Repair:
Any server failed, reconstruct from the other n-d+1=n-2+1=n-1 servers.
Reed-Solomon codes
Fragment the file X = (X_1, \ldots, X_k).
Need 2^t\geq n servers to store the file.
Reconstruction:
Any k servers can reconstruct the file.
Overhead:
Need 2^t\geq n servers to store the file.
Repair:
Worse, need all servers to reconstruct the file.
New codes for storage systems
EVENODD code
- One of the first storage specific codes.
Can a xor only code be built that enables reconstruction if two disks are missing?
locality/bandwidth problem for next lecture.
For prime m, partition X=(X_0,\ldots,X_{m-1}) each X_i with m-1 bits.
Store Y_i=X_i in disks 0,1,\ldots,m-1.
Add two redundant disks Y_m,Y_{m+1}.
(Y_m)_iis the parity of rowi.(Y_{m+1})_i, first we definedS=a_{0,4}+a_{1,3}+a_{2,2}+a_{3,1}, then(Y_{m+1})_i=S\oplus \sum_{j=0}^{m-2}a_{(i,j)\mod m,j}
Y_0 |
Y_1 |
Y_2 |
Y_3 |
Y_4 |
Y_5 |
Y_6 |
|---|---|---|---|---|---|---|
1 |
0 |
1 |
1 |
0 |
1 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
Note that the S diagonal can be extracted from Y_m and Y_{m+1}.
\sum_{j=0}^{m-2}(Y_m)_j\oplus \sum_{j=0}^{m-2}(Y_{m+1})_j=\sum_{j=1}^{m}S=S
Goal: Reconstruct if any two disks are missing.
- If
Y_m, Y_{m+1}missing, nothing to do. - If
Y_i, Y_{m+1}are missing fori < m, decode like a parity code. - If
Y_i, Y_{m}are missing fori < m, similar, using diagonal parities.
The interesting case: Y_i, Y_j are missing for i,j < m.
Using the skill you solve sudoku puzzles, we can find the missing values.
First we recover the S diagonal from Y_m and Y_{m+1}.
Then we solve for the row by Y_m and the diagonal by Y_{m+1}.
Proof for why it always works
There are m-1 rows, m including a ghost row with full $0$s.
\mathbb{Z}_m is cyclic of prime size, any non-zero element is a generator.
When moving from diagonal to horizontal, we are moving some offset from the diagonal, which are always generator.
This is an example of array code:
The message (X_0,X_1,\ldots,X_{m-1}) is a matrix in \mathbb{F}_2^{(m-1)\times m}.
The codeword (Y_0,Y_1,\ldots,Y_{m+1}) is a matrix in \mathbb{F}_2^{(m-1)\times (m+2)}.
Encoding is done over \mathbb{F}_q.
Locally Recoverable Codes
Locality: when a node j fails,
- A newcomer node joins the system.
- The newcomer contacts a "small" number of helper nodes with the message "repairing $j$".
- Each of the helper nodes sends something to the newcomer.
- The newcomer aggregates the responses to find
Y_j.
Notes:
- No adversarial behavior.
- No privacy issues.
- No concern about bandwidth (for now).
Research question:
- How small can the "small number of nodes" be?
- How does that affect the rate/minimum distance of the code?
- How to build codes with this capability?
Definition of locally recoverable code
An [n, k]_q code is called $r$-locally recoverable if
- every codeword symbol
y_jhas a recovering setR_j \subseteq [n] \setminus j([n]=\{1,2,\ldots,n\}), - such that
y_jis computable fromy_ifor alli \in R_j. |R_j| \leq rfor everyj \in n.
Notes:
- From
n-d+1nodes, we can reconstruct the entire file, always assumek\leq n-d+1. - We want
r\ll n-d+1. R_jdoes not depend ony_j, nor on the codewordsy, only onj. (Need to repair without knowingy,y_j.)
Bounds for Locally Recoverable Codes
Let \mathcal{C} be an [n, k]_q code with $r$-locally recoverable, with minimum distance d.
Bound 1: \frac{k}{n}\leq \frac{r}{r+1}.
Bound 2: d\leq n-k-\lceil\frac{k}{r}\rceil +2.
Notes:
For r=k, bound 2 becomes d\leq n-k+1.
- The natural extension of singleton bound.
For r=1, bound 1 becomes \frac{k}{n}\leq \frac{1}{2}.
- The duplication code is trivial code for this bound
For r=1, bound 2 becomes d\leq n-2k+2.
- The duplication code is trivial code for this bound
Bound 1
Turan's Lemma
Let G be a graph with n vertices. Then there exists an induced directed acyclic subgraph (DAG) of G on at least \frac{n}{1+\avg_i(d^{out}_i)} nodes, where d^{out}_i is the out-degree of vertex i.
Directed graphs have large acyclic subgraphs.
Proof via the probabilistic method
Useful for showing the existence of a large acyclic subgraph, but not for finding it.
Tip
Show that
\mathbb{E}[X]\geq something, and therefore there existsU_\piwith|U_\pi|\geq something, using pigeonhole principle.
For a permutation \pi of [n], define U_\pi = \{\pi(i): i \in [n]\}.
Let i\in U_\pi if each of the d_i^{out} outgoing edges from i connect to a node j with \pi(j)>\pi(i).
In other words, we select a subset of nodes U_\pi such that each node in U_\pi has an outgoing edge to a node in U_\pi with a larger index. All edges going to right.
This graph is clearly acyclic.
Choose \pi at random and Let X=|U_\pi| be a random variable.
Let X_i be the indicator random variable for i\in U_\pi.
So X=\sum_{i=1}^{n} X_i.
Using linearity of expectation, we have
E[X]=\sum_{i=1}^{n} E[X_i]
E[X_i] is the probability that \pi places i before any of its out-neighbors.
For each node, there are (d_i^{out}+1)! ways to place the node and its out-neighbors.
For each node, there are d_i^{out}! ways to place the out-neighbors.
So, E[X_i]=\frac{d_i^{out}!}{(d_i^{out}+1)!}=\frac{1}{d_i^{out}+1}.
Continue next time.