diff --git a/content/CSE5313/CSE5313_L12.md b/content/CSE5313/CSE5313_L12.md new file mode 100644 index 0000000..d75f214 --- /dev/null +++ b/content/CSE5313/CSE5313_L12.md @@ -0,0 +1,234 @@ +# CSE5313 Coding and information theory for data science (Lecture 12) + +Challenge 1: Reconstruction + +- Minimize reconstruction bandwidth. + +Challenge 2: Repair + +- Maintaining data consistency. +- Failed servers must be repaired: + - By contacting few other servers (locality, due to geographical constraints). + - By minimizing bandwidth. + +Challenge 3: Storage overhead + +- Minimize space consumption. +- Minimize redundancy. + +## Code for storage systems + +### Naive solution: Replication + +Locality is 1 (by copying from another server). + +This gives the optimal reconstruction bandwidth. + +### Use codes to improve storage efficiency + +Locality is $n-d+1$, high bandwidth. + +#### Parity codes + +Let $X_1,X_2,\ldots,X_n\in \mathbb{F}_2^t$ be the data blocks, take extra server to store the parity. + +Reconstruction: + +Optimal for reconstruction bandwidth. Only need $k$ servers to reconstruct the file. + +Overhead: + +Only need one additional server + +Repair: + +Any server failed, reconstruct from the other $n-d+1=n-2+1=n-1$ servers. + +#### Reed-Solomon codes + +Fragment the file $X = (X_1, \ldots, X_k)$. + +Need $2^t\geq n$ servers to store the file. + +Reconstruction: + +Any $k$ servers can reconstruct the file. + +Overhead: + +Need $2^t\geq n$ servers to store the file. + +Repair: + +Worse, need all servers to reconstruct the file. + +### New codes for storage systems + +#### EVENODD code + +- One of the first storage specific codes. + +Can a xor only code be built that enables reconstruction if two disks are missing? + +locality/bandwidth problem for next lecture. + +For prime $m$, partition $X=(X_0,\ldots,X_{m-1})$ each $X_i$ with $m-1$ bits. + +Store $Y_i=X_i$ in disks $0,1,\ldots,m-1$. + +Add two redundant disks $Y_m,Y_{m+1}$. + +- $(Y_m)_i$ is the parity of row $i$. +- $(Y_{m+1})_i$, first we defined $S=a_{0,4}+a_{1,3}+a_{2,2}+a_{3,1}$, then $(Y_{m+1})_i=S\oplus \sum_{j=0}^{m-2}a_{(i,j)\mod m,j}$ + +| $Y_0$ | $Y_1$ | $Y_2$ | $Y_3$ | $Y_4$ | $Y_5$ | $Y_6$ | +|-------|-------|-------|-------|-------|-------|-------| +| $1$ | $0$ | $1$ | $1$ | $0$ | $1$ | $0$ | +| $0$ | $1$ | $1$ | $0$ | $0$ | $0$ | $0$ | +| $1$ | $1$ | $0$ | $0$ | $0$ | $0$ | $1$ | +| $0$ | $1$ | $0$ | $1$ | $1$ | $1$ | $0$ | + +Note that the $S$ diagonal can be extracted from $Y_m$ and $Y_{m+1}$. + +$$ +\sum_{j=0}^{m-2}(Y_m)_j\oplus \sum_{j=0}^{m-2}(Y_{m+1})_j=\sum_{j=1}^{m}S=S +$$ + +Goal: Reconstruct if any two disks are missing. + +- If $Y_m, Y_{m+1}$ missing, nothing to do. +- If $Y_i, Y_{m+1}$ are missing for $i < m$, decode like a parity code. +- If $Y_i, Y_{m}$ are missing for $i < m$, similar, using diagonal parities. + +The interesting case: $Y_i, Y_j$ are missing for $i,j < m$. + +Using the skill you solve sudoku puzzles, we can find the missing values. + +First we recover the $S$ diagonal from $Y_m$ and $Y_{m+1}$. + +Then we solve for the row by $Y_m$ and the diagonal by $Y_{m+1}$. + +
+Proof for why it always works + +There are $m-1$ rows, $m$ including a ghost row with full $0$s. + +$\mathbb{Z}_m$ is cyclic of prime size, any non-zero element is a generator. + +When moving from diagonal to horizontal, we are moving some offset from the diagonal, which are always generator. + +
+ +This is an example of array code: + +The message $(X_0,X_1,\ldots,X_{m-1})$ is a matrix in $\mathbb{F}_2^{(m-1)\times m}$. + +The codeword $(Y_0,Y_1,\ldots,Y_{m+1})$ is a matrix in $\mathbb{F}_2^{(m-1)\times (m+2)}$. + +Encoding is done over $\mathbb{F}_q$. + +## Locally Recoverable Codes + +Locality: when a node $j$ fails, + +- A newcomer node joins the system. +- The newcomer contacts a "small" number of helper nodes with the message "repairing $j$". +- Each of the helper nodes sends something to the newcomer. +- The newcomer aggregates the responses to find $Y_j$. + +Notes: + +- No adversarial behavior. +- No privacy issues. +- No concern about bandwidth (for now). + +Research question: + +- How small can the "small number of nodes" be? +- How does that affect the rate/minimum distance of the code? +- How to build codes with this capability? + +### Definition of locally recoverable code + +An $[n, k]_q$ code is called $r$-locally recoverable if + +- every codeword symbol $y_j$ has a recovering set $R_j \subseteq [n] \setminus j$ ($[n]=\{1,2,\ldots,n\}$), +- such that $y_j$ is computable from $y_i$ for all $i \in R_j$. +- $|R_j| \leq r$ for every $j \in n$. + +Notes: + +- From $n-d+1$ nodes, we can reconstruct the entire file, always assume $k\leq n-d+1$. +- We want $r\ll n-d+1$. +- $R_j$ does not depend on $y_j$, nor on the codewords $y$, only on $j$. (Need to repair without knowing $y,y_j$.) + +### Bounds for Locally Recoverable Codes + +Let $\mathcal{C}$ be an $[n, k]_q$ code with $r$-locally recoverable, with minimum distance $d$. + +Bound 1: $\frac{k}{n}\leq \frac{r}{r+1}$. + +Bound 2: $d\leq n-k-\lceil\frac{k}{r}\rceil +2$. + +Notes: + +For $r=k$, bound 2 becomes $d\leq n-k+1$. + +- The natural extension of singleton bound. + +For $r=1$, bound 1 becomes $\frac{k}{n}\leq \frac{1}{2}$. + +- The duplication code is trivial code for this bound + +For $r=1$, bound 2 becomes $d\leq n-2k+2$. + +- The duplication code is trivial code for this bound + +### Bound 1 + +#### Turan's Lemma + +Let $G$ be a graph with $n$ vertices. Then there exists an induced directed acyclic subgraph (DAG) of $G$ on at least $\frac{n}{1+\avg_i(d^{out}_i)}$ nodes, where $d^{out}_i$ is the out-degree of vertex $i$. + +> Directed graphs have large acyclic subgraphs. + +
+Proof via the probabilistic method + +> Useful for showing the existence of a large acyclic subgraph, but not for finding it. + +> [!TIP] +> +> Show that $\mathbb{E}[X]\geq something$, and therefore there exists $U_\pi$ with $|U_\pi|\geq something$, using pigeonhole principle. + +For a permutation $\pi$ of $[n]$, define $U_\pi = \{\pi(i): i \in [n]\}$. + +Let $i\in U_\pi$ if each of the $d_i^{out}$ outgoing edges from $i$ connect to a node $j$ with $\pi(j)>\pi(i)$. + +In other words, we select a subset of nodes $U_\pi$ such that each node in $U_\pi$ has an outgoing edge to a node in $U_\pi$ with a larger index. All edges going to right. + +This graph is clearly acyclic. + +Choose $\pi$ at random and Let $X=|U_\pi|$ be a random variable. + +Let $X_i$ be the indicator random variable for $i\in U_\pi$. + +So $X=\sum_{i=1}^{n} X_i$. + +Using linearity of expectation, we have + +$$ +E[X]=\sum_{i=1}^{n} E[X_i] +$$ + +$E[X_i]$ is the probability that $\pi$ places $i$ before any of its out-neighbors. + +For each node, there are $(d_i^{out}+1)!$ ways to place the node and its out-neighbors. + +For each node, there are $d_i^{out}!$ ways to place the out-neighbors. + +So, $E[X_i]=\frac{d_i^{out}!}{(d_i^{out}+1)!}=\frac{1}{d_i^{out}+1}$. + +Continue next time. + +
\ No newline at end of file diff --git a/content/CSE5313/_meta.js b/content/CSE5313/_meta.js index b2a3452..8388c92 100644 --- a/content/CSE5313/_meta.js +++ b/content/CSE5313/_meta.js @@ -14,4 +14,5 @@ export default { CSE5313_L9: "CSE5313 Coding and information theory for data science (Lecture 9)", CSE5313_L10: "CSE5313 Coding and information theory for data science (Recitation 10)", CSE5313_L11: "CSE5313 Coding and information theory for data science (Recitation 11)", + CSE5313_L12: "CSE5313 Coding and information theory for data science (Lecture 12)", } \ No newline at end of file