updates

2025-10-09 12:50:06 -05:00
parent 11a8796868
commit 9734128293
2 changed files with 235 additions and 0 deletions
--- a/content/CSE5313/CSE5313_L12.md
+++ b/content/CSE5313/CSE5313_L12.md
@@ -0,0 +1,234 @@
+# CSE5313 Coding and information theory for data science (Lecture 12)
+
+Challenge 1: Reconstruction
+
+- Minimize reconstruction bandwidth.
+
+Challenge 2: Repair
+
+- Maintaining data consistency.
+- Failed servers must be repaired:
+  - By contacting few other servers (locality, due to geographical constraints).
+  - By minimizing bandwidth.
+
+Challenge 3: Storage overhead
+
+- Minimize space consumption.
+- Minimize redundancy.
+
+## Code for storage systems
+
+### Naive solution: Replication
+
+Locality is 1 (by copying from another server).
+
+This gives the optimal reconstruction bandwidth.
+
+### Use codes to improve storage efficiency
+
+Locality is $n-d+1$, high bandwidth.
+
+#### Parity codes
+
+Let $X_1,X_2,\ldots,X_n\in \mathbb{F}_2^t$ be the data blocks, take extra server to store the parity.
+
+Reconstruction:
+
+Optimal for reconstruction bandwidth. Only need $k$ servers to reconstruct the file.
+
+Overhead:
+
+Only need one additional server
+
+Repair:
+
+Any server failed, reconstruct from the other $n-d+1=n-2+1=n-1$ servers.
+
+#### Reed-Solomon codes
+
+Fragment the file $X = (X_1, \ldots, X_k)$.
+
+Need $2^t\geq n$ servers to store the file.
+
+Reconstruction:
+
+Any $k$ servers can reconstruct the file.
+
+Overhead:
+
+Need $2^t\geq n$ servers to store the file.
+
+Repair:
+
+Worse, need all servers to reconstruct the file.
+
+### New codes for storage systems
+
+#### EVENODD code
+
+- One of the first storage specific codes.
+
+Can a xor only code be built that enables reconstruction if two disks are missing?
+
+locality/bandwidth problem for next lecture.
+
+For prime $m$, partition $X=(X_0,\ldots,X_{m-1})$ each $X_i$ with $m-1$ bits.
+
+Store $Y_i=X_i$ in disks $0,1,\ldots,m-1$.
+
+Add two redundant disks $Y_m,Y_{m+1}$.
+
+- $(Y_m)_i$ is the parity of row $i$.
+- $(Y_{m+1})_i$, first we defined $S=a_{0,4}+a_{1,3}+a_{2,2}+a_{3,1}$, then $(Y_{m+1})_i=S\oplus \sum_{j=0}^{m-2}a_{(i,j)\mod m,j}$
+
+| $Y_0$ | $Y_1$ | $Y_2$ | $Y_3$ | $Y_4$ | $Y_5$ | $Y_6$ |
+|-------|-------|-------|-------|-------|-------|-------|
+| $1$ | $0$ | $1$ | $1$ | $0$ | $1$ | $0$ |
+| $0$ | $1$ | $1$ | $0$ | $0$ | $0$ | $0$ |
+| $1$ | $1$ | $0$ | $0$ | $0$ | $0$ | $1$ |
+| $0$ | $1$ | $0$ | $1$ | $1$ | $1$ | $0$ |
+
+Note that the $S$ diagonal can be extracted from $Y_m$ and $Y_{m+1}$.
+
+$$
+\sum_{j=0}^{m-2}(Y_m)_j\oplus \sum_{j=0}^{m-2}(Y_{m+1})_j=\sum_{j=1}^{m}S=S
+$$
+
+Goal: Reconstruct if any two disks are missing.
+
+- If $Y_m, Y_{m+1}$ missing, nothing to do.
+- If $Y_i, Y_{m+1}$ are missing for $i < m$, decode like a parity code.
+- If $Y_i, Y_{m}$ are missing for $i < m$, similar, using diagonal parities.
+  
+The interesting case: $Y_i, Y_j$ are missing for $i,j < m$.
+
+Using the skill you solve sudoku puzzles, we can find the missing values.
+
+First we recover the $S$ diagonal from $Y_m$ and $Y_{m+1}$.
+
+Then we solve for the row by $Y_m$ and the diagonal by $Y_{m+1}$.
+
+<details>
+<summary>Proof for why it always works</summary>
+
+There are $m-1$ rows, $m$ including a ghost row with full $0$s.
+
+$\mathbb{Z}_m$ is cyclic of prime size, any non-zero element is a generator.
+
+When moving from diagonal to horizontal, we are moving some offset from the diagonal, which are always generator.
+
+</details>
+
+This is an example of array code:
+
+The message $(X_0,X_1,\ldots,X_{m-1})$ is a matrix in $\mathbb{F}_2^{(m-1)\times m}$.
+
+The codeword $(Y_0,Y_1,\ldots,Y_{m+1})$ is a matrix in $\mathbb{F}_2^{(m-1)\times (m+2)}$.
+
+Encoding is done over $\mathbb{F}_q$.
+
+## Locally Recoverable Codes
+
+Locality: when a node $j$ fails,
+
+- A newcomer node joins the system.
+- The newcomer contacts a "small" number of helper nodes with the message "repairing $j$".
+- Each of the helper nodes sends something to the newcomer.
+- The newcomer aggregates the responses to find $Y_j$.
+
+Notes:
+
+- No adversarial behavior.
+- No privacy issues.
+- No concern about bandwidth (for now).
+
+Research question:
+
+- How small can the "small number of nodes" be?
+- How does that affect the rate/minimum distance of the code?
+- How to build codes with this capability?
+
+### Definition of locally recoverable code
+
+An $[n, k]_q$ code is called $r$-locally recoverable if
+
+- every codeword symbol $y_j$ has a recovering set $R_j \subseteq [n] \setminus j$ ($[n]=\{1,2,\ldots,n\}$),
+- such that $y_j$ is computable from $y_i$ for all $i \in R_j$.
+- $|R_j| \leq r$ for every $j \in n$.
+
+Notes:
+
+- From $n-d+1$ nodes, we can reconstruct the entire file, always assume $k\leq n-d+1$.
+- We want $r\ll n-d+1$.
+- $R_j$ does not depend on $y_j$, nor on the codewords $y$, only on $j$. (Need to repair without knowing $y,y_j$.)
+
+### Bounds for Locally Recoverable Codes
+
+Let $\mathcal{C}$ be an $[n, k]_q$ code with $r$-locally recoverable, with minimum distance $d$.
+
+Bound 1: $\frac{k}{n}\leq \frac{r}{r+1}$.
+
+Bound 2: $d\leq n-k-\lceil\frac{k}{r}\rceil +2$.
+
+Notes:
+
+For $r=k$, bound 2 becomes $d\leq n-k+1$.
+
+- The natural extension of singleton bound.
+
+For $r=1$, bound 1 becomes $\frac{k}{n}\leq \frac{1}{2}$.
+
+- The duplication code is trivial code for this bound
+
+For $r=1$, bound 2 becomes $d\leq n-2k+2$.
+
+- The duplication code is trivial code for this bound
+
+### Bound 1
+
+#### Turan's Lemma
+
+Let $G$ be a graph with $n$ vertices. Then there exists an induced directed acyclic subgraph (DAG) of $G$ on at least $\frac{n}{1+\avg_i(d^{out}_i)}$ nodes, where $d^{out}_i$ is the out-degree of vertex $i$.
+
+> Directed graphs have large acyclic subgraphs.
+
+<details>
+<summary>Proof via the probabilistic method</summary>
+
+> Useful for showing the existence of a large acyclic subgraph, but not for finding it.
+
+> [!TIP]
+>
+> Show that $\mathbb{E}[X]\geq something$, and therefore there exists $U_\pi$ with $|U_\pi|\geq something$, using pigeonhole principle.
+
+For a permutation $\pi$ of $[n]$, define $U_\pi = \{\pi(i): i \in [n]\}$.
+
+Let $i\in U_\pi$ if each of the $d_i^{out}$ outgoing edges from $i$ connect to a node $j$ with $\pi(j)>\pi(i)$.
+
+In other words, we select a subset of nodes $U_\pi$ such that each node in $U_\pi$ has an outgoing edge to a node in $U_\pi$ with a larger index. All edges going to right.
+
+This graph is clearly acyclic.
+
+Choose $\pi$ at random and Let $X=|U_\pi|$ be a random variable.
+
+Let $X_i$ be the indicator random variable for $i\in U_\pi$.
+
+So $X=\sum_{i=1}^{n} X_i$.
+
+Using linearity of expectation, we have
+
+$$
+E[X]=\sum_{i=1}^{n} E[X_i]
+$$
+
+$E[X_i]$ is the probability that $\pi$ places $i$ before any of its out-neighbors.
+
+For each node, there are $(d_i^{out}+1)!$ ways to place the node and its out-neighbors.
+
+For each node, there are $d_i^{out}!$ ways to place the out-neighbors.
+
+So, $E[X_i]=\frac{d_i^{out}!}{(d_i^{out}+1)!}=\frac{1}{d_i^{out}+1}$.
+
+Continue next time.
+
+</details>
--- a/content/CSE5313/_meta.js
+++ b/content/CSE5313/_meta.js
@@ -14,4 +14,5 @@ export default {
    CSE5313_L9: "CSE5313 Coding and information theory for data science (Lecture 9)",
    CSE5313_L10: "CSE5313 Coding and information theory for data science (Recitation 10)",
    CSE5313_L11: "CSE5313 Coding and information theory for data science (Recitation 11)",
+    CSE5313_L12: "CSE5313 Coding and information theory for data science (Lecture 12)",
 }