updates
This commit is contained in:
278
content/CSE5313/CSE5313_L9.md
Normal file
278
content/CSE5313/CSE5313_L9.md
Normal file
@@ -0,0 +1,278 @@
|
||||
# CSE5313 Coding and information theory for data science (Lecture 9)
|
||||
|
||||
## Explicit optimal codes
|
||||
|
||||
Explicit optimal codes?
|
||||
|
||||
- Singleton, Sphere-packing provide restrictions.
|
||||
- Gilbert-Varshamov provides existence.
|
||||
|
||||
Are there explicit optimal codes? That is,
|
||||
|
||||
- Easily (polynomial time) encodable, decodable.
|
||||
|
||||
Yes! This lecture:
|
||||
|
||||
– Gustave Solomon [1930-1996] (Reed-Solomon code)
|
||||
– Irving S. Reed [1923-2012].
|
||||
– David E. Muller [1924-2008] (Reed-Muller code)
|
||||
|
||||
Using Polynomials over $\mathbb{F}_q$
|
||||
|
||||
## Reed-Solomon code
|
||||
|
||||
> [!NOTE]
|
||||
>
|
||||
> The fundamental theorem of algebra:
|
||||
>
|
||||
> A polynomial of degree $k$ has at most $k$ roots.
|
||||
|
||||
We have two equivalent definitions of a Reed-Solomon code:
|
||||
|
||||
- As polynomial evaluations.
|
||||
- As linear codes (from generator matrix)
|
||||
|
||||
Efficient encoding (as linear codes)
|
||||
|
||||
Efficient decoding (use Euclidean algorithm)
|
||||
|
||||
### Definition of Reed-Solomon code from polynomial evaluations
|
||||
|
||||
> [!DANGER]
|
||||
>
|
||||
> We assume $q\geq n$.
|
||||
|
||||
Every codeword corresponds to a polynomial of degree at most $k-1$.
|
||||
|
||||
Let $f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x]$ ($f_i\in \mathbb{F}_q$ for all $i$, $\deg(f)\leq k-1$).
|
||||
|
||||
Fix **distinct** $a_1,a_2,\ldots,a_n\in \mathbb{F}_q$.
|
||||
|
||||
#### Definition of Reed-Solomon code
|
||||
|
||||
A Reed-Solomon code is $\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}$.
|
||||
|
||||
- In words, the set of all evaluations at $a_1,a_2,\ldots,a_n$ of polynomials of degree at most $k-1$.
|
||||
|
||||
<details>
|
||||
<summary>Example of Reed-Solomon code</summary>
|
||||
|
||||
Let $n=5$, $\mathbb{F}_q=\mathbb{Z}_5$, $k=3$.
|
||||
|
||||
||$a_0=0$|$a_1=1$|$a_2=2$|$a_3=3$|$a_4=4$|
|
||||
|---|---|---|---|---|---|
|
||||
|$f(x)=1$|$1$|$1$|$1$|$1$|$1$|
|
||||
|$f(x)=x+2$|$2$|$3$|$4$|$0$|$1$|
|
||||
|$f(x)=x^2+x$|$0$|$2$|$1$|$2$|$0$|
|
||||
|
||||
Here $d=n-k+1=3$.
|
||||
|
||||
</details>
|
||||
|
||||
#### Proposition: Reed-Solomon code is a linear code
|
||||
|
||||
A Reed-Solomon code is $\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}$ is a linear code.
|
||||
|
||||
<details>
|
||||
<summary>Proof</summary>
|
||||
|
||||
First the code is closed under addition.
|
||||
|
||||
Let $f(x),g(x)\in \mathbb{F}_q^{k-1}[x]$, then $f(x)+g(x)\in \mathbb{F}_q^{k-1}[x]$.
|
||||
|
||||
$$
|
||||
f(x)+g(x)=\sum_{i=0}^{k-1}(f_i+g_i)x^i
|
||||
$$
|
||||
|
||||
Then the code is closed under scalar multiplication.
|
||||
|
||||
Let $f(x)\in \mathbb{F}_q^{k-1}[x]$, $c\in \mathbb{F}_q$, then $cf(x)\in \mathbb{F}_q^{k-1}[x]$.
|
||||
|
||||
$$
|
||||
cf(x)=\sum_{i=0}^{k-1}(cf_i)x^i
|
||||
$$
|
||||
|
||||
</details>
|
||||
|
||||
The dimension of the code is $k$.
|
||||
|
||||
#### Corollary: The Reed-Solomon code attains the Singleton bound with equality
|
||||
|
||||
The Reed-Solomon code has minimum distance $n-k+1$.
|
||||
|
||||
<details>
|
||||
<summary>Proof</summary>
|
||||
|
||||
Let $c_f=(f(a_1),f(a_2),\ldots,f(a_n))$ and $c_g=(g(a_1),g(a_2),\ldots,g(a_n))$.
|
||||
|
||||
Since $f\neq g$, and $d(c_f,c_g)$ is the minimum distance of the code
|
||||
|
||||
Let $c_{f-g}=(f(a_1)-g(a_1),f(a_2)-g(a_2),\ldots,f(a_n)-g(a_n))$.
|
||||
|
||||
By the lemma for minimum distance, we have $d(c_f,c_g)=w_H(c_{f-g})=w_H((f-g)(a_1),(f-g)(a_2),\ldots,(f-g)(a_n))$ where $f-g\in \mathbb{F}_q^{k-1}[x]$.
|
||||
|
||||
So $n-w_H(c_{f-g})$ is the number of zeros (root) of the polynomial $f-g$.
|
||||
|
||||
So if $f-g$ has more than $k-1$ roots, then $f=g$.
|
||||
|
||||
So $n-d\leq k-1$, $d\geq n-k+1$.
|
||||
|
||||
Which is the Singleton bound.
|
||||
|
||||
</details>
|
||||
|
||||
### Definition of Reed-Solomon code from generator matrix
|
||||
|
||||
Every Reed-Solomon code is of the form $(f(a_1),f(a_2),\ldots,f(a_n))$ for some $f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x]$.
|
||||
|
||||
Observer that the evaluation map is a linear map.
|
||||
|
||||
$f(a_1)=f_0+f_1a_1+f_2a_1^2+\cdots+f_{k-1}a_1^{k-1}$
|
||||
$f(a_2)=f_0+f_1a_2+f_2a_2^2+\cdots+f_{k-1}a_2^{k-1}$
|
||||
$\vdots$
|
||||
$f(a_n)=f_0+f_1a_n+f_2a_n^2+\cdots+f_{k-1}a_n^{k-1}$
|
||||
|
||||
So, every code word can be constructed by
|
||||
|
||||
$$
|
||||
(f(a_1),f(a_2),\ldots,f(a_n))=(f_0,f_1,f_2,\ldots,f_{k-1})\begin{pmatrix}
|
||||
1 & 1 & \cdots & 1\\
|
||||
a_1 & a_2 & \cdots & a_n\\
|
||||
a_1^2 & a_2^2 & \cdots & a_n^2\\
|
||||
\vdots & \vdots & \cdots & \vdots\\
|
||||
a_1^{k-1} & a_2^{k-1} & \cdots & a_n^{k-1}
|
||||
\end{pmatrix}
|
||||
$$
|
||||
|
||||
The generator matrix for Reed-Solomon code is a Vandermonde matrix $V(a_1,a_2,\ldots,a_n)$.
|
||||
|
||||
Fact: $V(a_1,a_2,\ldots,a_n)$ is invertible if and only if $a_1,a_2,\ldots,a_n$ are distinct. (that's how we choose $a_1,a_2,\ldots,a_n$)
|
||||
|
||||
The parity check matrix for Reed-Solomon code is also a Vandermonde matrix $V(a_1,a_2,\ldots,a_n)^T$ with scalar multiples of the columns.
|
||||
|
||||
Some technical lemmas:
|
||||
|
||||
Let $G$ and $H$ be the generator and parity-check matrices of (any) linear code
|
||||
$C = [n, k, d]_{\mathbb{F}_q}$. Then:
|
||||
|
||||
I. Then $H G^T = 0$.
|
||||
II. Any matrix $M \in \mathbb{F}_q^{n-k \times k}$ such that $\rank(M) = n - k$ and $M G^T = 0$ is a parity-check matrix for $C$ (i.e. $C = \ker M$).
|
||||
|
||||
## Reed-Muller code
|
||||
|
||||
Reed-Solomon codes: Evaluations of univariate polynomials of deg ≤ $k-1$.
|
||||
|
||||
Reed-Muller codes: Evaluations of multivariate polynomials of deg $\leq k-1$
|
||||
|
||||
Example:
|
||||
|
||||
$$
|
||||
f(x_1,x_2,x_3)=x_1x_2^2+x_1x_3+x_2+x_2x_3^3
|
||||
$$
|
||||
|
||||
This is a degree 4 polynomial.
|
||||
|
||||
Usually we use $q=2$ for binary codes.
|
||||
|
||||
So $x^2=x$
|
||||
|
||||
### Definition of Reed-Muller code (binary case)
|
||||
|
||||
$$
|
||||
RM(r,m)=\left\{(f(\alpha_1),\ldots,f(\alpha_2^m))|\alpha_i\in \mathbb{F}_2^m,\deg f\leq r\right\}
|
||||
$$
|
||||
|
||||
Facts:
|
||||
|
||||
- Length $n = 2^m$.
|
||||
- Minimum distance $2^{m-r}$ (not shown).
|
||||
- Dimension = # of free coefficients in a multilinear polynomial of degree at most $r$.
|
||||
- Dimension = # of subsets of $\{1, 2, \ldots, m\}$ of size at most $r$
|
||||
- Dimension = $\sum_{i=0}^{r}\binom{m}{i}$
|
||||
|
||||
Exercises: Show that
|
||||
|
||||
1. $C_1 = RM(m-1,m) =$ Parity code.
|
||||
2. $C_2 = RM(m-2,m) =$ Extended Hamming code.
|
||||
3. $C_3 = RM(1,m) =$ Augmented Hadamard.
|
||||
|
||||
## Coding for storage
|
||||
|
||||
### Requirements/Challenges in Storage Systems
|
||||
|
||||
1. Challenge 1: Reconstruction.
|
||||
- The data collector must be able to reconstruct the file, even if some are nonresponsive.
|
||||
- Minimize **reconstruction** bandwidth.
|
||||
2. Challenge 2: Repair.
|
||||
- The system must maintain data consistency.
|
||||
- Failed servers must be repaired:
|
||||
- By contacting few other servers (locality, due to geographical constraints).
|
||||
- By minimizing bandwidth.
|
||||
3. Challenge 3: Storage overhead.
|
||||
- Minimize space consumption.
|
||||
- Minimize redundancy.
|
||||
|
||||
### Naive solution: Replication
|
||||
|
||||
Fragment the file $X = (X_1, \ldots, X_k)$.
|
||||
|
||||
- Size of $X_i$ = Whatever fits in a storage server.
|
||||
|
||||
Hold $r$ copies of each $X_i$.
|
||||
|
||||
- I.e., $n = rk$ servers in the system.
|
||||
|
||||
Storage overhead?
|
||||
|
||||
- $\frac{n}{k} = r$.
|
||||
|
||||
Repair?
|
||||
|
||||
- $X_i$ fails
|
||||
- $\geq r$ failures is lost data.
|
||||
|
||||
Reconstruction?
|
||||
|
||||
- Possible if any $r-1$ servers fail.
|
||||
- Impossible for some $\geq r$ failures.
|
||||
|
||||
### Use codes to improve storage efficiency
|
||||
|
||||
Reconstruction?
|
||||
|
||||
- Lecture 1: If $d_H(\mathcal{C})\geq d$, every pattern of at most $d-1$ erasures is recoverable.
|
||||
|
||||
- Idea: Treat unavailable servers as erasures.
|
||||
|
||||
Is this better/worse than replication?
|
||||
|
||||
- Say we wish to reconstruct from any $\approx \frac{n}{10}$ servers.
|
||||
- What would be the redundancy in replication vs. coding?
|
||||
|
||||
Coding:
|
||||
|
||||
- Can reconstruct file from any $n-d+1\approx \frac{9}{10}n$ servers.
|
||||
- Resulting overhead $\frac{n}{k}=\frac{n}{n-d+1}\approx \frac{10}{9}$ (constant!).
|
||||
|
||||
Replication:
|
||||
|
||||
To reconstruct from any $\frac{9}{10}n$ servers, need $r-1\approx \frac{1}{10}n$
|
||||
|
||||
Repair?
|
||||
|
||||
- Need low locality (repair by contacting few other servers).
|
||||
- Need low bandwidth (repair by downloading as few bits as possible).
|
||||
|
||||
Repair in a replicated system:
|
||||
|
||||
- $X_i$ fails $\Rightarrow$ reconstruct from a different copy.
|
||||
- Locality 1.
|
||||
- Optimal bandwidth.
|
||||
|
||||
Repair in a coded system:
|
||||
|
||||
- repair one $Y_i$ $\approx$ Reconstruct the entire file.
|
||||
- Locality $n-d+1$, high bandwidth.
|
||||
- Much worse than replication.
|
||||
|
||||
New coding challenges: Minimize locality and bandwidth
|
||||
@@ -10,4 +10,6 @@ export default {
|
||||
CSE5313_L5: "CSE5313 Coding and information theory for data science (Lecture 5)",
|
||||
CSE5313_L6: "CSE5313 Coding and information theory for data science (Lecture 6)",
|
||||
CSE5313_L7: "CSE5313 Coding and information theory for data science (Lecture 7)",
|
||||
CSE5313_L8: "CSE5313 Coding and information theory for data science (Lecture 8)",
|
||||
CSE5313_L9: "CSE5313 Coding and information theory for data science (Lecture 9)",
|
||||
}
|
||||
Reference in New Issue
Block a user