Files
NoteNextra-origin/content/CSE5313/CSE5313_L9.md
2025-11-04 12:43:23 -06:00

279 lines
7.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CSE5313 Coding and information theory for data science (Lecture 9)
## Explicit optimal codes
Explicit optimal codes?
- Singleton, Sphere-packing provide restrictions.
- Gilbert-Varshamov provides existence.
Are there explicit optimal codes? That is,
- Easily (polynomial time) encodable, decodable.
Yes! This lecture:
Gustave Solomon [1930-1996] (Reed-Solomon code)
Irving S. Reed [1923-2012].
David E. Muller [1924-2008] (Reed-Muller code)
Using Polynomials over $\mathbb{F}_q$
## Reed-Solomon code
> [!NOTE]
>
> The fundamental theorem of algebra:
>
> A polynomial of degree $k$ has at most $k$ roots.
We have two equivalent definitions of a Reed-Solomon code:
- As polynomial evaluations.
- As linear codes (from generator matrix)
Efficient encoding (as linear codes)
Efficient decoding (use Euclidean algorithm)
### Definition of Reed-Solomon code from polynomial evaluations
> [!CAUTION]
>
> We assume $q\geq n$.
Every codeword corresponds to a polynomial of degree at most $k-1$.
Let $f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x]$ ($f_i\in \mathbb{F}_q$ for all $i$, $\deg(f)\leq k-1$).
Fix **distinct** $a_1,a_2,\ldots,a_n\in \mathbb{F}_q$.
#### Definition of Reed-Solomon code
A Reed-Solomon code is $\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}$.
- In words, the set of all evaluations at $a_1,a_2,\ldots,a_n$ of polynomials of degree at most $k-1$.
<details>
<summary>Example of Reed-Solomon code</summary>
Let $n=5$, $\mathbb{F}_q=\mathbb{Z}_5$, $k=3$.
||$a_0=0$|$a_1=1$|$a_2=2$|$a_3=3$|$a_4=4$|
|---|---|---|---|---|---|
|$f(x)=1$|$1$|$1$|$1$|$1$|$1$|
|$f(x)=x+2$|$2$|$3$|$4$|$0$|$1$|
|$f(x)=x^2+x$|$0$|$2$|$1$|$2$|$0$|
Here $d=n-k+1=3$.
</details>
#### Proposition: Reed-Solomon code is a linear code
A Reed-Solomon code is $\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}$ is a linear code.
<details>
<summary>Proof</summary>
First the code is closed under addition.
Let $f(x),g(x)\in \mathbb{F}_q^{k-1}[x]$, then $f(x)+g(x)\in \mathbb{F}_q^{k-1}[x]$.
$$
f(x)+g(x)=\sum_{i=0}^{k-1}(f_i+g_i)x^i
$$
Then the code is closed under scalar multiplication.
Let $f(x)\in \mathbb{F}_q^{k-1}[x]$, $c\in \mathbb{F}_q$, then $cf(x)\in \mathbb{F}_q^{k-1}[x]$.
$$
cf(x)=\sum_{i=0}^{k-1}(cf_i)x^i
$$
</details>
The dimension of the code is $k$.
#### Corollary: The Reed-Solomon code attains the Singleton bound with equality
The Reed-Solomon code has minimum distance $n-k+1$.
<details>
<summary>Proof</summary>
Let $c_f=(f(a_1),f(a_2),\ldots,f(a_n))$ and $c_g=(g(a_1),g(a_2),\ldots,g(a_n))$.
Since $f\neq g$, and $d(c_f,c_g)$ is the minimum distance of the code
Let $c_{f-g}=(f(a_1)-g(a_1),f(a_2)-g(a_2),\ldots,f(a_n)-g(a_n))$.
By the lemma for minimum distance, we have $d(c_f,c_g)=w_H(c_{f-g})=w_H((f-g)(a_1),(f-g)(a_2),\ldots,(f-g)(a_n))$ where $f-g\in \mathbb{F}_q^{k-1}[x]$.
So $n-w_H(c_{f-g})$ is the number of zeros (root) of the polynomial $f-g$.
So if $f-g$ has more than $k-1$ roots, then $f=g$.
So $n-d\leq k-1$, $d\geq n-k+1$.
Which is the Singleton bound.
</details>
### Definition of Reed-Solomon code from generator matrix
Every Reed-Solomon code is of the form $(f(a_1),f(a_2),\ldots,f(a_n))$ for some $f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x]$.
Observer that the evaluation map is a linear map.
$f(a_1)=f_0+f_1a_1+f_2a_1^2+\cdots+f_{k-1}a_1^{k-1}$
$f(a_2)=f_0+f_1a_2+f_2a_2^2+\cdots+f_{k-1}a_2^{k-1}$
$\vdots$
$f(a_n)=f_0+f_1a_n+f_2a_n^2+\cdots+f_{k-1}a_n^{k-1}$
So, every code word can be constructed by
$$
(f(a_1),f(a_2),\ldots,f(a_n))=(f_0,f_1,f_2,\ldots,f_{k-1})\begin{pmatrix}
1 & 1 & \cdots & 1\\
a_1 & a_2 & \cdots & a_n\\
a_1^2 & a_2^2 & \cdots & a_n^2\\
\vdots & \vdots & \cdots & \vdots\\
a_1^{k-1} & a_2^{k-1} & \cdots & a_n^{k-1}
\end{pmatrix}
$$
The generator matrix for Reed-Solomon code is a Vandermonde matrix $V(a_1,a_2,\ldots,a_n)$.
Fact: $V(a_1,a_2,\ldots,a_n)$ is invertible if and only if $a_1,a_2,\ldots,a_n$ are distinct. (that's how we choose $a_1,a_2,\ldots,a_n$)
The parity check matrix for Reed-Solomon code is also a Vandermonde matrix $V(a_1,a_2,\ldots,a_n)^\top$ with scalar multiples of the columns.
Some technical lemmas:
Let $G$ and $H$ be the generator and parity-check matrices of (any) linear code
$C = [n, k, d]_{\mathbb{F}_q}$. Then:
I. Then $H G^\top = 0$.
II. Any matrix $M \in \mathbb{F}_q^{n-k \times k}$ such that $\rank(M) = n - k$ and $M G^\top = 0$ is a parity-check matrix for $C$ (i.e. $C = \ker M$).
## Reed-Muller code
Reed-Solomon codes: Evaluations of univariate polynomials of deg ≤ $k-1$.
Reed-Muller codes: Evaluations of multivariate polynomials of deg $\leq k-1$
Example:
$$
f(x_1,x_2,x_3)=x_1x_2^2+x_1x_3+x_2+x_2x_3^3
$$
This is a degree 4 polynomial.
Usually we use $q=2$ for binary codes.
So $x^2=x$
### Definition of Reed-Muller code (binary case)
$$
RM(r,m)=\left\{(f(\alpha_1),\ldots,f(\alpha_2^m))|\alpha_i\in \mathbb{F}_2^m,\deg f\leq r\right\}
$$
Facts:
- Length $n = 2^m$.
- Minimum distance $2^{m-r}$ (not shown).
- Dimension = # of free coefficients in a multilinear polynomial of degree at most $r$.
- Dimension = # of subsets of $\{1, 2, \ldots, m\}$ of size at most $r$
- Dimension = $\sum_{i=0}^{r}\binom{m}{i}$
Exercises: Show that
1. $C_1 = RM(m-1,m) =$ Parity code.
2. $C_2 = RM(m-2,m) =$ Extended Hamming code.
3. $C_3 = RM(1,m) =$ Augmented Hadamard.
## Coding for storage
### Requirements/Challenges in Storage Systems
1. Challenge 1: Reconstruction.
- The data collector must be able to reconstruct the file, even if some are nonresponsive.
- Minimize **reconstruction** bandwidth.
2. Challenge 2: Repair.
- The system must maintain data consistency.
- Failed servers must be repaired:
- By contacting few other servers (locality, due to geographical constraints).
- By minimizing bandwidth.
3. Challenge 3: Storage overhead.
- Minimize space consumption.
- Minimize redundancy.
### Naive solution: Replication
Fragment the file $X = (X_1, \ldots, X_k)$.
- Size of $X_i$ = Whatever fits in a storage server.
Hold $r$ copies of each $X_i$.
- I.e., $n = rk$ servers in the system.
Storage overhead?
- $\frac{n}{k} = r$.
Repair?
- $X_i$ fails
- $\geq r$ failures is lost data.
Reconstruction?
- Possible if any $r-1$ servers fail.
- Impossible for some $\geq r$ failures.
### Use codes to improve storage efficiency
Reconstruction?
- Lecture 1: If $d_H(\mathcal{C})\geq d$, every pattern of at most $d-1$ erasures is recoverable.
- Idea: Treat unavailable servers as erasures.
Is this better/worse than replication?
- Say we wish to reconstruct from any $\approx \frac{n}{10}$ servers.
- What would be the redundancy in replication vs. coding?
Coding:
- Can reconstruct file from any $n-d+1\approx \frac{9}{10}n$ servers.
- Resulting overhead $\frac{n}{k}=\frac{n}{n-d+1}\approx \frac{10}{9}$ (constant!).
Replication:
To reconstruct from any $\frac{9}{10}n$ servers, need $r-1\approx \frac{1}{10}n$
Repair?
- Need low locality (repair by contacting few other servers).
- Need low bandwidth (repair by downloading as few bits as possible).
Repair in a replicated system:
- $X_i$ fails $\Rightarrow$ reconstruct from a different copy.
- Locality 1.
- Optimal bandwidth.
Repair in a coded system:
- repair one $Y_i$ $\approx$ Reconstruct the entire file.
- Locality $n-d+1$, high bandwidth.
- Much worse than replication.
New coding challenges: Minimize locality and bandwidth