# CSE5313 Coding and information theory for data science (Lecture 9) ## Explicit optimal codes Explicit optimal codes? - Singleton, Sphere-packing provide restrictions. - Gilbert-Varshamov provides existence. Are there explicit optimal codes? That is, - Easily (polynomial time) encodable, decodable. Yes! This lecture: – Gustave Solomon [1930-1996] (Reed-Solomon code) – Irving S. Reed [1923-2012]. – David E. Muller [1924-2008] (Reed-Muller code) Using Polynomials over $\mathbb{F}_q$ ## Reed-Solomon code > [!NOTE] > > The fundamental theorem of algebra: > > A polynomial of degree $k$ has at most $k$ roots. We have two equivalent definitions of a Reed-Solomon code: - As polynomial evaluations. - As linear codes (from generator matrix) Efficient encoding (as linear codes) Efficient decoding (use Euclidean algorithm) ### Definition of Reed-Solomon code from polynomial evaluations > [!CAUTION] > > We assume $q\geq n$. Every codeword corresponds to a polynomial of degree at most $k-1$. Let $f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x]$ ($f_i\in \mathbb{F}_q$ for all $i$, $\deg(f)\leq k-1$). Fix **distinct** $a_1,a_2,\ldots,a_n\in \mathbb{F}_q$. #### Definition of Reed-Solomon code A Reed-Solomon code is $\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}$. - In words, the set of all evaluations at $a_1,a_2,\ldots,a_n$ of polynomials of degree at most $k-1$.
Example of Reed-Solomon code Let $n=5$, $\mathbb{F}_q=\mathbb{Z}_5$, $k=3$. ||$a_0=0$|$a_1=1$|$a_2=2$|$a_3=3$|$a_4=4$| |---|---|---|---|---|---| |$f(x)=1$|$1$|$1$|$1$|$1$|$1$| |$f(x)=x+2$|$2$|$3$|$4$|$0$|$1$| |$f(x)=x^2+x$|$0$|$2$|$1$|$2$|$0$| Here $d=n-k+1=3$.
#### Proposition: Reed-Solomon code is a linear code A Reed-Solomon code is $\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}$ is a linear code.
Proof First the code is closed under addition. Let $f(x),g(x)\in \mathbb{F}_q^{k-1}[x]$, then $f(x)+g(x)\in \mathbb{F}_q^{k-1}[x]$. $$ f(x)+g(x)=\sum_{i=0}^{k-1}(f_i+g_i)x^i $$ Then the code is closed under scalar multiplication. Let $f(x)\in \mathbb{F}_q^{k-1}[x]$, $c\in \mathbb{F}_q$, then $cf(x)\in \mathbb{F}_q^{k-1}[x]$. $$ cf(x)=\sum_{i=0}^{k-1}(cf_i)x^i $$
The dimension of the code is $k$. #### Corollary: The Reed-Solomon code attains the Singleton bound with equality The Reed-Solomon code has minimum distance $n-k+1$.
Proof Let $c_f=(f(a_1),f(a_2),\ldots,f(a_n))$ and $c_g=(g(a_1),g(a_2),\ldots,g(a_n))$. Since $f\neq g$, and $d(c_f,c_g)$ is the minimum distance of the code Let $c_{f-g}=(f(a_1)-g(a_1),f(a_2)-g(a_2),\ldots,f(a_n)-g(a_n))$. By the lemma for minimum distance, we have $d(c_f,c_g)=w_H(c_{f-g})=w_H((f-g)(a_1),(f-g)(a_2),\ldots,(f-g)(a_n))$ where $f-g\in \mathbb{F}_q^{k-1}[x]$. So $n-w_H(c_{f-g})$ is the number of zeros (root) of the polynomial $f-g$. So if $f-g$ has more than $k-1$ roots, then $f=g$. So $n-d\leq k-1$, $d\geq n-k+1$. Which is the Singleton bound.
### Definition of Reed-Solomon code from generator matrix Every Reed-Solomon code is of the form $(f(a_1),f(a_2),\ldots,f(a_n))$ for some $f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x]$. Observer that the evaluation map is a linear map. $f(a_1)=f_0+f_1a_1+f_2a_1^2+\cdots+f_{k-1}a_1^{k-1}$ $f(a_2)=f_0+f_1a_2+f_2a_2^2+\cdots+f_{k-1}a_2^{k-1}$ $\vdots$ $f(a_n)=f_0+f_1a_n+f_2a_n^2+\cdots+f_{k-1}a_n^{k-1}$ So, every code word can be constructed by $$ (f(a_1),f(a_2),\ldots,f(a_n))=(f_0,f_1,f_2,\ldots,f_{k-1})\begin{pmatrix} 1 & 1 & \cdots & 1\\ a_1 & a_2 & \cdots & a_n\\ a_1^2 & a_2^2 & \cdots & a_n^2\\ \vdots & \vdots & \cdots & \vdots\\ a_1^{k-1} & a_2^{k-1} & \cdots & a_n^{k-1} \end{pmatrix} $$ The generator matrix for Reed-Solomon code is a Vandermonde matrix $V(a_1,a_2,\ldots,a_n)$. Fact: $V(a_1,a_2,\ldots,a_n)$ is invertible if and only if $a_1,a_2,\ldots,a_n$ are distinct. (that's how we choose $a_1,a_2,\ldots,a_n$) The parity check matrix for Reed-Solomon code is also a Vandermonde matrix $V(a_1,a_2,\ldots,a_n)^\top$ with scalar multiples of the columns. Some technical lemmas: Let $G$ and $H$ be the generator and parity-check matrices of (any) linear code $C = [n, k, d]_{\mathbb{F}_q}$. Then: I. Then $H G^\top = 0$. II. Any matrix $M \in \mathbb{F}_q^{n-k \times k}$ such that $\rank(M) = n - k$ and $M G^\top = 0$ is a parity-check matrix for $C$ (i.e. $C = \ker M$). ## Reed-Muller code Reed-Solomon codes: Evaluations of univariate polynomials of deg ≤ $k-1$. Reed-Muller codes: Evaluations of multivariate polynomials of deg $\leq k-1$ Example: $$ f(x_1,x_2,x_3)=x_1x_2^2+x_1x_3+x_2+x_2x_3^3 $$ This is a degree 4 polynomial. Usually we use $q=2$ for binary codes. So $x^2=x$ ### Definition of Reed-Muller code (binary case) $$ RM(r,m)=\left\{(f(\alpha_1),\ldots,f(\alpha_2^m))|\alpha_i\in \mathbb{F}_2^m,\deg f\leq r\right\} $$ Facts: - Length $n = 2^m$. - Minimum distance $2^{m-r}$ (not shown). - Dimension = # of free coefficients in a multilinear polynomial of degree at most $r$. - Dimension = # of subsets of $\{1, 2, \ldots, m\}$ of size at most $r$ - Dimension = $\sum_{i=0}^{r}\binom{m}{i}$ Exercises: Show that 1. $C_1 = RM(m-1,m) =$ Parity code. 2. $C_2 = RM(m-2,m) =$ Extended Hamming code. 3. $C_3 = RM(1,m) =$ Augmented Hadamard. ## Coding for storage ### Requirements/Challenges in Storage Systems 1. Challenge 1: Reconstruction. - The data collector must be able to reconstruct the file, even if some are nonresponsive. - Minimize **reconstruction** bandwidth. 2. Challenge 2: Repair. - The system must maintain data consistency. - Failed servers must be repaired: - By contacting few other servers (locality, due to geographical constraints). - By minimizing bandwidth. 3. Challenge 3: Storage overhead. - Minimize space consumption. - Minimize redundancy. ### Naive solution: Replication Fragment the file $X = (X_1, \ldots, X_k)$. - Size of $X_i$ = Whatever fits in a storage server. Hold $r$ copies of each $X_i$. - I.e., $n = rk$ servers in the system. Storage overhead? - $\frac{n}{k} = r$. Repair? - $X_i$ fails - $\geq r$ failures is lost data. Reconstruction? - Possible if any $r-1$ servers fail. - Impossible for some $\geq r$ failures. ### Use codes to improve storage efficiency Reconstruction? - Lecture 1: If $d_H(\mathcal{C})\geq d$, every pattern of at most $d-1$ erasures is recoverable. - Idea: Treat unavailable servers as erasures. Is this better/worse than replication? - Say we wish to reconstruct from any $\approx \frac{n}{10}$ servers. - What would be the redundancy in replication vs. coding? Coding: - Can reconstruct file from any $n-d+1\approx \frac{9}{10}n$ servers. - Resulting overhead $\frac{n}{k}=\frac{n}{n-d+1}\approx \frac{10}{9}$ (constant!). Replication: To reconstruct from any $\frac{9}{10}n$ servers, need $r-1\approx \frac{1}{10}n$ Repair? - Need low locality (repair by contacting few other servers). - Need low bandwidth (repair by downloading as few bits as possible). Repair in a replicated system: - $X_i$ fails $\Rightarrow$ reconstruct from a different copy. - Locality 1. - Optimal bandwidth. Repair in a coded system: - repair one $Y_i$ $\approx$ Reconstruct the entire file. - Locality $n-d+1$, high bandwidth. - Much worse than replication. New coding challenges: Minimize locality and bandwidth