7.3 KiB
CSE5313 Coding and information theory for data science (Lecture 9)
Explicit optimal codes
Explicit optimal codes?
- Singleton, Sphere-packing provide restrictions.
- Gilbert-Varshamov provides existence.
Are there explicit optimal codes? That is,
- Easily (polynomial time) encodable, decodable.
Yes! This lecture:
– Gustave Solomon [1930-1996] (Reed-Solomon code) – Irving S. Reed [1923-2012]. – David E. Muller [1924-2008] (Reed-Muller code)
Using Polynomials over \mathbb{F}_q
Reed-Solomon code
Note
The fundamental theorem of algebra:
A polynomial of degree
khas at mostkroots.
We have two equivalent definitions of a Reed-Solomon code:
- As polynomial evaluations.
- As linear codes (from generator matrix)
Efficient encoding (as linear codes)
Efficient decoding (use Euclidean algorithm)
Definition of Reed-Solomon code from polynomial evaluations
[!DANGER]
We assume
q\geq n.
Every codeword corresponds to a polynomial of degree at most k-1.
Let f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x] (f_i\in \mathbb{F}_q for all i, \deg(f)\leq k-1).
Fix distinct a_1,a_2,\ldots,a_n\in \mathbb{F}_q.
Definition of Reed-Solomon code
A Reed-Solomon code is \{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}.
- In words, the set of all evaluations at
a_1,a_2,\ldots,a_nof polynomials of degree at mostk-1.
Example of Reed-Solomon code
Let n=5, \mathbb{F}_q=\mathbb{Z}_5, k=3.
a_0=0 |
a_1=1 |
a_2=2 |
a_3=3 |
a_4=4 |
|
|---|---|---|---|---|---|
f(x)=1 |
1 |
1 |
1 |
1 |
1 |
f(x)=x+2 |
2 |
3 |
4 |
0 |
1 |
f(x)=x^2+x |
0 |
2 |
1 |
2 |
0 |
Here d=n-k+1=3.
Proposition: Reed-Solomon code is a linear code
A Reed-Solomon code is \{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\} is a linear code.
Proof
First the code is closed under addition.
Let f(x),g(x)\in \mathbb{F}_q^{k-1}[x], then f(x)+g(x)\in \mathbb{F}_q^{k-1}[x].
f(x)+g(x)=\sum_{i=0}^{k-1}(f_i+g_i)x^i
Then the code is closed under scalar multiplication.
Let f(x)\in \mathbb{F}_q^{k-1}[x], c\in \mathbb{F}_q, then cf(x)\in \mathbb{F}_q^{k-1}[x].
cf(x)=\sum_{i=0}^{k-1}(cf_i)x^i
The dimension of the code is k.
Corollary: The Reed-Solomon code attains the Singleton bound with equality
The Reed-Solomon code has minimum distance n-k+1.
Proof
Let c_f=(f(a_1),f(a_2),\ldots,f(a_n)) and c_g=(g(a_1),g(a_2),\ldots,g(a_n)).
Since f\neq g, and d(c_f,c_g) is the minimum distance of the code
Let c_{f-g}=(f(a_1)-g(a_1),f(a_2)-g(a_2),\ldots,f(a_n)-g(a_n)).
By the lemma for minimum distance, we have d(c_f,c_g)=w_H(c_{f-g})=w_H((f-g)(a_1),(f-g)(a_2),\ldots,(f-g)(a_n)) where f-g\in \mathbb{F}_q^{k-1}[x].
So n-w_H(c_{f-g}) is the number of zeros (root) of the polynomial f-g.
So if f-g has more than k-1 roots, then f=g.
So n-d\leq k-1, d\geq n-k+1.
Which is the Singleton bound.
Definition of Reed-Solomon code from generator matrix
Every Reed-Solomon code is of the form (f(a_1),f(a_2),\ldots,f(a_n)) for some f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x].
Observer that the evaluation map is a linear map.
f(a_1)=f_0+f_1a_1+f_2a_1^2+\cdots+f_{k-1}a_1^{k-1}
f(a_2)=f_0+f_1a_2+f_2a_2^2+\cdots+f_{k-1}a_2^{k-1}
\vdots
f(a_n)=f_0+f_1a_n+f_2a_n^2+\cdots+f_{k-1}a_n^{k-1}
So, every code word can be constructed by
(f(a_1),f(a_2),\ldots,f(a_n))=(f_0,f_1,f_2,\ldots,f_{k-1})\begin{pmatrix}
1 & 1 & \cdots & 1\\
a_1 & a_2 & \cdots & a_n\\
a_1^2 & a_2^2 & \cdots & a_n^2\\
\vdots & \vdots & \cdots & \vdots\\
a_1^{k-1} & a_2^{k-1} & \cdots & a_n^{k-1}
\end{pmatrix}
The generator matrix for Reed-Solomon code is a Vandermonde matrix V(a_1,a_2,\ldots,a_n).
Fact: V(a_1,a_2,\ldots,a_n) is invertible if and only if a_1,a_2,\ldots,a_n are distinct. (that's how we choose a_1,a_2,\ldots,a_n)
The parity check matrix for Reed-Solomon code is also a Vandermonde matrix V(a_1,a_2,\ldots,a_n)^T with scalar multiples of the columns.
Some technical lemmas:
Let G and H be the generator and parity-check matrices of (any) linear code
C = [n, k, d]_{\mathbb{F}_q}. Then:
I. Then H G^T = 0.
II. Any matrix M \in \mathbb{F}_q^{n-k \times k} such that \rank(M) = n - k and M G^T = 0 is a parity-check matrix for C (i.e. C = \ker M).
Reed-Muller code
Reed-Solomon codes: Evaluations of univariate polynomials of deg ≤ k-1.
Reed-Muller codes: Evaluations of multivariate polynomials of deg \leq k-1
Example:
f(x_1,x_2,x_3)=x_1x_2^2+x_1x_3+x_2+x_2x_3^3
This is a degree 4 polynomial.
Usually we use q=2 for binary codes.
So x^2=x
Definition of Reed-Muller code (binary case)
RM(r,m)=\left\{(f(\alpha_1),\ldots,f(\alpha_2^m))|\alpha_i\in \mathbb{F}_2^m,\deg f\leq r\right\}
Facts:
- Length
n = 2^m. - Minimum distance
2^{m-r}(not shown). - Dimension = # of free coefficients in a multilinear polynomial of degree at most
r. - Dimension = # of subsets of
\{1, 2, \ldots, m\}of size at mostr - Dimension =
\sum_{i=0}^{r}\binom{m}{i}
Exercises: Show that
C_1 = RM(m-1,m) =Parity code.C_2 = RM(m-2,m) =Extended Hamming code.C_3 = RM(1,m) =Augmented Hadamard.
Coding for storage
Requirements/Challenges in Storage Systems
- Challenge 1: Reconstruction.
- The data collector must be able to reconstruct the file, even if some are nonresponsive.
- Minimize reconstruction bandwidth.
- The data collector must be able to reconstruct the file, even if some are nonresponsive.
- Challenge 2: Repair.
- The system must maintain data consistency.
- Failed servers must be repaired:
- By contacting few other servers (locality, due to geographical constraints).
- By minimizing bandwidth.
- Challenge 3: Storage overhead.
- Minimize space consumption.
- Minimize redundancy.
- Minimize space consumption.
Naive solution: Replication
Fragment the file X = (X_1, \ldots, X_k).
- Size of
X_i= Whatever fits in a storage server.
Hold r copies of each X_i.
- I.e.,
n = rkservers in the system.
Storage overhead?
\frac{n}{k} = r.
Repair?
X_ifails\geq rfailures is lost data.
Reconstruction?
- Possible if any
r-1servers fail. - Impossible for some
\geq rfailures.
Use codes to improve storage efficiency
Reconstruction?
-
Lecture 1: If
d_H(\mathcal{C})\geq d, every pattern of at mostd-1erasures is recoverable. -
Idea: Treat unavailable servers as erasures.
Is this better/worse than replication?
- Say we wish to reconstruct from any
\approx \frac{n}{10}servers. - What would be the redundancy in replication vs. coding?
Coding:
- Can reconstruct file from any
n-d+1\approx \frac{9}{10}nservers. - Resulting overhead
\frac{n}{k}=\frac{n}{n-d+1}\approx \frac{10}{9}(constant!).
Replication:
To reconstruct from any \frac{9}{10}n servers, need r-1\approx \frac{1}{10}n
Repair?
- Need low locality (repair by contacting few other servers).
- Need low bandwidth (repair by downloading as few bits as possible).
Repair in a replicated system:
X_ifails\Rightarrowreconstruct from a different copy.- Locality 1.
- Optimal bandwidth.
Repair in a coded system:
- repair one
Y_i\approxReconstruct the entire file.- Locality
n-d+1, high bandwidth.
- Locality
- Much worse than replication.
New coding challenges: Minimize locality and bandwidth