Files
NoteNextra-origin/content/CSE5313/CSE5313_L9.md
2025-11-04 12:43:23 -06:00

7.3 KiB
Raw Blame History

CSE5313 Coding and information theory for data science (Lecture 9)

Explicit optimal codes

Explicit optimal codes?

  • Singleton, Sphere-packing provide restrictions.
  • Gilbert-Varshamov provides existence.

Are there explicit optimal codes? That is,

  • Easily (polynomial time) encodable, decodable.

Yes! This lecture:

Gustave Solomon [1930-1996] (Reed-Solomon code) Irving S. Reed [1923-2012]. David E. Muller [1924-2008] (Reed-Muller code)

Using Polynomials over \mathbb{F}_q

Reed-Solomon code

Note

The fundamental theorem of algebra:

A polynomial of degree k has at most k roots.

We have two equivalent definitions of a Reed-Solomon code:

  • As polynomial evaluations.
  • As linear codes (from generator matrix)

Efficient encoding (as linear codes)

Efficient decoding (use Euclidean algorithm)

Definition of Reed-Solomon code from polynomial evaluations

Caution

We assume q\geq n.

Every codeword corresponds to a polynomial of degree at most k-1.

Let f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x] (f_i\in \mathbb{F}_q for all i, \deg(f)\leq k-1).

Fix distinct a_1,a_2,\ldots,a_n\in \mathbb{F}_q.

Definition of Reed-Solomon code

A Reed-Solomon code is \{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}.

  • In words, the set of all evaluations at a_1,a_2,\ldots,a_n of polynomials of degree at most k-1.
Example of Reed-Solomon code

Let n=5, \mathbb{F}_q=\mathbb{Z}_5, k=3.

a_0=0 a_1=1 a_2=2 a_3=3 a_4=4
f(x)=1 1 1 1 1 1
f(x)=x+2 2 3 4 0 1
f(x)=x^2+x 0 2 1 2 0

Here d=n-k+1=3.

Proposition: Reed-Solomon code is a linear code

A Reed-Solomon code is \{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\} is a linear code.

Proof

First the code is closed under addition.

Let f(x),g(x)\in \mathbb{F}_q^{k-1}[x], then f(x)+g(x)\in \mathbb{F}_q^{k-1}[x].


f(x)+g(x)=\sum_{i=0}^{k-1}(f_i+g_i)x^i

Then the code is closed under scalar multiplication.

Let f(x)\in \mathbb{F}_q^{k-1}[x], c\in \mathbb{F}_q, then cf(x)\in \mathbb{F}_q^{k-1}[x].


cf(x)=\sum_{i=0}^{k-1}(cf_i)x^i

The dimension of the code is k.

Corollary: The Reed-Solomon code attains the Singleton bound with equality

The Reed-Solomon code has minimum distance n-k+1.

Proof

Let c_f=(f(a_1),f(a_2),\ldots,f(a_n)) and c_g=(g(a_1),g(a_2),\ldots,g(a_n)).

Since f\neq g, and d(c_f,c_g) is the minimum distance of the code

Let c_{f-g}=(f(a_1)-g(a_1),f(a_2)-g(a_2),\ldots,f(a_n)-g(a_n)).

By the lemma for minimum distance, we have d(c_f,c_g)=w_H(c_{f-g})=w_H((f-g)(a_1),(f-g)(a_2),\ldots,(f-g)(a_n)) where f-g\in \mathbb{F}_q^{k-1}[x].

So n-w_H(c_{f-g}) is the number of zeros (root) of the polynomial f-g.

So if f-g has more than k-1 roots, then f=g.

So n-d\leq k-1, d\geq n-k+1.

Which is the Singleton bound.

Definition of Reed-Solomon code from generator matrix

Every Reed-Solomon code is of the form (f(a_1),f(a_2),\ldots,f(a_n)) for some f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x].

Observer that the evaluation map is a linear map.

f(a_1)=f_0+f_1a_1+f_2a_1^2+\cdots+f_{k-1}a_1^{k-1} f(a_2)=f_0+f_1a_2+f_2a_2^2+\cdots+f_{k-1}a_2^{k-1} \vdots f(a_n)=f_0+f_1a_n+f_2a_n^2+\cdots+f_{k-1}a_n^{k-1}

So, every code word can be constructed by


(f(a_1),f(a_2),\ldots,f(a_n))=(f_0,f_1,f_2,\ldots,f_{k-1})\begin{pmatrix}
1 & 1 & \cdots & 1\\
a_1 & a_2 & \cdots & a_n\\
a_1^2 & a_2^2 & \cdots & a_n^2\\
\vdots & \vdots & \cdots & \vdots\\
a_1^{k-1} & a_2^{k-1} & \cdots & a_n^{k-1}
\end{pmatrix}

The generator matrix for Reed-Solomon code is a Vandermonde matrix V(a_1,a_2,\ldots,a_n).

Fact: V(a_1,a_2,\ldots,a_n) is invertible if and only if a_1,a_2,\ldots,a_n are distinct. (that's how we choose a_1,a_2,\ldots,a_n)

The parity check matrix for Reed-Solomon code is also a Vandermonde matrix V(a_1,a_2,\ldots,a_n)^\top with scalar multiples of the columns.

Some technical lemmas:

Let G and H be the generator and parity-check matrices of (any) linear code C = [n, k, d]_{\mathbb{F}_q}. Then:

I. Then H G^\top = 0. II. Any matrix M \in \mathbb{F}_q^{n-k \times k} such that \rank(M) = n - k and M G^\top = 0 is a parity-check matrix for C (i.e. C = \ker M).

Reed-Muller code

Reed-Solomon codes: Evaluations of univariate polynomials of deg ≤ k-1.

Reed-Muller codes: Evaluations of multivariate polynomials of deg \leq k-1

Example:


f(x_1,x_2,x_3)=x_1x_2^2+x_1x_3+x_2+x_2x_3^3

This is a degree 4 polynomial.

Usually we use q=2 for binary codes.

So x^2=x

Definition of Reed-Muller code (binary case)


RM(r,m)=\left\{(f(\alpha_1),\ldots,f(\alpha_2^m))|\alpha_i\in \mathbb{F}_2^m,\deg f\leq r\right\}

Facts:

  • Length n = 2^m.
  • Minimum distance 2^{m-r} (not shown).
  • Dimension = # of free coefficients in a multilinear polynomial of degree at most r.
  • Dimension = # of subsets of \{1, 2, \ldots, m\} of size at most r
  • Dimension = \sum_{i=0}^{r}\binom{m}{i}

Exercises: Show that

  1. C_1 = RM(m-1,m) = Parity code.
  2. C_2 = RM(m-2,m) = Extended Hamming code.
  3. C_3 = RM(1,m) = Augmented Hadamard.

Coding for storage

Requirements/Challenges in Storage Systems

  1. Challenge 1: Reconstruction.
    • The data collector must be able to reconstruct the file, even if some are nonresponsive.
      • Minimize reconstruction bandwidth.
  2. Challenge 2: Repair.
    • The system must maintain data consistency.
    • Failed servers must be repaired:
      • By contacting few other servers (locality, due to geographical constraints).
      • By minimizing bandwidth.
  3. Challenge 3: Storage overhead.
    • Minimize space consumption.
      • Minimize redundancy.

Naive solution: Replication

Fragment the file X = (X_1, \ldots, X_k).

  • Size of X_i = Whatever fits in a storage server.

Hold r copies of each X_i.

  • I.e., n = rk servers in the system.

Storage overhead?

  • \frac{n}{k} = r.

Repair?

  • X_i fails
  • \geq r failures is lost data.

Reconstruction?

  • Possible if any r-1 servers fail.
  • Impossible for some \geq r failures.

Use codes to improve storage efficiency

Reconstruction?

  • Lecture 1: If d_H(\mathcal{C})\geq d, every pattern of at most d-1 erasures is recoverable.

  • Idea: Treat unavailable servers as erasures.

Is this better/worse than replication?

  • Say we wish to reconstruct from any \approx \frac{n}{10} servers.
  • What would be the redundancy in replication vs. coding?

Coding:

  • Can reconstruct file from any n-d+1\approx \frac{9}{10}n servers.
  • Resulting overhead \frac{n}{k}=\frac{n}{n-d+1}\approx \frac{10}{9} (constant!).

Replication:

To reconstruct from any \frac{9}{10}n servers, need r-1\approx \frac{1}{10}n

Repair?

  • Need low locality (repair by contacting few other servers).
  • Need low bandwidth (repair by downloading as few bits as possible).

Repair in a replicated system:

  • X_i fails \Rightarrow reconstruct from a different copy.
  • Locality 1.
  • Optimal bandwidth.

Repair in a coded system:

  • repair one Y_i \approx Reconstruct the entire file.
    • Locality n-d+1, high bandwidth.
  • Much worse than replication.

New coding challenges: Minimize locality and bandwidth