Trance-0/NoteNextra-origin

Fork 0

Files

Trance-0 614479e4d0 update notations

2025-11-04 12:43:23 -06:00

7.3 KiB

Raw Blame History

CSE5313 Coding and information theory for data science (Lecture 9)

Explicit optimal codes

Explicit optimal codes?

Singleton, Sphere-packing provide restrictions.
Gilbert-Varshamov provides existence.

Are there explicit optimal codes? That is,

Easily (polynomial time) encodable, decodable.

Yes! This lecture:

– Gustave Solomon [1930-1996] (Reed-Solomon code) – Irving S. Reed [1923-2012]. – David E. Muller [1924-2008] (Reed-Muller code)

Using Polynomials over \mathbb{F}_q

Reed-Solomon code

Note

The fundamental theorem of algebra:

A polynomial of degree k has at most k roots.

We have two equivalent definitions of a Reed-Solomon code:

As polynomial evaluations.
As linear codes (from generator matrix)

Efficient encoding (as linear codes)

Efficient decoding (use Euclidean algorithm)

Definition of Reed-Solomon code from polynomial evaluations

Caution

We assume q\geq n.

Every codeword corresponds to a polynomial of degree at most k-1.

Let f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x] (f_i\in \mathbb{F}_q for all i, \deg(f)\leq k-1).

Fix distinct a_1,a_2,\ldots,a_n\in \mathbb{F}_q.

Definition of Reed-Solomon code

A Reed-Solomon code is \{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}.

In words, the set of all evaluations at a_1,a_2,\ldots,a_n of polynomials of degree at most k-1.

Example of Reed-Solomon code

Let n=5, \mathbb{F}_q=\mathbb{Z}_5, k=3.

	`a_0=0`	`a_1=1`	`a_2=2`	`a_3=3`	`a_4=4`
`f(x)=1`	`1`	`1`	`1`	`1`	`1`
`f(x)=x+2`	`2`	`3`	`4`	`0`	`1`
`f(x)=x^2+x`	`0`	`2`	`1`	`2`	`0`

Here d=n-k+1=3.

Proposition: Reed-Solomon code is a linear code

A Reed-Solomon code is \{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\} is a linear code.

Proof

First the code is closed under addition.

Let f(x),g(x)\in \mathbb{F}_q^{k-1}[x], then f(x)+g(x)\in \mathbb{F}_q^{k-1}[x].


f(x)+g(x)=\sum_{i=0}^{k-1}(f_i+g_i)x^i

Then the code is closed under scalar multiplication.

Let f(x)\in \mathbb{F}_q^{k-1}[x], c\in \mathbb{F}_q, then cf(x)\in \mathbb{F}_q^{k-1}[x].


cf(x)=\sum_{i=0}^{k-1}(cf_i)x^i

The dimension of the code is k.

Corollary: The Reed-Solomon code attains the Singleton bound with equality

The Reed-Solomon code has minimum distance n-k+1.

Proof

Let c_f=(f(a_1),f(a_2),\ldots,f(a_n)) and c_g=(g(a_1),g(a_2),\ldots,g(a_n)).

Since f\neq g, and d(c_f,c_g) is the minimum distance of the code

Let c_{f-g}=(f(a_1)-g(a_1),f(a_2)-g(a_2),\ldots,f(a_n)-g(a_n)).

By the lemma for minimum distance, we have d(c_f,c_g)=w_H(c_{f-g})=w_H((f-g)(a_1),(f-g)(a_2),\ldots,(f-g)(a_n)) where f-g\in \mathbb{F}_q^{k-1}[x].

So n-w_H(c_{f-g}) is the number of zeros (root) of the polynomial f-g.

So if f-g has more than k-1 roots, then f=g.

So n-d\leq k-1, d\geq n-k+1.

Which is the Singleton bound.

Definition of Reed-Solomon code from generator matrix

Every Reed-Solomon code is of the form (f(a_1),f(a_2),\ldots,f(a_n)) for some f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x].

Observer that the evaluation map is a linear map.

f(a_1)=f_0+f_1a_1+f_2a_1^2+\cdots+f_{k-1}a_1^{k-1} f(a_2)=f_0+f_1a_2+f_2a_2^2+\cdots+f_{k-1}a_2^{k-1} \vdots f(a_n)=f_0+f_1a_n+f_2a_n^2+\cdots+f_{k-1}a_n^{k-1}

So, every code word can be constructed by


(f(a_1),f(a_2),\ldots,f(a_n))=(f_0,f_1,f_2,\ldots,f_{k-1})\begin{pmatrix}
1 & 1 & \cdots & 1\\
a_1 & a_2 & \cdots & a_n\\
a_1^2 & a_2^2 & \cdots & a_n^2\\
\vdots & \vdots & \cdots & \vdots\\
a_1^{k-1} & a_2^{k-1} & \cdots & a_n^{k-1}
\end{pmatrix}

The generator matrix for Reed-Solomon code is a Vandermonde matrix V(a_1,a_2,\ldots,a_n).

Fact: V(a_1,a_2,\ldots,a_n) is invertible if and only if a_1,a_2,\ldots,a_n are distinct. (that's how we choose a_1,a_2,\ldots,a_n)

The parity check matrix for Reed-Solomon code is also a Vandermonde matrix V(a_1,a_2,\ldots,a_n)^\top with scalar multiples of the columns.

Some technical lemmas:

Let G and H be the generator and parity-check matrices of (any) linear code C = [n, k, d]_{\mathbb{F}_q}. Then:

I. Then H G^\top = 0. II. Any matrix M \in \mathbb{F}_q^{n-k \times k} such that \rank(M) = n - k and M G^\top = 0 is a parity-check matrix for C (i.e. C = \ker M).

Reed-Muller code

Reed-Solomon codes: Evaluations of univariate polynomials of deg ≤ k-1.

Reed-Muller codes: Evaluations of multivariate polynomials of deg \leq k-1

Example:


f(x_1,x_2,x_3)=x_1x_2^2+x_1x_3+x_2+x_2x_3^3

This is a degree 4 polynomial.

Usually we use q=2 for binary codes.

So x^2=x

Definition of Reed-Muller code (binary case)


RM(r,m)=\left\{(f(\alpha_1),\ldots,f(\alpha_2^m))|\alpha_i\in \mathbb{F}_2^m,\deg f\leq r\right\}

Facts:

Length n = 2^m.
Minimum distance 2^{m-r} (not shown).
Dimension = # of free coefficients in a multilinear polynomial of degree at most r.
Dimension = # of subsets of \{1, 2, \ldots, m\} of size at most r
Dimension = \sum_{i=0}^{r}\binom{m}{i}

Exercises: Show that

C_1 = RM(m-1,m) = Parity code.
C_2 = RM(m-2,m) = Extended Hamming code.
C_3 = RM(1,m) = Augmented Hadamard.

Coding for storage

Requirements/Challenges in Storage Systems

Challenge 1: Reconstruction.
- The data collector must be able to reconstruct the file, even if some are nonresponsive.
  - Minimize reconstruction bandwidth.
Challenge 2: Repair.
- The system must maintain data consistency.
- Failed servers must be repaired:
  - By contacting few other servers (locality, due to geographical constraints).
  - By minimizing bandwidth.
Challenge 3: Storage overhead.
- Minimize space consumption.
  - Minimize redundancy.

Naive solution: Replication

Fragment the file X = (X_1, \ldots, X_k).

Size of X_i = Whatever fits in a storage server.

Hold r copies of each X_i.

I.e., n = rk servers in the system.

Storage overhead?

\frac{n}{k} = r.

Repair?

X_i fails
\geq r failures is lost data.

Reconstruction?

Possible if any r-1 servers fail.
Impossible for some \geq r failures.

Use codes to improve storage efficiency

Reconstruction?

Lecture 1: If d_H(\mathcal{C})\geq d, every pattern of at most d-1 erasures is recoverable.
Idea: Treat unavailable servers as erasures.

Is this better/worse than replication?

Say we wish to reconstruct from any \approx \frac{n}{10} servers.
What would be the redundancy in replication vs. coding?

Coding:

Can reconstruct file from any n-d+1\approx \frac{9}{10}n servers.
Resulting overhead \frac{n}{k}=\frac{n}{n-d+1}\approx \frac{10}{9} (constant!).

Replication:

To reconstruct from any \frac{9}{10}n servers, need r-1\approx \frac{1}{10}n