From 21ef2250c00a2b488e5aae19c61ea5e6b317e7fa Mon Sep 17 00:00:00 2001 From: Trance-0 <60459821+Trance-0@users.noreply.github.com> Date: Thu, 25 Sep 2025 12:48:46 -0500 Subject: [PATCH] updates --- content/CSE5313/CSE5313_L9.md | 278 ++++++++++++++++++++++++++++++++++ content/CSE5313/_meta.js | 2 + 2 files changed, 280 insertions(+) create mode 100644 content/CSE5313/CSE5313_L9.md diff --git a/content/CSE5313/CSE5313_L9.md b/content/CSE5313/CSE5313_L9.md new file mode 100644 index 0000000..354e67b --- /dev/null +++ b/content/CSE5313/CSE5313_L9.md @@ -0,0 +1,278 @@ +# CSE5313 Coding and information theory for data science (Lecture 9) + +## Explicit optimal codes + +Explicit optimal codes? + +- Singleton, Sphere-packing provide restrictions. +- Gilbert-Varshamov provides existence. + +Are there explicit optimal codes? That is, + +- Easily (polynomial time) encodable, decodable. + +Yes! This lecture: + +– Gustave Solomon [1930-1996] (Reed-Solomon code) +– Irving S. Reed [1923-2012]. +– David E. Muller [1924-2008] (Reed-Muller code) + +Using Polynomials over $\mathbb{F}_q$ + +## Reed-Solomon code + +> [!NOTE] +> +> The fundamental theorem of algebra: +> +> A polynomial of degree $k$ has at most $k$ roots. + +We have two equivalent definitions of a Reed-Solomon code: + +- As polynomial evaluations. +- As linear codes (from generator matrix) + +Efficient encoding (as linear codes) + +Efficient decoding (use Euclidean algorithm) + +### Definition of Reed-Solomon code from polynomial evaluations + +> [!DANGER] +> +> We assume $q\geq n$. + +Every codeword corresponds to a polynomial of degree at most $k-1$. + +Let $f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x]$ ($f_i\in \mathbb{F}_q$ for all $i$, $\deg(f)\leq k-1$). + +Fix **distinct** $a_1,a_2,\ldots,a_n\in \mathbb{F}_q$. + +#### Definition of Reed-Solomon code + +A Reed-Solomon code is $\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}$. + +- In words, the set of all evaluations at $a_1,a_2,\ldots,a_n$ of polynomials of degree at most $k-1$. + +
+Example of Reed-Solomon code + +Let $n=5$, $\mathbb{F}_q=\mathbb{Z}_5$, $k=3$. + +||$a_0=0$|$a_1=1$|$a_2=2$|$a_3=3$|$a_4=4$| +|---|---|---|---|---|---| +|$f(x)=1$|$1$|$1$|$1$|$1$|$1$| +|$f(x)=x+2$|$2$|$3$|$4$|$0$|$1$| +|$f(x)=x^2+x$|$0$|$2$|$1$|$2$|$0$| + +Here $d=n-k+1=3$. + +
+ +#### Proposition: Reed-Solomon code is a linear code + +A Reed-Solomon code is $\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}$ is a linear code. + +
+Proof + +First the code is closed under addition. + +Let $f(x),g(x)\in \mathbb{F}_q^{k-1}[x]$, then $f(x)+g(x)\in \mathbb{F}_q^{k-1}[x]$. + +$$ +f(x)+g(x)=\sum_{i=0}^{k-1}(f_i+g_i)x^i +$$ + +Then the code is closed under scalar multiplication. + +Let $f(x)\in \mathbb{F}_q^{k-1}[x]$, $c\in \mathbb{F}_q$, then $cf(x)\in \mathbb{F}_q^{k-1}[x]$. + +$$ +cf(x)=\sum_{i=0}^{k-1}(cf_i)x^i +$$ + +
+ +The dimension of the code is $k$. + +#### Corollary: The Reed-Solomon code attains the Singleton bound with equality + +The Reed-Solomon code has minimum distance $n-k+1$. + +
+Proof + +Let $c_f=(f(a_1),f(a_2),\ldots,f(a_n))$ and $c_g=(g(a_1),g(a_2),\ldots,g(a_n))$. + +Since $f\neq g$, and $d(c_f,c_g)$ is the minimum distance of the code + +Let $c_{f-g}=(f(a_1)-g(a_1),f(a_2)-g(a_2),\ldots,f(a_n)-g(a_n))$. + +By the lemma for minimum distance, we have $d(c_f,c_g)=w_H(c_{f-g})=w_H((f-g)(a_1),(f-g)(a_2),\ldots,(f-g)(a_n))$ where $f-g\in \mathbb{F}_q^{k-1}[x]$. + +So $n-w_H(c_{f-g})$ is the number of zeros (root) of the polynomial $f-g$. + +So if $f-g$ has more than $k-1$ roots, then $f=g$. + +So $n-d\leq k-1$, $d\geq n-k+1$. + +Which is the Singleton bound. + +
+ +### Definition of Reed-Solomon code from generator matrix + +Every Reed-Solomon code is of the form $(f(a_1),f(a_2),\ldots,f(a_n))$ for some $f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x]$. + +Observer that the evaluation map is a linear map. + +$f(a_1)=f_0+f_1a_1+f_2a_1^2+\cdots+f_{k-1}a_1^{k-1}$ +$f(a_2)=f_0+f_1a_2+f_2a_2^2+\cdots+f_{k-1}a_2^{k-1}$ +$\vdots$ +$f(a_n)=f_0+f_1a_n+f_2a_n^2+\cdots+f_{k-1}a_n^{k-1}$ + +So, every code word can be constructed by + +$$ +(f(a_1),f(a_2),\ldots,f(a_n))=(f_0,f_1,f_2,\ldots,f_{k-1})\begin{pmatrix} +1 & 1 & \cdots & 1\\ +a_1 & a_2 & \cdots & a_n\\ +a_1^2 & a_2^2 & \cdots & a_n^2\\ +\vdots & \vdots & \cdots & \vdots\\ +a_1^{k-1} & a_2^{k-1} & \cdots & a_n^{k-1} +\end{pmatrix} +$$ + +The generator matrix for Reed-Solomon code is a Vandermonde matrix $V(a_1,a_2,\ldots,a_n)$. + +Fact: $V(a_1,a_2,\ldots,a_n)$ is invertible if and only if $a_1,a_2,\ldots,a_n$ are distinct. (that's how we choose $a_1,a_2,\ldots,a_n$) + +The parity check matrix for Reed-Solomon code is also a Vandermonde matrix $V(a_1,a_2,\ldots,a_n)^T$ with scalar multiples of the columns. + +Some technical lemmas: + +Let $G$ and $H$ be the generator and parity-check matrices of (any) linear code +$C = [n, k, d]_{\mathbb{F}_q}$. Then: + +I. Then $H G^T = 0$. +II. Any matrix $M \in \mathbb{F}_q^{n-k \times k}$ such that $\rank(M) = n - k$ and $M G^T = 0$ is a parity-check matrix for $C$ (i.e. $C = \ker M$). + +## Reed-Muller code + +Reed-Solomon codes: Evaluations of univariate polynomials of deg ≤ $k-1$. + +Reed-Muller codes: Evaluations of multivariate polynomials of deg $\leq k-1$ + +Example: + +$$ +f(x_1,x_2,x_3)=x_1x_2^2+x_1x_3+x_2+x_2x_3^3 +$$ + +This is a degree 4 polynomial. + +Usually we use $q=2$ for binary codes. + +So $x^2=x$ + +### Definition of Reed-Muller code (binary case) + +$$ +RM(r,m)=\left\{(f(\alpha_1),\ldots,f(\alpha_2^m))|\alpha_i\in \mathbb{F}_2^m,\deg f\leq r\right\} +$$ + +Facts: + +- Length $n = 2^m$. +- Minimum distance $2^{m-r}$ (not shown). +- Dimension = # of free coefficients in a multilinear polynomial of degree at most $r$. +- Dimension = # of subsets of $\{1, 2, \ldots, m\}$ of size at most $r$ +- Dimension = $\sum_{i=0}^{r}\binom{m}{i}$ + +Exercises: Show that + +1. $C_1 = RM(m-1,m) =$ Parity code. +2. $C_2 = RM(m-2,m) =$ Extended Hamming code. +3. $C_3 = RM(1,m) =$ Augmented Hadamard. + +## Coding for storage + +### Requirements/Challenges in Storage Systems + +1. Challenge 1: Reconstruction. + - The data collector must be able to reconstruct the file, even if some are nonresponsive. + - Minimize **reconstruction** bandwidth. +2. Challenge 2: Repair. + - The system must maintain data consistency. + - Failed servers must be repaired: + - By contacting few other servers (locality, due to geographical constraints). + - By minimizing bandwidth. +3. Challenge 3: Storage overhead. + - Minimize space consumption. + - Minimize redundancy. + +### Naive solution: Replication + +Fragment the file $X = (X_1, \ldots, X_k)$. + +- Size of $X_i$ = Whatever fits in a storage server. + +Hold $r$ copies of each $X_i$. + +- I.e., $n = rk$ servers in the system. + +Storage overhead? + +- $\frac{n}{k} = r$. + +Repair? + +- $X_i$ fails +- $\geq r$ failures is lost data. + +Reconstruction? + +- Possible if any $r-1$ servers fail. +- Impossible for some $\geq r$ failures. + +### Use codes to improve storage efficiency + +Reconstruction? + +- Lecture 1: If $d_H(\mathcal{C})\geq d$, every pattern of at most $d-1$ erasures is recoverable. + +- Idea: Treat unavailable servers as erasures. + +Is this better/worse than replication? + +- Say we wish to reconstruct from any $\approx \frac{n}{10}$ servers. +- What would be the redundancy in replication vs. coding? + +Coding: + +- Can reconstruct file from any $n-d+1\approx \frac{9}{10}n$ servers. +- Resulting overhead $\frac{n}{k}=\frac{n}{n-d+1}\approx \frac{10}{9}$ (constant!). + +Replication: + +To reconstruct from any $\frac{9}{10}n$ servers, need $r-1\approx \frac{1}{10}n$ + +Repair? + +- Need low locality (repair by contacting few other servers). +- Need low bandwidth (repair by downloading as few bits as possible). + +Repair in a replicated system: + +- $X_i$ fails $\Rightarrow$ reconstruct from a different copy. +- Locality 1. +- Optimal bandwidth. + +Repair in a coded system: + +- repair one $Y_i$ $\approx$ Reconstruct the entire file. + - Locality $n-d+1$, high bandwidth. +- Much worse than replication. + +New coding challenges: Minimize locality and bandwidth diff --git a/content/CSE5313/_meta.js b/content/CSE5313/_meta.js index 2068cc7..35560ed 100644 --- a/content/CSE5313/_meta.js +++ b/content/CSE5313/_meta.js @@ -10,4 +10,6 @@ export default { CSE5313_L5: "CSE5313 Coding and information theory for data science (Lecture 5)", CSE5313_L6: "CSE5313 Coding and information theory for data science (Lecture 6)", CSE5313_L7: "CSE5313 Coding and information theory for data science (Lecture 7)", + CSE5313_L8: "CSE5313 Coding and information theory for data science (Lecture 8)", + CSE5313_L9: "CSE5313 Coding and information theory for data science (Lecture 9)", } \ No newline at end of file