9.5 KiB
CSE5313 Coding and information theory for data science (Lecture 2)
Review on Channel coding
Let F be the input alphabet, \Phi be the output alphabet.
e.g. F=\{0,1\},\mathbb{R}.
Introduce noise: \operatorname{Pr}(c'\text{ received}|c\text{ transmitted}).
We use u to denote the information to be transmitted
c to be the codeword.
c' is the received codeword. given to the decoder.
u' is the decoded information word.
Error if u' \neq u.
Example:
Binary symmetric channel (BSC)
F=\Phi=\{0,1\}
Every bit of c is flipped with probability p.
Binary erasure channel (BEC)
F=\Phi=\{0,1,*\}, very common in practice when we are unsure when the bit is transmitted.
c is transmitted, c' is received.
c' is c with probability 1-p, e with probability p.
Encoding
Encoding E is a function from F^k to F^n.
Where E(u)=c is the codeword.
Assume n\geq k, we don't compress the information.
A code \mathcal{C} is a subset of F^n.
Encoding is a one to one mapping from F^k to \mathcal{C}.
In practice, we usually choose \mathcal{C}\subseteq F^n to be the size of F^k.
Decoding
D is a function from \Phi^n to \mathcal{C}.
D(c')=\hat{c}
The decoder then outputs the unique u' such that E(u')=\hat{c}.
Our aim is to have u=u'.
Decoding error probability: \operatorname{P}_{err}=\max_{c\in \mathcal{C}}\operatorname{P}_{err}(c).
where \operatorname{P}_{err}(c)=\sum_{y|D(y)\neq c}\operatorname{Pr}(y\text{ received}|c\text{ transmitted}).
Our goal is to construct decoder D such that \operatorname{P}_{err} is bounded.
Example:
Repetition code in binary symmetric channel:
Let F=\Phi=\{0,1\}. Every bit of c is flipped with probability p.
Say k=1, n=3 and let \mathcal{C}=\{000,111\}.
Let the encoder be E(u)=u u u.
The decoder is D(000)=D(100)=D(010)=D(001)=0, D(110)=D(101)=D(011)=D(111)=1.
Exercise: Compute the error probability of the repetition code in binary symmetric channel.
Solution
Recall that P_{err}(c)=\sum_{y|D(y)\neq c}\operatorname{Pr}(y\text{ received}|c\text{ transmitted}).
Use binomial random variable:
\begin{aligned}
P_{err}(000)&=\sum_{y|D(y)\neq 000}\operatorname{Pr}(y\text{ received}|000\text{ transmitted})\\
&=\operatorname{Pr}(2\text{ flipes or more})\\
&=\binom{n}{2}p^2(1-p)+\binom{n}{3}p^3\\
&=3p^2(1-p)+p^3\\
\end{aligned}
The computation is identical for 111.
P_{err}=\max\{P_{err}(000),P_{err}(111)\}=P_{err}(000)=3p^2(1-p)+p^3.
Maximum likelihood principle
For p\leq 1/2, the last example is maximum likelihood decoder.
Notice that \operatorname{Pr}(c'=000|c=000)=(1-p)^3 and \operatorname{Pr}(c'=000|c=111)=p^3.
- If
p\leq 1/2, then(1-p)^3\geq p^3.c=000is more likely to be transmitted thanc=111.
When \operatorname{Pr}(c'=001|c=000)=(1-p)^2p and \operatorname{Pr}(c'=001|c=111)=p^2(1-p).
- If
p\leq 1/2, then(1-p)^2p\geq p^2(1-p).c=001is more likely to be transmitted thanc=110.
For p>1/2, we just negate the above.
In general, Maximum likelihood decoder is D(c')=\arg\max_{c\in \mathcal{C}}\operatorname{Pr}(c'\text{ received}|c\text{ transmitted}).
Defining a "good" code
Two metrics:
- How many redundant bits are needed?
- e.g. repetition code:
k=1,n=3sends2redundant bits.
- e.g. repetition code:
- What is the resulting error probability?
- Depends on the decoding function.
- Normally, maximum likelihood decoding is assumed.
- Should go zero with
n.
Definition for rate of code is \frac{k}{n}.
More generally, \log_{|F|}\frac{|\mathcal{C}|}{n}.
Definition for information entropy
Let X be a random variable over a discrete set \mathcal{X}.
- That is every
x\in \mathcal{X}has a probability\operatorname{Pr}(X=x).
The entropy H(X) of a discrete random variable X is defined as:
H(X)=\mathbb{E}_{x\sim X}{\log \frac{1}{\operatorname{Pr}(x)}}=-\sum_{x\in \mathcal{X}}\operatorname{Pr}(x)\log \operatorname{Pr}(x)
when X=Bernouili(p), we denote H(X)=H(p)=-p\log p-(1-p)\log (1-p).
A deeper explanation will be given in the later in the course.
Which rate are possible?
Claude Shannon '48: Coding theorem of the BSC(binary symmetric channel)
Recall r=\frac{k}{n}.
Let H(\cdot) be the entropy function.
For every 0\leq r<1-H(p),
- There exists
\mathcal{C}_1, \mathcal{C}_2,\ldotsof ratesr_1,r_2,\ldotslengthsn_1,n_2,\ldotsandr_i\geq r. - That with Maximum likelihood decoding satisifies
P_{err}\to 0asi\to \infty.
For any R\geq 1-H(p),
- Any sequence
\mathcal{C}_1, \mathcal{C}_2,\ldotsof ratesr_1,r_2,\ldotslengthsn_1,n_2,\ldotsandr_i\geq R, - Any andy decoding algorithm,
P_{err}\to 1asi\to \infty.
1-H(p) is the capacity of the BSC.
- Informally, the capacity is the best possible rate of the code (asymptotically).
- A special case of a broader theorem (Shannon's coding theorem).
- We will see later in this course.
Polar codes, for explicit construction of codes with rate arbitrarily close to capacity.
BSC capacity - Intuition
Capacity of the binary symmetric channel with crossover probability p=1-H(p).
A correct decoder c'\to c essentially identifies two objects:
- The codeword
c - The error word
e=c'-csubtraction\mod 2. candeare independent of each other.
A typical e has \approx np $1$'s (law of large numbers), say n(p\pm \delta).
Exercise:
\operatorname{Pr}(e)=p^{n(p\pm \delta)}(1-p)^{n(1-p\pm \delta)}=2^{-n(H(p)+\epsilon)} for some \epsilon goes to zero as n\to \infty.
Intuition
There exists \approx 2^{n(H(p)} typical error words.
To index those typical error words, we need \log_2 (2^{nH(p)})=nH(p)+O(1). bits to identify the error word e.
To encode the message, we need \log_2 |\mathcal{C}| bits.
Since we send n bits, the rate is k+nH(p)+O(1)\leq n, so \frac{k}{n}\leq 1-H(p).
So the rate cannot exceed 1-H(p).
Formal proof
\begin{aligned}
\operatorname{Pr}(e)&=p^{n(p\pm \delta)}(1-p)^{n(1-p\pm \delta)}\\
&=p^{np}(1-p)^{n(1-p)}p^{\pm n\delta}(1-p)^{\mp n\delta}\\
\end{aligned}
And
\begin{aligned}
2^{-n(H(p)+\epsilon)}&=2^{-n(-p\log p-(1-p)\log (1-p)+\epsilon)}\\
&=2^{np\log p}2^{n(1-p)\log (1-p)}2^{-n\epsilon}\\
&=p^{np}(1-p)^{n(1-p)}2^{-n\epsilon}\\
\end{aligned}
So we need to check there exists \epsilon>0 such that
\lim_{n\to \infty}p^{\pm n\delta}(1-p)^{\mp n\delta}\leq 2^{-n\epsilon}
Test
\begin{aligned}
2^{-n\epsilon}&=p^{np}(1-p)^{n(1-p)}2^{-n\epsilon}\\
-n\epsilon&=\delta n\log p-\delta n\log (1-p)\\
\epsilon&=\delta (\log (1-p)-\log p)\\
\end{aligned}
Hamming distance
How to quantify the noise in the channel?
- Number of flipped bits.
Definition of Hamming distance:
- Denote
c=(c_1,c_2,\ldots,c_n)andc'=(c'_1,c'_2,\ldots,c'_n). d_H(c,c')=\sum_{i=1}^n [c_i\neq c'_i].
Minimum hamminng distance:
- Let
\mathcal{C}be a code. d_H(\mathcal{C})=\min_{c_1,c_2\in \mathcal{C},c_1\neq c_2}d_H(c_1,c_2).
Hamming distance is a metric.
d_H(x,y)\geq 0equal iffx=y.d_H(x,y)=d_H(y,x)- Triangle inequality:
d_H(x,y)\leq d_H(x,z)+d_H(z,y)
Level of error handling
- error detection
- erasure correction
- error correction
Erasure: replacement of an entry by *\not\in F.
Error: substitution of one entry by a different one.
Example: If d_H(\mathcal{C})=d.
Error detection
Theorem: If d_H(\mathcal{C})=d, then there exists f:F^n\to \mathcal{C}\cap \{\text{"error detected"}\}. that detects every patter of \leq d-1 errors correctly.
- That is, we can identify if the channel introduced at most
d-1errors. - No decoding is needed.
Idea:
Since d_H(\mathcal{C})=d, one needs \geq d errors to cause "confusion".
Proof
The function
f(y)=\begin{cases}
y\text{ if }y\in \mathcal{C}\\
\text{"error detected"} & \text{otherwise}
\end{cases}
will only fails if there are \geq d errors.
Erasure correction
Theorem: If d_H(\mathcal{C})=d, then there exists f:\{F^n\cup \{*\}\}\to \mathcal{C}\cap \{\text{"failed"}\}. that recovers every patter of at most d-1 erasures.
Idea:
Suppose d=4.
If 4 erasures occurred, there might be two possible codewords c,c'\in \mathcal{C}.
If \leq 3 erasures occurred, there is only one possible codeword c\in \mathcal{C}.
Error correction
Define the Hamming ball of radius r centered at c as:
B_H(c,r)=\{y\in F^n:d_H(c,y)\leq r\}
Theorem: If d_H(\mathcal{C})\geq d, then there exists f:F^n\to \mathcal{C} that corrects every pattern of at most \lfloor \frac{d-1}{2}\rfloor errors.
Ideas:
The ball \{B_H(c,\lfloor \frac{d-1}{2}\rfloor)|c\in \mathcal{C}\} are disjoint.
Use closest neighbor decoding, use triangle inequality.
Intro to linear codes
Summary: a code of minimum hamming distance d can
- detect
\leq d-1errors. - correct
\leq d-1erasures. - Correct
\leq \lfloor \frac{d-1}{2}\rfloorerrors.
Problems:
- How to construct good codes,
k/nanddlarge? - How good can these codes possibly be?
- How to encode?
- How to decode with noisy channel
Tools
- Linear algebra over finite fields.
Linear codes
Consider F^n as a vector space, and let \mathcal{C}\subseteq F^n be a subspace.
F,\Phi are finites, we use finite fields (algebraic objects that "immitate" \mathbb{R}^n, \mathbb{C}^n).
Formally, satisfy the field axioms.
Next Lectures:
- Field axioms
- Prime fields (
\mathbb{F}_p) - Field extensions (e.g.
\mathbb{F}_{p^t})