diff --git a/content/CSE5313/CSE5313_L26.md b/content/CSE5313/CSE5313_L26.md new file mode 100644 index 0000000..fd7b7f9 --- /dev/null +++ b/content/CSE5313/CSE5313_L26.md @@ -0,0 +1,411 @@ +# CSE5313 Coding and information theory for data science (Lecture 26) + +## Sliced and Broken Information with applications in DNA storage and 3D printing + +### Basic info + +Deoxyribo-Nucleic Acid. + +A double-helix shaped molecule. + +Each helix is a string of + +- Cytosine, +- Guanine, +- Adenine, and +- Thymine. + +Contained inside every living cell. + +- Inside the nucleus. + +Used to encode proteins. + +mRNA carries info to Ribosome as codons of length 3 over GUCA. + +- Each codon produces an amino acids. +- $4^3> 20$, redundancy in nature! + +1st Chargaff rule: + +- The two strands are complements (A-T and G-C). +- $#A = #T$ and $#G = #C$ in both strands. + +2nd Chargaff rule: + +- $#𝐴 β‰ˆ #𝑇$ and $#G \approx #C$ in each strands. +- Can be explained via tandem duplications. + - $GCAGCATT \implies GCAGCAGCATT$. + - Occur naturally during cell mitosis. + +### DNA storage + +DNA synthesis: + +- Artificial creation of DNA from G’s, T’s, A’s, and C’s. + +Can be used to store information! + +Advantages: + +- Density. + - 5.5 PB per mm3. +- Stability. + - Half-life 521 years (compare to β‰ˆ 20𝑦 on hard drives). +- Future proof. + - DNA reading and writing will remain relevant "forever." + +#### DNA storage prototypes + +Some recent attempts: + +- 2011, 659kb. + - Church, Gao, Kosuri, "Next-generation digital information storage in DNA", Science. +- 2018, 200MB. + - Organick et al., "Random access in large-scale DNA data storage," Nature biotechnology. +- CatalogDNA (startup): + - 2019, 16GB. + - 2021, 18 Mbps. + +Companies: + +- Microsoft, Illumina, Western Digital, many startups. + +Challenges: + +- Expensive, Slow. +- Traditional storage media still sufficient and affordable. + +#### DNA Storage models + +In vivo: + +- Implant the synthetic DNA inside a living organism. +- Need evolution-correcting codes! +- E.g., coding against tandem-duplications. + +In vitro: + +- Place the synthetic DNA in test tubes. +- Challenge: Can only synthesize short sequences ($\approx 1000 bp$). +- 1 test tube contains millions to billions of short sequences. + +How to encode information? + +How to achieve noise robustness? + +### DNA coding in vitro environment + +Traditional data communication: + +$$ +m\in\{0,1\}^k\mapsto c\in\{0,1\}^n +$$ + +DNA storage: + +$$ +m\in\{0,1\}^k\mapsto c\in \binom{\{0,1\}^L}{M} +$$ + +where $\binom{\{0,1\}^L}{M}$ is the collection of all $M$-subsets of $\{0,1\}^L$. ($0\leq M\leq 2^L$) + +A codeword is a set of $M$ binary strings, each of length $L$. + +"Sliced channel": + +- The message π‘š is encoded to $c\in \{0,1\}^{ML}, and then sliced to 𝑀 equal parts. +- Parts may be noisy (substitutions, deletions, etc.). +- Also useful in network packet transmission ($M$ packets of length $L$). + +#### Sliced channel: Figures of merit + +How to quantify the **merit** of a given code $\mathcal{C}$? + +- Want resilience to any $K$ substitutions in all $M$ parts. + +Redundance: + +- Recall in linear codes, + - $redundancy=lengt-dimension=\log (size\ of\ space)-\log (size\ of\ code)$. +- In sliced channel: + - $redundancy=\log (size\ of\ space)-\log (size\ of\ code)=\log \binom{2^L}{M}-\log |\mathcal{C}|$. + +Research questions: + +- Bounds on redundancy? +- Code construction? +- Is more redundancy needed in sliced channel? + +#### Sliced channel: Lower bound + +Idea: **Sphere packing** + +Given $c\in \binom{\{0,1\}^L}{M}$ and $K=1$, how may codewords must we exclude? + +
+Example + +- $L=3,M=2$, and let $c=\{001,011\}$. Ball of radius 1 is sized 5. + +all the codeword with distance 1 from $c$ are: + +Distance 0: + +$$ +\{001,011\} +$$ + +Distance 1: +$$ +\{101,011\}\quad \{011\}\quad \{000,011\}\\ +\{001,111\}\quad \{001\}\quad \{001,010\} +$$ + +7 options and 2 of them are not codewords. + +So the effective size of the ball is 5. + +
+ +Tool: (Fourier analysis of) Boolean functions + +Introducing hypercube graph: + +- $V=\{0,1\}^L$. +- $\{x,y\}\in E$ if and only if $d_H(x,y)=1$. +- What is size of $E$? + +Consider $c\in \binom{\{0,1\}^L}{M}$ as a characteristic function: $f_c(x)=\{0,1\}^L\to \{0,1\}$. + +Let $\partial f_c$ be its boundary. + +- All hypercube edges $\{x,y\}$ such that $f_c(x)\neq f_c(y)$. + +#### Lemma of boundary + +Size of 1-ball $\geq |\partial f_c|+1$. + +
+Proof + +Every edge on the boundary represents a unique way of flipping one bit in one string in $c$. + +
+ +Need to bound $|\partial f_c|$ from below + +Tool: Total influence. + +#### Definition of total influence. + +The **total influence** $I(f)$ of $f:\{0,1\}^L\to \{0,1\}$ is defined as: + +$$ +\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i})) +$$ + +where $x^{\oplus i}$ equals to $x$ with it's $i$th bit flipped. + +#### Theorem: Edge-isoperimetric inequality, no proof) + +$I(f)\geq 2\alpha\log\frac{1}{\alpha}$. + +where $\alpha=\min\{\text{fraction of 1's},\text{fraction of 0's}\}$. + +Notice: Let $\partial_i f$ be the $i$-dimensional edges in $\partial f$. + +$$ +\begin{aligned} +I(f)&=\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))\\ +&=\sum_{i=1}^L\frac{|\partial_i f|}{2^{L-1}}\\ +&=\frac{||\partial f||}{2^{L-1}}\\ +\end{aligned} +$$ + +Corollary: Let $\epsilon>0$, $L\geq \frac{1}{\epsilon}$ and $M\leq 2^{(1-\epsilon)L}$, and let $c\in \binom{\{0,1\}^L}{M}$. Then, + +$$ +|\parital f_c|\geq 2\times 2^{L-1}\frac{M}{2^L}\log \frac{2^L}{M}\geq M\log \frac{2^L}{M}\geq ML\epsilon +$$ + +Size of $1$ ball is sliced channel $\geq ML\epsilon$. + +this implies that $|\mathcal{C}|leq \frac{\binom{2^L}{M}}{\epsilon ML}$ + +Corollary: + +- Redundancy in sliced channel with $K=1$ with the above parameters $\log ML-O(1)$. +- Simple generation (not shown) gives $O(K\log ML)$. + +### Robust indexing + +Idea: Start with $L$ bit string with $\log M$ bits for indexing. + +Problem 1: Indices subject to noise. + +Problem 2: Indexing bits do not carry information $\implies$ higher redundancy. + +[link to paper](https://ieeexplore.ieee.org/document/9174447) + +Idea: Robust indexing + +Instead of using $1,\ldots, M$ for indexing, use $x_1,\ldots, x_M$ such that + +- $\{x_1,\ldots,x_M\}$ are of minimum distance $2K+1$ (solves problem 1) $|x_i|=O(\log M)$. +- $\{x_1,\ldots,x_M\}$ contain information (solves problem 2). + +$\{x_1,\ldots,x_M\}$ depend on the message. + +- Consider the message $m=(m_1,m_2)$. +- Find an encoding function $m_1\mapsto\{\{x_i\}_{i=1}^M|d_H(x_i,x_j)\geq 2K+1\}$ (coding over codes) +- Assume $e$ is such function (not shown). + +... + +Additional reading: + +Jin Sima, Netanel Raviv, Moshe Schwartz, and Jehoshua Bruck. "Error Correction for DNA Storage." arXiv:2310.01729 (2023). + +- Magazine article. +- Introductory. +- Broad perspective. + +### Information Embedding in 3D printing + +Motivations: + +Threats to public safety. + +- Ghost guns, forging fingerprints, forging keys, fooling facial recognition. + +Solution – Information Embedding. + +- Printer ID, user ID, time/location stamp + +#### Existing Information Embedding Techniques + +Many techniques exist. + +Information embedding using variations in layers. + +- Width, rotation, etc. +- Magnetic properties. +- Radiative materials. + +Combating Adversarial Noise + +Most techniques are rather accurate. + +- I.e., low bit error rate. + +Challenge: Adversarial damage after use. + +- Scraping. +- Deformation. +- **Breaking**. + +#### A t-break code + +Let $m\in \{0,1\}^k\mapsto c\in \{0,1\}^n$ + +Adversary breaks $c$ at most $t$ times (security parameter). + +Decoder receives a **multi**-st of at most $t+1$ fragments. Assume the following: + +- Oriented +- Unordered +- Any length + +#### Lower bound for t-break code + +Claim: A t-break code must have $\Omega(t\log (n/t))$ redundancy. + +Lemma: Let $\mathcal{C}$ be a t-break code of length $n$, and for $i\in [n]$ and $\mathcal{C}_i\subseteq\mathcal{C}$ be the subset of $\mathcal{C}$ containing all codewords of Hamming weight $i$. Then $d_H(\mathcal{C}_i)\geq \lceil \frac{t+1}{2}\rceil$. + +
+Proof of Lemma + +Let $x,y\in \mathcal{C}_i$ for some $i$, and let $\ell=d_H(x,y)$. + +Write ($\circ$ denotes the concatenation operation): + +$x=c_1\circ x_{i_1}\circ c_2\circ x_{i_2}\circ\ldots\circ c_{t+1}\circ x_{i_{t+1}}$ + +$y=c_1\circ y_{i_1}\circ c_2\circ y_{i_2}\circ\ldots\circ c_{t+1}\circ y_{i_{t+1}}$ + +Where $x_{i_j}\neq y_{i_j}$ for all $j\in [\ell]$. + +Break $x$ and $y$ $2\ell$ times to produce the multi-sets: + +$$ +\mathcal{X}=\{\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\}\} +$$ + +$$ +\mathcal{Y}=\{\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\}\} +$$ + +$w_H(x)=w_H(y)=i$, and therefore $\mathcal{X}=\mathcal{Y}$ and $\ell\geq \lceil \frac{t+1}{2}\rceil$. + +
+ +
+Proof of Claim + +Let $j\in \{0,1,\ldots,n\}$ be such that $\mathcal{C}_j$ is the largest among $C_0,C_1,\ldots,C_{n}$. + +$\log |\mathcal{C}|=\log(\sum_{i=0}^n|\mathcal{C}_i|)\leq \log \left((n+1)|C_j|\right)\leq \log (n+1)+\log |\mathcal{C}_j|$ + +By Lemma and ordinary sphere packing bound, for $t'=\lfloor\frac{\lceil \frac{t+1}{2}\rceil-1}{2}\rfloor\approx \frac{t}{4}$, + +$$ +|C_j|\leq \frac{2^n}{\sum_{t=0}^{t'}\binom{n}{i}} +$$ + +This implies that $n-\log |\mathcal{C}|\geq n-\log(n+1)-\log|\mathcal{C}_j|\geq \dots \geq \Omega(t\log (n/t))$ + +
+ +Corollary: In the relevant regime $t=O(n^{1-\epsilon})$, we have $\Omega(t\log n)$ redundancy. + +TRACK LOST HERE + +𝑑-break codes: Main ideas. +β€’ Encoding: +– Need multiple markers across the codeword. +– Construct an adjacency matrix 𝐴 of markers to record their order. +– Append 𝑅𝑆2𝑑 𝐴 to the codeword (as in the sliced channel). +β€’ Decoding (from 𝑑 + 1 fragments): +– Locate all surviving markers, and locate 𝑅𝑆2𝑑 𝐴 β€². +– Build an approximate adjacency matrix 𝐴 +β€² +from surviving markers (𝑑𝐻 𝐴, 𝐴′ ≀ 2𝑑). +– Correct 𝐴 +β€² +, 𝑅𝑆2𝑑 𝐴 β€² ↦ 𝐴 , 𝑅𝑆2𝑑 𝐴 . +– Order the fragments correctly using 𝐴. +β€’ Tools: +– Random encoding (to have many markers). +– Mutually uncorrelated codes (so that markers will not overlap). + +Tool: Mutually uncorrelated codes. +β€’ Want: Markers not to overlap. +β€’ Solution: Take markers from a Mutually Uncorrelated Codes (existing notion). +– A code β„³ is called mutually uncorrelated if no suffix of any π‘šπ‘– ∈ β„³ if a prefix of another +π‘šπ‘— ∈ β„³ (including 𝑖 = 𝑗). +– Many constructions exist. +β€’ Theorem: For any integer β„“ there exists a mutually uncorrelated code πΆπ‘€π‘ˆ of length +β„“ and size πΆπ‘€π‘ˆ β‰₯ +2 +β„“ +32β„“ + +Tool: Random encoding. +β€’ Want: Codewords with many markers from πΆπ‘€π‘ˆ, that are not too far apart. +β€’ Problem: Hard to achieve explicitly. +β€’ Workaround: Show that a uniformly random string has this property. +β€’ Random encoding: +– Choose the message at random. +– Suitable for embedding, say, printer ID. +– Not suitable for dynamic information. \ No newline at end of file diff --git a/content/CSE5313/CSE5313_L27.md b/content/CSE5313/CSE5313_L27.md new file mode 100644 index 0000000..4db8b3b --- /dev/null +++ b/content/CSE5313/CSE5313_L27.md @@ -0,0 +1 @@ +# CSE5313 Coding and information theory for data science (Lecture 27) \ No newline at end of file diff --git a/content/CSE5313/_meta.js b/content/CSE5313/_meta.js index 909a951..f996b9a 100644 --- a/content/CSE5313/_meta.js +++ b/content/CSE5313/_meta.js @@ -29,4 +29,6 @@ export default { CSE5313_L23: "CSE5313 Coding and information theory for data science (Lecture 23)", CSE5313_L24: "CSE5313 Coding and information theory for data science (Lecture 24)", CSE5313_L25: "CSE5313 Coding and information theory for data science (Lecture 25)", + CSE5313_L26: "CSE5313 Coding and information theory for data science (Lecture 26)", + CSE5313_L27: "CSE5313 Coding and information theory for data science (Lecture 27)", } \ No newline at end of file