# CSE5313 Coding and information theory for data science (Lecture 26) ## Sliced and Broken Information with applications in DNA storage and 3D printing ### Basic info Deoxyribo-Nucleic Acid. A double-helix shaped molecule. Each helix is a string of - Cytosine, - Guanine, - Adenine, and - Thymine. Contained inside every living cell. - Inside the nucleus. Used to encode proteins. mRNA carries info to Ribosome as codons of length 3 over GUCA. - Each codon produces an amino acids. - $4^3> 20$, redundancy in nature! 1st Chargaff rule: - The two strands are complements (A-T and G-C). - $#A = #T$ and $#G = #C$ in both strands. 2nd Chargaff rule: - $#A \approx #T$ and $#G \approx #C$ in each strands. - Can be explained via tandem duplications. - $GCAGCATT \implies GCAGCAGCATT$. - Occur naturally during cell mitosis. ### DNA storage DNA synthesis: - Artificial creation of DNA from G’s, T’s, A’s, and C’s. Can be used to store information! Advantages: - Density. - 5.5 PB per mm3. - Stability. - Half-life 521 years (compare to ≈ 20𝑦 on hard drives). - Future proof. - DNA reading and writing will remain relevant "forever." #### DNA storage prototypes Some recent attempts: - 2011, 659kb. - Church, Gao, Kosuri, "Next-generation digital information storage in DNA", Science. - 2018, 200MB. - Organick et al., "Random access in large-scale DNA data storage," Nature biotechnology. - CatalogDNA (startup): - 2019, 16GB. - 2021, 18 Mbps. Companies: - Microsoft, Illumina, Western Digital, many startups. Challenges: - Expensive, Slow. - Traditional storage media still sufficient and affordable. #### DNA Storage models In vivo: - Implant the synthetic DNA inside a living organism. - Need evolution-correcting codes! - E.g., coding against tandem-duplications. In vitro: - Place the synthetic DNA in test tubes. - Challenge: Can only synthesize short sequences ($\approx 1000 bp$). - 1 test tube contains millions to billions of short sequences. How to encode information? How to achieve noise robustness? ### DNA coding in vitro environment Traditional data communication: $$ m\in\{0,1\}^k\mapsto c\in\{0,1\}^n $$ DNA storage: $$ m\in\{0,1\}^k\mapsto c\in \binom{\{0,1\}^L}{M} $$ where $\binom{\{0,1\}^L}{M}$ is the collection of all $M$-subsets of $\{0,1\}^L$. ($0\leq M\leq 2^L$) A codeword is a set of $M$ binary strings, each of length $L$. "Sliced channel": - The message 𝑚 is encoded to $c\in \{0,1\}^{ML}, and then sliced to 𝑀 equal parts. - Parts may be noisy (substitutions, deletions, etc.). - Also useful in network packet transmission ($M$ packets of length $L$). #### Sliced channel: Figures of merit How to quantify the **merit** of a given code $\mathcal{C}$? - Want resilience to any $K$ substitutions in all $M$ parts. Redundance: - Recall in linear codes, - $redundancy=lengt-dimension=\log (size\ of\ space)-\log (size\ of\ code)$. - In sliced channel: - $redundancy=\log (size\ of\ space)-\log (size\ of\ code)=\log \binom{2^L}{M}-\log |\mathcal{C}|$. Research questions: - Bounds on redundancy? - Code construction? - Is more redundancy needed in sliced channel? #### Sliced channel: Lower bound Idea: **Sphere packing** Given $c\in \binom{\{0,1\}^L}{M}$ and $K=1$, how may codewords must we exclude?
Example - $L=3,M=2$, and let $c=\{001,011\}$. Ball of radius 1 is sized 5. all the codeword with distance 1 from $c$ are: Distance 0: $$ \{001,011\} $$ Distance 1: $$ \{101,011\}\quad \{011\}\quad \{000,011\}\\ \{001,111\}\quad \{001\}\quad \{001,010\} $$ 7 options and 2 of them are not codewords. So the effective size of the ball is 5.
Tool: (Fourier analysis of) Boolean functions Introducing hypercube graph: - $V=\{0,1\}^L$. - $\{x,y\}\in E$ if and only if $d_H(x,y)=1$. - What is size of $E$? Consider $c\in \binom{\{0,1\}^L}{M}$ as a characteristic function: $f_c(x)=\{0,1\}^L\to \{0,1\}$. Let $\partial f_c$ be its boundary. - All hypercube edges $\{x,y\}$ such that $f_c(x)\neq f_c(y)$. #### Lemma of boundary Size of 1-ball $\geq |\partial f_c|+1$.
Proof Every edge on the boundary represents a unique way of flipping one bit in one string in $c$.
Need to bound $|\partial f_c|$ from below Tool: Total influence. #### Definition of total influence. The **total influence** $I(f)$ of $f:\{0,1\}^L\to \{0,1\}$ is defined as: $$ \sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i})) $$ where $x^{\oplus i}$ equals to $x$ with it's $i$th bit flipped. #### Theorem: Edge-isoperimetric inequality, no proof) $I(f)\geq 2\alpha\log\frac{1}{\alpha}$. where $\alpha=\min\{\text{fraction of 1's},\text{fraction of 0's}\}$. Notice: Let $\partial_i f$ be the $i$-dimensional edges in $\partial f$. $$ \begin{aligned} I(f)&=\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))\\ &=\sum_{i=1}^L\frac{|\partial_i f|}{2^{L-1}}\\ &=\frac{||\partial f||}{2^{L-1}}\\ \end{aligned} $$ Corollary: Let $\epsilon>0$, $L\geq \frac{1}{\epsilon}$ and $M\leq 2^{(1-\epsilon)L}$, and let $c\in \binom{\{0,1\}^L}{M}$. Then, $$ |\parital f_c|\geq 2\times 2^{L-1}\frac{M}{2^L}\log \frac{2^L}{M}\geq M\log \frac{2^L}{M}\geq ML\epsilon $$ Size of $1$ ball is sliced channel $\geq ML\epsilon$. this implies that $|\mathcal{C}|leq \frac{\binom{2^L}{M}}{\epsilon ML}$ Corollary: - Redundancy in sliced channel with $K=1$ with the above parameters $\log ML-O(1)$. - Simple generation (not shown) gives $O(K\log ML)$. ### Robust indexing Idea: Start with $L$ bit string with $\log M$ bits for indexing. Problem 1: Indices subject to noise. Problem 2: Indexing bits do not carry information $\implies$ higher redundancy. [link to paper](https://ieeexplore.ieee.org/document/9174447) Idea: Robust indexing Instead of using $1,\ldots, M$ for indexing, use $x_1,\ldots, x_M$ such that - $\{x_1,\ldots,x_M\}$ are of minimum distance $2K+1$ (solves problem 1) $|x_i|=O(\log M)$. - $\{x_1,\ldots,x_M\}$ contain information (solves problem 2). $\{x_1,\ldots,x_M\}$ depend on the message. - Consider the message $m=(m_1,m_2)$. - Find an encoding function $m_1\mapsto\{\{x_i\}_{i=1}^M|d_H(x_i,x_j)\geq 2K+1\}$ (coding over codes) - Assume $e$ is such function (not shown). ... Additional reading: Jin Sima, Netanel Raviv, Moshe Schwartz, and Jehoshua Bruck. "Error Correction for DNA Storage." arXiv:2310.01729 (2023). - Magazine article. - Introductory. - Broad perspective. ### Information Embedding in 3D printing Motivations: Threats to public safety. - Ghost guns, forging fingerprints, forging keys, fooling facial recognition. Solution – Information Embedding. - Printer ID, user ID, time/location stamp #### Existing Information Embedding Techniques Many techniques exist. Information embedding using variations in layers. - Width, rotation, etc. - Magnetic properties. - Radiative materials. Combating Adversarial Noise Most techniques are rather accurate. - I.e., low bit error rate. Challenge: Adversarial damage after use. - Scraping. - Deformation. - **Breaking**. #### A t-break code Let $m\in \{0,1\}^k\mapsto c\in \{0,1\}^n$ Adversary breaks $c$ at most $t$ times (security parameter). Decoder receives a **multi**-st of at most $t+1$ fragments. Assume the following: - Oriented - Unordered - Any length #### Lower bound for t-break code Claim: A t-break code must have $\Omega(t\log (n/t))$ redundancy. Lemma: Let $\mathcal{C}$ be a t-break code of length $n$, and for $i\in [n]$ and $\mathcal{C}_i\subseteq\mathcal{C}$ be the subset of $\mathcal{C}$ containing all codewords of Hamming weight $i$. Then $d_H(\mathcal{C}_i)\geq \lceil \frac{t+1}{2}\rceil$.
Proof of Lemma Let $x,y\in \mathcal{C}_i$ for some $i$, and let $\ell=d_H(x,y)$. Write ($\circ$ denotes the concatenation operation): $x=c_1\circ x_{i_1}\circ c_2\circ x_{i_2}\circ\ldots\circ c_{t+1}\circ x_{i_{t+1}}$ $y=c_1\circ y_{i_1}\circ c_2\circ y_{i_2}\circ\ldots\circ c_{t+1}\circ y_{i_{t+1}}$ Where $x_{i_j}\neq y_{i_j}$ for all $j\in [\ell]$. Break $x$ and $y$ $2\ell$ times to produce the multi-sets: $$ \mathcal{X}=\{\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\}\} $$ $$ \mathcal{Y}=\{\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\}\} $$ $w_H(x)=w_H(y)=i$, and therefore $\mathcal{X}=\mathcal{Y}$ and $\ell\geq \lceil \frac{t+1}{2}\rceil$.
Proof of Claim Let $j\in \{0,1,\ldots,n\}$ be such that $\mathcal{C}_j$ is the largest among $C_0,C_1,\ldots,C_{n}$. $\log |\mathcal{C}|=\log(\sum_{i=0}^n|\mathcal{C}_i|)\leq \log \left((n+1)|C_j|\right)\leq \log (n+1)+\log |\mathcal{C}_j|$ By Lemma and ordinary sphere packing bound, for $t'=\lfloor\frac{\lceil \frac{t+1}{2}\rceil-1}{2}\rfloor\approx \frac{t}{4}$, $$ |C_j|\leq \frac{2^n}{\sum_{t=0}^{t'}\binom{n}{i}} $$ This implies that $n-\log |\mathcal{C}|\geq n-\log(n+1)-\log|\mathcal{C}_j|\geq \dots \geq \Omega(t\log (n/t))$
Corollary: In the relevant regime $t=O(n^{1-\epsilon})$, we have $\Omega(t\log n)$ redundancy. ### t-break codes: Main ideas. Encoding: - Need multiple markers across the codeword. - Construct an adjacency matrix 𝐴 of markers to record their order. - Append $RS_{2t}(A)$ to the codeword (as in the sliced channel). Decoding (from $t + 1$ fragments): - Locate all surviving markers, and locate $RS_{2t}(A)'$. - Build an approximate adjacency matrix $A'$ from surviving markers $(d_H(A, A' )\leq 2t)$. - Correct $(A',RS_{2t}(A)')\mapsto (A,RS_{2t}(A))$. - Order the fragments correctly using $A$. Tools: - Random encoding (to have many markers). - Mutually uncorrelated codes (so that markers will not overlap). #### Tool: Mutually uncorrelated codes. - Want: Markers not to overlap. - Solution: Take markers from a Mutually Uncorrelated Codes (existing notion). - A code $\mathcal{M}$ is called mutually uncorrelated if no suffix of any 𝑚𝑖 ∈ ℳ if a prefix of another 𝑚𝑗 ∈ ℳ (including 𝑖 = 𝑗). – Many constructions exist. • Theorem: For any integer ℓ there exists a mutually uncorrelated code 𝐶𝑀𝑈 of length ℓ and size 𝐶𝑀𝑈 ≥ 2 ℓ 32ℓ Tool: Random encoding. • Want: Codewords with many markers from 𝐶𝑀𝑈, that are not too far apart. • Problem: Hard to achieve explicitly. • Workaround: Show that a uniformly random string has this property. • Random encoding: – Choose the message at random. – Suitable for embedding, say, printer ID. – Not suitable for dynamic information.