# CSE5313 Coding and information theory for data science (Lecture 26) ## Sliced and Broken Information with applications in DNA storage and 3D printing ### Basic info Deoxyribo-Nucleic Acid. A double-helix shaped molecule. Each helix is a string of - Cytosine, - Guanine, - Adenine, and - Thymine. Contained inside every living cell. - Inside the nucleus. Used to encode proteins. mRNA carries info to Ribosome as codons of length 3 over GUCA. - Each codon produces an amino acids. - $4^3> 20$, redundancy in nature! 1st Chargaff rule: - The two strands are complements (A-T and G-C). - $#A = #T$ and $#G = #C$ in both strands. 2nd Chargaff rule: - $#A \approx #T$ and $#G \approx #C$ in each strands. - Can be explained via tandem duplications. - $GCAGCATT \implies GCAGCAGCATT$. - Occur naturally during cell mitosis. ### DNA storage DNA synthesis: - Artificial creation of DNA from G’s, T’s, A’s, and C’s. Can be used to store information! Advantages: - Density. - 5.5 PB per mm3. - Stability. - Half-life 521 years (compare to ≈ 20𝑦 on hard drives). - Future proof. - DNA reading and writing will remain relevant "forever." #### DNA storage prototypes Some recent attempts: - 2011, 659kb. - Church, Gao, Kosuri, "Next-generation digital information storage in DNA", Science. - 2018, 200MB. - Organick et al., "Random access in large-scale DNA data storage," Nature biotechnology. - CatalogDNA (startup): - 2019, 16GB. - 2021, 18 Mbps. Companies: - Microsoft, Illumina, Western Digital, many startups. Challenges: - Expensive, Slow. - Traditional storage media still sufficient and affordable. #### DNA Storage models In vivo: - Implant the synthetic DNA inside a living organism. - Need evolution-correcting codes! - E.g., coding against tandem-duplications. In vitro: - Place the synthetic DNA in test tubes. - Challenge: Can only synthesize short sequences ($\approx 1000 bp$). - 1 test tube contains millions to billions of short sequences. How to encode information? How to achieve noise robustness? ### DNA coding in vitro environment Traditional data communication: $$ m\in\{0,1\}^k\mapsto c\in\{0,1\}^n $$ DNA storage: $$ m\in\{0,1\}^k\mapsto c\in \binom{\{0,1\}^L}{M} $$ where $\binom{\{0,1\}^L}{M}$ is the collection of all $M$-subsets of $\{0,1\}^L$. ($0\leq M\leq 2^L$) A codeword is a set of $M$ binary strings, each of length $L$. "Sliced channel": - The message $m$ is encoded to $c\in \{0,1\}^{ML}, and then sliced to $M$ equal parts. - Parts may be noisy (substitutions, deletions, etc.). - Also useful in network packet transmission ($M$ packets of length $L$). #### Sliced channel: Figures of merit How to quantify the **merit** of a given code $\mathcal{C}$? - Want resilience to any $K$ substitutions in all $M$ parts. Redundance: - Recall in linear codes, - $redundancy=length-dimension=\log (size\ of\ space)-\log (size\ of\ code)$. - In sliced channel: - $redundancy=\log (size\ of\ space)-\log (size\ of\ code)=\log \binom{2^L}{M}-\log |\mathcal{C}|$. Research questions: - Bounds on redundancy? - Code construction? - Is more redundancy needed in sliced channel? #### Sliced channel: Lower bound Idea: **Sphere packing** Given $c\in \binom{\{0,1\}^L}{M}$ and $K=1$, how may codewords must we exclude?
Example - $L=3,M=2$, and let $c=\{001,011\}$. Ball of radius 1 is sized 5. all the codeword with distance 1 from $c$ are: Distance 0: $$ \{001,011\} $$ Distance 1: $$ \{101,011\}\quad \{011\}\quad \{000,011\}\\ \{001,111\}\quad \{001\}\quad \{001,010\} $$ 7 options and 2 of them are not codewords. So the effective size of the ball is 5.
Tool: (Fourier analysis of) Boolean functions Introducing hypercube graph: - $V=\{0,1\}^L$. - $\{x,y\}\in E$ if and only if $d_H(x,y)=1$. - What is size of $E$? Consider $c\in \binom{\{0,1\}^L}{M}$ as a characteristic function: $f_c(x)=\{0,1\}^L\to \{0,1\}$. Let $\partial f_c$ be its boundary. - All hypercube edges $\{x,y\}$ such that $f_c(x)\neq f_c(y)$. #### Lemma of boundary Size of 1-ball $\geq |\partial f_c|+1$.
Proof Every edge on the boundary represents a unique way of flipping one bit in one string in $c$.
Need to bound $|\partial f_c|$ from below Tool: Total influence. #### Definition of total influence. The **total influence** $I(f)$ of $f:\{0,1\}^L\to \{0,1\}$ is defined as: $$ \sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i})) $$ where $x^{\oplus i}$ equals to $x$ with it's $i$th bit flipped. #### Theorem: Edge-isoperimetric inequality, no proof) $I(f)\geq 2\alpha\log\frac{1}{\alpha}$. where $\alpha=\min\{\text{fraction of 1's},\text{fraction of 0's}\}$. Notice: Let $\partial_i f$ be the $i$-dimensional edges in $\partial f$. $$ \begin{aligned} I(f)&=\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))\\ &=\sum_{i=1}^L\frac{|\partial_i f|}{2^{L-1}}\\ &=\frac{||\partial f||}{2^{L-1}}\\ \end{aligned} $$ Corollary: Let $\epsilon>0$, $L\geq \frac{1}{\epsilon}$ and $M\leq 2^{(1-\epsilon)L}$, and let $c\in \binom{\{0,1\}^L}{M}$. Then, $$ |\parital f_c|\geq 2\times 2^{L-1}\frac{M}{2^L}\log \frac{2^L}{M}\geq M\log \frac{2^L}{M}\geq ML\epsilon $$ Size of $1$ ball is sliced channel $\geq ML\epsilon$. this implies that $|\mathcal{C}|leq \frac{\binom{2^L}{M}}{\epsilon ML}$ Corollary: - Redundancy in sliced channel with $K=1$ with the above parameters $\log ML-O(1)$. - Simple generation (not shown) gives $O(K\log ML)$. ### Robust indexing Idea: Start with $L$ bit string with $\log M$ bits for indexing. Problem 1: Indices subject to noise. Problem 2: Indexing bits do not carry information $\implies$ higher redundancy. [link to paper](https://ieeexplore.ieee.org/document/9174447) Idea: Robust indexing Instead of using $1,\ldots, M$ for indexing, use $x_1,\ldots, x_M$ such that - $\{x_1,\ldots,x_M\}$ are of minimum distance $2K+1$ (solves problem 1) $|x_i|=O(\log M)$. - $\{x_1,\ldots,x_M\}$ contain information (solves problem 2). $\{x_1,\ldots,x_M\}$ depend on the message. - Consider the message $m=(m_1,m_2)$. - Find an encoding function $m_1\mapsto\{\{x_i\}_{i=1}^M|d_H(x_i,x_j)\geq 2K+1\}$ (coding over codes) - Assume $e$ is such function (not shown). ... Additional reading: Jin Sima, Netanel Raviv, Moshe Schwartz, and Jehoshua Bruck. "Error Correction for DNA Storage." arXiv:2310.01729 (2023). - Magazine article. - Introductory. - Broad perspective. ### Information Embedding in 3D printing Motivations: Threats to public safety. - Ghost guns, forging fingerprints, forging keys, fooling facial recognition. Solution – Information Embedding. - Printer ID, user ID, time/location stamp #### Existing Information Embedding Techniques Many techniques exist. Information embedding using variations in layers. - Width, rotation, etc. - Magnetic properties. - Radiative materials. Combating Adversarial Noise Most techniques are rather accurate. - I.e., low bit error rate. Challenge: Adversarial damage after use. - Scraping. - Deformation. - **Breaking**. #### A t-break code Let $m\in \{0,1\}^k\mapsto c\in \{0,1\}^n$ Adversary breaks $c$ at most $t$ times (security parameter). Decoder receives a **multi**-st of at most $t+1$ fragments. Assume the following: - Oriented - Unordered - Any length #### Lower bound for t-break code Claim: A t-break code must have $\Omega(t\log (n/t))$ redundancy. Lemma: Let $\mathcal{C}$ be a t-break code of length $n$, and for $i\in [n]$ and $\mathcal{C}_i\subseteq\mathcal{C}$ be the subset of $\mathcal{C}$ containing all codewords of Hamming weight $i$. Then $d_H(\mathcal{C}_i)\geq \lceil \frac{t+1}{2}\rceil$.
Proof of Lemma Let $x,y\in \mathcal{C}_i$ for some $i$, and let $\ell=d_H(x,y)$. Write ($\circ$ denotes the concatenation operation): $x=c_1\circ x_{i_1}\circ c_2\circ x_{i_2}\circ\ldots\circ c_{t+1}\circ x_{i_{t+1}}$ $y=c_1\circ y_{i_1}\circ c_2\circ y_{i_2}\circ\ldots\circ c_{t+1}\circ y_{i_{t+1}}$ Where $x_{i_j}\neq y_{i_j}$ for all $j\in [\ell]$. Break $x$ and $y$ $2\ell$ times to produce the multi-sets: $$ \mathcal{X}=\{\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\}\} $$ $$ \mathcal{Y}=\{\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\}\} $$ $w_H(x)=w_H(y)=i$, and therefore $\mathcal{X}=\mathcal{Y}$ and $\ell\geq \lceil \frac{t+1}{2}\rceil$.
Proof of Claim Let $j\in \{0,1,\ldots,n\}$ be such that $\mathcal{C}_j$ is the largest among $C_0,C_1,\ldots,C_{n}$. $\log |\mathcal{C}|=\log(\sum_{i=0}^n|\mathcal{C}_i|)\leq \log \left((n+1)|C_j|\right)\leq \log (n+1)+\log |\mathcal{C}_j|$ By Lemma and ordinary sphere packing bound, for $t'=\lfloor\frac{\lceil \frac{t+1}{2}\rceil-1}{2}\rfloor\approx \frac{t}{4}$, $$ |C_j|\leq \frac{2^n}{\sum_{t=0}^{t'}\binom{n}{i}} $$ This implies that $n-\log |\mathcal{C}|\geq n-\log(n+1)-\log|\mathcal{C}_j|\geq \dots \geq \Omega(t\log (n/t))$
Corollary: In the relevant regime $t=O(n^{1-\epsilon})$, we have $\Omega(t\log n)$ redundancy. ### t-break codes: Main ideas. Encoding: - Need multiple markers across the codeword. - Construct an adjacency matrix 𝐴 of markers to record their order. - Append $RS_{2t}(A)$ to the codeword (as in the sliced channel). Decoding (from $t + 1$ fragments): - Locate all surviving markers, and locate $RS_{2t}(A)'$. - Build an approximate adjacency matrix $A'$ from surviving markers $(d_H(A, A' )\leq 2t)$. - Correct $(A',RS_{2t}(A)')\mapsto (A,RS_{2t}(A))$. - Order the fragments correctly using $A$. Tools: - Random encoding (to have many markers). - Mutually uncorrelated codes (so that markers will not overlap). #### Tool: Mutually uncorrelated codes. - Want: Markers not to overlap. - Solution: Take markers from a Mutually Uncorrelated Codes (existing notion). - A code $\mathcal{M}$ is called mutually uncorrelated if no suffix of any $m_i \in \mathcal{M}$ is if a prefix of another $m_j \in \mathcal{M}$ (including $i=j$). - Many constructions exist. Theorem: For any integer $\ell$ there exists a mutually uncorrelated code $\mathcal{C}_{MU}$ of length $\ell$ and size $|\mathcal{C}_{MU}|\geq \frac{2^\ell}{32\ell}$. #### Tool: Random encoding. - Want: Codewords with many markers from $\mathcal{C}_{MU}$, that are not too far apart. - Problem: Hard to achieve explicitly. - Workaround: Show that a uniformly random string has this property. Random encoding: - Choose the message at random. - Suitable for embedding, say, printer ID. - Not suitable for dynamic information. Let $m>0$ be a parameter. Fix a mutually uncorrelated code $\mathcal{C}_{MU}$ of length $\Theta(\log m)$. Fix $m_1,\ldots, m_t$ from $\mathcal{C}_{MU}$ as "special" markers. Claim: With probability $1-\frac{1}{\poly(m)}$, in uniformly random string $z\in \{0,1\}^m$. - Every $O(\log^2(m))$ bits contain a marker from $\mathcal{C}_{MU}$. - Every two non-overlapping substrings of length $c\log m$ are distinct. - $z$ does not contain any of the special markers $m_1,\ldots, m_t$. Proof idea: - Short substring are abundant. - Long substring are rare. #### Sketch of encoding for t-break codes. Repeatedly sample $z\in \{0,1\}^m$ until it is "good". Find all markers $m_{i_1},\ldots, m_{i_r}$ in it. Build a $|\mathcal{C}_{MU}|\times |\mathcal{C}_{MU}|$ matrix $A$ which records order and distances: - $A_{i,j}=0$ if $m_i,m_j$ are not adjacent. - Otherwise, it is the distance between them (in bits). Append $RS_{2t}(A)$ at the end, and use the special markers $m_1,\ldots, m_t$. #### Sketch of decoding for t-break codes. Construct a partial adjacency matrix $A'$ from fragments.