12 KiB
CSE5313 Coding and information theory for data science (Lecture 26)
Sliced and Broken Information with applications in DNA storage and 3D printing
Basic info
Deoxyribo-Nucleic Acid.
A double-helix shaped molecule.
Each helix is a string of
- Cytosine,
- Guanine,
- Adenine, and
- Thymine.
Contained inside every living cell.
- Inside the nucleus.
Used to encode proteins.
mRNA carries info to Ribosome as codons of length 3 over GUCA.
- Each codon produces an amino acids.
4^3> 20, redundancy in nature!
1st Chargaff rule:
- The two strands are complements (A-T and G-C).
#A = #Tand#G = #Cin both strands.
2nd Chargaff rule:
#A \approx #Tand#G \approx #Cin each strands.- Can be explained via tandem duplications.
GCAGCATT \implies GCAGCAGCATT.- Occur naturally during cell mitosis.
DNA storage
DNA synthesis:
- Artificial creation of DNA from G’s, T’s, A’s, and C’s.
Can be used to store information!
Advantages:
- Density.
- 5.5 PB per mm3.
- Stability.
- Half-life 521 years (compare to ≈ 20𝑦 on hard drives).
- Future proof.
- DNA reading and writing will remain relevant "forever."
DNA storage prototypes
Some recent attempts:
- 2011, 659kb.
- Church, Gao, Kosuri, "Next-generation digital information storage in DNA", Science.
- 2018, 200MB.
- Organick et al., "Random access in large-scale DNA data storage," Nature biotechnology.
- CatalogDNA (startup):
- 2019, 16GB.
- 2021, 18 Mbps.
Companies:
- Microsoft, Illumina, Western Digital, many startups.
Challenges:
- Expensive, Slow.
- Traditional storage media still sufficient and affordable.
DNA Storage models
In vivo:
- Implant the synthetic DNA inside a living organism.
- Need evolution-correcting codes!
- E.g., coding against tandem-duplications.
In vitro:
- Place the synthetic DNA in test tubes.
- Challenge: Can only synthesize short sequences (
\approx 1000 bp). - 1 test tube contains millions to billions of short sequences.
How to encode information?
How to achieve noise robustness?
DNA coding in vitro environment
Traditional data communication:
m\in\{0,1\}^k\mapsto c\in\{0,1\}^n
DNA storage:
m\in\{0,1\}^k\mapsto c\in \binom{\{0,1\}^L}{M}
where \binom{\{0,1\}^L}{M} is the collection of all $M$-subsets of \{0,1\}^L. (0\leq M\leq 2^L)
A codeword is a set of M binary strings, each of length L.
"Sliced channel":
- The message
mis encoded to $c\in {0,1}^{ML}, and then sliced toMequal parts. - Parts may be noisy (substitutions, deletions, etc.).
- Also useful in network packet transmission (
Mpackets of lengthL).
Sliced channel: Figures of merit
How to quantify the merit of a given code \mathcal{C}?
- Want resilience to any
Ksubstitutions in allMparts.
Redundance:
- Recall in linear codes,
redundancy=length-dimension=\log (size\ of\ space)-\log (size\ of\ code).
- In sliced channel:
redundancy=\log (size\ of\ space)-\log (size\ of\ code)=\log \binom{2^L}{M}-\log |\mathcal{C}|.
Research questions:
- Bounds on redundancy?
- Code construction?
- Is more redundancy needed in sliced channel?
Sliced channel: Lower bound
Idea: Sphere packing
Given c\in \binom{\{0,1\}^L}{M} and K=1, how may codewords must we exclude?
Example
L=3,M=2, and letc=\{001,011\}. Ball of radius 1 is sized 5.
all the codeword with distance 1 from c are:
Distance 0:
\{001,011\}
Distance 1:
\{101,011\}\quad \{011\}\quad \{000,011\}\\
\{001,111\}\quad \{001\}\quad \{001,010\}
7 options and 2 of them are not codewords.
So the effective size of the ball is 5.
Tool: (Fourier analysis of) Boolean functions
Introducing hypercube graph:
V=\{0,1\}^L.\{x,y\}\in Eif and only ifd_H(x,y)=1.- What is size of
E?
Consider c\in \binom{\{0,1\}^L}{M} as a characteristic function: f_c(x)=\{0,1\}^L\to \{0,1\}.
Let \partial f_c be its boundary.
- All hypercube edges
\{x,y\}such thatf_c(x)\neq f_c(y).
Lemma of boundary
Size of 1-ball \geq |\partial f_c|+1.
Proof
Every edge on the boundary represents a unique way of flipping one bit in one string in c.
Need to bound |\partial f_c| from below
Tool: Total influence.
Definition of total influence.
The total influence I(f) of f:\{0,1\}^L\to \{0,1\} is defined as:
\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))
where x^{\oplus i} equals to x with it's $i$th bit flipped.
Theorem: Edge-isoperimetric inequality, no proof)
I(f)\geq 2\alpha\log\frac{1}{\alpha}.
where \alpha=\min\{\text{fraction of 1's},\text{fraction of 0's}\}.
Notice: Let \partial_i f be the $i$-dimensional edges in \partial f.
\begin{aligned}
I(f)&=\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))\\
&=\sum_{i=1}^L\frac{|\partial_i f|}{2^{L-1}}\\
&=\frac{||\partial f||}{2^{L-1}}\\
\end{aligned}
Corollary: Let \epsilon>0, L\geq \frac{1}{\epsilon} and M\leq 2^{(1-\epsilon)L}, and let c\in \binom{\{0,1\}^L}{M}. Then,
|\parital f_c|\geq 2\times 2^{L-1}\frac{M}{2^L}\log \frac{2^L}{M}\geq M\log \frac{2^L}{M}\geq ML\epsilon
Size of 1 ball is sliced channel \geq ML\epsilon.
this implies that |\mathcal{C}|leq \frac{\binom{2^L}{M}}{\epsilon ML}
Corollary:
- Redundancy in sliced channel with
K=1with the above parameters\log ML-O(1). - Simple generation (not shown) gives
O(K\log ML).
Robust indexing
Idea: Start with L bit string with \log M bits for indexing.
Problem 1: Indices subject to noise.
Problem 2: Indexing bits do not carry information \implies higher redundancy.
Idea: Robust indexing
Instead of using 1,\ldots, M for indexing, use x_1,\ldots, x_M such that
\{x_1,\ldots,x_M\}are of minimum distance2K+1(solves problem 1)|x_i|=O(\log M).\{x_1,\ldots,x_M\}contain information (solves problem 2).
\{x_1,\ldots,x_M\} depend on the message.
- Consider the message
m=(m_1,m_2). - Find an encoding function
m_1\mapsto\{\{x_i\}_{i=1}^M|d_H(x_i,x_j)\geq 2K+1\}(coding over codes) - Assume
eis such function (not shown).
...
Additional reading:
Jin Sima, Netanel Raviv, Moshe Schwartz, and Jehoshua Bruck. "Error Correction for DNA Storage." arXiv:2310.01729 (2023).
- Magazine article.
- Introductory.
- Broad perspective.
Information Embedding in 3D printing
Motivations:
Threats to public safety.
- Ghost guns, forging fingerprints, forging keys, fooling facial recognition.
Solution – Information Embedding.
- Printer ID, user ID, time/location stamp
Existing Information Embedding Techniques
Many techniques exist.
Information embedding using variations in layers.
- Width, rotation, etc.
- Magnetic properties.
- Radiative materials.
Combating Adversarial Noise
Most techniques are rather accurate.
- I.e., low bit error rate.
Challenge: Adversarial damage after use.
- Scraping.
- Deformation.
- Breaking.
A t-break code
Let m\in \{0,1\}^k\mapsto c\in \{0,1\}^n
Adversary breaks c at most t times (security parameter).
Decoder receives a multi-st of at most t+1 fragments. Assume the following:
- Oriented
- Unordered
- Any length
Lower bound for t-break code
Claim: A t-break code must have \Omega(t\log (n/t)) redundancy.
Lemma: Let \mathcal{C} be a t-break code of length n, and for i\in [n] and \mathcal{C}_i\subseteq\mathcal{C} be the subset of \mathcal{C} containing all codewords of Hamming weight i. Then d_H(\mathcal{C}_i)\geq \lceil \frac{t+1}{2}\rceil.
Proof of Lemma
Let x,y\in \mathcal{C}_i for some i, and let \ell=d_H(x,y).
Write (\circ denotes the concatenation operation):
x=c_1\circ x_{i_1}\circ c_2\circ x_{i_2}\circ\ldots\circ c_{t+1}\circ x_{i_{t+1}}
y=c_1\circ y_{i_1}\circ c_2\circ y_{i_2}\circ\ldots\circ c_{t+1}\circ y_{i_{t+1}}
Where x_{i_j}\neq y_{i_j} for all j\in [\ell].
Break x and y 2\ell times to produce the multi-sets:
\mathcal{X}=\{\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\}\}
\mathcal{Y}=\{\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\}\}
w_H(x)=w_H(y)=i, and therefore \mathcal{X}=\mathcal{Y} and \ell\geq \lceil \frac{t+1}{2}\rceil.
Proof of Claim
Let j\in \{0,1,\ldots,n\} be such that \mathcal{C}_j is the largest among C_0,C_1,\ldots,C_{n}.
\log |\mathcal{C}|=\log(\sum_{i=0}^n|\mathcal{C}_i|)\leq \log \left((n+1)|C_j|\right)\leq \log (n+1)+\log |\mathcal{C}_j|
By Lemma and ordinary sphere packing bound, for t'=\lfloor\frac{\lceil \frac{t+1}{2}\rceil-1}{2}\rfloor\approx \frac{t}{4},
|C_j|\leq \frac{2^n}{\sum_{t=0}^{t'}\binom{n}{i}}
This implies that n-\log |\mathcal{C}|\geq n-\log(n+1)-\log|\mathcal{C}_j|\geq \dots \geq \Omega(t\log (n/t))
Corollary: In the relevant regime t=O(n^{1-\epsilon}), we have \Omega(t\log n) redundancy.
t-break codes: Main ideas.
Encoding:
- Need multiple markers across the codeword.
- Construct an adjacency matrix 𝐴 of markers to record their order.
- Append
RS_{2t}(A)to the codeword (as in the sliced channel).
Decoding (from t + 1 fragments):
- Locate all surviving markers, and locate
RS_{2t}(A)'. - Build an approximate adjacency matrix
A'from surviving markers(d_H(A, A' )\leq 2t). - Correct
(A',RS_{2t}(A)')\mapsto (A,RS_{2t}(A)). - Order the fragments correctly using
A.
Tools:
- Random encoding (to have many markers).
- Mutually uncorrelated codes (so that markers will not overlap).
Tool: Mutually uncorrelated codes.
- Want: Markers not to overlap.
- Solution: Take markers from a Mutually Uncorrelated Codes (existing notion).
- A code
\mathcal{M}is called mutually uncorrelated if no suffix of anym_i \in \mathcal{M}is if a prefix of anotherm_j \in \mathcal{M}(includingi=j).
- A code
- Many constructions exist.
Theorem: For any integer \ell there exists a mutually uncorrelated code \mathcal{C}_{MU} of length \ell and size |\mathcal{C}_{MU}|\geq \frac{2^\ell}{32\ell}.
Tool: Random encoding.
- Want: Codewords with many markers from
\mathcal{C}_{MU}, that are not too far apart. - Problem: Hard to achieve explicitly.
- Workaround: Show that a uniformly random string has this property.
Random encoding:
- Choose the message at random.
- Suitable for embedding, say, printer ID.
- Not suitable for dynamic information.
Let m>0 be a parameter.
Fix a mutually uncorrelated code \mathcal{C}_{MU} of length \Theta(\log m).
Fix m_1,\ldots, m_t from \mathcal{C}_{MU} as "special" markers.
Claim: With probability 1-\frac{1}{\poly(m)}, in uniformly random string z\in \{0,1\}^m.
- Every
O(\log^2(m))bits contain a marker from\mathcal{C}_{MU}. - Every two non-overlapping substrings of length
c\log mare distinct. zdoes not contain any of the special markersm_1,\ldots, m_t.
Proof idea:
- Short substring are abundant.
- Long substring are rare.
Sketch of encoding for t-break codes.
Repeatedly sample z\in \{0,1\}^m until it is "good".
Find all markers m_{i_1},\ldots, m_{i_r} in it.
Build a |\mathcal{C}_{MU}|\times |\mathcal{C}_{MU}| matrix A which records order and distances:
A_{i,j}=0ifm_i,m_jare not adjacent.- Otherwise, it is the distance between them (in bits).
Append RS_{2t}(A) at the end, and use the special markers m_1,\ldots, m_t.
Sketch of decoding for t-break codes.
Construct a partial adjacency matrix A' from fragments.