Files
NoteNextra-origin/content/CSE5313/CSE5313_L26.md
2025-12-02 13:16:26 -06:00

11 KiB
Raw Blame History

CSE5313 Coding and information theory for data science (Lecture 26)

Sliced and Broken Information with applications in DNA storage and 3D printing

Basic info

Deoxyribo-Nucleic Acid.

A double-helix shaped molecule.

Each helix is a string of

  • Cytosine,
  • Guanine,
  • Adenine, and
  • Thymine.

Contained inside every living cell.

  • Inside the nucleus.

Used to encode proteins.

mRNA carries info to Ribosome as codons of length 3 over GUCA.

  • Each codon produces an amino acids.
  • 4^3> 20, redundancy in nature!

1st Chargaff rule:

  • The two strands are complements (A-T and G-C).
  • #A = #T and #G = #C in both strands.

2nd Chargaff rule:

  • #A \approx #T and #G \approx #C in each strands.
  • Can be explained via tandem duplications.
    • GCAGCATT \implies GCAGCAGCATT.
    • Occur naturally during cell mitosis.

DNA storage

DNA synthesis:

  • Artificial creation of DNA from Gs, Ts, As, and Cs.

Can be used to store information!

Advantages:

  • Density.
    • 5.5 PB per mm3.
  • Stability.
    • Half-life 521 years (compare to ≈ 20𝑦 on hard drives).
  • Future proof.
    • DNA reading and writing will remain relevant "forever."

DNA storage prototypes

Some recent attempts:

  • 2011, 659kb.
    • Church, Gao, Kosuri, "Next-generation digital information storage in DNA", Science.
  • 2018, 200MB.
    • Organick et al., "Random access in large-scale DNA data storage," Nature biotechnology.
  • CatalogDNA (startup):
    • 2019, 16GB.
    • 2021, 18 Mbps.

Companies:

  • Microsoft, Illumina, Western Digital, many startups.

Challenges:

  • Expensive, Slow.
  • Traditional storage media still sufficient and affordable.

DNA Storage models

In vivo:

  • Implant the synthetic DNA inside a living organism.
  • Need evolution-correcting codes!
  • E.g., coding against tandem-duplications.

In vitro:

  • Place the synthetic DNA in test tubes.
  • Challenge: Can only synthesize short sequences (\approx 1000 bp).
  • 1 test tube contains millions to billions of short sequences.

How to encode information?

How to achieve noise robustness?

DNA coding in vitro environment

Traditional data communication:


m\in\{0,1\}^k\mapsto c\in\{0,1\}^n

DNA storage:


m\in\{0,1\}^k\mapsto c\in \binom{\{0,1\}^L}{M}

where \binom{\{0,1\}^L}{M} is the collection of all $M$-subsets of \{0,1\}^L. (0\leq M\leq 2^L)

A codeword is a set of M binary strings, each of length L.

"Sliced channel":

  • The message 𝑚 is encoded to $c\in {0,1}^{ML}, and then sliced to 𝑀 equal parts.
  • Parts may be noisy (substitutions, deletions, etc.).
  • Also useful in network packet transmission (M packets of length L).

Sliced channel: Figures of merit

How to quantify the merit of a given code \mathcal{C}?

  • Want resilience to any K substitutions in all M parts.

Redundance:

  • Recall in linear codes,
    • redundancy=lengt-dimension=\log (size\ of\ space)-\log (size\ of\ code).
  • In sliced channel:
    • redundancy=\log (size\ of\ space)-\log (size\ of\ code)=\log \binom{2^L}{M}-\log |\mathcal{C}|.

Research questions:

  • Bounds on redundancy?
  • Code construction?
  • Is more redundancy needed in sliced channel?

Sliced channel: Lower bound

Idea: Sphere packing

Given c\in \binom{\{0,1\}^L}{M} and K=1, how may codewords must we exclude?

Example
  • L=3,M=2, and let c=\{001,011\}. Ball of radius 1 is sized 5.

all the codeword with distance 1 from c are:

Distance 0:


\{001,011\}

Distance 1:


\{101,011\}\quad \{011\}\quad \{000,011\}\\
\{001,111\}\quad \{001\}\quad \{001,010\}

7 options and 2 of them are not codewords.

So the effective size of the ball is 5.

Tool: (Fourier analysis of) Boolean functions

Introducing hypercube graph:

  • V=\{0,1\}^L.
  • \{x,y\}\in E if and only if d_H(x,y)=1.
  • What is size of E?

Consider c\in \binom{\{0,1\}^L}{M} as a characteristic function: f_c(x)=\{0,1\}^L\to \{0,1\}.

Let \partial f_c be its boundary.

  • All hypercube edges \{x,y\} such that f_c(x)\neq f_c(y).

Lemma of boundary

Size of 1-ball \geq |\partial f_c|+1.

Proof

Every edge on the boundary represents a unique way of flipping one bit in one string in c.

Need to bound |\partial f_c| from below

Tool: Total influence.

Definition of total influence.

The total influence I(f) of f:\{0,1\}^L\to \{0,1\} is defined as:


\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))

where x^{\oplus i} equals to x with it's $i$th bit flipped.

Theorem: Edge-isoperimetric inequality, no proof)

I(f)\geq 2\alpha\log\frac{1}{\alpha}.

where \alpha=\min\{\text{fraction of 1's},\text{fraction of 0's}\}.

Notice: Let \partial_i f be the $i$-dimensional edges in \partial f.


\begin{aligned}
I(f)&=\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))\\
&=\sum_{i=1}^L\frac{|\partial_i f|}{2^{L-1}}\\
&=\frac{||\partial f||}{2^{L-1}}\\
\end{aligned}

Corollary: Let \epsilon>0, L\geq \frac{1}{\epsilon} and M\leq 2^{(1-\epsilon)L}, and let c\in \binom{\{0,1\}^L}{M}. Then,


|\parital f_c|\geq 2\times 2^{L-1}\frac{M}{2^L}\log \frac{2^L}{M}\geq M\log \frac{2^L}{M}\geq ML\epsilon

Size of 1 ball is sliced channel \geq ML\epsilon.

this implies that |\mathcal{C}|leq \frac{\binom{2^L}{M}}{\epsilon ML}

Corollary:

  • Redundancy in sliced channel with K=1 with the above parameters \log ML-O(1).
  • Simple generation (not shown) gives O(K\log ML).

Robust indexing

Idea: Start with L bit string with \log M bits for indexing.

Problem 1: Indices subject to noise.

Problem 2: Indexing bits do not carry information \implies higher redundancy.

link to paper

Idea: Robust indexing

Instead of using 1,\ldots, M for indexing, use x_1,\ldots, x_M such that

  • \{x_1,\ldots,x_M\} are of minimum distance 2K+1 (solves problem 1) |x_i|=O(\log M).
  • \{x_1,\ldots,x_M\} contain information (solves problem 2).

\{x_1,\ldots,x_M\} depend on the message.

  • Consider the message m=(m_1,m_2).
  • Find an encoding function m_1\mapsto\{\{x_i\}_{i=1}^M|d_H(x_i,x_j)\geq 2K+1\} (coding over codes)
  • Assume e is such function (not shown).

...

Additional reading:

Jin Sima, Netanel Raviv, Moshe Schwartz, and Jehoshua Bruck. "Error Correction for DNA Storage." arXiv:2310.01729 (2023).

  • Magazine article.
  • Introductory.
  • Broad perspective.

Information Embedding in 3D printing

Motivations:

Threats to public safety.

  • Ghost guns, forging fingerprints, forging keys, fooling facial recognition.

Solution Information Embedding.

  • Printer ID, user ID, time/location stamp

Existing Information Embedding Techniques

Many techniques exist.

Information embedding using variations in layers.

  • Width, rotation, etc.
  • Magnetic properties.
  • Radiative materials.

Combating Adversarial Noise

Most techniques are rather accurate.

  • I.e., low bit error rate.

Challenge: Adversarial damage after use.

  • Scraping.
  • Deformation.
  • Breaking.

A t-break code

Let m\in \{0,1\}^k\mapsto c\in \{0,1\}^n

Adversary breaks c at most t times (security parameter).

Decoder receives a multi-st of at most t+1 fragments. Assume the following:

  • Oriented
  • Unordered
  • Any length

Lower bound for t-break code

Claim: A t-break code must have \Omega(t\log (n/t)) redundancy.

Lemma: Let \mathcal{C} be a t-break code of length n, and for i\in [n] and \mathcal{C}_i\subseteq\mathcal{C} be the subset of \mathcal{C} containing all codewords of Hamming weight i. Then d_H(\mathcal{C}_i)\geq \lceil \frac{t+1}{2}\rceil.

Proof of Lemma

Let x,y\in \mathcal{C}_i for some i, and let \ell=d_H(x,y).

Write (\circ denotes the concatenation operation):

x=c_1\circ x_{i_1}\circ c_2\circ x_{i_2}\circ\ldots\circ c_{t+1}\circ x_{i_{t+1}}

y=c_1\circ y_{i_1}\circ c_2\circ y_{i_2}\circ\ldots\circ c_{t+1}\circ y_{i_{t+1}}

Where x_{i_j}\neq y_{i_j} for all j\in [\ell].

Break x and y 2\ell times to produce the multi-sets:


\mathcal{X}=\{\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\}\}

\mathcal{Y}=\{\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\}\}

w_H(x)=w_H(y)=i, and therefore \mathcal{X}=\mathcal{Y} and \ell\geq \lceil \frac{t+1}{2}\rceil.

Proof of Claim

Let j\in \{0,1,\ldots,n\} be such that \mathcal{C}_j is the largest among C_0,C_1,\ldots,C_{n}.

\log |\mathcal{C}|=\log(\sum_{i=0}^n|\mathcal{C}_i|)\leq \log \left((n+1)|C_j|\right)\leq \log (n+1)+\log |\mathcal{C}_j|

By Lemma and ordinary sphere packing bound, for t'=\lfloor\frac{\lceil \frac{t+1}{2}\rceil-1}{2}\rfloor\approx \frac{t}{4},


|C_j|\leq \frac{2^n}{\sum_{t=0}^{t'}\binom{n}{i}}

This implies that n-\log |\mathcal{C}|\geq n-\log(n+1)-\log|\mathcal{C}_j|\geq \dots \geq \Omega(t\log (n/t))

Corollary: In the relevant regime t=O(n^{1-\epsilon}), we have \Omega(t\log n) redundancy.

t-break codes: Main ideas.

Encoding:

  • Need multiple markers across the codeword.
  • Construct an adjacency matrix 𝐴 of markers to record their order.
  • Append RS_{2t}(A) to the codeword (as in the sliced channel).

Decoding (from t + 1 fragments):

  • Locate all surviving markers, and locate RS_{2t}(A)'.
  • Build an approximate adjacency matrix A' from surviving markers (d_H(A, A' )\leq 2t).
  • Correct (A',RS_{2t}(A)')\mapsto (A,RS_{2t}(A)).
  • Order the fragments correctly using A.

Tools:

  • Random encoding (to have many markers).
  • Mutually uncorrelated codes (so that markers will not overlap).

Tool: Mutually uncorrelated codes.

  • Want: Markers not to overlap.
  • Solution: Take markers from a Mutually Uncorrelated Codes (existing notion).
    • A code \mathcal{M} is called mutually uncorrelated if no suffix of any 𝑚𝑖 ∈ if a prefix of another 𝑚𝑗 ∈ (including 𝑖 = 𝑗). Many constructions exist. • Theorem: For any integer there exists a mutually uncorrelated code 𝐶𝑀𝑈 of length and size 𝐶𝑀𝑈 ≥ 2 32

Tool: Random encoding. • Want: Codewords with many markers from 𝐶𝑀𝑈, that are not too far apart. • Problem: Hard to achieve explicitly. • Workaround: Show that a uniformly random string has this property. • Random encoding: Choose the message at random. Suitable for embedding, say, printer ID. Not suitable for dynamic information.