Trance-0/NoteNextra-origin

Fork 0

Files

Trance-0 0b736aa18d

Sync from Gitea (main→main, keep workflow) / mirror (push) Has been cancelled

Details

Update CSE5313_L26.md

2025-12-02 13:54:12 -06:00

12 KiB

Raw Blame History

CSE5313 Coding and information theory for data science (Lecture 26)

Sliced and Broken Information with applications in DNA storage and 3D printing

Basic info

Deoxyribo-Nucleic Acid.

A double-helix shaped molecule.

Each helix is a string of

Cytosine,
Guanine,
Adenine, and
Thymine.

Contained inside every living cell.

Inside the nucleus.

Used to encode proteins.

mRNA carries info to Ribosome as codons of length 3 over GUCA.

Each codon produces an amino acids.
4^3> 20, redundancy in nature!

1st Chargaff rule:

The two strands are complements (A-T and G-C).
#A = #T and #G = #C in both strands.

2nd Chargaff rule:

#A \approx #T and #G \approx #C in each strands.
Can be explained via tandem duplications.
- GCAGCATT \implies GCAGCAGCATT.
- Occur naturally during cell mitosis.

DNA storage

DNA synthesis:

Artificial creation of DNA from G’s, T’s, A’s, and C’s.

Can be used to store information!

Advantages:

Density.
- 5.5 PB per mm3.
Stability.
- Half-life 521 years (compare to ≈ 20𝑦 on hard drives).
Future proof.
- DNA reading and writing will remain relevant "forever."

DNA storage prototypes

Some recent attempts:

2011, 659kb.
- Church, Gao, Kosuri, "Next-generation digital information storage in DNA", Science.
2018, 200MB.
- Organick et al., "Random access in large-scale DNA data storage," Nature biotechnology.
CatalogDNA (startup):
- 2019, 16GB.
- 2021, 18 Mbps.

Companies:

Microsoft, Illumina, Western Digital, many startups.

Challenges:

Expensive, Slow.
Traditional storage media still sufficient and affordable.

DNA Storage models

In vivo:

Implant the synthetic DNA inside a living organism.
Need evolution-correcting codes!
E.g., coding against tandem-duplications.

In vitro:

Place the synthetic DNA in test tubes.
Challenge: Can only synthesize short sequences (\approx 1000 bp).
1 test tube contains millions to billions of short sequences.

How to encode information?

How to achieve noise robustness?

DNA coding in vitro environment

Traditional data communication:


m\in\{0,1\}^k\mapsto c\in\{0,1\}^n

DNA storage:


m\in\{0,1\}^k\mapsto c\in \binom{\{0,1\}^L}{M}

where \binom{\{0,1\}^L}{M} is the collection of all $M$-subsets of \{0,1\}^L. (0\leq M\leq 2^L)

A codeword is a set of M binary strings, each of length L.

"Sliced channel":

The message m is encoded to $c\in {0,1}^{ML}, and then sliced to M equal parts.
Parts may be noisy (substitutions, deletions, etc.).
Also useful in network packet transmission (M packets of length L).

Sliced channel: Figures of merit

How to quantify the merit of a given code \mathcal{C}?

Want resilience to any K substitutions in all M parts.

Redundance:

Recall in linear codes,
- redundancy=length-dimension=\log (size\ of\ space)-\log (size\ of\ code).
In sliced channel:
- redundancy=\log (size\ of\ space)-\log (size\ of\ code)=\log \binom{2^L}{M}-\log |\mathcal{C}|.

Research questions:

Bounds on redundancy?
Code construction?
Is more redundancy needed in sliced channel?

Sliced channel: Lower bound

Idea: Sphere packing

Given c\in \binom{\{0,1\}^L}{M} and K=1, how may codewords must we exclude?

Example

L=3,M=2, and let c=\{001,011\}. Ball of radius 1 is sized 5.

all the codeword with distance 1 from c are:

Distance 0:


\{001,011\}

Distance 1:


\{101,011\}\quad \{011\}\quad \{000,011\}\\
\{001,111\}\quad \{001\}\quad \{001,010\}

7 options and 2 of them are not codewords.

So the effective size of the ball is 5.

Tool: (Fourier analysis of) Boolean functions

Introducing hypercube graph:

V=\{0,1\}^L.
\{x,y\}\in E if and only if d_H(x,y)=1.
What is size of E?

Consider c\in \binom{\{0,1\}^L}{M} as a characteristic function: f_c(x)=\{0,1\}^L\to \{0,1\}.

Let \partial f_c be its boundary.

All hypercube edges \{x,y\} such that f_c(x)\neq f_c(y).

Lemma of boundary

Size of 1-ball \geq |\partial f_c|+1.

Proof

Every edge on the boundary represents a unique way of flipping one bit in one string in c.

Need to bound |\partial f_c| from below

Tool: Total influence.

Definition of total influence.

The total influence I(f) of f:\{0,1\}^L\to \{0,1\} is defined as:


\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))

where x^{\oplus i} equals to x with it's $i$th bit flipped.

Theorem: Edge-isoperimetric inequality, no proof)

I(f)\geq 2\alpha\log\frac{1}{\alpha}.

where \alpha=\min\{\text{fraction of 1's},\text{fraction of 0's}\}.

Notice: Let \partial_i f be the $i$-dimensional edges in \partial f.


\begin{aligned}
I(f)&=\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))\\
&=\sum_{i=1}^L\frac{|\partial_i f|}{2^{L-1}}\\
&=\frac{||\partial f||}{2^{L-1}}\\
\end{aligned}

Corollary: Let \epsilon>0, L\geq \frac{1}{\epsilon} and M\leq 2^{(1-\epsilon)L}, and let c\in \binom{\{0,1\}^L}{M}. Then,


|\parital f_c|\geq 2\times 2^{L-1}\frac{M}{2^L}\log \frac{2^L}{M}\geq M\log \frac{2^L}{M}\geq ML\epsilon

Size of 1 ball is sliced channel \geq ML\epsilon.

this implies that |\mathcal{C}|leq \frac{\binom{2^L}{M}}{\epsilon ML}

Corollary:

Redundancy in sliced channel with K=1 with the above parameters \log ML-O(1).
Simple generation (not shown) gives O(K\log ML).

Robust indexing

Idea: Start with L bit string with \log M bits for indexing.

Problem 1: Indices subject to noise.

Problem 2: Indexing bits do not carry information \implies higher redundancy.

link to paper

Idea: Robust indexing

Instead of using 1,\ldots, M for indexing, use x_1,\ldots, x_M such that

\{x_1,\ldots,x_M\} are of minimum distance 2K+1 (solves problem 1) |x_i|=O(\log M).
\{x_1,\ldots,x_M\} contain information (solves problem 2).

\{x_1,\ldots,x_M\} depend on the message.

Consider the message m=(m_1,m_2).
Find an encoding function m_1\mapsto\{\{x_i\}_{i=1}^M|d_H(x_i,x_j)\geq 2K+1\} (coding over codes)
Assume e is such function (not shown).

...

Additional reading:

Jin Sima, Netanel Raviv, Moshe Schwartz, and Jehoshua Bruck. "Error Correction for DNA Storage." arXiv:2310.01729 (2023).

Magazine article.
Introductory.
Broad perspective.

Information Embedding in 3D printing

Motivations:

Threats to public safety.

Ghost guns, forging fingerprints, forging keys, fooling facial recognition.

Solution – Information Embedding.

Printer ID, user ID, time/location stamp

Existing Information Embedding Techniques

Many techniques exist.

Information embedding using variations in layers.

Width, rotation, etc.
Magnetic properties.
Radiative materials.

Combating Adversarial Noise

Most techniques are rather accurate.

I.e., low bit error rate.

Challenge: Adversarial damage after use.

Scraping.
Deformation.
Breaking.

A t-break code

Let m\in \{0,1\}^k\mapsto c\in \{0,1\}^n

Adversary breaks c at most t times (security parameter).

Decoder receives a multi-st of at most t+1 fragments. Assume the following:

Oriented
Unordered
Any length

Lower bound for t-break code

Claim: A t-break code must have \Omega(t\log (n/t)) redundancy.

Lemma: Let \mathcal{C} be a t-break code of length n, and for i\in [n] and \mathcal{C}_i\subseteq\mathcal{C} be the subset of \mathcal{C} containing all codewords of Hamming weight i. Then d_H(\mathcal{C}_i)\geq \lceil \frac{t+1}{2}\rceil.

Proof of Lemma

Let x,y\in \mathcal{C}_i for some i, and let \ell=d_H(x,y).

Write (\circ denotes the concatenation operation):

x=c_1\circ x_{i_1}\circ c_2\circ x_{i_2}\circ\ldots\circ c_{t+1}\circ x_{i_{t+1}}

y=c_1\circ y_{i_1}\circ c_2\circ y_{i_2}\circ\ldots\circ c_{t+1}\circ y_{i_{t+1}}

Where x_{i_j}\neq y_{i_j} for all j\in [\ell].

Break x and y 2\ell times to produce the multi-sets:


\mathcal{X}=\{\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\}\}


\mathcal{Y}=\{\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\}\}

w_H(x)=w_H(y)=i, and therefore \mathcal{X}=\mathcal{Y} and \ell\geq \lceil \frac{t+1}{2}\rceil.

Proof of Claim

Let j\in \{0,1,\ldots,n\} be such that \mathcal{C}_j is the largest among C_0,C_1,\ldots,C_{n}.

\log |\mathcal{C}|=\log(\sum_{i=0}^n|\mathcal{C}_i|)\leq \log \left((n+1)|C_j|\right)\leq \log (n+1)+\log |\mathcal{C}_j|

By Lemma and ordinary sphere packing bound, for t'=\lfloor\frac{\lceil \frac{t+1}{2}\rceil-1}{2}\rfloor\approx \frac{t}{4},


|C_j|\leq \frac{2^n}{\sum_{t=0}^{t'}\binom{n}{i}}

This implies that n-\log |\mathcal{C}|\geq n-\log(n+1)-\log|\mathcal{C}_j|\geq \dots \geq \Omega(t\log (n/t))

Corollary: In the relevant regime t=O(n^{1-\epsilon}), we have \Omega(t\log n) redundancy.

t-break codes: Main ideas.

Encoding:

Need multiple markers across the codeword.
Construct an adjacency matrix 𝐴 of markers to record their order.
Append RS_{2t}(A) to the codeword (as in the sliced channel).

Decoding (from t + 1 fragments):

Locate all surviving markers, and locate RS_{2t}(A)'.
Build an approximate adjacency matrix A' from surviving markers (d_H(A, A' )\leq 2t).
Correct (A',RS_{2t}(A)')\mapsto (A,RS_{2t}(A)).
Order the fragments correctly using A.

Tools:

Random encoding (to have many markers).
Mutually uncorrelated codes (so that markers will not overlap).

Tool: Mutually uncorrelated codes.

Want: Markers not to overlap.
Solution: Take markers from a Mutually Uncorrelated Codes (existing notion).
- A code \mathcal{M} is called mutually uncorrelated if no suffix of any m_i \in \mathcal{M} is if a prefix of another m_j \in \mathcal{M} (including i=j).
Many constructions exist.

Theorem: For any integer \ell there exists a mutually uncorrelated code \mathcal{C}_{MU} of length \ell and size |\mathcal{C}_{MU}|\geq \frac{2^\ell}{32\ell}.

Tool: Random encoding.

Want: Codewords with many markers from \mathcal{C}_{MU}, that are not too far apart.
Problem: Hard to achieve explicitly.
Workaround: Show that a uniformly random string has this property.

Random encoding:

Choose the message at random.
Suitable for embedding, say, printer ID.
Not suitable for dynamic information.

Let m>0 be a parameter.

Fix a mutually uncorrelated code \mathcal{C}_{MU} of length \Theta(\log m).

Fix m_1,\ldots, m_t from \mathcal{C}_{MU} as "special" markers.

Claim: With probability 1-\frac{1}{\poly(m)}, in uniformly random string z\in \{0,1\}^m.

Every O(\log^2(m)) bits contain a marker from \mathcal{C}_{MU}.
Every two non-overlapping substrings of length c\log m are distinct.
z does not contain any of the special markers m_1,\ldots, m_t.

Proof idea:

Short substring are abundant.
Long substring are rare.

Sketch of encoding for t-break codes.

Repeatedly sample z\in \{0,1\}^m until it is "good".

Find all markers m_{i_1},\ldots, m_{i_r} in it.

Build a |\mathcal{C}_{MU}|\times |\mathcal{C}_{MU}| matrix A which records order and distances:

A_{i,j}=0 if m_i,m_j are not adjacent.
Otherwise, it is the distance between them (in bits).

Append RS_{2t}(A) at the end, and use the special markers m_1,\ldots, m_t.

Sketch of decoding for t-break codes.

Construct a partial adjacency matrix A' from fragments.

12 KiB Raw Blame History Unescape Escape