Files
NoteNextra-origin/content/CSE5313/CSE5313_L26.md
Trance-0 0b736aa18d
Some checks failed
Sync from Gitea (main→main, keep workflow) / mirror (push) Has been cancelled
Update CSE5313_L26.md
2025-12-02 13:54:12 -06:00

445 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CSE5313 Coding and information theory for data science (Lecture 26)
## Sliced and Broken Information with applications in DNA storage and 3D printing
### Basic info
Deoxyribo-Nucleic Acid.
A double-helix shaped molecule.
Each helix is a string of
- Cytosine,
- Guanine,
- Adenine, and
- Thymine.
Contained inside every living cell.
- Inside the nucleus.
Used to encode proteins.
mRNA carries info to Ribosome as codons of length 3 over GUCA.
- Each codon produces an amino acids.
- $4^3> 20$, redundancy in nature!
1st Chargaff rule:
- The two strands are complements (A-T and G-C).
- $#A = #T$ and $#G = #C$ in both strands.
2nd Chargaff rule:
- $#A \approx #T$ and $#G \approx #C$ in each strands.
- Can be explained via tandem duplications.
- $GCAGCATT \implies GCAGCAGCATT$.
- Occur naturally during cell mitosis.
### DNA storage
DNA synthesis:
- Artificial creation of DNA from Gs, Ts, As, and Cs.
Can be used to store information!
Advantages:
- Density.
- 5.5 PB per mm3.
- Stability.
- Half-life 521 years (compare to ≈ 20𝑦 on hard drives).
- Future proof.
- DNA reading and writing will remain relevant "forever."
#### DNA storage prototypes
Some recent attempts:
- 2011, 659kb.
- Church, Gao, Kosuri, "Next-generation digital information storage in DNA", Science.
- 2018, 200MB.
- Organick et al., "Random access in large-scale DNA data storage," Nature biotechnology.
- CatalogDNA (startup):
- 2019, 16GB.
- 2021, 18 Mbps.
Companies:
- Microsoft, Illumina, Western Digital, many startups.
Challenges:
- Expensive, Slow.
- Traditional storage media still sufficient and affordable.
#### DNA Storage models
In vivo:
- Implant the synthetic DNA inside a living organism.
- Need evolution-correcting codes!
- E.g., coding against tandem-duplications.
In vitro:
- Place the synthetic DNA in test tubes.
- Challenge: Can only synthesize short sequences ($\approx 1000 bp$).
- 1 test tube contains millions to billions of short sequences.
How to encode information?
How to achieve noise robustness?
### DNA coding in vitro environment
Traditional data communication:
$$
m\in\{0,1\}^k\mapsto c\in\{0,1\}^n
$$
DNA storage:
$$
m\in\{0,1\}^k\mapsto c\in \binom{\{0,1\}^L}{M}
$$
where $\binom{\{0,1\}^L}{M}$ is the collection of all $M$-subsets of $\{0,1\}^L$. ($0\leq M\leq 2^L$)
A codeword is a set of $M$ binary strings, each of length $L$.
"Sliced channel":
- The message $m$ is encoded to $c\in \{0,1\}^{ML}, and then sliced to $M$ equal parts.
- Parts may be noisy (substitutions, deletions, etc.).
- Also useful in network packet transmission ($M$ packets of length $L$).
#### Sliced channel: Figures of merit
How to quantify the **merit** of a given code $\mathcal{C}$?
- Want resilience to any $K$ substitutions in all $M$ parts.
Redundance:
- Recall in linear codes,
- $redundancy=length-dimension=\log (size\ of\ space)-\log (size\ of\ code)$.
- In sliced channel:
- $redundancy=\log (size\ of\ space)-\log (size\ of\ code)=\log \binom{2^L}{M}-\log |\mathcal{C}|$.
Research questions:
- Bounds on redundancy?
- Code construction?
- Is more redundancy needed in sliced channel?
#### Sliced channel: Lower bound
Idea: **Sphere packing**
Given $c\in \binom{\{0,1\}^L}{M}$ and $K=1$, how may codewords must we exclude?
<details>
<summary>Example</summary>
- $L=3,M=2$, and let $c=\{001,011\}$. Ball of radius 1 is sized 5.
all the codeword with distance 1 from $c$ are:
Distance 0:
$$
\{001,011\}
$$
Distance 1:
$$
\{101,011\}\quad \{011\}\quad \{000,011\}\\
\{001,111\}\quad \{001\}\quad \{001,010\}
$$
7 options and 2 of them are not codewords.
So the effective size of the ball is 5.
</details>
Tool: (Fourier analysis of) Boolean functions
Introducing hypercube graph:
- $V=\{0,1\}^L$.
- $\{x,y\}\in E$ if and only if $d_H(x,y)=1$.
- What is size of $E$?
Consider $c\in \binom{\{0,1\}^L}{M}$ as a characteristic function: $f_c(x)=\{0,1\}^L\to \{0,1\}$.
Let $\partial f_c$ be its boundary.
- All hypercube edges $\{x,y\}$ such that $f_c(x)\neq f_c(y)$.
#### Lemma of boundary
Size of 1-ball $\geq |\partial f_c|+1$.
<details>
<summary>Proof</summary>
Every edge on the boundary represents a unique way of flipping one bit in one string in $c$.
</details>
Need to bound $|\partial f_c|$ from below
Tool: Total influence.
#### Definition of total influence.
The **total influence** $I(f)$ of $f:\{0,1\}^L\to \{0,1\}$ is defined as:
$$
\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))
$$
where $x^{\oplus i}$ equals to $x$ with it's $i$th bit flipped.
#### Theorem: Edge-isoperimetric inequality, no proof)
$I(f)\geq 2\alpha\log\frac{1}{\alpha}$.
where $\alpha=\min\{\text{fraction of 1's},\text{fraction of 0's}\}$.
Notice: Let $\partial_i f$ be the $i$-dimensional edges in $\partial f$.
$$
\begin{aligned}
I(f)&=\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))\\
&=\sum_{i=1}^L\frac{|\partial_i f|}{2^{L-1}}\\
&=\frac{||\partial f||}{2^{L-1}}\\
\end{aligned}
$$
Corollary: Let $\epsilon>0$, $L\geq \frac{1}{\epsilon}$ and $M\leq 2^{(1-\epsilon)L}$, and let $c\in \binom{\{0,1\}^L}{M}$. Then,
$$
|\parital f_c|\geq 2\times 2^{L-1}\frac{M}{2^L}\log \frac{2^L}{M}\geq M\log \frac{2^L}{M}\geq ML\epsilon
$$
Size of $1$ ball is sliced channel $\geq ML\epsilon$.
this implies that $|\mathcal{C}|leq \frac{\binom{2^L}{M}}{\epsilon ML}$
Corollary:
- Redundancy in sliced channel with $K=1$ with the above parameters $\log ML-O(1)$.
- Simple generation (not shown) gives $O(K\log ML)$.
### Robust indexing
Idea: Start with $L$ bit string with $\log M$ bits for indexing.
Problem 1: Indices subject to noise.
Problem 2: Indexing bits do not carry information $\implies$ higher redundancy.
[link to paper](https://ieeexplore.ieee.org/document/9174447)
Idea: Robust indexing
Instead of using $1,\ldots, M$ for indexing, use $x_1,\ldots, x_M$ such that
- $\{x_1,\ldots,x_M\}$ are of minimum distance $2K+1$ (solves problem 1) $|x_i|=O(\log M)$.
- $\{x_1,\ldots,x_M\}$ contain information (solves problem 2).
$\{x_1,\ldots,x_M\}$ depend on the message.
- Consider the message $m=(m_1,m_2)$.
- Find an encoding function $m_1\mapsto\{\{x_i\}_{i=1}^M|d_H(x_i,x_j)\geq 2K+1\}$ (coding over codes)
- Assume $e$ is such function (not shown).
...
Additional reading:
Jin Sima, Netanel Raviv, Moshe Schwartz, and Jehoshua Bruck. "Error Correction for DNA Storage." arXiv:2310.01729 (2023).
- Magazine article.
- Introductory.
- Broad perspective.
### Information Embedding in 3D printing
Motivations:
Threats to public safety.
- Ghost guns, forging fingerprints, forging keys, fooling facial recognition.
Solution Information Embedding.
- Printer ID, user ID, time/location stamp
#### Existing Information Embedding Techniques
Many techniques exist.
Information embedding using variations in layers.
- Width, rotation, etc.
- Magnetic properties.
- Radiative materials.
Combating Adversarial Noise
Most techniques are rather accurate.
- I.e., low bit error rate.
Challenge: Adversarial damage after use.
- Scraping.
- Deformation.
- **Breaking**.
#### A t-break code
Let $m\in \{0,1\}^k\mapsto c\in \{0,1\}^n$
Adversary breaks $c$ at most $t$ times (security parameter).
Decoder receives a **multi**-st of at most $t+1$ fragments. Assume the following:
- Oriented
- Unordered
- Any length
#### Lower bound for t-break code
Claim: A t-break code must have $\Omega(t\log (n/t))$ redundancy.
Lemma: Let $\mathcal{C}$ be a t-break code of length $n$, and for $i\in [n]$ and $\mathcal{C}_i\subseteq\mathcal{C}$ be the subset of $\mathcal{C}$ containing all codewords of Hamming weight $i$. Then $d_H(\mathcal{C}_i)\geq \lceil \frac{t+1}{2}\rceil$.
<details>
<summary>Proof of Lemma</summary>
Let $x,y\in \mathcal{C}_i$ for some $i$, and let $\ell=d_H(x,y)$.
Write ($\circ$ denotes the concatenation operation):
$x=c_1\circ x_{i_1}\circ c_2\circ x_{i_2}\circ\ldots\circ c_{t+1}\circ x_{i_{t+1}}$
$y=c_1\circ y_{i_1}\circ c_2\circ y_{i_2}\circ\ldots\circ c_{t+1}\circ y_{i_{t+1}}$
Where $x_{i_j}\neq y_{i_j}$ for all $j\in [\ell]$.
Break $x$ and $y$ $2\ell$ times to produce the multi-sets:
$$
\mathcal{X}=\{\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\}\}
$$
$$
\mathcal{Y}=\{\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\}\}
$$
$w_H(x)=w_H(y)=i$, and therefore $\mathcal{X}=\mathcal{Y}$ and $\ell\geq \lceil \frac{t+1}{2}\rceil$.
</details>
<details>
<summary>Proof of Claim</summary>
Let $j\in \{0,1,\ldots,n\}$ be such that $\mathcal{C}_j$ is the largest among $C_0,C_1,\ldots,C_{n}$.
$\log |\mathcal{C}|=\log(\sum_{i=0}^n|\mathcal{C}_i|)\leq \log \left((n+1)|C_j|\right)\leq \log (n+1)+\log |\mathcal{C}_j|$
By Lemma and ordinary sphere packing bound, for $t'=\lfloor\frac{\lceil \frac{t+1}{2}\rceil-1}{2}\rfloor\approx \frac{t}{4}$,
$$
|C_j|\leq \frac{2^n}{\sum_{t=0}^{t'}\binom{n}{i}}
$$
This implies that $n-\log |\mathcal{C}|\geq n-\log(n+1)-\log|\mathcal{C}_j|\geq \dots \geq \Omega(t\log (n/t))$
</details>
Corollary: In the relevant regime $t=O(n^{1-\epsilon})$, we have $\Omega(t\log n)$ redundancy.
### t-break codes: Main ideas.
Encoding:
- Need multiple markers across the codeword.
- Construct an adjacency matrix 𝐴 of markers to record their order.
- Append $RS_{2t}(A)$ to the codeword (as in the sliced channel).
Decoding (from $t + 1$ fragments):
- Locate all surviving markers, and locate $RS_{2t}(A)'$.
- Build an approximate adjacency matrix $A'$ from surviving markers $(d_H(A, A' )\leq 2t)$.
- Correct $(A',RS_{2t}(A)')\mapsto (A,RS_{2t}(A))$.
- Order the fragments correctly using $A$.
Tools:
- Random encoding (to have many markers).
- Mutually uncorrelated codes (so that markers will not overlap).
#### Tool: Mutually uncorrelated codes.
- Want: Markers not to overlap.
- Solution: Take markers from a Mutually Uncorrelated Codes (existing notion).
- A code $\mathcal{M}$ is called mutually uncorrelated if no suffix of any $m_i \in \mathcal{M}$ is if a prefix of another $m_j \in \mathcal{M}$ (including $i=j$).
- Many constructions exist.
Theorem: For any integer $\ell$ there exists a mutually uncorrelated code $\mathcal{C}_{MU}$ of length $\ell$ and size $|\mathcal{C}_{MU}|\geq \frac{2^\ell}{32\ell}$.
#### Tool: Random encoding.
- Want: Codewords with many markers from $\mathcal{C}_{MU}$, that are not too far apart.
- Problem: Hard to achieve explicitly.
- Workaround: Show that a uniformly random string has this property.
Random encoding:
- Choose the message at random.
- Suitable for embedding, say, printer ID.
- Not suitable for dynamic information.
Let $m>0$ be a parameter.
Fix a mutually uncorrelated code $\mathcal{C}_{MU}$ of length $\Theta(\log m)$.
Fix $m_1,\ldots, m_t$ from $\mathcal{C}_{MU}$ as "special" markers.
Claim: With probability $1-\frac{1}{\poly(m)}$, in uniformly random string $z\in \{0,1\}^m$.
- Every $O(\log^2(m))$ bits contain a marker from $\mathcal{C}_{MU}$.
- Every two non-overlapping substrings of length $c\log m$ are distinct.
- $z$ does not contain any of the special markers $m_1,\ldots, m_t$.
Proof idea:
- Short substring are abundant.
- Long substring are rare.
#### Sketch of encoding for t-break codes.
Repeatedly sample $z\in \{0,1\}^m$ until it is "good".
Find all markers $m_{i_1},\ldots, m_{i_r}$ in it.
Build a $|\mathcal{C}_{MU}|\times |\mathcal{C}_{MU}|$ matrix $A$ which records order and distances:
- $A_{i,j}=0$ if $m_i,m_j$ are not adjacent.
- Otherwise, it is the distance between them (in bits).
Append $RS_{2t}(A)$ at the end, and use the special markers $m_1,\ldots, m_t$.
#### Sketch of decoding for t-break codes.
Construct a partial adjacency matrix $A'$ from fragments.