Some checks failed
Sync from Gitea (main→main, keep workflow) / mirror (push) Has been cancelled
445 lines
12 KiB
Markdown
445 lines
12 KiB
Markdown
# CSE5313 Coding and information theory for data science (Lecture 26)
|
||
|
||
## Sliced and Broken Information with applications in DNA storage and 3D printing
|
||
|
||
### Basic info
|
||
|
||
Deoxyribo-Nucleic Acid.
|
||
|
||
A double-helix shaped molecule.
|
||
|
||
Each helix is a string of
|
||
|
||
- Cytosine,
|
||
- Guanine,
|
||
- Adenine, and
|
||
- Thymine.
|
||
|
||
Contained inside every living cell.
|
||
|
||
- Inside the nucleus.
|
||
|
||
Used to encode proteins.
|
||
|
||
mRNA carries info to Ribosome as codons of length 3 over GUCA.
|
||
|
||
- Each codon produces an amino acids.
|
||
- $4^3> 20$, redundancy in nature!
|
||
|
||
1st Chargaff rule:
|
||
|
||
- The two strands are complements (A-T and G-C).
|
||
- $#A = #T$ and $#G = #C$ in both strands.
|
||
|
||
2nd Chargaff rule:
|
||
|
||
- $#A \approx #T$ and $#G \approx #C$ in each strands.
|
||
- Can be explained via tandem duplications.
|
||
- $GCAGCATT \implies GCAGCAGCATT$.
|
||
- Occur naturally during cell mitosis.
|
||
|
||
### DNA storage
|
||
|
||
DNA synthesis:
|
||
|
||
- Artificial creation of DNA from G’s, T’s, A’s, and C’s.
|
||
|
||
Can be used to store information!
|
||
|
||
Advantages:
|
||
|
||
- Density.
|
||
- 5.5 PB per mm3.
|
||
- Stability.
|
||
- Half-life 521 years (compare to ≈ 20𝑦 on hard drives).
|
||
- Future proof.
|
||
- DNA reading and writing will remain relevant "forever."
|
||
|
||
#### DNA storage prototypes
|
||
|
||
Some recent attempts:
|
||
|
||
- 2011, 659kb.
|
||
- Church, Gao, Kosuri, "Next-generation digital information storage in DNA", Science.
|
||
- 2018, 200MB.
|
||
- Organick et al., "Random access in large-scale DNA data storage," Nature biotechnology.
|
||
- CatalogDNA (startup):
|
||
- 2019, 16GB.
|
||
- 2021, 18 Mbps.
|
||
|
||
Companies:
|
||
|
||
- Microsoft, Illumina, Western Digital, many startups.
|
||
|
||
Challenges:
|
||
|
||
- Expensive, Slow.
|
||
- Traditional storage media still sufficient and affordable.
|
||
|
||
#### DNA Storage models
|
||
|
||
In vivo:
|
||
|
||
- Implant the synthetic DNA inside a living organism.
|
||
- Need evolution-correcting codes!
|
||
- E.g., coding against tandem-duplications.
|
||
|
||
In vitro:
|
||
|
||
- Place the synthetic DNA in test tubes.
|
||
- Challenge: Can only synthesize short sequences ($\approx 1000 bp$).
|
||
- 1 test tube contains millions to billions of short sequences.
|
||
|
||
How to encode information?
|
||
|
||
How to achieve noise robustness?
|
||
|
||
### DNA coding in vitro environment
|
||
|
||
Traditional data communication:
|
||
|
||
$$
|
||
m\in\{0,1\}^k\mapsto c\in\{0,1\}^n
|
||
$$
|
||
|
||
DNA storage:
|
||
|
||
$$
|
||
m\in\{0,1\}^k\mapsto c\in \binom{\{0,1\}^L}{M}
|
||
$$
|
||
|
||
where $\binom{\{0,1\}^L}{M}$ is the collection of all $M$-subsets of $\{0,1\}^L$. ($0\leq M\leq 2^L$)
|
||
|
||
A codeword is a set of $M$ binary strings, each of length $L$.
|
||
|
||
"Sliced channel":
|
||
|
||
- The message $m$ is encoded to $c\in \{0,1\}^{ML}, and then sliced to $M$ equal parts.
|
||
- Parts may be noisy (substitutions, deletions, etc.).
|
||
- Also useful in network packet transmission ($M$ packets of length $L$).
|
||
|
||
#### Sliced channel: Figures of merit
|
||
|
||
How to quantify the **merit** of a given code $\mathcal{C}$?
|
||
|
||
- Want resilience to any $K$ substitutions in all $M$ parts.
|
||
|
||
Redundance:
|
||
|
||
- Recall in linear codes,
|
||
- $redundancy=length-dimension=\log (size\ of\ space)-\log (size\ of\ code)$.
|
||
- In sliced channel:
|
||
- $redundancy=\log (size\ of\ space)-\log (size\ of\ code)=\log \binom{2^L}{M}-\log |\mathcal{C}|$.
|
||
|
||
Research questions:
|
||
|
||
- Bounds on redundancy?
|
||
- Code construction?
|
||
- Is more redundancy needed in sliced channel?
|
||
|
||
#### Sliced channel: Lower bound
|
||
|
||
Idea: **Sphere packing**
|
||
|
||
Given $c\in \binom{\{0,1\}^L}{M}$ and $K=1$, how may codewords must we exclude?
|
||
|
||
<details>
|
||
<summary>Example</summary>
|
||
|
||
- $L=3,M=2$, and let $c=\{001,011\}$. Ball of radius 1 is sized 5.
|
||
|
||
all the codeword with distance 1 from $c$ are:
|
||
|
||
Distance 0:
|
||
|
||
$$
|
||
\{001,011\}
|
||
$$
|
||
|
||
Distance 1:
|
||
$$
|
||
\{101,011\}\quad \{011\}\quad \{000,011\}\\
|
||
\{001,111\}\quad \{001\}\quad \{001,010\}
|
||
$$
|
||
|
||
7 options and 2 of them are not codewords.
|
||
|
||
So the effective size of the ball is 5.
|
||
|
||
</details>
|
||
|
||
Tool: (Fourier analysis of) Boolean functions
|
||
|
||
Introducing hypercube graph:
|
||
|
||
- $V=\{0,1\}^L$.
|
||
- $\{x,y\}\in E$ if and only if $d_H(x,y)=1$.
|
||
- What is size of $E$?
|
||
|
||
Consider $c\in \binom{\{0,1\}^L}{M}$ as a characteristic function: $f_c(x)=\{0,1\}^L\to \{0,1\}$.
|
||
|
||
Let $\partial f_c$ be its boundary.
|
||
|
||
- All hypercube edges $\{x,y\}$ such that $f_c(x)\neq f_c(y)$.
|
||
|
||
#### Lemma of boundary
|
||
|
||
Size of 1-ball $\geq |\partial f_c|+1$.
|
||
|
||
<details>
|
||
<summary>Proof</summary>
|
||
|
||
Every edge on the boundary represents a unique way of flipping one bit in one string in $c$.
|
||
|
||
</details>
|
||
|
||
Need to bound $|\partial f_c|$ from below
|
||
|
||
Tool: Total influence.
|
||
|
||
#### Definition of total influence.
|
||
|
||
The **total influence** $I(f)$ of $f:\{0,1\}^L\to \{0,1\}$ is defined as:
|
||
|
||
$$
|
||
\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))
|
||
$$
|
||
|
||
where $x^{\oplus i}$ equals to $x$ with it's $i$th bit flipped.
|
||
|
||
#### Theorem: Edge-isoperimetric inequality, no proof)
|
||
|
||
$I(f)\geq 2\alpha\log\frac{1}{\alpha}$.
|
||
|
||
where $\alpha=\min\{\text{fraction of 1's},\text{fraction of 0's}\}$.
|
||
|
||
Notice: Let $\partial_i f$ be the $i$-dimensional edges in $\partial f$.
|
||
|
||
$$
|
||
\begin{aligned}
|
||
I(f)&=\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))\\
|
||
&=\sum_{i=1}^L\frac{|\partial_i f|}{2^{L-1}}\\
|
||
&=\frac{||\partial f||}{2^{L-1}}\\
|
||
\end{aligned}
|
||
$$
|
||
|
||
Corollary: Let $\epsilon>0$, $L\geq \frac{1}{\epsilon}$ and $M\leq 2^{(1-\epsilon)L}$, and let $c\in \binom{\{0,1\}^L}{M}$. Then,
|
||
|
||
$$
|
||
|\parital f_c|\geq 2\times 2^{L-1}\frac{M}{2^L}\log \frac{2^L}{M}\geq M\log \frac{2^L}{M}\geq ML\epsilon
|
||
$$
|
||
|
||
Size of $1$ ball is sliced channel $\geq ML\epsilon$.
|
||
|
||
this implies that $|\mathcal{C}|leq \frac{\binom{2^L}{M}}{\epsilon ML}$
|
||
|
||
Corollary:
|
||
|
||
- Redundancy in sliced channel with $K=1$ with the above parameters $\log ML-O(1)$.
|
||
- Simple generation (not shown) gives $O(K\log ML)$.
|
||
|
||
### Robust indexing
|
||
|
||
Idea: Start with $L$ bit string with $\log M$ bits for indexing.
|
||
|
||
Problem 1: Indices subject to noise.
|
||
|
||
Problem 2: Indexing bits do not carry information $\implies$ higher redundancy.
|
||
|
||
[link to paper](https://ieeexplore.ieee.org/document/9174447)
|
||
|
||
Idea: Robust indexing
|
||
|
||
Instead of using $1,\ldots, M$ for indexing, use $x_1,\ldots, x_M$ such that
|
||
|
||
- $\{x_1,\ldots,x_M\}$ are of minimum distance $2K+1$ (solves problem 1) $|x_i|=O(\log M)$.
|
||
- $\{x_1,\ldots,x_M\}$ contain information (solves problem 2).
|
||
|
||
$\{x_1,\ldots,x_M\}$ depend on the message.
|
||
|
||
- Consider the message $m=(m_1,m_2)$.
|
||
- Find an encoding function $m_1\mapsto\{\{x_i\}_{i=1}^M|d_H(x_i,x_j)\geq 2K+1\}$ (coding over codes)
|
||
- Assume $e$ is such function (not shown).
|
||
|
||
...
|
||
|
||
Additional reading:
|
||
|
||
Jin Sima, Netanel Raviv, Moshe Schwartz, and Jehoshua Bruck. "Error Correction for DNA Storage." arXiv:2310.01729 (2023).
|
||
|
||
- Magazine article.
|
||
- Introductory.
|
||
- Broad perspective.
|
||
|
||
### Information Embedding in 3D printing
|
||
|
||
Motivations:
|
||
|
||
Threats to public safety.
|
||
|
||
- Ghost guns, forging fingerprints, forging keys, fooling facial recognition.
|
||
|
||
Solution – Information Embedding.
|
||
|
||
- Printer ID, user ID, time/location stamp
|
||
|
||
#### Existing Information Embedding Techniques
|
||
|
||
Many techniques exist.
|
||
|
||
Information embedding using variations in layers.
|
||
|
||
- Width, rotation, etc.
|
||
- Magnetic properties.
|
||
- Radiative materials.
|
||
|
||
Combating Adversarial Noise
|
||
|
||
Most techniques are rather accurate.
|
||
|
||
- I.e., low bit error rate.
|
||
|
||
Challenge: Adversarial damage after use.
|
||
|
||
- Scraping.
|
||
- Deformation.
|
||
- **Breaking**.
|
||
|
||
#### A t-break code
|
||
|
||
Let $m\in \{0,1\}^k\mapsto c\in \{0,1\}^n$
|
||
|
||
Adversary breaks $c$ at most $t$ times (security parameter).
|
||
|
||
Decoder receives a **multi**-st of at most $t+1$ fragments. Assume the following:
|
||
|
||
- Oriented
|
||
- Unordered
|
||
- Any length
|
||
|
||
#### Lower bound for t-break code
|
||
|
||
Claim: A t-break code must have $\Omega(t\log (n/t))$ redundancy.
|
||
|
||
Lemma: Let $\mathcal{C}$ be a t-break code of length $n$, and for $i\in [n]$ and $\mathcal{C}_i\subseteq\mathcal{C}$ be the subset of $\mathcal{C}$ containing all codewords of Hamming weight $i$. Then $d_H(\mathcal{C}_i)\geq \lceil \frac{t+1}{2}\rceil$.
|
||
|
||
<details>
|
||
<summary>Proof of Lemma</summary>
|
||
|
||
Let $x,y\in \mathcal{C}_i$ for some $i$, and let $\ell=d_H(x,y)$.
|
||
|
||
Write ($\circ$ denotes the concatenation operation):
|
||
|
||
$x=c_1\circ x_{i_1}\circ c_2\circ x_{i_2}\circ\ldots\circ c_{t+1}\circ x_{i_{t+1}}$
|
||
|
||
$y=c_1\circ y_{i_1}\circ c_2\circ y_{i_2}\circ\ldots\circ c_{t+1}\circ y_{i_{t+1}}$
|
||
|
||
Where $x_{i_j}\neq y_{i_j}$ for all $j\in [\ell]$.
|
||
|
||
Break $x$ and $y$ $2\ell$ times to produce the multi-sets:
|
||
|
||
$$
|
||
\mathcal{X}=\{\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\}\}
|
||
$$
|
||
|
||
$$
|
||
\mathcal{Y}=\{\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\}\}
|
||
$$
|
||
|
||
$w_H(x)=w_H(y)=i$, and therefore $\mathcal{X}=\mathcal{Y}$ and $\ell\geq \lceil \frac{t+1}{2}\rceil$.
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Proof of Claim</summary>
|
||
|
||
Let $j\in \{0,1,\ldots,n\}$ be such that $\mathcal{C}_j$ is the largest among $C_0,C_1,\ldots,C_{n}$.
|
||
|
||
$\log |\mathcal{C}|=\log(\sum_{i=0}^n|\mathcal{C}_i|)\leq \log \left((n+1)|C_j|\right)\leq \log (n+1)+\log |\mathcal{C}_j|$
|
||
|
||
By Lemma and ordinary sphere packing bound, for $t'=\lfloor\frac{\lceil \frac{t+1}{2}\rceil-1}{2}\rfloor\approx \frac{t}{4}$,
|
||
|
||
$$
|
||
|C_j|\leq \frac{2^n}{\sum_{t=0}^{t'}\binom{n}{i}}
|
||
$$
|
||
|
||
This implies that $n-\log |\mathcal{C}|\geq n-\log(n+1)-\log|\mathcal{C}_j|\geq \dots \geq \Omega(t\log (n/t))$
|
||
|
||
</details>
|
||
|
||
Corollary: In the relevant regime $t=O(n^{1-\epsilon})$, we have $\Omega(t\log n)$ redundancy.
|
||
|
||
### t-break codes: Main ideas.
|
||
|
||
Encoding:
|
||
|
||
- Need multiple markers across the codeword.
|
||
- Construct an adjacency matrix 𝐴 of markers to record their order.
|
||
- Append $RS_{2t}(A)$ to the codeword (as in the sliced channel).
|
||
|
||
Decoding (from $t + 1$ fragments):
|
||
|
||
- Locate all surviving markers, and locate $RS_{2t}(A)'$.
|
||
- Build an approximate adjacency matrix $A'$ from surviving markers $(d_H(A, A' )\leq 2t)$.
|
||
- Correct $(A',RS_{2t}(A)')\mapsto (A,RS_{2t}(A))$.
|
||
- Order the fragments correctly using $A$.
|
||
|
||
Tools:
|
||
|
||
- Random encoding (to have many markers).
|
||
- Mutually uncorrelated codes (so that markers will not overlap).
|
||
|
||
#### Tool: Mutually uncorrelated codes.
|
||
|
||
- Want: Markers not to overlap.
|
||
- Solution: Take markers from a Mutually Uncorrelated Codes (existing notion).
|
||
- A code $\mathcal{M}$ is called mutually uncorrelated if no suffix of any $m_i \in \mathcal{M}$ is if a prefix of another $m_j \in \mathcal{M}$ (including $i=j$).
|
||
- Many constructions exist.
|
||
|
||
Theorem: For any integer $\ell$ there exists a mutually uncorrelated code $\mathcal{C}_{MU}$ of length $\ell$ and size $|\mathcal{C}_{MU}|\geq \frac{2^\ell}{32\ell}$.
|
||
|
||
#### Tool: Random encoding.
|
||
|
||
- Want: Codewords with many markers from $\mathcal{C}_{MU}$, that are not too far apart.
|
||
- Problem: Hard to achieve explicitly.
|
||
- Workaround: Show that a uniformly random string has this property.
|
||
|
||
Random encoding:
|
||
|
||
- Choose the message at random.
|
||
- Suitable for embedding, say, printer ID.
|
||
- Not suitable for dynamic information.
|
||
|
||
Let $m>0$ be a parameter.
|
||
|
||
Fix a mutually uncorrelated code $\mathcal{C}_{MU}$ of length $\Theta(\log m)$.
|
||
|
||
Fix $m_1,\ldots, m_t$ from $\mathcal{C}_{MU}$ as "special" markers.
|
||
|
||
Claim: With probability $1-\frac{1}{\poly(m)}$, in uniformly random string $z\in \{0,1\}^m$.
|
||
|
||
- Every $O(\log^2(m))$ bits contain a marker from $\mathcal{C}_{MU}$.
|
||
- Every two non-overlapping substrings of length $c\log m$ are distinct.
|
||
- $z$ does not contain any of the special markers $m_1,\ldots, m_t$.
|
||
|
||
Proof idea:
|
||
|
||
- Short substring are abundant.
|
||
- Long substring are rare.
|
||
|
||
#### Sketch of encoding for t-break codes.
|
||
|
||
Repeatedly sample $z\in \{0,1\}^m$ until it is "good".
|
||
|
||
Find all markers $m_{i_1},\ldots, m_{i_r}$ in it.
|
||
|
||
Build a $|\mathcal{C}_{MU}|\times |\mathcal{C}_{MU}|$ matrix $A$ which records order and distances:
|
||
|
||
- $A_{i,j}=0$ if $m_i,m_j$ are not adjacent.
|
||
- Otherwise, it is the distance between them (in bits).
|
||
|
||
Append $RS_{2t}(A)$ at the end, and use the special markers $m_1,\ldots, m_t$.
|
||
|
||
#### Sketch of decoding for t-break codes.
|
||
|
||
Construct a partial adjacency matrix $A'$ from fragments. |