updates?

2025-12-02 12:49:03 -06:00
parent 55decba076
commit 7d8196635e
3 changed files with 414 additions and 0 deletions
--- a/content/CSE5313/CSE5313_L26.md
+++ b/content/CSE5313/CSE5313_L26.md
@@ -0,0 +1,411 @@
+# CSE5313 Coding and information theory for data science (Lecture 26)
+
+## Sliced and Broken Information with applications in DNA storage and 3D printing
+
+### Basic info
+
+Deoxyribo-Nucleic Acid.
+
+A double-helix shaped molecule.
+
+Each helix is a string of
+
+- Cytosine,
+- Guanine,
+- Adenine, and
+- Thymine.
+
+Contained inside every living cell.
+
+- Inside the nucleus.
+
+Used to encode proteins.
+
+mRNA carries info to Ribosome as codons of length 3 over GUCA.
+
+- Each codon produces an amino acids.
+- $4^3> 20$, redundancy in nature!
+
+1st Chargaff rule:
+
+- The two strands are complements (A-T and G-C).
+- $#A = #T$ and $#G = #C$ in both strands.
+
+2nd Chargaff rule:
+
+- $#𝐴 ≈ #𝑇$ and $#G \approx #C$ in each strands.
+- Can be explained via tandem duplications.
+  - $GCAGCATT \implies GCAGCAGCATT$.
+  - Occur naturally during cell mitosis.
+
+### DNA storage
+
+DNA synthesis:
+
+- Artificial creation of DNA from G’s, T’s, A’s, and C’s.
+
+Can be used to store information!
+
+Advantages:
+
+- Density.
+  - 5.5 PB per mm3.
+- Stability.
+  - Half-life 521 years (compare to ≈ 20𝑦 on hard drives).
+- Future proof.
+  - DNA reading and writing will remain relevant "forever."
+
+#### DNA storage prototypes
+
+Some recent attempts:
+
+- 2011, 659kb.
+  - Church, Gao, Kosuri, "Next-generation digital information storage in DNA", Science.
+- 2018, 200MB.
+  - Organick et al., "Random access in large-scale DNA data storage," Nature biotechnology.
+- CatalogDNA (startup):
+  - 2019, 16GB.
+  - 2021, 18 Mbps.
+
+Companies:
+
+- Microsoft, Illumina, Western Digital, many startups.
+
+Challenges:
+
+- Expensive, Slow.
+- Traditional storage media still sufficient and affordable.
+
+#### DNA Storage models
+
+In vivo:
+
+- Implant the synthetic DNA inside a living organism.
+- Need evolution-correcting codes!
+- E.g., coding against tandem-duplications.
+
+In vitro:
+
+- Place the synthetic DNA in test tubes.
+- Challenge: Can only synthesize short sequences ($\approx 1000 bp$).
+- 1 test tube contains millions to billions of short sequences.
+
+How to encode information?
+
+How to achieve noise robustness?
+
+### DNA coding in vitro environment
+
+Traditional data communication:
+
+$$
+m\in\{0,1\}^k\mapsto c\in\{0,1\}^n
+$$
+
+DNA storage:
+
+$$
+m\in\{0,1\}^k\mapsto c\in \binom{\{0,1\}^L}{M}
+$$
+
+where $\binom{\{0,1\}^L}{M}$ is the collection of all $M$-subsets of $\{0,1\}^L$. ($0\leq M\leq 2^L$)
+
+A codeword is a set of $M$ binary strings, each of length $L$.
+
+"Sliced channel":
+
+- The message 𝑚 is encoded to $c\in \{0,1\}^{ML}, and then sliced to 𝑀 equal parts.
+- Parts may be noisy (substitutions, deletions, etc.).
+- Also useful in network packet transmission ($M$ packets of length $L$).
+
+#### Sliced channel: Figures of merit
+
+How to quantify the **merit** of a given code $\mathcal{C}$?
+
+- Want resilience to any $K$ substitutions in all $M$ parts.
+
+Redundance:
+
+- Recall in linear codes,
+  - $redundancy=lengt-dimension=\log (size\ of\ space)-\log (size\ of\ code)$.
+- In sliced channel:
+  - $redundancy=\log (size\ of\ space)-\log (size\ of\ code)=\log \binom{2^L}{M}-\log |\mathcal{C}|$.
+
+Research questions:
+
+- Bounds on redundancy?
+- Code construction?
+- Is more redundancy needed in sliced channel?
+
+#### Sliced channel: Lower bound
+
+Idea: **Sphere packing**
+
+Given $c\in \binom{\{0,1\}^L}{M}$ and $K=1$, how may codewords must we exclude?
+
+<details>
+<summary>Example</summary>
+
+- $L=3,M=2$, and let $c=\{001,011\}$. Ball of radius 1 is sized 5.
+
+all the codeword with distance 1 from $c$ are:
+
+Distance 0:
+
+$$
+\{001,011\}
+$$
+
+Distance 1:
+$$
+\{101,011\}\quad \{011\}\quad \{000,011\}\\
+\{001,111\}\quad \{001\}\quad \{001,010\}
+$$
+
+7 options and 2 of them are not codewords.
+
+So the effective size of the ball is 5.
+
+</details>
+
+Tool: (Fourier analysis of) Boolean functions
+
+Introducing hypercube graph:
+
+- $V=\{0,1\}^L$.
+- $\{x,y\}\in E$ if and only if $d_H(x,y)=1$.
+- What is size of $E$?
+
+Consider $c\in \binom{\{0,1\}^L}{M}$ as a characteristic function: $f_c(x)=\{0,1\}^L\to \{0,1\}$.
+
+Let $\partial f_c$ be its boundary.
+
+- All hypercube edges $\{x,y\}$ such that $f_c(x)\neq f_c(y)$.
+
+#### Lemma of boundary
+
+Size of 1-ball $\geq |\partial f_c|+1$.
+
+<details>
+<summary>Proof</summary>
+
+Every edge on the boundary represents a unique way of flipping one bit in one string in $c$.
+
+</details>
+
+Need to bound $|\partial f_c|$ from below
+
+Tool: Total influence.
+
+#### Definition of total influence.
+
+The **total influence** $I(f)$ of $f:\{0,1\}^L\to \{0,1\}$ is defined as:
+
+$$
+\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))
+$$
+
+where $x^{\oplus i}$ equals to $x$ with it's $i$th bit flipped.
+
+#### Theorem: Edge-isoperimetric inequality, no proof)
+
+$I(f)\geq 2\alpha\log\frac{1}{\alpha}$.
+
+where $\alpha=\min\{\text{fraction of 1's},\text{fraction of 0's}\}$.
+
+Notice: Let $\partial_i f$ be the $i$-dimensional edges in $\partial f$.
+
+$$
+\begin{aligned}
+I(f)&=\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))\\
+&=\sum_{i=1}^L\frac{|\partial_i f|}{2^{L-1}}\\
+&=\frac{||\partial f||}{2^{L-1}}\\
+\end{aligned}
+$$
+
+Corollary: Let $\epsilon>0$, $L\geq \frac{1}{\epsilon}$ and $M\leq 2^{(1-\epsilon)L}$, and let $c\in \binom{\{0,1\}^L}{M}$. Then,
+
+$$
+|\parital f_c|\geq 2\times 2^{L-1}\frac{M}{2^L}\log \frac{2^L}{M}\geq M\log \frac{2^L}{M}\geq ML\epsilon
+$$
+
+Size of $1$ ball is sliced channel $\geq ML\epsilon$.
+
+this implies that $|\mathcal{C}|leq \frac{\binom{2^L}{M}}{\epsilon ML}$
+
+Corollary:
+
+- Redundancy in sliced channel with $K=1$ with the above parameters $\log ML-O(1)$.
+- Simple generation (not shown) gives $O(K\log ML)$.
+
+### Robust indexing
+
+Idea: Start with $L$ bit string with $\log M$ bits for indexing.
+
+Problem 1: Indices subject to noise.
+
+Problem 2: Indexing bits do not carry information $\implies$ higher redundancy.
+
+[link to paper](https://ieeexplore.ieee.org/document/9174447)
+
+Idea: Robust indexing
+
+Instead of using $1,\ldots, M$ for indexing, use $x_1,\ldots, x_M$ such that
+
+- $\{x_1,\ldots,x_M\}$ are of minimum distance $2K+1$ (solves problem 1) $|x_i|=O(\log M)$.
+- $\{x_1,\ldots,x_M\}$ contain information (solves problem 2).
+
+$\{x_1,\ldots,x_M\}$ depend on the message.
+
+- Consider the message $m=(m_1,m_2)$.
+- Find an encoding function $m_1\mapsto\{\{x_i\}_{i=1}^M|d_H(x_i,x_j)\geq 2K+1\}$ (coding over codes)
+- Assume $e$ is such function (not shown).
+
+...
+
+Additional reading:
+
+Jin Sima, Netanel Raviv, Moshe Schwartz, and Jehoshua Bruck. "Error Correction for DNA Storage." arXiv:2310.01729 (2023).
+
+- Magazine article.
+- Introductory.
+- Broad perspective.
+
+### Information Embedding in 3D printing  
+
+Motivations:
+
+Threats to public safety.
+
+- Ghost guns, forging fingerprints, forging keys, fooling facial recognition.
+
+Solution – Information Embedding.
+
+- Printer ID, user ID, time/location stamp
+
+#### Existing Information Embedding Techniques
+
+Many techniques exist.
+
+Information embedding using variations in layers.
+
+- Width, rotation, etc.
+- Magnetic properties.
+- Radiative materials.
+
+Combating Adversarial Noise
+
+Most techniques are rather accurate.
+
+- I.e., low bit error rate.
+
+Challenge: Adversarial damage after use.
+
+- Scraping.
+- Deformation.
+- **Breaking**.
+
+#### A t-break code
+
+Let $m\in \{0,1\}^k\mapsto c\in \{0,1\}^n$
+
+Adversary breaks $c$ at most $t$ times (security parameter).
+
+Decoder receives a **multi**-st of at most $t+1$ fragments. Assume the following:
+
+- Oriented
+- Unordered
+- Any length
+
+#### Lower bound for t-break code
+
+Claim: A t-break code must have $\Omega(t\log (n/t))$ redundancy.
+
+Lemma: Let $\mathcal{C}$ be a t-break code of length $n$, and for $i\in [n]$ and $\mathcal{C}_i\subseteq\mathcal{C}$ be the subset of $\mathcal{C}$ containing all codewords of Hamming weight $i$. Then $d_H(\mathcal{C}_i)\geq \lceil \frac{t+1}{2}\rceil$.
+
+<details>
+<summary>Proof of Lemma</summary>
+
+Let $x,y\in \mathcal{C}_i$ for some $i$, and let $\ell=d_H(x,y)$.
+
+Write ($\circ$ denotes the concatenation operation):
+
+$x=c_1\circ x_{i_1}\circ c_2\circ x_{i_2}\circ\ldots\circ c_{t+1}\circ x_{i_{t+1}}$
+
+$y=c_1\circ y_{i_1}\circ c_2\circ y_{i_2}\circ\ldots\circ c_{t+1}\circ y_{i_{t+1}}$
+
+Where $x_{i_j}\neq y_{i_j}$ for all $j\in [\ell]$.
+
+Break $x$ and $y$ $2\ell$ times to produce the multi-sets:
+
+$$
+\mathcal{X}=\{\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\}\}
+$$
+
+$$
+\mathcal{Y}=\{\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\}\}
+$$
+
+$w_H(x)=w_H(y)=i$, and therefore $\mathcal{X}=\mathcal{Y}$ and $\ell\geq \lceil \frac{t+1}{2}\rceil$.
+
+</details>
+
+<details>
+<summary>Proof of Claim</summary>
+
+Let $j\in \{0,1,\ldots,n\}$ be such that $\mathcal{C}_j$ is the largest among $C_0,C_1,\ldots,C_{n}$.
+
+$\log |\mathcal{C}|=\log(\sum_{i=0}^n|\mathcal{C}_i|)\leq \log \left((n+1)|C_j|\right)\leq \log (n+1)+\log |\mathcal{C}_j|$
+
+By Lemma and ordinary sphere packing bound, for $t'=\lfloor\frac{\lceil \frac{t+1}{2}\rceil-1}{2}\rfloor\approx \frac{t}{4}$,
+
+$$
+|C_j|\leq \frac{2^n}{\sum_{t=0}^{t'}\binom{n}{i}}
+$$
+
+This implies that $n-\log |\mathcal{C}|\geq n-\log(n+1)-\log|\mathcal{C}_j|\geq \dots \geq \Omega(t\log (n/t))$
+
+</details>
+
+Corollary: In the relevant regime $t=O(n^{1-\epsilon})$, we have $\Omega(t\log n)$ redundancy.
+
+TRACK LOST HERE
+
+𝑡-break codes: Main ideas.
+• Encoding:
+– Need multiple markers across the codeword.
+– Construct an adjacency matrix 𝐴 of markers to record their order.
+– Append 𝑅𝑆2𝑡 𝐴 to the codeword (as in the sliced channel).
+• Decoding (from 𝑡 + 1 fragments):
+– Locate all surviving markers, and locate 𝑅𝑆2𝑡 𝐴 ′.
+– Build an approximate adjacency matrix 𝐴
+′
+from surviving markers (𝑑𝐻 𝐴, 𝐴′ ≤ 2𝑡).
+– Correct 𝐴
+′
+, 𝑅𝑆2𝑡 𝐴 ′ ↦ 𝐴 , 𝑅𝑆2𝑡 𝐴 .
+– Order the fragments correctly using 𝐴.
+• Tools:
+– Random encoding (to have many markers).
+– Mutually uncorrelated codes (so that markers will not overlap).
+
+Tool: Mutually uncorrelated codes.
+• Want: Markers not to overlap.
+• Solution: Take markers from a Mutually Uncorrelated Codes (existing notion).
+– A code ℳ is called mutually uncorrelated if no suffix of any 𝑚𝑖 ∈ ℳ if a prefix of another
+𝑚𝑗 ∈ ℳ (including 𝑖 = 𝑗).
+– Many constructions exist.
+• Theorem: For any integer ℓ there exists a mutually uncorrelated code 𝐶𝑀𝑈 of length
+ℓ and size 𝐶𝑀𝑈 ≥
+2
+ℓ
+32ℓ
+
+Tool: Random encoding.
+• Want: Codewords with many markers from 𝐶𝑀𝑈, that are not too far apart.
+• Problem: Hard to achieve explicitly.
+• Workaround: Show that a uniformly random string has this property.
+• Random encoding:
+– Choose the message at random.
+– Suitable for embedding, say, printer ID.
+– Not suitable for dynamic information.
--- a/content/CSE5313/CSE5313_L27.md
+++ b/content/CSE5313/CSE5313_L27.md
@@ -0,0 +1 @@
+# CSE5313 Coding and information theory for data science (Lecture 27)
--- a/content/CSE5313/_meta.js
+++ b/content/CSE5313/_meta.js
@@ -29,4 +29,6 @@ export default {
    CSE5313_L23: "CSE5313 Coding and information theory for data science (Lecture 23)",
    CSE5313_L24: "CSE5313 Coding and information theory for data science (Lecture 24)",
    CSE5313_L25: "CSE5313 Coding and information theory for data science (Lecture 25)",
+    CSE5313_L26: "CSE5313 Coding and information theory for data science (Lecture 26)",
+    CSE5313_L27: "CSE5313 Coding and information theory for data science (Lecture 27)",
 }
				`@@ -0,0 +1 @@`
				`# CSE5313 Coding and information theory for data science (Lecture 27)`