Files
NoteNextra-origin/content/CSE5313/CSE5313_L20.md
2025-11-06 13:59:31 -06:00

253 lines
8.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CSE5313 Coding and information theory for data science (Lecture 20)
## Review for Private Information Retrieval
### PIR from replicated databases
For 2 replicated databases, we have the following protocol:
- User has $i \sim U_{m}$.
- User chooses $r_1, r_2 \sim U_{\mathbb{F}_2^m}$.
- Two queries to each server:
- $q_{1, 1} = r_1 + e_i$, $q_{1, 2} = r_2$.
- $q_{2, 1} = r_1$, $q_{2, 2} = r_2 + e_i$.
- Server $j$ responds with $q_{j, 1} c_j^\top$ and $q_{j, 2} c_j^\top$.
- Decoding?
- $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top = r_1 c_1 + c_2 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$.
- $q_{1, 2} c_1^\top + q_{2, 2} c_2^\top = r_2 c_1 + c_2 + e_i c_2^\top = x_{i, 2}$.
PIR-rate is $\frac{k}{2k} = \frac{1}{2}$.
### PIR from coded parity-check databases
For 3 coded parity-check databases, we have the following protocol:
- User has $i \sim U_{m}$.
- User chooses $r_1, r_2, r_3 \sim U_{\mathbb{F}_2^m}$.
- Three queries to each server:
- $q_{1, 1} = r_1 + e_i$, $q_{1, 2} = r_2$, $q_{1, 3} = r_3$.
- $q_{2, 1} = r_1$, $q_{2, 2} = r_2 + e_i$, $q_{2, 3} = r_3$.
- $q_{3, 1} = r_1$, $q_{3, 2} = r_2$, $q_{3, 3} = r_3 + e_i$.
- Server $j$ responds with $q_{j, 1} c_j^\top, q_{j, 2} c_j^\top, q_{j, 3} c_j^\top$.
- Decoding?
- $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 c_1 + c_2 + c_3 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$.
- $q_{1, 2} c_1^\top + q_{2, 2} c_2^\top + q_{3, 2} c_3^\top = r_2 c_1 + c_2 + c_3 + e_i c_2^\top = x_{i, 2}$.
- $q_{1, 3} c_1^\top + q_{2, 3} c_2^\top + q_{3, 3} c_3^\top = r_3 c_1 + c_2 + c_3 + e_i c_3^\top = x_{i, 3}$.
PIR-rate is $\frac{k}{3k} = \frac{1}{3}$.
## Beyond z=1
### Star-product theme
Given $x=(x_1, \ldots, x_j)_{j\in [n]}, y=(y_1, \ldots, y_j)_{j\in [n]}$, over $\mathbb{F}_q$, the star-product is defined as:
$$
x \star y = (x_1 y_1, \ldots, x_n y_n)
$$
Given two linear codes, $C,D\subseteq \mathbb{F}_q^n$, the star-product code is defined as:
$$
C \star D = span_{\mathbb{F}_q} \{x \star y | x \in C, y \in D\}
$$
Singleton bound for star-product:
$$
d_{C \star D} \leq n-\dim C-\dim D+2
$$
### PIR form a database coded with any MDS code and z>1
To generalize the previous scheme to $z > 1$ need to encode multiple $r$'s together.
- As in the ramp scheme.
> Recall from the ramp scheme, we use $r_1, \ldots, r_z \sim U_{\mathbb{F}_q^k}$ as our key vector to avoid occlusion of the servers.
In the star-product scheme:
- Files are coded with an MDS code $C$.
- The multiple $r$'s are coded with an MDS code $D$.
- The scheme is based on the minimum distance of $C \star D$.
To code the data:
- Let $C \subseteq \mathbb{F}_q^n$ be an MDS code of dimension $k$.
- For all $j \in m$, encode file $x_j = x_{j, 1}, \ldots, x_{j, k}$ using $G_C$:
$$
\begin{pmatrix}
x_{1, 1} & x_{1, 2} & \cdots & x_{1, k}\\
x_{2, 1} & x_{2, 2} & \cdots & x_{2, k}\\
\vdots & \vdots & \ddots & \vdots\\
x_{m, 1} & x_{m, 2} & \cdots & x_{m, k}
\end{pmatrix} \cdot G_C = \begin{pmatrix}
c_{1, 1} & c_{1, 2} & \cdots & c_{1, n}\\
c_{2, 1} & c_{2, 2} & \cdots & c_{2, n}\\
\vdots & \vdots & \ddots & \vdots\\
c_{m, 1} & c_{m, 2} & \cdots & c_{m, n}
\end{pmatrix}
$$
- For all $j \in n$, store $c_j = c_{1, j}, c_{2, j}, \ldots, c_{m, j}$ (a column of the above matrix) in server $j$.
Let $r_1, \ldots, r_z \sim U_{\mathbb{F}_q^k}$.
To code the queries:
- Let $D \subseteq \mathbb{F}_q^k$ be an MDS code of dimension $z$.
- Encode the $r_j$'s using $G_D=[g_1^\top, \ldots, g_z^\top]$.
$$
(r_1^\top, \ldots, r_z^\top) \cdot G_D = \begin{pmatrix}
r_{1, 1} & r_{2, 1} & \cdots & r_{z, 1}\\
r_{1, 2} & r_{2, 2} & \cdots & r_{z, 2}\\
\vdots & \vdots & \ddots & \vdots\\
r_{1, m} & r_{2, m} & \cdots & r_{z, m}
\end{pmatrix}
\cdot G_D=\left((r_1^\top,\ldots, r_z^\top)g_1^\top,\ldots, (r_1^\top,\ldots, r_z^\top)g_n^\top \right)
$$
To introduce the "errors in known locations" to the encoded $r_j$'s:
- Let $W \in \{0, 1\}^{m \times n}$ with some $d_{C \star D} - 1$ entries in its $i$-th row equal to 1.
- These are the entries we will retrieve.
For every server $j \in [n]$ send $q_j = r_1^\top, \ldots, r_z^\top g_j^\top + w_j$, where $w_j$ is the $i$-th column of $W$.
- This is similar to ramp scheme, where $w_j$ is the "message".
- Privacy against collusion of $z$ servers.
Response from server: $a_j = q_j c_j^\top$.
Decoding? Let $Q \in \mathbb{F}_q^{m \times n}$ be a matrix whose columns are the $q_j$'s.
$$
Q = \begin{pmatrix}
r_1^\top & \cdots & r_z^\top
\end{pmatrix} \cdot G_D + W
$$
- The user has
$$
\begin{aligned}
q_1 c_1^\top, \ldots, q_n c_n^\top &= \left(\sum_{j \in m} q_{1, j} c_{j, 1}, \ldots, \sum_{j \in m} q_{n, j} c_{j, n}\right) \\
&=\sum_{j \in m} (q_{1,j}c_{j, 1}, \ldots, q_{n,j}c_{j, n}) \\
&=\sum_{j \in m} q^j \star c^j
$$
where $q^j$ is a row of $Q$ and $c^j$ is a codeword in $C$ (an $n, k$ $q$ MDS code).
We have:
- $Q=(r_1^\top, \ldots, r_z^\top) \cdot G_D + W$
- $W\in \{0, 1\}^{m \times n}$ with some $d_{C \star D} - 1$ entries in its $i$-th row equal to 1.
- $(q^j \star c^j)=sum_{j \in m} q^j \star c^j$
- Each $q^j$ is a row of $Q$
- For $j \neq i$, $q^j$ is a codeword in $D$
- $q^i = d^i + w^i$
- Therefore:
$$
\begin{aligned}
\sum_{j \in [m]} q^j \star c^j &= \sum_{j \neq i} (d^j \star c^j) + ((d^i + w^i) \star c^i) \\
&= \sum_{j \neq i} (d^j \star c^j) + w^i \star c^i
&= (\text{codeword in } C \star D )+( \text{noise of Hamming weight } \leq d_{C \star D} - 1)
\end{aligned}
$$
Multiply by $H_{C \star D}$ and get $d_{C \star D} - 1$ elements of $c^i$.
- Recall that $c^i = x_i \cdot G_C$
- Repeat $k^{d_{C \star D} - 1}$ times to obtain $k$ elements of $c^i$.
- Suffices to obtain $x_i$, since $C$ is $n, k$ $q$ MDS code.
PIR-rate:
- = $\frac{k}{# \text{ downloaded elements}} = \frac{k}{\frac{k}{d_{C \star D} - 1} \cdot n} = \frac{d_{C \star D} - 1}{n}$
- Singleton bound for star-product: $d_{C \star D} \leq n - \dim C - \dim D + 2$.
- Achieved with equality if $C$ and $D$ are Reed-Solomon codes.
- PIR-rate = $\frac{n - \dim C - \dim D + 1}{n} = \frac{n - k - z + 1}{n}$.
- Intuition:
- "paying" $k$ for "reconstruction from any $k$".
- "paying" $z$ for "protection against colluding sets of size $z$".
- Capacity unknown! (as of 2022).
- Known for special cases, e.g., $k = 1, z = 1$, certain types of schemes, etc.
### PIR over graphs
Graph-based replication:
- Every file is replicated twice on two separate servers.
- Every two servers have at most one file in common.
- "file" = "granularity" of data, i.e., the smallest information unit shared by any two servers.
A server that stores $(x_{i, j})_{j=1}^d$ receives $(q_{i, j})_{j=1}^d$, and replies with $\sum_{j=1}^d q_{i, j} \cdot x_{i, j}$.
The idea:
- Consider a 2-server replicated PIR and "split" the queries between the servers.
- Sum the responses, unwanted files "cancel out", while $x_i$ does not.
Problem: Collusion.
Solution: Add per server randomness.
Good for any graph, and any $q \geq 3$ (for simplicity assume $2 | q$).
The protocol:
- Choose random $\gamma \in \mathbb{F}_q^n$, $\nu \in \mathbb{F}_q^m$, and $h \in \mathbb{F} \setminus \{0, 1\}$.
- Queries:
- If node $j$ is incident with edge $\ell$, send $q_{j, \ell} = \gamma_j \cdot \nu_\ell$ to node $j$.
- I.e., if server $j$ stores file $\ell$.
- Except one node $j_0$ that stores $x_i$, which gets $q_{j_0, i} = h \cdot \gamma_{j_0} \cdot \nu_i$.
- Server $j$ responds with $a_j = \sum_{j=1}^d q_{j, \ell} \cdot x_{i, \ell}$.
- Where $x_{i, 1}, \ldots, $x_{i, d}$ are the files adjacent with it.
<details>
<summary>Example</summary>
- Consider the following graph.
- $n = 5, m = 7, and i = 3$.
- $q_3 = \gamma_3 \cdot v_2, v_3, v_6$ and $a_3 = x_2 \cdot \gamma_3 v_2 + x_3 \cdot \gamma_3 v_3 + x_6 \cdot \gamma_3 v_6$.
- $q_2 = \gamma_2 \cdot v_1, h v_3, v_4$ and $a_2 = x_1 \cdot \gamma_2 v_1 + x_3 \cdot h \gamma_2 v_3 + x_4 \cdot \gamma_2 v_4$.
![Example of PIR over graphs](https://notenextra.trance-0.com/CSE5313/PIR_over_graphs.png)
</details>
Correctness:
- $\sum_{j=1}^5 \gamma_j^{-1} a_j =( h + 1 )v_3 x_3$
- $h \neq 1, v_3 \neq 0 \implies$ find $x_3$.
Parameters:
- Storage overhead 2 (for any graph).
- Download $n \cdot k$.
- PIR rate 1/n.
Collusion resistance:
1-privacy: Each node sees an entirely random vector.
2-privacy:
- If no edge as for 1-privacy.
- If edge exists E.g.,
- $\gamma_3 v_6$ and $\gamma_4 v_6$ are independent.
- $\gamma_3 v_3$ and $h \cdot \gamma_2 v_3$ are independent.
S-privacy:
- Let $S \subseteq n$ (e.g., $S = 2,3,5$), and consider the query matrix of their mutual files:
$$
Q_S = diag(\gamma_3, \gamma_2, \gamma_5) \begin{pmatrix} 1 &\\ h & 1 \\ & 1\end{pmatrix} diag(v_3, v_4)
$$
- It can be shown that $Pr(Q_S)=\frac{1}{(q-1)^4}$, regardless of $i \implies$ perfect privacy.