Files
NoteNextra-origin/content/CSE5313/CSE5313_L19.md
2025-11-06 13:59:31 -06:00

234 lines
9.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CSE5313 Coding and information theory for data science (Lecture 19)
## Private information retrieval
### Problem setup
Premise:
- Database $X = \{x_1, \ldots, x_m\}$, each $x_i \in \mathbb{F}_q^k$ is a "file" (e.g., medical record).
- $X$ is coded $X \mapsto \{y_1, \ldots, y_n\}$, $y_j$ stored at server $j$.
- The user (physician) wants $x_i$.
- The user sends a query $q_j \sim Q_j$ to server $j$.
- Server $j$ responds with $a_j \sim A_j$.
Decodability:
- The user can retrieve the file: $H(X_i | A_1, \ldots, A_n) = 0$.
Privacy:
- $i$ is seen as $i \sim U = U_{m}$, reflecting server's lack of knowledge.
- $i$ must be kept private: $I(Q_j; U) = 0$ for all $j \in n$.
> In short, we want to retrieve $x_i$ from the servers without revealing $i$ to the servers.
### Private information retrieval from Replicated Databases
#### Simple case, one server
Say $n = 1, y_1 = X$.
- All data is stored in one server.
- Simple solution:
- $q_1 =$ "send everything".
- $a_1 = y_1 = X$.
Theorem: Information Theoretic PIR with $n = 1$ can only be achieved by downloading the entire database.
- Can we do better if $n > 1$?
#### Collusion parameter
Key question for $n > 1$: Can servers collude?
- I.e., does server $j$ see any $Q_\ell$, $\ell \neq j$?
- Key assumption:
- Privacy parameter $z$.
- At most $z$ servers can collude.
- $z = 1\implies$ No collusion.
- Requirement for $z = 1$: $I(Q_j; U) = 0$ for all $j \in n$.
- Requirement for a general $z$:
- $I(Q_\mathcal{T}; U) = 0$ for all $\mathcal{T} \in n$, $|\mathcal{T}| \leq z$, where $Q_\mathcal{T} = Q_\ell$ for all $\ell \in \mathcal{T}$.
- Motivation:
- Interception of communication links.
- Data breaches.
Other assumptions:
- Computational Private information retrieval (even all the servers are hacked, still cannot get the information -> solve np-hard problem):
- Non-zero MI
#### Private information retrieval from 2-replicated databases
First PIR protocol: Chor et al. FOCS 95.
- The data $X = \{x_1, \ldots, x_m\}$ is replicated on two servers.
- $z = 1$, i.e., no collusion.
- Protocol: User has $i \sim U_{m}$.
- User generates $r \sim U_{\mathbb{F}_q^m}$.
- $q_1 = r, q_2 = r + e_i$ ($e_i \in \mathbb{F}_q^m$ is the $i$-th unit vector, $q_2$ is equivalent to one-time pad encryption of $x_i$ with key $r$).
- $a_j = q_j X^\top = \sum_{\ell \in m} q_j, \ell x_\ell$
- Linear combination of the files according to the query vector $q_j$.
- Decoding?
- $a_2 - a_1 = q_2 - q_1 X^\top = e_i X^\top = x_i$.
- Download?
- $a_j =$ size of file $\implies$ downloading **twice** the size of the file.
- Privacy?
- Since $z = 1$, need to show $I(U; Q_i) = 0$.
- $I(U; Q_1) = I(e_U; F) = 0$ since $U$ and $F$ are independent.
- $I(U; Q_2) = I(e_U; F + e_U) = 0$ since this is one-time pad!
##### Parameters and notations in PIR
Parameters of the system:
- $n =$ # servers (as in storage).
- $m =$ # files.
- $k =$ size of each file (as in storage).
- $z =$ max. collusion (as in secret sharing).
- $t =$ # of answers required to obtain $x_i$ (as in secret sharing).
- $n - t$ servers are “stragglers”, i.e., might not respond.
Figures of merit:
- PIR-rate = $\#$ desired symbols / $\#$ downloaded symbols
- PIR-capacity = largest possible rate.
Notaional conventions:
-The dataset $X = \{x_j\}_{j \in m} = \{x_{j, \ell}\}_{(j, \ell) \in [m] \times [k]}$ is seen as a vector in $\mathbb{F}_q^{mk}$.
- Index $\mathbb{F}_q^{mk}$ using $[m] \times [k]$, i.e., $x_{j, \ell}$ is the $\ell$-th symbol of the $j$-th file.
#### Private information retrieval from 4-replicated databases
Consider $n = 4$ replicated servers, file size $k = 2$, collusion $z = 1$.
Protocol: User has $i \sim U_{m}$.
- Fix distinct nonzero $\alpha_1, \ldots, \alpha_4 \in \mathbb{F}_q$.
- Choose $r \sim U_{\mathbb{F}_q^{2m}}$.
- User sends $q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r$ to each server $j$.
- Server $j$ responds with
$$
a_j = q_j X^\top = e_{i, 1} X^\top + \alpha_j e_{i, 2} X^\top + \alpha_j^2 r X^\top
$$
- This is an evaluation at $\alpha_j$ of the polynomial $f_i(w) = x_{i, 1} + x_{i, 2} \cdot w + r \cdot w^2$.
- Where $r$ is some random combination of the entries of $X$.
- Decoding?
- Any 3 responses suffice to interpolate $f_i$ and obtain $x_i = x_{i, 1}, x_{i, 2}$.
- $\implies t = 3$, (one straggler is allowed)
- Privacy?
- Does $q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r$ look familiar?
- This is a share in [ramp scheme](CSE5313_L18.md#scheme-2-ramp-secret-sharing-scheme-mceliece-sarwate-scheme) with vector messages $m_1 = e_{i, 1}, m_2 = e_{i, 2}, m_i \in \mathbb{F}_q^{2m}$.
- This is equivalent to $2m$ "parallel" ramp scheme over $\mathbb{F}_q$.
- Each one reveals nothing to any $z = 1$ shareholders $\implies$ Private!
### Private information retrieval from general replicated databases
$n$ servers, $m$ files, file size $k$, $X \in \mathbb{F}_q^{mk}$.
Server decodes $x_i$ from any $t$ responses.
Any $\leq z$ servers might collude to infer $i$ ($z < t$).
Protocol: User has $i \sim U_{m}$.
- User chooses $r_1, \ldots, r_z \sim U_{\mathbb{F}_q^{mk}}$.
- User sends $q_j = \sum_{\ell=1}^k e_{i, \ell} \alpha_j^{\ell-1} + \sum_{\ell=1}^z r_\ell \alpha_j^{k+\ell-1}$ to each server $j$.
- Server $j$ responds with $a_j = q_j X^\top = f_i(\alpha_j)$.
- $f_i(w) = \sum_{\ell=1}^k e_{i, \ell} X^\top w^{\ell-1} + \sum_{\ell=1}^z r_\ell X^\top w^{k+\ell-1}$ (random combinations of $X$).
- Caveat: must have $t = k + z$.
- $\implies \deg f_i = k + z - 1 = t - 1$.
- Decoding?
- Interpolation from any $t$ evaluations of $f_i$.
- Privacy?
- Against any $z = t - k$ colluding servers, immediate from the proof of the ramp scheme.
PIR-rate?
- Each $a_j$ is a single field element.
- Download $t = k + z$ elements in $\mathbb{F}_q$ in order to obtain $x_i \in \mathbb{F}_q^k$.
- $\implies$ PIR-rate = $\frac{k}{k+z} = \frac{k}{t}$.
#### Theorem: PIR-capacity for general replicated databases
The PIR-capacity for $n$ replicated databases with $z$ colluding servers, $n - t$ unresponsive servers, and $m$ files is $C = \frac{1-\frac{z}{t}}{1-(\frac{z}{t})^m}$.
- When $m \to \infty$, $C \to 1 - \frac{z}{t} = \frac{t-z}{t} = \frac{k}{t}$.
- The above scheme achieves PIR-capacity as $m \to \infty$
### Private information retrieval from coded databases
#### Problem setup:
Example:
- $n = 3$ servers, $m$ files $x_j$, $x_j = x_{j, 1}, x_{j, 2}$, $k = 2$, and $q = 2$.
- Code each file with a parity code: $x_{j, 1}, x_{j, 2} \mapsto x_{j, 1}, x_{j, 2}, x_{j, 1} + x_{j, 2}$.
- Server $j \in 3$ stores all $j$-th symbols of all coded files.
Queries, answers, decoding, and privacy must be tailored for the code at hand.
With respect to a code $C$ and parameters $n, k, t, z$, such scheme is called coded-PIR.
- The content for server $j$ is denoted by $c_j = c_{j, 1}, \ldots, c_{j, m}$.
- $C$ is usually an MDS code.
#### Private information retrieval from parity-check codes
Example:
Say $z = 1$ (no collusion).
- Protocol: User has $i \sim U_{m}$.
- User chooses $r_1, r_2 \sim U_{\mathbb{F}_2^m}$.
- Two queries to each server:
- $q_{1, 1} = r_1 + e_i$, $q_{1, 2} = r_2$.
- $q_{2, 1} = r_1$, $q_{2, 2} = r_2 + e_i$.
- $q_{3, 1} = r_1$, $q_{3, 2} = r_2$.
- Server $j$ responds with $q_{j, 1} c_j^\top$ and $q_{j, 2} c_j^\top$.
- Decoding?
- $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 c_1 + c_2 + c_3 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$.
- $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$.
- $q_{1, 2} c_1^\top + q_{2, 2} c_2^\top + q_{3, 2} c_3^\top = r_2 c_1 + c_2 + c_3^\top + e_i c_2^\top = x_{i, 2}$.
- Privacy?
- Every server sees two uniformly random vectors in $\mathbb{F}_2^m$.
<details>
<summary>Proof from coding-theoretic interpretation</summary>
Let $G = g_1^\top, g_2^\top, g_3^\top$ be the generator matrix.
- For every file $x_j = x_{j, 1}, x_{j, 2}$ we encode $x_j G = (x_{j, 1} g_1^\top, x_{j, 2} g_2^\top, x_{j, 1} g_3^\top) = (c_{j, 1}, c_{j, 2}, c_{j, 3})$.
- Server $j$ stores $X g_j^\top = (x_1^\top, \ldots, x_m^\top)^\top g_j^\top = (c_{j, 1}, \ldots, c_{j, m})^\top$.
- By multiplying by $r_1$, the servers together store a codeword in $C$:
- $r_1 X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top = r_1 X G$.
- By replacing one of the $r_1$s by $r_1 + e_i$, we introduce an error in that entry:
- $\left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) = r_1 X G + (e_i X g_1^\top, 0,0)$.
- Downloading this “erroneous” word from the servers and multiply by $H = h_1^\top, h_2^\top, h_3^\top$ be the parity-check matrix.
$$
\begin{aligned}
\left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) H^\top &= \left(r_1 X G + (e_i X g_1^\top, 0,0)\right) H^\top \\
&= r_1 X G H^\top + (e_i X g_1^\top, 0,0) H^\top \\
&= 0 + x_{i, 1} g_1^\top \\
&= x_{i, 1}.
\end{aligned}
$$
> In homework we will show tha this work with any MDS code ($z=1$).
- Say we obtained $x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top$ (𝑑 1 at a time, how?).
- $x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top = x_{i, B}$, where $B$ is a $k \times k$ submatrix of $G$.
- $B$ is a $k \times k$ submatrix of $G$ $\implies$ invertible! $\implies$ Obtain $x_{i}$.
</details>
> [!TIP]
>
> error + known location $\implies$ erasure. $d = 2 \implies$ 1 erasure is correctable.