234 lines
9.2 KiB
Markdown
234 lines
9.2 KiB
Markdown
# CSE5313 Coding and information theory for data science (Lecture 19)
|
||
|
||
## Private information retrieval
|
||
|
||
### Problem setup
|
||
|
||
Premise:
|
||
|
||
- Database $X = \{x_1, \ldots, x_m\}$, each $x_i \in \mathbb{F}_q^k$ is a "file" (e.g., medical record).
|
||
- $X$ is coded $X \mapsto \{y_1, \ldots, y_n\}$, $y_j$ stored at server $j$.
|
||
- The user (physician) wants $x_i$.
|
||
- The user sends a query $q_j \sim Q_j$ to server $j$.
|
||
- Server $j$ responds with $a_j \sim A_j$.
|
||
|
||
Decodability:
|
||
|
||
- The user can retrieve the file: $H(X_i | A_1, \ldots, A_n) = 0$.
|
||
|
||
Privacy:
|
||
|
||
- $i$ is seen as $i \sim U = U_{m}$, reflecting server's lack of knowledge.
|
||
- $i$ must be kept private: $I(Q_j; U) = 0$ for all $j \in n$.
|
||
|
||
> In short, we want to retrieve $x_i$ from the servers without revealing $i$ to the servers.
|
||
|
||
### Private information retrieval from Replicated Databases
|
||
|
||
#### Simple case, one server
|
||
|
||
Say $n = 1, y_1 = X$.
|
||
|
||
- All data is stored in one server.
|
||
- Simple solution:
|
||
- $q_1 =$ "send everything".
|
||
- $a_1 = y_1 = X$.
|
||
|
||
Theorem: Information Theoretic PIR with $n = 1$ can only be achieved by downloading the entire database.
|
||
|
||
- Can we do better if $n > 1$?
|
||
|
||
#### Collusion parameter
|
||
|
||
Key question for $n > 1$: Can servers collude?
|
||
|
||
- I.e., does server $j$ see any $Q_\ell$, $\ell \neq j$?
|
||
- Key assumption:
|
||
- Privacy parameter $z$.
|
||
- At most $z$ servers can collude.
|
||
- $z = 1\implies$ No collusion.
|
||
- Requirement for $z = 1$: $I(Q_j; U) = 0$ for all $j \in n$.
|
||
- Requirement for a general $z$:
|
||
- $I(Q_\mathcal{T}; U) = 0$ for all $\mathcal{T} \in n$, $|\mathcal{T}| \leq z$, where $Q_\mathcal{T} = Q_\ell$ for all $\ell \in \mathcal{T}$.
|
||
- Motivation:
|
||
- Interception of communication links.
|
||
- Data breaches.
|
||
|
||
Other assumptions:
|
||
|
||
- Computational Private information retrieval (even all the servers are hacked, still cannot get the information -> solve np-hard problem):
|
||
- Non-zero MI
|
||
|
||
#### Private information retrieval from 2-replicated databases
|
||
|
||
First PIR protocol: Chor et al. FOCS ‘95.
|
||
|
||
- The data $X = \{x_1, \ldots, x_m\}$ is replicated on two servers.
|
||
- $z = 1$, i.e., no collusion.
|
||
- Protocol: User has $i \sim U_{m}$.
|
||
- User generates $r \sim U_{\mathbb{F}_q^m}$.
|
||
- $q_1 = r, q_2 = r + e_i$ ($e_i \in \mathbb{F}_q^m$ is the $i$-th unit vector, $q_2$ is equivalent to one-time pad encryption of $x_i$ with key $r$).
|
||
- $a_j = q_j X^\top = \sum_{\ell \in m} q_j, \ell x_\ell$
|
||
- Linear combination of the files according to the query vector $q_j$.
|
||
- Decoding?
|
||
- $a_2 - a_1 = q_2 - q_1 X^\top = e_i X^\top = x_i$.
|
||
- Download?
|
||
- $a_j =$ size of file $\implies$ downloading **twice** the size of the file.
|
||
- Privacy?
|
||
- Since $z = 1$, need to show $I(U; Q_i) = 0$.
|
||
- $I(U; Q_1) = I(e_U; F) = 0$ since $U$ and $F$ are independent.
|
||
- $I(U; Q_2) = I(e_U; F + e_U) = 0$ since this is one-time pad!
|
||
|
||
##### Parameters and notations in PIR
|
||
|
||
Parameters of the system:
|
||
|
||
- $n =$ # servers (as in storage).
|
||
- $m =$ # files.
|
||
- $k =$ size of each file (as in storage).
|
||
- $z =$ max. collusion (as in secret sharing).
|
||
- $t =$ # of answers required to obtain $x_i$ (as in secret sharing).
|
||
- $n - t$ servers are “stragglers”, i.e., might not respond.
|
||
|
||
Figures of merit:
|
||
|
||
- PIR-rate = $\#$ desired symbols / $\#$ downloaded symbols
|
||
- PIR-capacity = largest possible rate.
|
||
|
||
Notaional conventions:
|
||
|
||
-The dataset $X = \{x_j\}_{j \in m} = \{x_{j, \ell}\}_{(j, \ell) \in [m] \times [k]}$ is seen as a vector in $\mathbb{F}_q^{mk}$.
|
||
|
||
- Index $\mathbb{F}_q^{mk}$ using $[m] \times [k]$, i.e., $x_{j, \ell}$ is the $\ell$-th symbol of the $j$-th file.
|
||
|
||
#### Private information retrieval from 4-replicated databases
|
||
|
||
Consider $n = 4$ replicated servers, file size $k = 2$, collusion $z = 1$.
|
||
|
||
Protocol: User has $i \sim U_{m}$.
|
||
|
||
- Fix distinct nonzero $\alpha_1, \ldots, \alpha_4 \in \mathbb{F}_q$.
|
||
- Choose $r \sim U_{\mathbb{F}_q^{2m}}$.
|
||
- User sends $q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r$ to each server $j$.
|
||
- Server $j$ responds with
|
||
$$
|
||
a_j = q_j X^\top = e_{i, 1} X^\top + \alpha_j e_{i, 2} X^\top + \alpha_j^2 r X^\top
|
||
$$
|
||
- This is an evaluation at $\alpha_j$ of the polynomial $f_i(w) = x_{i, 1} + x_{i, 2} \cdot w + r \cdot w^2$.
|
||
- Where $r$ is some random combination of the entries of $X$.
|
||
- Decoding?
|
||
- Any 3 responses suffice to interpolate $f_i$ and obtain $x_i = x_{i, 1}, x_{i, 2}$.
|
||
- $\implies t = 3$, (one straggler is allowed)
|
||
- Privacy?
|
||
- Does $q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r$ look familiar?
|
||
- This is a share in [ramp scheme](CSE5313_L18.md#scheme-2-ramp-secret-sharing-scheme-mceliece-sarwate-scheme) with vector messages $m_1 = e_{i, 1}, m_2 = e_{i, 2}, m_i \in \mathbb{F}_q^{2m}$.
|
||
- This is equivalent to $2m$ "parallel" ramp scheme over $\mathbb{F}_q$.
|
||
- Each one reveals nothing to any $z = 1$ shareholders $\implies$ Private!
|
||
|
||
### Private information retrieval from general replicated databases
|
||
|
||
$n$ servers, $m$ files, file size $k$, $X \in \mathbb{F}_q^{mk}$.
|
||
|
||
Server decodes $x_i$ from any $t$ responses.
|
||
|
||
Any $\leq z$ servers might collude to infer $i$ ($z < t$).
|
||
|
||
Protocol: User has $i \sim U_{m}$.
|
||
|
||
- User chooses $r_1, \ldots, r_z \sim U_{\mathbb{F}_q^{mk}}$.
|
||
- User sends $q_j = \sum_{\ell=1}^k e_{i, \ell} \alpha_j^{\ell-1} + \sum_{\ell=1}^z r_\ell \alpha_j^{k+\ell-1}$ to each server $j$.
|
||
- Server $j$ responds with $a_j = q_j X^\top = f_i(\alpha_j)$.
|
||
- $f_i(w) = \sum_{\ell=1}^k e_{i, \ell} X^\top w^{\ell-1} + \sum_{\ell=1}^z r_\ell X^\top w^{k+\ell-1}$ (random combinations of $X$).
|
||
- Caveat: must have $t = k + z$.
|
||
- $\implies \deg f_i = k + z - 1 = t - 1$.
|
||
- Decoding?
|
||
- Interpolation from any $t$ evaluations of $f_i$.
|
||
- Privacy?
|
||
- Against any $z = t - k$ colluding servers, immediate from the proof of the ramp scheme.
|
||
|
||
PIR-rate?
|
||
|
||
- Each $a_j$ is a single field element.
|
||
- Download $t = k + z$ elements in $\mathbb{F}_q$ in order to obtain $x_i \in \mathbb{F}_q^k$.
|
||
- $\implies$ PIR-rate = $\frac{k}{k+z} = \frac{k}{t}$.
|
||
|
||
#### Theorem: PIR-capacity for general replicated databases
|
||
|
||
The PIR-capacity for $n$ replicated databases with $z$ colluding servers, $n - t$ unresponsive servers, and $m$ files is $C = \frac{1-\frac{z}{t}}{1-(\frac{z}{t})^m}$.
|
||
|
||
- When $m \to \infty$, $C \to 1 - \frac{z}{t} = \frac{t-z}{t} = \frac{k}{t}$.
|
||
- The above scheme achieves PIR-capacity as $m \to \infty$
|
||
|
||
### Private information retrieval from coded databases
|
||
|
||
#### Problem setup:
|
||
|
||
Example:
|
||
|
||
- $n = 3$ servers, $m$ files $x_j$, $x_j = x_{j, 1}, x_{j, 2}$, $k = 2$, and $q = 2$.
|
||
- Code each file with a parity code: $x_{j, 1}, x_{j, 2} \mapsto x_{j, 1}, x_{j, 2}, x_{j, 1} + x_{j, 2}$.
|
||
- Server $j \in 3$ stores all $j$-th symbols of all coded files.
|
||
|
||
Queries, answers, decoding, and privacy must be tailored for the code at hand.
|
||
|
||
With respect to a code $C$ and parameters $n, k, t, z$, such scheme is called coded-PIR.
|
||
|
||
- The content for server $j$ is denoted by $c_j = c_{j, 1}, \ldots, c_{j, m}$.
|
||
- $C$ is usually an MDS code.
|
||
|
||
#### Private information retrieval from parity-check codes
|
||
|
||
Example:
|
||
|
||
Say $z = 1$ (no collusion).
|
||
|
||
- Protocol: User has $i \sim U_{m}$.
|
||
- User chooses $r_1, r_2 \sim U_{\mathbb{F}_2^m}$.
|
||
- Two queries to each server:
|
||
- $q_{1, 1} = r_1 + e_i$, $q_{1, 2} = r_2$.
|
||
- $q_{2, 1} = r_1$, $q_{2, 2} = r_2 + e_i$.
|
||
- $q_{3, 1} = r_1$, $q_{3, 2} = r_2$.
|
||
- Server $j$ responds with $q_{j, 1} c_j^\top$ and $q_{j, 2} c_j^\top$.
|
||
- Decoding?
|
||
- $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 c_1 + c_2 + c_3 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$.
|
||
- $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$.
|
||
- $q_{1, 2} c_1^\top + q_{2, 2} c_2^\top + q_{3, 2} c_3^\top = r_2 c_1 + c_2 + c_3^\top + e_i c_2^\top = x_{i, 2}$.
|
||
- Privacy?
|
||
- Every server sees two uniformly random vectors in $\mathbb{F}_2^m$.
|
||
|
||
<details>
|
||
<summary>Proof from coding-theoretic interpretation</summary>
|
||
|
||
Let $G = g_1^\top, g_2^\top, g_3^\top$ be the generator matrix.
|
||
|
||
- For every file $x_j = x_{j, 1}, x_{j, 2}$ we encode $x_j G = (x_{j, 1} g_1^\top, x_{j, 2} g_2^\top, x_{j, 1} g_3^\top) = (c_{j, 1}, c_{j, 2}, c_{j, 3})$.
|
||
- Server $j$ stores $X g_j^\top = (x_1^\top, \ldots, x_m^\top)^\top g_j^\top = (c_{j, 1}, \ldots, c_{j, m})^\top$.
|
||
|
||
- By multiplying by $r_1$, the servers together store a codeword in $C$:
|
||
- $r_1 X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top = r_1 X G$.
|
||
- By replacing one of the $r_1$’s by $r_1 + e_i$, we introduce an error in that entry:
|
||
- $\left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) = r_1 X G + (e_i X g_1^\top, 0,0)$.
|
||
- Downloading this “erroneous” word from the servers and multiply by $H = h_1^\top, h_2^\top, h_3^\top$ be the parity-check matrix.
|
||
|
||
$$
|
||
\begin{aligned}
|
||
\left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) H^\top &= \left(r_1 X G + (e_i X g_1^\top, 0,0)\right) H^\top \\
|
||
&= r_1 X G H^\top + (e_i X g_1^\top, 0,0) H^\top \\
|
||
&= 0 + x_{i, 1} g_1^\top \\
|
||
&= x_{i, 1}.
|
||
\end{aligned}
|
||
$$
|
||
|
||
> In homework we will show tha this work with any MDS code ($z=1$).
|
||
|
||
- Say we obtained $x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top$ (𝑑 − 1 at a time, how?).
|
||
- $x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top = x_{i, B}$, where $B$ is a $k \times k$ submatrix of $G$.
|
||
- $B$ is a $k \times k$ submatrix of $G$ $\implies$ invertible! $\implies$ Obtain $x_{i}$.
|
||
|
||
</details>
|
||
|
||
> [!TIP]
|
||
>
|
||
> error + known location $\implies$ erasure. $d = 2 \implies$ 1 erasure is correctable.
|
||
|