# CSE5313 Coding and information theory for data science (Lecture 19)

## Private information retrieval

### Problem setup

Premise:

- Database $X = \{x_1, \ldots, x_m\}$, each $x_i \in \mathbb{F}_q^k$ is a "file" (e.g., medical record).
- $X$ is coded $X \mapsto \{y_1, \ldots, y_n\}$, $y_j$ stored at server $j$.
- The user (physician) wants $x_i$.
- The user sends a query $q_j \sim Q_j$ to server $j$.
- Server $j$ responds with $a_j \sim A_j$.

Decodability:

- The user can retrieve the file: $H(X_i | A_1, \ldots, A_n) = 0$.

Privacy:

- $i$ is seen as $i \sim U = U_{m}$, reflecting server's lack of knowledge.
- $i$ must be kept private: $I(Q_j; U) = 0$ for all $j \in n$.

> In short, we want to retrieve $x_i$ from the servers without revealing $i$ to the servers.

### Private information retrieval from Replicated Databases

#### Simple case, one server

Say $n = 1, y_1 = X$.

- All data is stored in one server.
- Simple solution:
- $q_1 =$ "send everything".
- $a_1 = y_1 = X$.

Theorem: Information Theoretic PIR with $n = 1$ can only be achieved by downloading the entire database.

- Can we do better if $n > 1$?

#### Collusion parameter

Key question for $n > 1$: Can servers collude?

- I.e., does server $j$ see any $Q_\ell$, $\ell \neq j$?
- Key assumption:
  - Privacy parameter $z$.
  - At most $z$ servers can collude.
  - $z = 1\implies$ No collusion.
- Requirement for $z = 1$: $I(Q_j; U) = 0$ for all $j \in n$.
- Requirement for a general $z$:
  - $I(Q_\mathcal{T}; U) = 0$ for all $\mathcal{T} \in n$, $|\mathcal{T}| \leq z$, where $Q_\mathcal{T} = Q_\ell$ for all $\ell \in \mathcal{T}$.
- Motivation:
  - Interception of communication links.
  - Data breaches.

Other assumptions:

- Computational Private information retrieval (even all the servers are hacked, still cannot get the information -> solve np-hard problem):
- Non-zero MI

#### Private information retrieval from 2-replicated databases

First PIR protocol: Chor et al. FOCS ‘95.

- The data $X = \{x_1, \ldots, x_m\}$ is replicated on two servers.
  - $z = 1$, i.e., no collusion.
- Protocol: User has $i \sim U_{m}$.
  - User generates $r \sim U_{\mathbb{F}_q^m}$.
  - $q_1 = r, q_2 = r + e_i$ ($e_i \in \mathbb{F}_q^m$ is the $i$-th unit vector, $q_2$ is equivalent to one-time pad encryption of $x_i$ with key $r$).
  - $a_j = q_j X^\top = \sum_{\ell \in m} q_j, \ell x_\ell$
  - Linear combination of the files according to the query vector $q_j$.
- Decoding?
  - $a_2 - a_1 = q_2 - q_1 X^\top = e_i X^\top = x_i$.
- Download?
  - $a_j =$ size of file $\implies$ downloading **twice** the size of the file.
- Privacy?
  - Since $z = 1$, need to show $I(U; Q_i) = 0$.
    - $I(U; Q_1) = I(e_U; F) = 0$ since $U$ and $F$ are independent.
    - $I(U; Q_2) = I(e_U; F + e_U) = 0$ since this is one-time pad!

##### Parameters and notations in PIR

Parameters of the system:

- $n =$ # servers (as in storage).
- $m =$ # files.
- $k =$ size of each file (as in storage).
- $z =$ max. collusion (as in secret sharing).
- $t =$ # of answers required to obtain $x_i$ (as in secret sharing).
  - $n - t$ servers are “stragglers”, i.e., might not respond.

Figures of merit:

- PIR-rate = $\#$ desired symbols / $\#$ downloaded symbols
- PIR-capacity = largest possible rate.

Notaional conventions:

-The dataset $X = \{x_j\}_{j \in m} = \{x_{j, \ell}\}_{(j, \ell) \in [m] \times [k]}$ is seen as a vector in $\mathbb{F}_q^{mk}$.

- Index $\mathbb{F}_q^{mk}$ using $[m] \times [k]$, i.e., $x_{j, \ell}$ is the $\ell$-th symbol of the $j$-th file.

#### Private information retrieval from 4-replicated databases

Consider $n = 4$ replicated servers, file size $k = 2$, collusion $z = 1$.

Protocol: User has $i \sim U_{m}$.

- Fix distinct nonzero $\alpha_1, \ldots, \alpha_4 \in \mathbb{F}_q$.
- Choose $r \sim U_{\mathbb{F}_q^{2m}}$.
- User sends $q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r$ to each server $j$.
- Server $j$ responds with
  $$
  a_j = q_j X^\top = e_{i, 1} X^\top + \alpha_j e_{i, 2} X^\top + \alpha_j^2 r X^\top
  $$
  - This is an evaluation at $\alpha_j$ of the polynomial $f_i(w) = x_{i, 1} + x_{i, 2} \cdot w + r \cdot w^2$.
  - Where $r$ is some random combination of the entries of $X$.
- Decoding?
  - Any 3 responses suffice to interpolate $f_i$ and obtain $x_i = x_{i, 1}, x_{i, 2}$.
  - $\implies t = 3$, (one straggler is allowed)
- Privacy?
  - Does $q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r$ look familiar?
  - This is a share in [ramp scheme](CSE5313_L18.md#scheme-2-ramp-secret-sharing-scheme-mceliece-sarwate-scheme) with vector messages $m_1 = e_{i, 1}, m_2 = e_{i, 2}, m_i \in \mathbb{F}_q^{2m}$.
  - This is equivalent to $2m$ "parallel" ramp scheme over $\mathbb{F}_q$.
  - Each one reveals nothing to any $z = 1$ shareholders $\implies$ Private!

### Private information retrieval from general replicated databases

$n$ servers, $m$ files, file size $k$, $X \in \mathbb{F}_q^{mk}$.

Server decodes $x_i$ from any $t$ responses.

Any $\leq z$ servers might collude to infer $i$ ($z < t$).

Protocol: User has $i \sim U_{m}$.

- User chooses $r_1, \ldots, r_z \sim U_{\mathbb{F}_q^{mk}}$.
- User sends $q_j = \sum_{\ell=1}^k e_{i, \ell} \alpha_j^{\ell-1} + \sum_{\ell=1}^z r_\ell \alpha_j^{k+\ell-1}$ to each server $j$.
- Server $j$ responds with $a_j = q_j X^\top = f_i(\alpha_j)$.
  - $f_i(w) = \sum_{\ell=1}^k e_{i, \ell} X^\top w^{\ell-1} + \sum_{\ell=1}^z r_\ell X^\top w^{k+\ell-1}$ (random combinations of $X$).
  - Caveat: must have $t = k + z$.
  - $\implies \deg f_i = k + z - 1 = t - 1$.
- Decoding?
  - Interpolation from any $t$ evaluations of $f_i$.
- Privacy?
  - Against any $z = t - k$ colluding servers, immediate from the proof of the ramp scheme.

PIR-rate?

- Each $a_j$ is a single field element.
- Download $t = k + z$ elements in $\mathbb{F}_q$ in order to obtain $x_i \in \mathbb{F}_q^k$.
- $\implies$ PIR-rate = $\frac{k}{k+z} = \frac{k}{t}$.

#### Theorem: PIR-capacity for general replicated databases

The PIR-capacity for $n$ replicated databases with $z$ colluding servers, $n - t$ unresponsive servers, and $m$ files is $C = \frac{1-\frac{z}{t}}{1-(\frac{z}{t})^m}$.

- When $m \to \infty$, $C \to 1 - \frac{z}{t} = \frac{t-z}{t} = \frac{k}{t}$.
- The above scheme achieves PIR-capacity as $m \to \infty$

### Private information retrieval from coded databases

#### Problem setup:

Example:

- $n = 3$ servers, $m$ files $x_j$, $x_j = x_{j, 1}, x_{j, 2}$, $k = 2$, and $q = 2$.
- Code each file with a parity code: $x_{j, 1}, x_{j, 2} \mapsto x_{j, 1}, x_{j, 2}, x_{j, 1} + x_{j, 2}$.
- Server $j \in 3$ stores all $j$-th symbols of all coded files.

Queries, answers, decoding, and privacy must be tailored for the code at hand.

With respect to a code $C$ and parameters $n, k, t, z$, such scheme is called coded-PIR.

- The content for server $j$ is denoted by $c_j = c_{j, 1}, \ldots, c_{j, m}$.
- $C$ is usually an MDS code.

#### Private information retrieval from parity-check codes

Example:

 Say $z = 1$ (no collusion).

- Protocol: User has $i \sim U_{m}$.
- User chooses $r_1, r_2 \sim U_{\mathbb{F}_2^m}$.
- Two queries to each server:
  - $q_{1, 1} = r_1 + e_i$, $q_{1, 2} = r_2$.
  - $q_{2, 1} = r_1$, $q_{2, 2} = r_2 + e_i$.
  - $q_{3, 1} = r_1$, $q_{3, 2} = r_2$.
- Server $j$ responds with $q_{j, 1} c_j^\top$ and $q_{j, 2} c_j^\top$.
- Decoding?
  - $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 c_1 + c_2 + c_3 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$.
  - $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$.
  - $q_{1, 2} c_1^\top + q_{2, 2} c_2^\top + q_{3, 2} c_3^\top = r_2 c_1 + c_2 + c_3^\top + e_i c_2^\top = x_{i, 2}$.
- Privacy?
  - Every server sees two uniformly random vectors in $\mathbb{F}_2^m$.

<details>
<summary>Proof from coding-theoretic interpretation</summary>

Let $G = g_1^\top, g_2^\top, g_3^\top$ be the generator matrix. 

- For every file $x_j = x_{j, 1}, x_{j, 2}$ we encode $x_j G = (x_{j, 1} g_1^\top, x_{j, 2} g_2^\top, x_{j, 1} g_3^\top) = (c_{j, 1}, c_{j, 2}, c_{j, 3})$.
- Server $j$ stores $X g_j^\top = (x_1^\top, \ldots, x_m^\top)^\top g_j^\top = (c_{j, 1}, \ldots, c_{j, m})^\top$.

- By multiplying by $r_1$, the servers together store a codeword in $C$:
  - $r_1 X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top = r_1 X G$.
- By replacing one of the $r_1$’s by $r_1 + e_i$, we introduce an error in that entry:
  - $\left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) = r_1 X G + (e_i X g_1^\top, 0,0)$.
- Downloading this “erroneous” word from the servers and multiply by $H = h_1^\top, h_2^\top, h_3^\top$ be the parity-check matrix.

$$
\begin{aligned}
\left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) H^\top &= \left(r_1 X G + (e_i X g_1^\top, 0,0)\right) H^\top \\
&= r_1 X G H^\top + (e_i X g_1^\top, 0,0) H^\top \\
&= 0 + x_{i, 1} g_1^\top \\
&= x_{i, 1}.
\end{aligned}
$$

> In homework we will show tha this work with any MDS code ($z=1$).

- Say we obtained $x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top$ (𝑑 − 1 at a time, how?).
- $x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top = x_{i, B}$, where $B$ is a $k \times k$ submatrix of $G$.
- $B$ is a $k \times k$ submatrix of $G$ $\implies$ invertible! $\implies$ Obtain $x_{i}$.

</details>

> [!TIP]
>
> error + known location $\implies$ erasure. $d = 2 \implies$ 1 erasure is correctable.