# CSE5313 Coding and information theory for data science (Lecture 19) ## Private information retrieval ### Problem setup Premise: - Database $X = \{x_1, \ldots, x_m\}$, each $x_i \in \mathbb{F}_q^k$ is a "file" (e.g., medical record). - $X$ is coded $X \mapsto \{y_1, \ldots, y_n\}$, $y_j$ stored at server $j$. - The user (physician) wants $x_i$. - The user sends a query $q_j \sim Q_j$ to server $j$. - Server $j$ responds with $a_j \sim A_j$. Decodability: - The user can retrieve the file: $H(X_i | A_1, \ldots, A_n) = 0$. Privacy: - $i$ is seen as $i \sim U = U_{m}$, reflecting server's lack of knowledge. - $i$ must be kept private: $I(Q_j; U) = 0$ for all $j \in n$. > In short, we want to retrieve $x_i$ from the servers without revealing $i$ to the servers. ### Private information retrieval from Replicated Databases #### Simple case, one server Say $n = 1, y_1 = X$. - All data is stored in one server. - Simple solution: - $q_1 =$ "send everything". - $a_1 = y_1 = X$. Theorem: Information Theoretic PIR with $n = 1$ can only be achieved by downloading the entire database. - Can we do better if $n > 1$? #### Collusion parameter Key question for $n > 1$: Can servers collude? - I.e., does server $j$ see any $Q_\ell$, $\ell \neq j$? - Key assumption: - Privacy parameter $z$. - At most $z$ servers can collude. - $z = 1\implies$ No collusion. - Requirement for $z = 1$: $I(Q_j; U) = 0$ for all $j \in n$. - Requirement for a general $z$: - $I(Q_\mathcal{T}; U) = 0$ for all $\mathcal{T} \in n$, $|\mathcal{T}| \leq z$, where $Q_\mathcal{T} = Q_\ell$ for all $\ell \in \mathcal{T}$. - Motivation: - Interception of communication links. - Data breaches. Other assumptions: - Computational Private information retrieval (even all the servers are hacked, still cannot get the information -> solve np-hard problem): - Non-zero MI #### Private information retrieval from 2-replicated databases First PIR protocol: Chor et al. FOCS ‘95. - The data $X = \{x_1, \ldots, x_m\}$ is replicated on two servers. - $z = 1$, i.e., no collusion. - Protocol: User has $i \sim U_{m}$. - User generates $r \sim U_{\mathbb{F}_q^m}$. - $q_1 = r, q_2 = r + e_i$ ($e_i \in \mathbb{F}_q^m$ is the $i$-th unit vector, $q_2$ is equivalent to one-time pad encryption of $x_i$ with key $r$). - $a_j = q_j X^\top = \sum_{\ell \in m} q_j, \ell x_\ell$ - Linear combination of the files according to the query vector $q_j$. - Decoding? - $a_2 - a_1 = q_2 - q_1 X^\top = e_i X^\top = x_i$. - Download? - $a_j =$ size of file $\implies$ downloading **twice** the size of the file. - Privacy? - Since $z = 1$, need to show $I(U; Q_i) = 0$. - $I(U; Q_1) = I(e_U; F) = 0$ since $U$ and $F$ are independent. - $I(U; Q_2) = I(e_U; F + e_U) = 0$ since this is one-time pad! ##### Parameters and notations in PIR Parameters of the system: - $n =$ # servers (as in storage). - $m =$ # files. - $k =$ size of each file (as in storage). - $z =$ max. collusion (as in secret sharing). - $t =$ # of answers required to obtain $x_i$ (as in secret sharing). - $n - t$ servers are “stragglers”, i.e., might not respond. Figures of merit: - PIR-rate = $\#$ desired symbols / $\#$ downloaded symbols - PIR-capacity = largest possible rate. Notaional conventions: -The dataset $X = \{x_j\}_{j \in m} = \{x_{j, \ell}\}_{(j, \ell) \in [m] \times [k]}$ is seen as a vector in $\mathbb{F}_q^{mk}$. - Index $\mathbb{F}_q^{mk}$ using $[m] \times [k]$, i.e., $x_{j, \ell}$ is the $\ell$-th symbol of the $j$-th file. #### Private information retrieval from 4-replicated databases Consider $n = 4$ replicated servers, file size $k = 2$, collusion $z = 1$. Protocol: User has $i \sim U_{m}$. - Fix distinct nonzero $\alpha_1, \ldots, \alpha_4 \in \mathbb{F}_q$. - Choose $r \sim U_{\mathbb{F}_q^{2m}}$. - User sends $q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r$ to each server $j$. - Server $j$ responds with $$ a_j = q_j X^\top = e_{i, 1} X^\top + \alpha_j e_{i, 2} X^\top + \alpha_j^2 r X^\top $$ - This is an evaluation at $\alpha_j$ of the polynomial $f_i(w) = x_{i, 1} + x_{i, 2} \cdot w + r \cdot w^2$. - Where $r$ is some random combination of the entries of $X$. - Decoding? - Any 3 responses suffice to interpolate $f_i$ and obtain $x_i = x_{i, 1}, x_{i, 2}$. - $\implies t = 3$, (one straggler is allowed) - Privacy? - Does $q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r$ look familiar? - This is a share in [ramp scheme](CSE5313_L18.md#scheme-2-ramp-secret-sharing-scheme-mceliece-sarwate-scheme) with vector messages $m_1 = e_{i, 1}, m_2 = e_{i, 2}, m_i \in \mathbb{F}_q^{2m}$. - This is equivalent to $2m$ "parallel" ramp scheme over $\mathbb{F}_q$. - Each one reveals nothing to any $z = 1$ shareholders $\implies$ Private! ### Private information retrieval from general replicated databases $n$ servers, $m$ files, file size $k$, $X \in \mathbb{F}_q^{mk}$. Server decodes $x_i$ from any $t$ responses. Any $\leq z$ servers might collude to infer $i$ ($z < t$). Protocol: User has $i \sim U_{m}$. - User chooses $r_1, \ldots, r_z \sim U_{\mathbb{F}_q^{mk}}$. - User sends $q_j = \sum_{\ell=1}^k e_{i, \ell} \alpha_j^{\ell-1} + \sum_{\ell=1}^z r_\ell \alpha_j^{k+\ell-1}$ to each server $j$. - Server $j$ responds with $a_j = q_j X^\top = f_i(\alpha_j)$. - $f_i(w) = \sum_{\ell=1}^k e_{i, \ell} X^\top w^{\ell-1} + \sum_{\ell=1}^z r_\ell X^\top w^{k+\ell-1}$ (random combinations of $X$). - Caveat: must have $t = k + z$. - $\implies \deg f_i = k + z - 1 = t - 1$. - Decoding? - Interpolation from any $t$ evaluations of $f_i$. - Privacy? - Against any $z = t - k$ colluding servers, immediate from the proof of the ramp scheme. PIR-rate? - Each $a_j$ is a single field element. - Download $t = k + z$ elements in $\mathbb{F}_q$ in order to obtain $x_i \in \mathbb{F}_q^k$. - $\implies$ PIR-rate = $\frac{k}{k+z} = \frac{k}{t}$. #### Theorem: PIR-capacity for general replicated databases The PIR-capacity for $n$ replicated databases with $z$ colluding servers, $n - t$ unresponsive servers, and $m$ files is $C = \frac{1-\frac{z}{t}}{1-(\frac{z}{t})^m}$. - When $m \to \infty$, $C \to 1 - \frac{z}{t} = \frac{t-z}{t} = \frac{k}{t}$. - The above scheme achieves PIR-capacity as $m \to \infty$ ### Private information retrieval from coded databases #### Problem setup: Example: - $n = 3$ servers, $m$ files $x_j$, $x_j = x_{j, 1}, x_{j, 2}$, $k = 2$, and $q = 2$. - Code each file with a parity code: $x_{j, 1}, x_{j, 2} \mapsto x_{j, 1}, x_{j, 2}, x_{j, 1} + x_{j, 2}$. - Server $j \in 3$ stores all $j$-th symbols of all coded files. Queries, answers, decoding, and privacy must be tailored for the code at hand. With respect to a code $C$ and parameters $n, k, t, z$, such scheme is called coded-PIR. - The content for server $j$ is denoted by $c_j = c_{j, 1}, \ldots, c_{j, m}$. - $C$ is usually an MDS code. #### Private information retrieval from parity-check codes Example: Say $z = 1$ (no collusion). - Protocol: User has $i \sim U_{m}$. - User chooses $r_1, r_2 \sim U_{\mathbb{F}_2^m}$. - Two queries to each server: - $q_{1, 1} = r_1 + e_i$, $q_{1, 2} = r_2$. - $q_{2, 1} = r_1$, $q_{2, 2} = r_2 + e_i$. - $q_{3, 1} = r_1$, $q_{3, 2} = r_2$. - Server $j$ responds with $q_{j, 1} c_j^\top$ and $q_{j, 2} c_j^\top$. - Decoding? - $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 c_1 + c_2 + c_3 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$. - $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$. - $q_{1, 2} c_1^\top + q_{2, 2} c_2^\top + q_{3, 2} c_3^\top = r_2 c_1 + c_2 + c_3^\top + e_i c_2^\top = x_{i, 2}$. - Privacy? - Every server sees two uniformly random vectors in $\mathbb{F}_2^m$.
Proof from coding-theoretic interpretation Let $G = g_1^\top, g_2^\top, g_3^\top$ be the generator matrix. - For every file $x_j = x_{j, 1}, x_{j, 2}$ we encode $x_j G = (x_{j, 1} g_1^\top, x_{j, 2} g_2^\top, x_{j, 1} g_3^\top) = (c_{j, 1}, c_{j, 2}, c_{j, 3})$. - Server $j$ stores $X g_j^\top = (x_1^\top, \ldots, x_m^\top)^\top g_j^\top = (c_{j, 1}, \ldots, c_{j, m})^\top$. - By multiplying by $r_1$, the servers together store a codeword in $C$: - $r_1 X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top = r_1 X G$. - By replacing one of the $r_1$’s by $r_1 + e_i$, we introduce an error in that entry: - $\left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) = r_1 X G + (e_i X g_1^\top, 0,0)$. - Downloading this “erroneous” word from the servers and multiply by $H = h_1^\top, h_2^\top, h_3^\top$ be the parity-check matrix. $$ \begin{aligned} \left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) H^\top &= \left(r_1 X G + (e_i X g_1^\top, 0,0)\right) H^\top \\ &= r_1 X G H^\top + (e_i X g_1^\top, 0,0) H^\top \\ &= 0 + x_{i, 1} g_1^\top \\ &= x_{i, 1}. \end{aligned} $$ > In homework we will show tha this work with any MDS code ($z=1$). - Say we obtained $x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top$ (𝑑 − 1 at a time, how?). - $x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top = x_{i, B}$, where $B$ is a $k \times k$ submatrix of $G$. - $B$ is a $k \times k$ submatrix of $G$ $\implies$ invertible! $\implies$ Obtain $x_{i}$.
> [!TIP] > > error + known location $\implies$ erasure. $d = 2 \implies$ 1 erasure is correctable.