9.2 KiB
CSE5313 Coding and information theory for data science (Lecture 19)
Private information retrieval
Problem setup
Premise:
- Database
X = \{x_1, \ldots, x_m\}, eachx_i \in \mathbb{F}_q^kis a "file" (e.g., medical record). Xis codedX \mapsto \{y_1, \ldots, y_n\},y_jstored at serverj.- The user (physician) wants
x_i. - The user sends a query
q_j \sim Q_jto serverj. - Server
jresponds witha_j \sim A_j.
Decodability:
- The user can retrieve the file:
H(X_i | A_1, \ldots, A_n) = 0.
Privacy:
iis seen asi \sim U = U_{m}, reflecting server's lack of knowledge.imust be kept private:I(Q_j; U) = 0for allj \in n.
In short, we want to retrieve
x_ifrom the servers without revealingito the servers.
Private information retrieval from Replicated Databases
Simple case, one server
Say n = 1, y_1 = X.
- All data is stored in one server.
- Simple solution:
q_1 ="send everything".a_1 = y_1 = X.
Theorem: Information Theoretic PIR with n = 1 can only be achieved by downloading the entire database.
- Can we do better if
n > 1?
Collusion parameter
Key question for n > 1: Can servers collude?
- I.e., does server
jsee anyQ_\ell,\ell \neq j? - Key assumption:
- Privacy parameter
z. - At most
zservers can collude. z = 1\impliesNo collusion.
- Privacy parameter
- Requirement for
z = 1:I(Q_j; U) = 0for allj \in n. - Requirement for a general
z:I(Q_\mathcal{T}; U) = 0for all\mathcal{T} \in n,|\mathcal{T}| \leq z, whereQ_\mathcal{T} = Q_\ellfor all\ell \in \mathcal{T}.
- Motivation:
- Interception of communication links.
- Data breaches.
Other assumptions:
- Computational Private information retrieval (even all the servers are hacked, still cannot get the information -> solve np-hard problem):
- Non-zero MI
Private information retrieval from 2-replicated databases
First PIR protocol: Chor et al. FOCS ‘95.
- The data
X = \{x_1, \ldots, x_m\}is replicated on two servers.z = 1, i.e., no collusion.
- Protocol: User has
i \sim U_{m}.- User generates
r \sim U_{\mathbb{F}_q^m}. q_1 = r, q_2 = r + e_i(e_i \in \mathbb{F}_q^mis the $i$-th unit vector,q_2is equivalent to one-time pad encryption ofx_iwith keyr).a_j = q_j X^\top = \sum_{\ell \in m} q_j, \ell x_\ell- Linear combination of the files according to the query vector
q_j.
- User generates
- Decoding?
a_2 - a_1 = q_2 - q_1 X^\top = e_i X^\top = x_i.
- Download?
a_j =size of file\impliesdownloading twice the size of the file.
- Privacy?
- Since
z = 1, need to showI(U; Q_i) = 0.I(U; Q_1) = I(e_U; F) = 0sinceUandFare independent.I(U; Q_2) = I(e_U; F + e_U) = 0since this is one-time pad!
- Since
Parameters and notations in PIR
Parameters of the system:
n =# servers (as in storage).m =# files.k =size of each file (as in storage).z =max. collusion (as in secret sharing).t =# of answers required to obtainx_i(as in secret sharing).n - tservers are “stragglers”, i.e., might not respond.
Figures of merit:
- PIR-rate =
\#desired symbols /\#downloaded symbols - PIR-capacity = largest possible rate.
Notaional conventions:
-The dataset X = \{x_j\}_{j \in m} = \{x_{j, \ell}\}_{(j, \ell) \in [m] \times [k]} is seen as a vector in \mathbb{F}_q^{mk}.
- Index
\mathbb{F}_q^{mk}using[m] \times [k], i.e.,x_{j, \ell}is the $\ell$-th symbol of the $j$-th file.
Private information retrieval from 4-replicated databases
Consider n = 4 replicated servers, file size k = 2, collusion z = 1.
Protocol: User has i \sim U_{m}.
- Fix distinct nonzero
\alpha_1, \ldots, \alpha_4 \in \mathbb{F}_q. - Choose
r \sim U_{\mathbb{F}_q^{2m}}. - User sends
q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 rto each serverj. - Server
jresponds witha_j = q_j X^\top = e_{i, 1} X^\top + \alpha_j e_{i, 2} X^\top + \alpha_j^2 r X^\top- This is an evaluation at
\alpha_jof the polynomialf_i(w) = x_{i, 1} + x_{i, 2} \cdot w + r \cdot w^2. - Where
ris some random combination of the entries ofX.
- This is an evaluation at
- Decoding?
- Any 3 responses suffice to interpolate
f_iand obtainx_i = x_{i, 1}, x_{i, 2}. \implies t = 3, (one straggler is allowed)
- Any 3 responses suffice to interpolate
- Privacy?
- Does
q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 rlook familiar? - This is a share in ramp scheme with vector messages
m_1 = e_{i, 1}, m_2 = e_{i, 2}, m_i \in \mathbb{F}_q^{2m}. - This is equivalent to
2m"parallel" ramp scheme over\mathbb{F}_q. - Each one reveals nothing to any
z = 1shareholders\impliesPrivate!
- Does
Private information retrieval from general replicated databases
n servers, m files, file size k, X \in \mathbb{F}_q^{mk}.
Server decodes x_i from any t responses.
Any \leq z servers might collude to infer i (z < t).
Protocol: User has i \sim U_{m}.
- User chooses
r_1, \ldots, r_z \sim U_{\mathbb{F}_q^{mk}}. - User sends
q_j = \sum_{\ell=1}^k e_{i, \ell} \alpha_j^{\ell-1} + \sum_{\ell=1}^z r_\ell \alpha_j^{k+\ell-1}to each serverj. - Server
jresponds witha_j = q_j X^\top = f_i(\alpha_j).f_i(w) = \sum_{\ell=1}^k e_{i, \ell} X^\top w^{\ell-1} + \sum_{\ell=1}^z r_\ell X^\top w^{k+\ell-1}(random combinations ofX).- Caveat: must have
t = k + z. \implies \deg f_i = k + z - 1 = t - 1.
- Decoding?
- Interpolation from any
tevaluations off_i.
- Interpolation from any
- Privacy?
- Against any
z = t - kcolluding servers, immediate from the proof of the ramp scheme.
- Against any
PIR-rate?
- Each
a_jis a single field element. - Download
t = k + zelements in\mathbb{F}_qin order to obtainx_i \in \mathbb{F}_q^k. \impliesPIR-rate =\frac{k}{k+z} = \frac{k}{t}.
Theorem: PIR-capacity for general replicated databases
The PIR-capacity for n replicated databases with z colluding servers, n - t unresponsive servers, and m files is C = \frac{1-\frac{z}{t}}{1-(\frac{z}{t})^m}.
- When
m \to \infty,C \to 1 - \frac{z}{t} = \frac{t-z}{t} = \frac{k}{t}. - The above scheme achieves PIR-capacity as
m \to \infty
Private information retrieval from coded databases
Problem setup:
Example:
n = 3servers,mfilesx_j,x_j = x_{j, 1}, x_{j, 2},k = 2, andq = 2.- Code each file with a parity code:
x_{j, 1}, x_{j, 2} \mapsto x_{j, 1}, x_{j, 2}, x_{j, 1} + x_{j, 2}. - Server
j \in 3stores all $j$-th symbols of all coded files.
Queries, answers, decoding, and privacy must be tailored for the code at hand.
With respect to a code C and parameters n, k, t, z, such scheme is called coded-PIR.
- The content for server
jis denoted byc_j = c_{j, 1}, \ldots, c_{j, m}. Cis usually an MDS code.
Private information retrieval from parity-check codes
Example:
Say z = 1 (no collusion).
- Protocol: User has
i \sim U_{m}. - User chooses
r_1, r_2 \sim U_{\mathbb{F}_2^m}. - Two queries to each server:
q_{1, 1} = r_1 + e_i,q_{1, 2} = r_2.q_{2, 1} = r_1,q_{2, 2} = r_2 + e_i.q_{3, 1} = r_1,q_{3, 2} = r_2.
- Server
jresponds withq_{j, 1} c_j^\topandq_{j, 2} c_j^\top. - Decoding?
q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 c_1 + c_2 + c_3 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}.q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}.q_{1, 2} c_1^\top + q_{2, 2} c_2^\top + q_{3, 2} c_3^\top = r_2 c_1 + c_2 + c_3^\top + e_i c_2^\top = x_{i, 2}.
- Privacy?
- Every server sees two uniformly random vectors in
\mathbb{F}_2^m.
- Every server sees two uniformly random vectors in
Proof from coding-theoretic interpretation
Let G = g_1^\top, g_2^\top, g_3^\top be the generator matrix.
-
For every file
x_j = x_{j, 1}, x_{j, 2}we encodex_j G = (x_{j, 1} g_1^\top, x_{j, 2} g_2^\top, x_{j, 1} g_3^\top) = (c_{j, 1}, c_{j, 2}, c_{j, 3}). -
Server
jstoresX g_j^\top = (x_1^\top, \ldots, x_m^\top)^\top g_j^\top = (c_{j, 1}, \ldots, c_{j, m})^\top. -
By multiplying by
r_1, the servers together store a codeword inC:r_1 X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top = r_1 X G.
-
By replacing one of the $r_1$’s by
r_1 + e_i, we introduce an error in that entry:\left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) = r_1 X G + (e_i X g_1^\top, 0,0).
-
Downloading this “erroneous” word from the servers and multiply by
H = h_1^\top, h_2^\top, h_3^\topbe the parity-check matrix.
\begin{aligned}
\left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) H^\top &= \left(r_1 X G + (e_i X g_1^\top, 0,0)\right) H^\top \\
&= r_1 X G H^\top + (e_i X g_1^\top, 0,0) H^\top \\
&= 0 + x_{i, 1} g_1^\top \\
&= x_{i, 1}.
\end{aligned}
In homework we will show tha this work with any MDS code (
z=1).
- Say we obtained
x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top(𝑑 − 1 at a time, how?). x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top = x_{i, B}, whereBis ak \times ksubmatrix ofG.Bis ak \times ksubmatrix ofG\impliesinvertible!\impliesObtainx_{i}.
Tip
error + known location
\implieserasure.d = 2 \implies1 erasure is correctable.