Files
NoteNextra-origin/content/CSE5313/CSE5313_L19.md
2025-11-04 12:43:23 -06:00

9.2 KiB
Raw Blame History

CSE5313 Coding and information theory for data science (Lecture 19)

Private information retrieval

Problem setup

Premise:

  • Database X = \{x_1, \ldots, x_m\}, each x_i \in \mathbb{F}_q^k is a "file" (e.g., medical record).
  • X is coded X \mapsto \{y_1, \ldots, y_n\}, y_j stored at server j.
  • The user (physician) wants x_i.
  • The user sends a query q_j \sim Q_j to server j.
  • Server j responds with a_j \sim A_j.

Decodability:

  • The user can retrieve the file: H(X_i | A_1, \ldots, A_n) = 0.

Privacy:

  • i is seen as i \sim U = U_{m}, reflecting server's lack of knowledge.
  • i must be kept private: I(Q_j; U) = 0 for all j \in n.

In short, we want to retrieve x_i from the servers without revealing i to the servers.

Private information retrieval from Replicated Databases

Simple case, one server

Say n = 1, y_1 = X.

  • All data is stored in one server.
  • Simple solution:
  • q_1 = "send everything".
  • a_1 = y_1 = X.

Theorem: Information Theoretic PIR with n = 1 can only be achieved by downloading the entire database.

  • Can we do better if n > 1?

Collusion parameter

Key question for n > 1: Can servers collude?

  • I.e., does server j see any Q_\ell, \ell \neq j?
  • Key assumption:
    • Privacy parameter z.
    • At most z servers can collude.
    • z = 1\implies No collusion.
  • Requirement for z = 1: I(Q_j; U) = 0 for all j \in n.
  • Requirement for a general z:
    • I(Q_\mathcal{T}; U) = 0 for all \mathcal{T} \in n, |\mathcal{T}| \leq z, where Q_\mathcal{T} = Q_\ell for all \ell \in \mathcal{T}.
  • Motivation:
    • Interception of communication links.
    • Data breaches.

Other assumptions:

  • Computational Private information retrieval (even all the servers are hacked, still cannot get the information -> solve np-hard problem):
  • Non-zero MI

Private information retrieval from 2-replicated databases

First PIR protocol: Chor et al. FOCS 95.

  • The data X = \{x_1, \ldots, x_m\} is replicated on two servers.
    • z = 1, i.e., no collusion.
  • Protocol: User has i \sim U_{m}.
    • User generates r \sim U_{\mathbb{F}_q^m}.
    • q_1 = r, q_2 = r + e_i (e_i \in \mathbb{F}_q^m is the $i$-th unit vector, q_2 is equivalent to one-time pad encryption of x_i with key r).
    • a_j = q_j X^\top = \sum_{\ell \in m} q_j, \ell x_\ell
    • Linear combination of the files according to the query vector q_j.
  • Decoding?
    • a_2 - a_1 = q_2 - q_1 X^\top = e_i X^\top = x_i.
  • Download?
    • a_j = size of file \implies downloading twice the size of the file.
  • Privacy?
    • Since z = 1, need to show I(U; Q_i) = 0.
      • I(U; Q_1) = I(e_U; F) = 0 since U and F are independent.
      • I(U; Q_2) = I(e_U; F + e_U) = 0 since this is one-time pad!
Parameters and notations in PIR

Parameters of the system:

  • n = # servers (as in storage).
  • m = # files.
  • k = size of each file (as in storage).
  • z = max. collusion (as in secret sharing).
  • t = # of answers required to obtain x_i (as in secret sharing).
    • n - t servers are “stragglers”, i.e., might not respond.

Figures of merit:

  • PIR-rate = \# desired symbols / \# downloaded symbols
  • PIR-capacity = largest possible rate.

Notaional conventions:

-The dataset X = \{x_j\}_{j \in m} = \{x_{j, \ell}\}_{(j, \ell) \in [m] \times [k]} is seen as a vector in \mathbb{F}_q^{mk}.

  • Index \mathbb{F}_q^{mk} using [m] \times [k], i.e., x_{j, \ell} is the $\ell$-th symbol of the $j$-th file.

Private information retrieval from 4-replicated databases

Consider n = 4 replicated servers, file size k = 2, collusion z = 1.

Protocol: User has i \sim U_{m}.

  • Fix distinct nonzero \alpha_1, \ldots, \alpha_4 \in \mathbb{F}_q.
  • Choose r \sim U_{\mathbb{F}_q^{2m}}.
  • User sends q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r to each server j.
  • Server j responds with
    
    a_j = q_j X^\top = e_{i, 1} X^\top + \alpha_j e_{i, 2} X^\top + \alpha_j^2 r X^\top
    
    • This is an evaluation at \alpha_j of the polynomial f_i(w) = x_{i, 1} + x_{i, 2} \cdot w + r \cdot w^2.
    • Where r is some random combination of the entries of X.
  • Decoding?
    • Any 3 responses suffice to interpolate f_i and obtain x_i = x_{i, 1}, x_{i, 2}.
    • \implies t = 3, (one straggler is allowed)
  • Privacy?
    • Does q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r look familiar?
    • This is a share in ramp scheme with vector messages m_1 = e_{i, 1}, m_2 = e_{i, 2}, m_i \in \mathbb{F}_q^{2m}.
    • This is equivalent to 2m "parallel" ramp scheme over \mathbb{F}_q.
    • Each one reveals nothing to any z = 1 shareholders \implies Private!

Private information retrieval from general replicated databases

n servers, m files, file size k, X \in \mathbb{F}_q^{mk}.

Server decodes x_i from any t responses.

Any \leq z servers might collude to infer i (z < t).

Protocol: User has i \sim U_{m}.

  • User chooses r_1, \ldots, r_z \sim U_{\mathbb{F}_q^{mk}}.
  • User sends q_j = \sum_{\ell=1}^k e_{i, \ell} \alpha_j^{\ell-1} + \sum_{\ell=1}^z r_\ell \alpha_j^{k+\ell-1} to each server j.
  • Server j responds with a_j = q_j X^\top = f_i(\alpha_j).
    • f_i(w) = \sum_{\ell=1}^k e_{i, \ell} X^\top w^{\ell-1} + \sum_{\ell=1}^z r_\ell X^\top w^{k+\ell-1} (random combinations of X).
    • Caveat: must have t = k + z.
    • \implies \deg f_i = k + z - 1 = t - 1.
  • Decoding?
    • Interpolation from any t evaluations of f_i.
  • Privacy?
    • Against any z = t - k colluding servers, immediate from the proof of the ramp scheme.

PIR-rate?

  • Each a_j is a single field element.
  • Download t = k + z elements in \mathbb{F}_q in order to obtain x_i \in \mathbb{F}_q^k.
  • \implies PIR-rate = \frac{k}{k+z} = \frac{k}{t}.

Theorem: PIR-capacity for general replicated databases

The PIR-capacity for n replicated databases with z colluding servers, n - t unresponsive servers, and m files is C = \frac{1-\frac{z}{t}}{1-(\frac{z}{t})^m}.

  • When m \to \infty, C \to 1 - \frac{z}{t} = \frac{t-z}{t} = \frac{k}{t}.
  • The above scheme achieves PIR-capacity as m \to \infty

Private information retrieval from coded databases

Problem setup:

Example:

  • n = 3 servers, m files x_j, x_j = x_{j, 1}, x_{j, 2}, k = 2, and q = 2.
  • Code each file with a parity code: x_{j, 1}, x_{j, 2} \mapsto x_{j, 1}, x_{j, 2}, x_{j, 1} + x_{j, 2}.
  • Server j \in 3 stores all $j$-th symbols of all coded files.

Queries, answers, decoding, and privacy must be tailored for the code at hand.

With respect to a code C and parameters n, k, t, z, such scheme is called coded-PIR.

  • The content for server j is denoted by c_j = c_{j, 1}, \ldots, c_{j, m}.
  • C is usually an MDS code.

Private information retrieval from parity-check codes

Example:

Say z = 1 (no collusion).

  • Protocol: User has i \sim U_{m}.
  • User chooses r_1, r_2 \sim U_{\mathbb{F}_2^m}.
  • Two queries to each server:
    • q_{1, 1} = r_1 + e_i, q_{1, 2} = r_2.
    • q_{2, 1} = r_1, q_{2, 2} = r_2 + e_i.
    • q_{3, 1} = r_1, q_{3, 2} = r_2.
  • Server j responds with q_{j, 1} c_j^\top and q_{j, 2} c_j^\top.
  • Decoding?
    • q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 c_1 + c_2 + c_3 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}.
    • q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}.
    • q_{1, 2} c_1^\top + q_{2, 2} c_2^\top + q_{3, 2} c_3^\top = r_2 c_1 + c_2 + c_3^\top + e_i c_2^\top = x_{i, 2}.
  • Privacy?
    • Every server sees two uniformly random vectors in \mathbb{F}_2^m.
Proof from coding-theoretic interpretation

Let G = g_1^\top, g_2^\top, g_3^\top be the generator matrix.

  • For every file x_j = x_{j, 1}, x_{j, 2} we encode x_j G = (x_{j, 1} g_1^\top, x_{j, 2} g_2^\top, x_{j, 1} g_3^\top) = (c_{j, 1}, c_{j, 2}, c_{j, 3}).

  • Server j stores X g_j^\top = (x_1^\top, \ldots, x_m^\top)^\top g_j^\top = (c_{j, 1}, \ldots, c_{j, m})^\top.

  • By multiplying by r_1, the servers together store a codeword in C:

    • r_1 X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top = r_1 X G.
  • By replacing one of the $r_1$s by r_1 + e_i, we introduce an error in that entry:

    • \left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) = r_1 X G + (e_i X g_1^\top, 0,0).
  • Downloading this “erroneous” word from the servers and multiply by H = h_1^\top, h_2^\top, h_3^\top be the parity-check matrix.


\begin{aligned}
\left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) H^\top &= \left(r_1 X G + (e_i X g_1^\top, 0,0)\right) H^\top \\
&= r_1 X G H^\top + (e_i X g_1^\top, 0,0) H^\top \\
&= 0 + x_{i, 1} g_1^\top \\
&= x_{i, 1}.
\end{aligned}

In homework we will show tha this work with any MDS code (z=1).

  • Say we obtained x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top (𝑑 1 at a time, how?).
  • x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top = x_{i, B}, where B is a k \times k submatrix of G.
  • B is a k \times k submatrix of G \implies invertible! \implies Obtain x_{i}.

Tip

error + known location \implies erasure. d = 2 \implies 1 erasure is correctable.