Trance-0/NoteNextra-origin

Fork 0

Files

Trance-0 614479e4d0 update notations

2025-11-04 12:43:23 -06:00

9.2 KiB

Raw Blame History

CSE5313 Coding and information theory for data science (Lecture 19)

Private information retrieval

Problem setup

Premise:

Database X = \{x_1, \ldots, x_m\}, each x_i \in \mathbb{F}_q^k is a "file" (e.g., medical record).
X is coded X \mapsto \{y_1, \ldots, y_n\}, y_j stored at server j.
The user (physician) wants x_i.
The user sends a query q_j \sim Q_j to server j.
Server j responds with a_j \sim A_j.

Decodability:

The user can retrieve the file: H(X_i | A_1, \ldots, A_n) = 0.

Privacy:

i is seen as i \sim U = U_{m}, reflecting server's lack of knowledge.
i must be kept private: I(Q_j; U) = 0 for all j \in n.

In short, we want to retrieve x_i from the servers without revealing i to the servers.

Private information retrieval from Replicated Databases

Simple case, one server

Say n = 1, y_1 = X.

All data is stored in one server.
Simple solution:
q_1 = "send everything".
a_1 = y_1 = X.

Theorem: Information Theoretic PIR with n = 1 can only be achieved by downloading the entire database.

Can we do better if n > 1?

Collusion parameter

Key question for n > 1: Can servers collude?

I.e., does server j see any Q_\ell, \ell \neq j?
Key assumption:
- Privacy parameter z.
- At most z servers can collude.
- z = 1\implies No collusion.
Requirement for z = 1: I(Q_j; U) = 0 for all j \in n.
Requirement for a general z:
- I(Q_\mathcal{T}; U) = 0 for all \mathcal{T} \in n, |\mathcal{T}| \leq z, where Q_\mathcal{T} = Q_\ell for all \ell \in \mathcal{T}.
Motivation:
- Interception of communication links.
- Data breaches.

Other assumptions:

Computational Private information retrieval (even all the servers are hacked, still cannot get the information -> solve np-hard problem):
Non-zero MI

Private information retrieval from 2-replicated databases

First PIR protocol: Chor et al. FOCS ‘95.

The data X = \{x_1, \ldots, x_m\} is replicated on two servers.
- z = 1, i.e., no collusion.
Protocol: User has i \sim U_{m}.
- User generates r \sim U_{\mathbb{F}_q^m}.
- q_1 = r, q_2 = r + e_i (e_i \in \mathbb{F}_q^m is the $i$-th unit vector, q_2 is equivalent to one-time pad encryption of x_i with key r).
- a_j = q_j X^\top = \sum_{\ell \in m} q_j, \ell x_\ell
- Linear combination of the files according to the query vector q_j.
Decoding?
- a_2 - a_1 = q_2 - q_1 X^\top = e_i X^\top = x_i.
Download?
- a_j = size of file \implies downloading twice the size of the file.
Privacy?
- Since z = 1, need to show I(U; Q_i) = 0.
  - I(U; Q_1) = I(e_U; F) = 0 since U and F are independent.
  - I(U; Q_2) = I(e_U; F + e_U) = 0 since this is one-time pad!

Parameters and notations in PIR

Parameters of the system:

n = # servers (as in storage).
m = # files.
k = size of each file (as in storage).
z = max. collusion (as in secret sharing).
t = # of answers required to obtain x_i (as in secret sharing).
- n - t servers are “stragglers”, i.e., might not respond.

Figures of merit:

PIR-rate = \# desired symbols / \# downloaded symbols
PIR-capacity = largest possible rate.

Notaional conventions:

-The dataset X = \{x_j\}_{j \in m} = \{x_{j, \ell}\}_{(j, \ell) \in [m] \times [k]} is seen as a vector in \mathbb{F}_q^{mk}.

Index \mathbb{F}_q^{mk} using [m] \times [k], i.e., x_{j, \ell} is the $\ell$-th symbol of the $j$-th file.

Private information retrieval from 4-replicated databases

Consider n = 4 replicated servers, file size k = 2, collusion z = 1.

Protocol: User has i \sim U_{m}.

Fix distinct nonzero \alpha_1, \ldots, \alpha_4 \in \mathbb{F}_q.
Choose r \sim U_{\mathbb{F}_q^{2m}}.
User sends q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r to each server j.
Server j responds with
```
a_j = q_j X^\top = e_{i, 1} X^\top + \alpha_j e_{i, 2} X^\top + \alpha_j^2 r X^\top
```
- This is an evaluation at \alpha_j of the polynomial f_i(w) = x_{i, 1} + x_{i, 2} \cdot w + r \cdot w^2.
- Where r is some random combination of the entries of X.
Decoding?
- Any 3 responses suffice to interpolate f_i and obtain x_i = x_{i, 1}, x_{i, 2}.
- \implies t = 3, (one straggler is allowed)
Privacy?
- Does q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r look familiar?
- This is a share in ramp scheme with vector messages m_1 = e_{i, 1}, m_2 = e_{i, 2}, m_i \in \mathbb{F}_q^{2m}.
- This is equivalent to 2m "parallel" ramp scheme over \mathbb{F}_q.
- Each one reveals nothing to any z = 1 shareholders \implies Private!

Private information retrieval from general replicated databases

n servers, m files, file size k, X \in \mathbb{F}_q^{mk}.

Server decodes x_i from any t responses.

Any \leq z servers might collude to infer i (z < t).

Protocol: User has i \sim U_{m}.

User chooses r_1, \ldots, r_z \sim U_{\mathbb{F}_q^{mk}}.
User sends q_j = \sum_{\ell=1}^k e_{i, \ell} \alpha_j^{\ell-1} + \sum_{\ell=1}^z r_\ell \alpha_j^{k+\ell-1} to each server j.
Server j responds with a_j = q_j X^\top = f_i(\alpha_j).
- f_i(w) = \sum_{\ell=1}^k e_{i, \ell} X^\top w^{\ell-1} + \sum_{\ell=1}^z r_\ell X^\top w^{k+\ell-1} (random combinations of X).
- Caveat: must have t = k + z.
- \implies \deg f_i = k + z - 1 = t - 1.
Decoding?
- Interpolation from any t evaluations of f_i.
Privacy?
- Against any z = t - k colluding servers, immediate from the proof of the ramp scheme.

PIR-rate?

Each a_j is a single field element.
Download t = k + z elements in \mathbb{F}_q in order to obtain x_i \in \mathbb{F}_q^k.
\implies PIR-rate = \frac{k}{k+z} = \frac{k}{t}.

Theorem: PIR-capacity for general replicated databases

The PIR-capacity for n replicated databases with z colluding servers, n - t unresponsive servers, and m files is C = \frac{1-\frac{z}{t}}{1-(\frac{z}{t})^m}.

When m \to \infty, C \to 1 - \frac{z}{t} = \frac{t-z}{t} = \frac{k}{t}.
The above scheme achieves PIR-capacity as m \to \infty

Private information retrieval from coded databases

Problem setup:

Example:

n = 3 servers, m files x_j, x_j = x_{j, 1}, x_{j, 2}, k = 2, and q = 2.
Code each file with a parity code: x_{j, 1}, x_{j, 2} \mapsto x_{j, 1}, x_{j, 2}, x_{j, 1} + x_{j, 2}.
Server j \in 3 stores all $j$-th symbols of all coded files.

Queries, answers, decoding, and privacy must be tailored for the code at hand.

With respect to a code C and parameters n, k, t, z, such scheme is called coded-PIR.

The content for server j is denoted by c_j = c_{j, 1}, \ldots, c_{j, m}.
C is usually an MDS code.

Private information retrieval from parity-check codes