Files
NoteNextra-origin/content/CSE5313/CSE5313_L20.md
2025-11-06 13:59:31 -06:00

8.6 KiB
Raw Blame History

CSE5313 Coding and information theory for data science (Lecture 20)

Review for Private Information Retrieval

PIR from replicated databases

For 2 replicated databases, we have the following protocol:

  • User has i \sim U_{m}.
  • User chooses r_1, r_2 \sim U_{\mathbb{F}_2^m}.
  • Two queries to each server:
    • q_{1, 1} = r_1 + e_i, q_{1, 2} = r_2.
    • q_{2, 1} = r_1, q_{2, 2} = r_2 + e_i.
  • Server j responds with q_{j, 1} c_j^\top and q_{j, 2} c_j^\top.
  • Decoding?
    • q_{1, 1} c_1^\top + q_{2, 1} c_2^\top = r_1 c_1 + c_2 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}.
    • q_{1, 2} c_1^\top + q_{2, 2} c_2^\top = r_2 c_1 + c_2 + e_i c_2^\top = x_{i, 2}.

PIR-rate is \frac{k}{2k} = \frac{1}{2}.

PIR from coded parity-check databases

For 3 coded parity-check databases, we have the following protocol:

  • User has i \sim U_{m}.
  • User chooses r_1, r_2, r_3 \sim U_{\mathbb{F}_2^m}.
  • Three queries to each server:
    • q_{1, 1} = r_1 + e_i, q_{1, 2} = r_2, q_{1, 3} = r_3.
    • q_{2, 1} = r_1, q_{2, 2} = r_2 + e_i, q_{2, 3} = r_3.
    • q_{3, 1} = r_1, q_{3, 2} = r_2, q_{3, 3} = r_3 + e_i.
  • Server j responds with q_{j, 1} c_j^\top, q_{j, 2} c_j^\top, q_{j, 3} c_j^\top.
  • Decoding?
    • q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 c_1 + c_2 + c_3 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}.
    • q_{1, 2} c_1^\top + q_{2, 2} c_2^\top + q_{3, 2} c_3^\top = r_2 c_1 + c_2 + c_3 + e_i c_2^\top = x_{i, 2}.
    • q_{1, 3} c_1^\top + q_{2, 3} c_2^\top + q_{3, 3} c_3^\top = r_3 c_1 + c_2 + c_3 + e_i c_3^\top = x_{i, 3}.

PIR-rate is \frac{k}{3k} = \frac{1}{3}.

Beyond z=1

Star-product theme

Given x=(x_1, \ldots, x_j)_{j\in [n]}, y=(y_1, \ldots, y_j)_{j\in [n]}, over \mathbb{F}_q, the star-product is defined as:


x \star y = (x_1 y_1, \ldots, x_n y_n)

Given two linear codes, C,D\subseteq \mathbb{F}_q^n, the star-product code is defined as:


C \star D = span_{\mathbb{F}_q} \{x \star y | x \in C, y \in D\}

Singleton bound for star-product:


d_{C \star D} \leq n-\dim C-\dim D+2

PIR form a database coded with any MDS code and z>1

To generalize the previous scheme to z > 1 need to encode multiple $r$'s together.

  • As in the ramp scheme.

Recall from the ramp scheme, we use r_1, \ldots, r_z \sim U_{\mathbb{F}_q^k} as our key vector to avoid occlusion of the servers.

In the star-product scheme:

  • Files are coded with an MDS code C.
  • The multiple $r$'s are coded with an MDS code D.
  • The scheme is based on the minimum distance of C \star D.

To code the data:

  • Let C \subseteq \mathbb{F}_q^n be an MDS code of dimension k.
  • For all j \in m, encode file x_j = x_{j, 1}, \ldots, x_{j, k} using G_C:

\begin{pmatrix}
x_{1, 1} & x_{1, 2} & \cdots & x_{1, k}\\
x_{2, 1} & x_{2, 2} & \cdots & x_{2, k}\\
\vdots & \vdots & \ddots & \vdots\\
x_{m, 1} & x_{m, 2} & \cdots & x_{m, k}
\end{pmatrix} \cdot G_C = \begin{pmatrix}
c_{1, 1} & c_{1, 2} & \cdots & c_{1, n}\\
c_{2, 1} & c_{2, 2} & \cdots & c_{2, n}\\
\vdots & \vdots & \ddots & \vdots\\
c_{m, 1} & c_{m, 2} & \cdots & c_{m, n}
\end{pmatrix}
  • For all j \in n, store c_j = c_{1, j}, c_{2, j}, \ldots, c_{m, j} (a column of the above matrix) in server j.

Let r_1, \ldots, r_z \sim U_{\mathbb{F}_q^k}.

To code the queries:

  • Let D \subseteq \mathbb{F}_q^k be an MDS code of dimension z.
  • Encode the $r_j$'s using G_D=[g_1^\top, \ldots, g_z^\top].

(r_1^\top, \ldots, r_z^\top) \cdot G_D = \begin{pmatrix}
r_{1, 1} & r_{2, 1} & \cdots & r_{z, 1}\\
r_{1, 2} & r_{2, 2} & \cdots & r_{z, 2}\\
\vdots & \vdots & \ddots & \vdots\\
r_{1, m} & r_{2, m} & \cdots & r_{z, m}
\end{pmatrix}
\cdot G_D=\left((r_1^\top,\ldots, r_z^\top)g_1^\top,\ldots, (r_1^\top,\ldots, r_z^\top)g_n^\top \right)

To introduce the "errors in known locations" to the encoded $r_j$'s:

  • Let W \in \{0, 1\}^{m \times n} with some d_{C \star D} - 1 entries in its $i$-th row equal to 1.
  • These are the entries we will retrieve.

For every server j \in [n] send q_j = r_1^\top, \ldots, r_z^\top g_j^\top + w_j, where w_j is the $i$-th column of W.

  • This is similar to ramp scheme, where w_j is the "message".
  • Privacy against collusion of z servers.

Response from server: a_j = q_j c_j^\top.

Decoding? Let Q \in \mathbb{F}_q^{m \times n} be a matrix whose columns are the $q_j$'s.


Q = \begin{pmatrix}
r_1^\top & \cdots & r_z^\top
\end{pmatrix} \cdot G_D + W
  • The user has

\begin{aligned}
q_1 c_1^\top, \ldots, q_n c_n^\top &= \left(\sum_{j \in m} q_{1, j} c_{j, 1}, \ldots, \sum_{j \in m} q_{n, j} c_{j, n}\right) \\
&=\sum_{j \in m} (q_{1,j}c_{j, 1}, \ldots, q_{n,j}c_{j, n}) \\
&=\sum_{j \in m} q^j \star c^j

where q^j is a row of Q and c^j is a codeword in C (an n, k q MDS code).

We have:

  • Q=(r_1^\top, \ldots, r_z^\top) \cdot G_D + W
  • W\in \{0, 1\}^{m \times n} with some d_{C \star D} - 1 entries in its $i$-th row equal to 1.
  • (q^j \star c^j)=sum_{j \in m} q^j \star c^j
  • Each q^j is a row of Q
    • For j \neq i, q^j is a codeword in D
    • q^i = d^i + w^i
  • Therefore:

\begin{aligned}
\sum_{j \in [m]} q^j \star c^j &= \sum_{j \neq i} (d^j \star c^j) + ((d^i + w^i) \star c^i) \\
&= \sum_{j \neq i} (d^j \star c^j) + w^i \star c^i
&= (\text{codeword in } C \star D )+( \text{noise of Hamming weight } \leq d_{C \star D} - 1)
\end{aligned}

Multiply by H_{C \star D} and get d_{C \star D} - 1 elements of c^i.

  • Recall that c^i = x_i \cdot G_C
  • Repeat k^{d_{C \star D} - 1} times to obtain k elements of c^i.
    • Suffices to obtain x_i, since C is n, k q MDS code.

PIR-rate:

  • = \frac{k}{# \text{ downloaded elements}} = \frac{k}{\frac{k}{d_{C \star D} - 1} \cdot n} = \frac{d_{C \star D} - 1}{n}
  • Singleton bound for star-product: d_{C \star D} \leq n - \dim C - \dim D + 2.
  • Achieved with equality if C and D are Reed-Solomon codes.
  • PIR-rate = \frac{n - \dim C - \dim D + 1}{n} = \frac{n - k - z + 1}{n}.
  • Intuition:
    • "paying" k for "reconstruction from any $k$".
    • "paying" z for "protection against colluding sets of size $z$".
  • Capacity unknown! (as of 2022).
    • Known for special cases, e.g., k = 1, z = 1, certain types of schemes, etc.

PIR over graphs

Graph-based replication:

  • Every file is replicated twice on two separate servers.
  • Every two servers have at most one file in common.
  • "file" = "granularity" of data, i.e., the smallest information unit shared by any two servers.

A server that stores (x_{i, j})_{j=1}^d receives (q_{i, j})_{j=1}^d, and replies with \sum_{j=1}^d q_{i, j} \cdot x_{i, j}.

The idea:

  • Consider a 2-server replicated PIR and "split" the queries between the servers.
  • Sum the responses, unwanted files "cancel out", while x_i does not.

Problem: Collusion.

Solution: Add per server randomness.

Good for any graph, and any q \geq 3 (for simplicity assume 2 | q).

The protocol:

  • Choose random \gamma \in \mathbb{F}_q^n, \nu \in \mathbb{F}_q^m, and h \in \mathbb{F} \setminus \{0, 1\}.
  • Queries:
    • If node j is incident with edge \ell, send q_{j, \ell} = \gamma_j \cdot \nu_\ell to node j.
    • I.e., if server j stores file \ell.
  • Except one node j_0 that stores x_i, which gets q_{j_0, i} = h \cdot \gamma_{j_0} \cdot \nu_i.
  • Server j responds with a_j = \sum_{j=1}^d q_{j, \ell} \cdot x_{i, \ell}.
    • Where $x_{i, 1}, \ldots, x_{i, d} are the files adjacent with it.
Example
  • Consider the following graph.
  • n = 5, m = 7, and i = 3.
  • q_3 = \gamma_3 \cdot v_2, v_3, v_6 and a_3 = x_2 \cdot \gamma_3 v_2 + x_3 \cdot \gamma_3 v_3 + x_6 \cdot \gamma_3 v_6.
  • q_2 = \gamma_2 \cdot v_1, h v_3, v_4 and a_2 = x_1 \cdot \gamma_2 v_1 + x_3 \cdot h \gamma_2 v_3 + x_4 \cdot \gamma_2 v_4.

Example of PIR over graphs

Correctness:

  • \sum_{j=1}^5 \gamma_j^{-1} a_j =( h + 1 )v_3 x_3
  • h \neq 1, v_3 \neq 0 \implies find x_3.

Parameters:

  • Storage overhead 2 (for any graph).
  • Download n \cdot k.
  • PIR rate 1/n.

Collusion resistance:

1-privacy: Each node sees an entirely random vector.

2-privacy:

  • If no edge as for 1-privacy.
  • If edge exists E.g.,
    • \gamma_3 v_6 and \gamma_4 v_6 are independent.
    • \gamma_3 v_3 and h \cdot \gamma_2 v_3 are independent.

S-privacy:

  • Let S \subseteq n (e.g., S = 2,3,5), and consider the query matrix of their mutual files:

Q_S = diag(\gamma_3, \gamma_2, \gamma_5) \begin{pmatrix} 1 &\\ h & 1 \\ & 1\end{pmatrix} diag(v_3, v_4)
  • It can be shown that Pr(Q_S)=\frac{1}{(q-1)^4}, regardless of i \implies perfect privacy.