Breaking the MDS-PIR Capacity Barrier via Joint Storage Coding

The capacity of private information retrieval (PIR) from databases coded using maximum distance separable (MDS) codes has been previously characterized by Banawan and Ulukus, where it was assumed that the messages are encoded and stored separably into the databases. This assumption was also usually taken in other related works in the literature, and this capacity is usually referred to as the MDS-PIR capacity colloquially. In this work, we considered the question if and when this capacity barrier can be broken through joint encoding and storing of the messages. Our main results are two classes of novel code constructions which allow joint encoding as well as the corresponding PIR protocols, which indeed outperform the separate MDS-coded systems. Moreover, we show that a simple but novel expansion technique allows us to generalize these two classes of codes, resulting in a wider range of the cases where this capacity barrier can be broken.


Introduction
Private information retrieval (PIR) [1] has attracted significant attention from researchers in the fields of theoretical computer science, cryptography, information theory, and coding theory.In the classical PIR model, a user wishes to retrieve one of the K available messages, from N noncommunicating databases, each of which has a copy of these K messages.User privacy needs to be preserved during the retrieval process, which requires that the identity of the desired message not be revealed to any single database.To accomplish the task efficiently, good codes need to be designed such that the least amount of data should be downloaded.The inverse of the minimum amount of the download data per-bit of desired message is referred to as the capacity of the PIR system.The capacity of the classical PIR system was characterized precisely in a recent work by Sun and Jafar [2].
In distributed systems, databases may fail; moreover, each storage node (database) is also constrained on the storage space.Erasure codes can be used to improve both storage efficiency and failure resistance, which motivated the investigation of PIR from data encoded with maximum distance separable (MDS) codes [3][4][5][6][7], with coding parameter (N, T ), i.e., the messages can be recovered by accessing any T databases.The capacity of PIR from MDS-coded databases (MDS-PIR) was characterized by Banawan and Ulukus [5], which is usually referred to as the MDS-PIR capacity colloquially.
Hua Sun (email: hua.sun@unt.edu) is with the Department of Electrical Engineering at the University of North Texas.Chao Tian (chao.tian@tamu.edu) is with the Department of Electrical and Computer Engineering at the Texas A&M University.
In all these existing works, the storage code has been designed such that each message is independently encoded and stored into the dababases, and thus can also be recovered individually.In fact, even when the storage codes are not necessarily MDS codes, most existing works on private information retrieval have assumed this separate coding architecture [8][9][10][11][12][13], and the only exceptions 1 we are aware of are [20][21][22].Though this architecture of separately encoding of each message offers a simple storage solution with good data reliability, it is by no means the only possible MDS storage coding strategy.Instead, the messages can be stored jointly using an MDS code, which could provide the same level of data reliability at the same amount of storage overhead.Motivated by this observation2 , we ask the following natural question: When can the MDS-PIR capacity barrier, which was established in [5] for separately encoding of the messages using an MDS code, be broken, by allowing jointly encoding of the messages using an MDS code?
In this work, we show that there are many cases, where by jointly encoding and storing the messages, the messages can be protected using an (N, T ) MDS code, but retrieved with less data download than the separate coding architecture.In other words, the capacity barrier for separately encoding of the messages can be broken for these cases.More precisely, the mathematical question we ask is under what (K, N, T ) parameters, jointly encoding and storing the MDS-coded messages can provide strict PIR retrieval rate improvement; we show that this can be done at least in the following two cases: To establish this result, we provide two novel code constructions and PIR protocols which yield strict performance improvement over the strategy of encoding and storing messages separately using an MDS code.Moreover, we show that through a simple but novel code expansion technique, the MDS-PIR capacity barrier can also be broken for the following cases for an arbitrary integer m ≥ 1: • (K, N, T ) = (2, mN, 2m) and N ≥ 3; The rest of the paper is organized as follows.In Section 2, we provide a precise description of the system model and problem formulation.In Section 3 and Section 4, we provide two novel joint coding storage codes and PIR protocols.In Section 5 we present a technique which yields more general classes of the codes which can strictly improve upon separately encoding and storing the messages.Section 6 finally concludes the paper.

System Model and Problem Formulation
In this section, we first provide a formal description of the system model, then proceed to pose the problem we seek to answer in this work.A couple of additional remarks to clarify the relation between our system model and those seen in the literature are given at the end of the section.

System Model
There are a total of K mutually independent messages W 1 , W 2 , . . ., W K in the system.Each message is uniformly distributed over X L , i.e., the set of length-L sequences in the finite alphabet X .The messages are MDS-coded and then distributed to N databases, such that from any T databases, the messages can be fully recovered.Since the messages are (N, T ) MDS-coded, it is without loss of generality to assume that L • K = M • T for some integer M .
When a user wishes to retrieve a particular message W k * , N queries Q N ) are sent to the databases, where is the query for database-n.The retrieval needs to be information theoretically private, i.e., any database is not able to infer any knowledge as to which message is being requested.For this purpose, a random key F in the set F is used together with the desired message index k * to generate the set of queries Q belongs to the set of allowed queries for database-n, denoted as Q n .After receiving query Q n .Each symbol in the answers from database-n belongs to a finite field A n , and the answers may have multiple (and different numbers of) symbols.Using the answers A [k * ] 1:N from all N databases, together with F and k * , the user then reconstructs Ŵ k * .We shall refer to such a system as a (K, N, T ) MDS-PIR system.
A more rigorous definition of a (K, N, T ) system can be specified by a set of coding functions as follows.In the following, we denote the cardinality of a set B as |B|.
Definition 1 A (K, N, T ) MDS-PIR code consists of the following coding components: 1.A set of MDS encoding functions: where each Φ n encodes all the messages together into the information to be stored at databasen.
2. A set of MDS decoding recovery functions: for each T ⊆ {1, . . ., N } such that |T | = T , whose outputs are denoted as W 1:K T ; 3. A query function i.e., for retrieving message W k * , the user sends the query Q 4. An answer length function i.e., the length of the answer from each database, a non-negative integer, is a deterministic function of the query, but not the particular realization of the messages; 5. An answer generating function i.e., the answer when is the query received by database-n; 6.A reconstruction function ψ : i.e., after receiving the answers, the user reconstructs the message as 1:N , k * , F).These functions satisfy the following three requirements: 1. MDS recoverable: For any T ⊆ {1, . . ., N } such that |T | = T , we have 2. Retrieval correctness: For any k * ∈ {1, . . ., K}, we have 3. Privacy: For every k, k ′ ∈ {1, . . ., K}, n ∈ {1, . . ., N } and q ∈ Q n , The retrieval rate is defined as This is the number of bits of desired message information that can be privately retrieved per bit of downloaded data.The maximum possible retrieval rate is referred to as the capacity of the (K, N, T ) system.

Separate vs. Joint MDS Storage Codes
In the general problem definition we have provided above, the MDS encoding functions Φ n allow the messages to be jointly encoded.For example, suppose we have K = 2 messages, N = 3 databases and from any T = 2 databases, we may decode both messages.A simple jointly encoded MDS storage code is as follows.Each message has L = 2 bits, denoted as Each database stores M = LK/T = 2 bits, i.e., database-1 stores (a 1 , a 2 ), database-2 stores (b 1 , b 2 ) and database-3 stores (a 1 + b 1 , a 2 + b 2 ).However, in almost all existing works in the literature, e.g., [3,5,7,[23][24][25][26], the messages are encoded separately.In other words, the MDS encoding functions have the special form where which encodes message W k to its MDS-coded form to be stored at database-n.Correspondingly, the MDS decoding functions have the form where which decodes message-k from the information regarding W k stored in the databases in the set T .Particularly, since most practical MDS codes are linear, several existing works have directly assumed the MDS encoding functions to be linear, and moreover, the component coding functions Φ k n for different messages W k 's are the same; see e.g., [5,23].In other words, in this class of codes, the encoding function Φ k n can be written as the multiplication of the message vector W k with an L × M/K encoding matrix G n , whose elements are also in the finite field X .To compare with the jointly encoded MDS storage example above, we consider the same setting where K = 2 messages, L = 2 bits per message, N = 3 servers, and the MDS parameter T = 2.A separate MDS storage code where each database stores M/K = 1 bit per message is as follows.Database-1 stores (a 1 , b 1 ), database-2 stores (a 2 , b 2 ) and database-3 stores (a 1 +a 2 , b 1 +b 2 ).It is easy to see that for separately encoded MDS storage codes, the storage space is divided evenly to each message and each divided storage space can only be a function of the corresponding message.
Let us denote the capacity of (K, N, T ) MDS-PIR system as C(K, N, T ), that of separate MDS coding as C ⊥ (K, N, T ), and that of separate linear MDS coding with a uniform component function as C ⊕ (K, N, T ).It is clear from the definitions that It was shown in [5] that However, a close inspection of the converse proof in [5] reveals that The issue we thus wish to understand in this work is the relation between C(K, N, T ) and C ⊥ (K, N, T ).In particular, we wish to identify the set of the (K, N, T ) triples such that if the set is not empty.We shall show in this work that such triples indeed exist, and they in fact span a rather wide range.

Further Remarks on the System Model
The result in [5] is in fact slightly stronger than we have stated in (13).Let us assume a particular MDS storage code C is used in the (K, N, T ) system, then the corresponding capacities of the (K, N, T ) systems as described above can be denoted as C(K, N, T, C), C ⊥ (K, N, T, C), and C ⊕ (K, N, T, C), respectively.The result in [5] can then be stated as that for any linear MDS code C, It is natural to ask whether for any particular MDS code C, which is not necessarily linear or does not necessarily use a uniform component MDS coding function, whether and more generally whether for any MDS code C, C(K, N, T, C) = C(K, N, T ).We believe this is in general not true, however, it appears difficult to prove or disprove this conjecture.
The MDS recovery requirement implies the following information theoretic relation: n∈T for any T ⊆ {1, 2, . . ., N } and |T | = T .These conditions can be used to derive converse results for a (K, N, T ) system, and sometimes are stated directly (e.g.[24]) as the MDS recovery requirement, instead of enforcing the MDS recovery property on the coding functions.
3 Code Construction: In this section, we present the storage and PIR code construction when K = T = 2, N ≥ 3 and show that the PIR rate achieved with the proposed joint MDS storage code is strictly higher than the capacity of PIR with separate MDS storage code, i.e., C(2, N, 2) > C ⊥ (2, N, 2).

Example: N = 4
To illustrate the main idea in a simpler setting, we start with an example where N = 4.We set message size L = 3 so that each message consists of 3 symbols from Storage Code: From the joint MDS storage code constraint, each database stores LK T = 3 symbols, and the stored variables are specified in the following table.
It is easy to verify that we may recover both messages from the storage of any 2 databases.For example, consider database-3 and database-4.It suffices to show that (a 1 − 2a 2 ; a 2 − 2a 0 ; a 0 − 2a 1 ) are invertible to W 1 = (a 0 ; a 1 ; a 2 ).Equivalently, we show that the following matrix has full rank over PIR Code: When we retrieve W 1 , the answers are shown in the following table.
F Database-1 Database-2 Database-3 Database-4 When we retrieve W 2 , the answers are shown in the following table.
F Database-1 Database-2 Database-3 Database-4 Correctness and Privacy: Both correctness and privacy are easy to verify.Correctness follows from the observation that from the 4 symbols downloaded (one from each database), we may decode the 3 desired symbols as only 1 undesired symbol appears in the answers.Privacy is guaranteed because no matter which message is desired, for each database, the answers are identically distributed.For example, consider database-3.The answers are equally likely to be a 0 + b 2 , a 1 + b 0 and a 2 + b 1 , regardless of the desired message index.

General Proof: Arbitrary N ≥ 3
We set message size L = N − 1, then each message consists of N − 1 symbols from F p m for a prime number p and an integer m such that p m ≥ (N − 3)(N − 1) + 2. The primitive element of the finite filed Storage Code: From the joint MDS storage code constraint, each database stores LK T = N − 1 symbols, and the stored variables Denote the cyclicly shifted message vector as W where i = i mod (N − 1), i.e., the symbol indices are interpreted modulo N − 1. Specifically, The proof that the above storage code satisfies the MDS criterion is deferred to Section 3.2.1.
PIR Code: When we retrieve W 1 , the answers are set as follows.
A [1] A When we retrieve W 2 , the answers are set as follows.
A [2] . . .(42) Correctness and Privacy: Similar to the example presented in the previous section, both correctness and privacy are easy to verify.Correctness follows from the observation that the N symbols downloaded (one from each database) contain all N − 1 desired symbols and only 1 undesired symbol.Specifically, when W 1 is desired, we may recover W 1 from (A 2 ) and when W 2 is desired, we may recover W 2 from (A 1 ).Privacy is guaranteed because no matter which message is desired, A [1] n and A [2] n are identically distributed.For n = 1, 2, this is trivial to see; when n ≥ 3, since A take values from the same set {0, 1, • • • , N − 2} for any n, and moreover the queries follow the same uniform distribution on this set for both messages.

Rate that outperforms separate MDS-PIR capacity:
The desired message has L = N −1 symbols and we are downloading one symbol from each database, ), the capacity of separate MDS storage code.

Proof of MDS storage criterion
We show that from the stored variables of any two databases, S i , S j , i < j, i, j ∈ {1, • • • , N } we may recover both W 1 and W 2 .
When i = 1, 2, the proof is immediate.Henceforth we consider i ≥ 3. To show that from (S i , S j ) we may recover (W 1 , W 2 ), it suffices to prove that from S i − S j , we may recover W 1 .Note that where C i,j is an (N − 1) × (N − 1) circulant matrix whose rows consist of all possible cyclic shifts of the following 1 × (N − 1) row vector, We are left to prove the circulant matrix C i,j has full rank.From a result by Ingleton [27], a circulant matrix has full rank if the following two polynomials have no common root.
To show that f (x), g(x) have no common root for all integers i, j, 3 ≤ i < j ≤ N , we prove by contradiction.Suppose on the contrary that there exists an element x 0 ∈ F p m and two integers i, j, 3 ≤ i < j ≤ N such that f (x 0 ) = 0 and g(x 0 ) = 0, i.e., Taking (50) to the (N − 1)-th power, we have . Combining with the assumption that p m − 2 ≥ (N − 3)(N − 1) and α is a primitive element of F p m , we have [28] 1 which contradicts (54).The proof is now complete.
Remark: The field size may be further reduced by a result from [29].To ensure C i,j has full rank, it suffices to ensure f (x) and g ′ (x) = x r − 1 has no common root, where N − 1 = rp l and p, r are co-prime [29].Using this result and following similar proof steps as above, we may set p m ≥ (N − 3)r + 2. Note that here r depends on p, so to find the smallest field size, we may search by first fixing p.

Code Construction
In this section, we present the storage and PIR code construction when N = K + 1 = T + 1 and show that the PIR rate achieved with the proposed joint MDS storage code is strictly higher than the capacity of PIR with separate MDS storage code, i.e., C(K, K + 1, K) > C ⊥ (K, K + 1, K).

Example: (K, N, T ) = (3, 4, 3)
To illustrate the main idea in a simpler setting, we start with an example where K = 3, N = 4, T = 3.We set message size L = 2 so that each message consists of 2 bits from F 2 .Denote Storage Code: From the joint MDS storage code constraint, each database stores LK T = 2 bits, and the stored variables are specified in the following table.

Table 4: Stored Variables.
Database-1 Database-2 Database-3 Database-4 The MDS storage criterion is easily verified, i.e., we may recover both messages from the storage of any 3 databases.
PIR Code: When we retrieve W 1 , the answers are shown in the following table.
F Database-1 Database-2 Database-3 Database-4 When we retrieve W 2 or W 3 , the answers are shown in the following tables.F Database-1 Database-2 Database-3 Database-4 F Database-1 Database-2 Database-3 Database-4 Correctness and Privacy: Both correctness and privacy are easy to see.

Rate that outperforms separate MDS-PIR capacity:
The rate achieved is 37 , the capacity of separate MDS storage code.

General
The proof is a simple generalization of the example presented above.We set L = 2, and each message consists of 2 bits from Storage Code: Each database stores LK T = 2 bits, and the stored variables are specified in the following table.Note that The MDS storage criterion is easily verified, i.e., we may recover both messages from the storage of any T = N − 1 databases.
PIR Code: When we retrieve W k , the answers are shown in the following table.
Table 9: Answers for W k .
Correctness and Privacy: Follow immediately.

Rate that outperforms separate MDS-PIR capacity:
The rate achieved is The proof is thus complete.

Regime Expansion Building upon Base Codes
We show that the two classes of base codes presented in previous sections for (K, N, T ) systems can be extended to (K, mN, mT ) systems (m is a positive integer).We present this result in the next two subsections, one for each class of base codes.Let us start from the simpler case of (K, K +1, K) systems.
The key idea is that we may split the messages and databases into m generic copies so that the same PIR rate is preserved.Note that the separate MDS-PIR capacity is a function of T , i.e., suffices to provide a joint MDS storage code for a (K, m(K + 1), mK) system that achieves the same PIR rate as that of a (K, K + 1, K) system (i.e., rate 2 K+1 ).Such a storage and PIR code construction is presented next.
Each message is "multiplied" by m so that we set L = 2m, and each message consists of 2m symbols from F q , where q is an integer power of a prime number and is no fewer than (m + 1)K.To highlight that the message symbols form two segments, we denote ).The (K + 1)-th group of databases store generic combinations of the message symbols.Denote m} denotes the i-th row of an m × mK Cauchy matrix C with elements C(i, j) in the form Note that q ≥ (m + 1)K, therefore, such distinct α i 's and β j 's exist.
We now verify that the MDS storage criterion is satisfied, i.e., both messages can be recovered from the storage of any T = mK databases.The two message segments W 1 , W 2 are encoded in the same manner, so it suffices to consider one segment, say segment 1, W 1 .Suppose among the T = mK databases, T 1 ≤ (m − 1)K databases are from the first K database groups and the remaining T − T 1 databases are from the (K + 1)-th database group.The T 1 databases from the first K database groups contribute T 1 raw message symbols from W 1 , then we only need to show that the remaining T − T 1 symbols from W 1 can be recovered from the T − T 1 databases of the (K + 1)-th database group.This is equivalent to prove that a (T − T 1 ) × (T − T 1 ) sub-matrix of the Cauchy matrix C ∈ F m×mK q has full rank, which trivially holds for any Cauchy matrix.PIR Code: When we retrieve W k , the answers are shown in the following table .Table 11: Answers for W k .
Correctness and Privacy: Privacy follows from the observation that no matter which message is desired, the answer from any database is equally likely to come from message segment 1 or 2. To see correctness, note that all non-desired message symbols appeared in answers from the (K + 1)-th database group are directly downloaded thus can be cancelled.m desired symbols are directly downloaded and the other m desired symbols can be successfully recovered because the m linear combinations of desired symbols downloaded from the (K + 1)-th database group have full rank (note that C ∈ F m×mK q is a Cauchy matrix).The rate achieved is 2 K+1 as L = 2m and we have downloaded one symbol from each of the m(K + 1) databases.We show that C(2, mN, 2m) > C ⊥ (2, mN, 2m), where N ≥ 3 and m is a positive integer.Similar to the reasoning in the previous section, it suffices to provide a joint MDS storage code for a (2, mN, 2m) system that achieves the PIR rate N −1 N (same as that of a (2, N, 2) system from Section 4).The idea is also based on splitting the messages and databases.Let us start with an example where N = 4, m = 2.

Example
The message size is multiplied by m = 2 so that we set L = m(N − 1) = 6 and each message consists of 6 symbols from F q , where q will be specified later.At this point, it is useful to view q as a sufficiently large prime number.Denote W 1 = (a 0 ; a 1 ; a 2 ), where We will show that there exist feasible choices of h i , g i .Specifically, we may choose h i , h ′ i , g i , g ′ i i.i.d. and uniform over F q .
To verify the MDS storage criterion, we need to show that both messages can be recovered from the storage of any 4 databases.The detailed proof is deferred to the general proof presented in the next section and we give a sketch here.Every 4 databases contribute 12 linear combinations on the 12 message symbols and this linear mapping is given by a 12 × 12 matrix.We view its determinant polynomial as a function of variables (h i , h ′ i , g i , g ′ i ).As shown in the general proof, these determinant polynomials are not zero polynomials.Overall we have 8  4 determinant polynomials and each polynomial has degree at most 12.Consider the product of all such determinant polynomials, which is another polynomial with degree at most 12 × 8 4 .Therefore by Schwartz-Zippel lemma, if we set q > 12 × 8 4 , then the probability that this product polynomial evaluates to 0 is non-zero.In other words, we have found a feasible choice of (h i , h ′ i , g i , g ′ i ) that guarantees the storage code satisfies the MDS criterion.
PIR Code: The PIR code is almost identical to that when m = 1.When we retrieve W 1 , the answers are shown in the following table.Correctness and Privacy: Privacy is easily seen.To prove correctness, note that nondesired symbols can be cancelled and we only need to ensure the received desired equations are invertible to the message symbols.This claim follows from Schwartz-Zippel lemma that shows (h 2i−1 ; h 2i ) ∈ F 2×2 q , (g 2i−1 ; g 2i ) ∈ F 2×2 q have full rank with non-zero probability over a sufficiently large field.Here we have 12 matrices, each of which has dimension 2 × 2 and has a determinant polynomial of degree at most 2.
Overall, we need to guarantee correctness and MDS criterion are simultaneously satisfied.Take the product of all determinant polynomials, whose degree is at most 12 × 8 4 + 12 × 2. So we set that the a j * 3 and b j * 4 vectors are fully recovered.Hence from any T databases, we may recover ( a 1 , • • • , a m ) and ( b 1 , • • • , b m ), i.e., all symbols from W 1 and W 2 .Therefore, there indeed exists a choice of h n,j,i , g n,j,i for which the determinant polynomial is not zero.
Finally, we need to consider correctness and MDS criterion jointly and show that there exist a single choice of h n,j,i , g n,j,i that satisfies both constraints at the same time.The product of all determinant polynomials has degree at most 2m(N − 2)(N − 1) + 2m(N − 1) mN 2m and as q > 2m(N − 2)(N − 1) + 2m(N − 1) mN 2m , Schwartz-Zippel lemma guarantees the existence of a feasible choice.

Conclusion
We considered the problem of private information retrieval from MDS-coded databases.Different from the prevailing approach in the literature where the messages are encoded separately using MDS codes, we consider encoding and storing the messages jointly using an MDS code into the databases.There are many cases for which by jointly MDS-coding, we can break the capacity barrier of the separate coding MDS-PIR.To establish this result, two novel code constructions and the corresponding PIR protocols are presented, and moreover, an expansion technique is introduced to allow more general parameters.The capacity of PIR with joint MDS storage, especially the converse side, remains an interesting future direction.
1, 2}.Storage Code: Each database stores LK T = 3 symbols, as specified in the following table.Define