The Capacity of Private Information Retrieval from Decentralized Uncoded Caching Databases

We consider the private information retrieval (PIR) problem from decentralized uncoded caching databases. There are two phases in our problem setting, a caching phase, and a retrieval phase. In the caching phase, a data center containing all the $K$ files, where each file is of size $L$ bits, and several databases with storage size constraint $\mu K L$ bits exist in the system. Each database independently chooses $\mu K L$ bits out of the total $KL$ bits from the data center to cache through the same probability distribution in a decentralized manner. In the retrieval phase, a user (retriever) accesses $N$ databases in addition to the data center, and wishes to retrieve a desired file privately. We characterize the optimal normalized download cost to be $\frac{D}{L} = \sum_{n=1}^{N+1} \binom{N}{n-1} \mu^{n-1} (1-\mu)^{N+1-n} \left( 1+ \frac{1}{n} + \dots+ \frac{1}{n^{K-1}} \right)$. We show that uniform and random caching scheme which is originally proposed for decentralized coded caching by Maddah-Ali and Niesen, along with Sun and Jafar retrieval scheme which is originally proposed for PIR from replicated databases surprisingly result in the lowest normalized download cost. This is the decentralized counterpart of the recent result of Attia, Kumar and Tandon for the centralized case. The converse proof contains several ingredients such as interference lower bound, induction lemma, replacing queries and answering string random variables with the content of distributed databases, the nature of decentralized uncoded caching databases, and bit marginalization of joint caching distributions.


Introduction
Private information retrieval (PIR) refers to the problem of downloading a desired file from distributed databases while keeping the identity of the desired file private against the databases.In the classical setting of PIR (see Fig. 1), there are N non-communicating databases, each storing the same set of K files.The user wishes to download one of these K files without letting the databases know the identity of the desired file.A simple but highly inefficient way is to download all the files from a particular database, which results in the normalized download cost of D L = K, where L is the file size and D is the total number of downloaded bits from the N databases.The PIR problem has originated in the computer science community [1][2][3][4][5] and has drawn attention in the information theory society with early examples [6][7][8][9][10][11]. Recently, Sun and Jafar [12] have characterized the optimal normalized download cost for the classical PIR problem to be D After [12], many interesting variants of the classical PIR problem have been investigated in .Most of these previous works consider the case where the contents of the databases are fixed a priori in an uncontrollable manner, and a vast majority of them consider the case of replicated databases where each database stores the same set of K files.
Coded caching refers to the problem of placing files in users' local storage caches ahead of time properly and designing efficient delivery schemes at the time of specific user requests in such a way to minimize the traffic during the delivery phase.In the original setup [55] (see Fig. 1), a server with K files connects to N users through an error-free shared link, where each user has a local memory which can store up to M files.The system operates in two phases, a placement phase and a delivery phase.In the placement phase, the server places the files into each user's local memory.In the delivery phase, each user requests a file from the server, and the server aims to satisfy all the requests with the lowest traffic load.If the set of users in the two phases are identical, the server can arrange the content in each user's local memory in an optimized manner, which is called centralized coded caching.Reference [55] proposes a symmetric batch caching scheme, which is shown to be optimal for the case of centralized uncoded placement in [56].If the set of users in the two phases varies, the server cannot arrange the files in user caches in a centralized manner.Instead, the server treats each user identically and independently which is called decentralized coded caching [57].Reference [57] proposes a uniform and random caching scheme, which is shown to be optimal for the case of decentralized uncoded placement in [56].Many interesting variants of coded caching problem have been investigated in [58][59][60][61][62][63][64][65][66][67][68][69][70][71][72].
The references that are most closely related to our work here are [38,44].References [38,44] formulate a new type of PIR problem where the content of each database is not fixed a priori, but can be optimized to minimize the download cost.These papers bring PIR and coded caching problems together in a practically relevant and theoretically interesting manner.In their problem setting (see Fig. 2), there is a data center (server) containing all the K files where each file is of size L bits, and the system operates in two phases.In the caching phase, there are N databases in the system with a common storage size constraint µ, i.e., each database can at most store µKL bits, 1 N ≤ µ ≤ 1.In the retrieval phase, a user (retriever) accesses the N databases, and wishes to download a desired file privately.They consider the problem of optimally storing content from the data center to the databases in the caching phase in such a way that the normalized download cost during the retrieval phase is minimized.They focus on the centralized uncoded caching case, i.e., the set of users in the two phases are identical so that the data center can assign the files to each database in a centralized manner, and caching is uncoded in that each database stores a subset of the bits from the data center (no coding), i.e., each database stores µKL bits out of the total KL bits.Surprisingly, they show that the symmetric batch caching scheme proposed in [55] results in the lowest normalized download cost in the retrieval phase.
We consider the PIR problem from decentralized uncoded caching databases.In our problem setting (see Fig. 3), the system also operates in two phases as in [38,44].However, the set of databases active in the two phases are different, and we do not know in advance which databases the user (retriever) can access in the retrieval phase.Therefore, we consider a decentralized setting for the caching phase, i.e., the data center treats each database identically and independently, or equivalently, each database chooses a subset of bits to store independently according to the same probability distribution.Here, we aim at designing the optimal probability distribution in the caching phase and PIR scheme in the retrieval phase such that the normalized download cost in the retrieval phase is minimized.Another main difference between our work and references [38,44] is that, in the caching phase, references [38,44] require that the N databases altogether can reconstruct the entire K files, i.e., when the user (retriever) connects to the N databases, their collective content is equivalent to the content in the data center, so the user can download any desired file.While this can be guaranteed in the centralized setting, in the decentralized setting, where cache placement is probabilistic, we cannot guarantee that any given N databases contain all the bits that exist in the data center.Thus, in order to formulate a meaningful PIR problem, we allow the user (retriever) access the data center as well as the databases in the retrieval phase.Finally, we remark about another sub-branch of PIR literature that considers caching: [30][31][32][33]39,52]; there the user (retriever) itself has a cache memory where it stores a subset of the bits available in the databases.That problem is unrelated to the setting here even though it is also referred to as PIR with caching; in essence, it is PIR with side information.
In this work, for PIR from decentralized caching databases, we show that uniform and random caching scheme, originally proposed in [57] for decentralized coded caching, results in the lowest expected normalized download cost in the retrieval phase.For the achievability, we apply the PIR scheme in [12] successively for all resulting subfile parts.For the converse, we first apply the lower bound derived in [44], which replaces the random variables for queries and answering strings by the content of the distributed databases in a novel manner extending the lower bounding techniques in [12, Lemma 5 and Lemma 6].To compare different probability distributions in the caching phase, we focus on the marginal distributions on each separate bit.Then, by using the nature of decentralization and uncoded caching, we further lower bound the normalized download cost.Finally, we show the matching converse for the expected normalized download cost to be D , which yields an exact capacity result for the problem.

System Model
We consider a system consisting of one data center and several databases.The data center stores K independent files, labeled as W 1 , W 2 , . . ., W K , where each file is of size L bits.Therefore, Each database has a storage capacity of µKL bits, where 0 ≤ µ ≤ 1.
The system operates in two phases: In the caching phase, we consider the case of uncoded caching, i.e., each database stores a subset of bits from the data center.Due to the storage size constraint, each database at most stores µKL bits out of the total KL bits from the data center.Here, we denote ith database as DB i and use random variable Z i to denote the stored content in DB i .Therefore, the storage size constraint for DB i is We consider the decentralized setting for the caching phase, i.e., each database chooses a subset of bits to store independently according to the same probability distribution, denoted by P H . Rigorously, let random variable H i denote the indices of the stored bits in DB i .For N databases, the decentralized caching scheme H can be specified as In the retrieval phase, the user accesses N databases and the data center.We note that we do not know in advance which N databases are available or which N databases the user will have access to.Here, we also assume that in the retrieval phase, the data center and N databases do not communicate with each other (no collusion).To simplify the notation, we use DB 0 to denote the data center, and therefore Z 0 = (W 1 , . . ., W K ) since the data center stores all the K files.The user privately generates an index θ ∈ [K] = {1, . . ., K}, and wishes to retrieve file W θ such that it is impossible for either the data center or any individual database to identify θ.For random variables θ, and W 1 , . . ., W K , we have In order to retrieve file W θ , the user sends N + 1 queries n is the query sent to DB n for file W θ .Note that the queries are independent of the realization of the K files.Therefore, Upon receiving the query n , DB n replies with an answering string After receiving the answering strings N from DB 0 , . . ., DB N , the user needs to decode the desired file W θ reliably.By using Fano's inequality, we have the following reliability constraint where o(L) denotes a function such that o(L) L → 0 as L → ∞.To ensure that individual databases do not know which file is retrieved, we have the following privacy constraint, ∀n ∈ {0} ∪ [N], ∀θ ∈ [K], (Q [1]  n , A [1] where A ∼ B means that A and B are identically distributed. Given that each file is of size L bits, for a fixed K, µ and decentralized caching probability distribution P H , let H denote the indices of the cached bits in the N databases available in the retrieval phase.The probability distribution of H is specified in (3).Let D

[θ]
H represent the number of downloaded bits via the answering strings  We further denote D H as the expected number of downloaded bits with respect to different file requests, i.e., H . Finally, we denote D as the expected number of downloaded bits with respect to different realization of the cached bit indices, i.e., D = E H [D H ]. A pair (D, L) is achievable if there exists a PIR scheme satisfying the reliability constraint (7) and the privacy constraint (8).The optimal normalized download cost D * is defined as In this work, we aim at characterizing the optimal normalized download cost and finding the optimal decentralized caching probability distribution.Next, we illustrate the system model and the problem considered with a simple example of K = 3 files and N = 2 databases in the retrieval phase; see Fig. 4. Consider a data center storing K = 3 files where each file is of size 4 bits.In the caching phase, there are 4 databases in the system, and each database can at most store 4 bits.Each database can always store the first file, which is of size 4 bits, as caching option 1 in Fig. 4. Or each database can uniformly and randomly choose 4 bits out of total 12 bits from the data center to store.One of the realization is shown as caching option 2 in Fig. 4.Each database can also choose 2 bits from the first file and 1 bit each from the remaining two files to store, where one of the realization is shown as caching option 3 in Fig. 4. We require each database to use the same probability distribution to choose the bits to store in order to satisfy the decentralized requirement.In this example, we assume that the user can access the data center and N = 2 databases in the retrieval phase, say the first and the third database, and the user wishes to download a file privately.Our questions are as follows: What is the optimal probability distribution to use in the caching phase?What is the optimal PIR scheme to use in the retrieval phase?How can we jointly design the schemes in the two phases such that the expected normalized download cost is the lowest in the second phase?

Main Results and Discussions
We characterize the optimal normalized download cost for PIR from decentralized uncoded caching databases in the following theorem.
Theorem 1 For PIR from decentralized uncoded caching databases with K files, where each file is of size L bits, N databases in addition to a data center available in the retrieval phase, and a storage size constraint µKL, 0 < µ < 1, bits for each database, the optimal normalized download cost is The achievability scheme is provided in Section 4, and the converse proof is shown in Section 5. We first use the following example to show the main ingredients of Theorem 1.

Motivating Example: K = 3 and N = 2
In this example, we consider the case where the data center stores K = 3 independent files labeled as A, B, and C, where each file is of size L bits.In the caching phase, several databases with storage capacity of 3µL bits are present in the system.We will show that the optimal normalized download cost is D L = 17 18 µ 2 − 5 2 µ + 3 when N = 2 databases in addition to the data center are available in the retrieval phase.

Achievability Scheme
In the caching phase, to satisfy the storage size constraint, each database randomly and uniformly stores 3µL bits out of total 3L bits from the data center.Each database operates independently through the same probability distribution resulting in decentralized caching.
In the retrieval phase, suppose N = 2 databases, labeled as DB 1 and DB 2 , in addition to the data center, labeled as DB 0 , are available to the user, and the user wishes to retrieve file A privately.Let us first focus on one file, say A. We can partition file A into four subfiles where, for S ⊆ {0, 1, 2}, A S denotes the bits of file A which are stored in databases in S.
For example, A 0 denotes the bits of file A only stored in DB 0 and A 0,2 denotes the bits of file A stored in DB 0 and DB 2 and so on.Since each bit is stored in the data center, 0 exists in the label of every partition.By the law of large numbers, when the file size is large enough.We can do the same partitions for files B and C.
To retrieve file A privately, we first retrieve the subfile A 0,1,2 privately.We apply the PIR scheme proposed in [12] to retrieve the subfile A 0,1,2 .Subfile A 0,1,2 is replicated in 3 databases and the total number of files is 3 since we also have B 0,1,2 and C 0,1,2 .Therefore, we download bits.We also need to retrieve the subfile A 0,1 privately.Subfile A 0,1 is replicated in 2 databases and the total number of files is 3 since we also have B 0,1 and C 0,1 .By applying the PIR scheme in [12], we download bits.Next, we need to retrieve the subfile A 0,2 privately.Using [12], we download bits.Finally, we need to retrieve A 0 privately.Using [12], we download bits.By adding ( 14), ( 15), ( 16) and ( 17), we show that the normalized download cost 17 18 is achievable.

Converse Proof
Here, we show that among all the decentralized caching probability distributions P H , the lowest normalized download cost for N = 2 databases is as shown in (18).Given a decentralized caching probability distribution P H , we have a resulting H in the retrieval phase.We lower bound D H first.In the retrieval phase, the stored content of DB 0 , DB 1 , and DB 2 are fixed and uncoded, i.e., Z 0 , Z 1 and Z 2 are fixed and uncoded.We can apply the lower bound in [44,Eqn. (31)] as the lower bound for D H . Therefore, where ( 20) holds due to Z 0 = (W 1 , W 2 , W 3 ), and ( 21) holds due to (2).We note that different H results in different Z 1 and Z 2 .We lower bound D now.From ( 22), we have Let random variables X (n) i,j , i = 1, . . ., L, j = 1, . . ., K, be the indicator functions showing that the ith bit of file W j is cached in DB n or not, i.e., X (n) i,j = 1 means that the ith bit of file W j is stored in DB n and X (n) i,j = 0 means that it is not stored in DB n .For DB 1 we have L,3 ≤ 3µL (24) due to the storage size constraint in (2).We note that P H induces probability measures on random variables X (n) i,j , and let X (n) i,j = 1 with probability p i,j , where we remove the superscript n since each database adopts the same probability distribution P H to choose the cached bits due to the decentralized property.By taking expectation on (24) and applying the linearity of expectation, we have which yields Let random variables V i,j , i = 1, . . ., L, j = 1, . . ., K, be the indicator functions showing that the ith bit of file W j is not cached in DB 1 and DB 2 , i.e., V i,j = 1 means that the ith bit of file W j is not stored in either DB 1 or DB 2 .Therefore, we have Now, we can evaluate (23) as follows Therefore, continuing from ( 23), we have where p 1,1 , . . ., p L,3 are subject to (26).To further lower bound the right hand side of (31), we minimize the right hand side with respect to p i,j subject to (26).Hence, we consider the following Lagrangian From the KKT conditions, we have Thus, we can further lower bound (31) by letting Therefore, we show that the optimal normalized download cost is 17 18 µ 2 − 5 2 µ + 3 when N = 2 databases in addition to the data center are available in the retrieval phase.To achieve the optimal normalized download cost, each database should randomly and uniformly store the bits in the caching phase.

Further Examples and Numerical Results
Now, we use different scenarios to illustrate the optimal normalized download cost in (11).We first consider the scenario where the data center contains K = 10 files, each database with storage size constraint µ = 1 2 , and in the retrieval phase, the user can access N = 0, . . ., 30 databases in addition to the data center.We plot the expected normalized download cost versus different number of available databases in Fig. 5.When N = 0, in order to download the desired file privately, the user should download all the files in the data center, and this results in a download cost of D L = K = 10.As the number of accessible databases increases, the normalized download cost decreases.We next consider the scenario where the data center contains K = 10 files, and the user can access N = 5 databases in addition to the data center in the retrieval phase.We plot the expected normalized download cost versus different storage size constraint µ in Fig. 6.When µ = 0, in order to download the desired file privately, the user should download all the files in the data center resulting in D L = K = 10.As µ increases, the normalized download cost decreases.Finally, we conclude this section with the following general remarks about our main result.

Remarks
Remark 1 The achievability scheme consists of two parts, the design of the probability distribution in the caching phase and the PIR scheme in the retrieval phase.We find that the uniform and random caching scheme, originally proposed in [57] for decentralized coded caching, results in the optimal normalized download cost in the retrieval phase.We remark here that the symmetric batch caching scheme, originally proposed in [55] for centralized coded caching, also results in the optimal normalized download cost for PIR from centralized uncoded caching databases [44].In the retrieval phase, according to the distribution of the subfiles, we apply the PIR scheme proposed in [12] for all subfiles to retrieve the desired file.
Remark 2 For the converse, we first apply the lower bound derived in [44] which introduces new ingredients in addition to the interference lower bound lemma and induction lemma in [12,Lemma 5 and Lemma 6].We note that in [44] the authors replace random variables for queries and answering strings by the contents of the distributed databases in a novel way which is crucial for the converse.With this replacement, we can account for different cached content in the caching phase resulting in different lower bound in the normalized download cost in the retrieval phase.Due to the nature of uncoded caching, this replacement facilitates further lower bound.For the decentralized problem here, to compare different probability distributions in the caching phase, we focus on the marginal distributions on each bit.This transformation allows us to use linearity of expectation, and the nature of decentralization and uncoded caching to further lower bound the expected normalized download cost.
Remark 3 A more directly related PIR problem from centralized uncoded caching databases for our setting is the one where, in the caching phase, the data center arranges the files in N databases in a centralized manner, and in the retrieval phase, the user has access also to the data center in addition to the N databases.This is different from the problem setting in [38,44], since there the user can only access the N databases in the retrieval phase.As a side note, we can show that symmetric batch caching scheme is still optimal for this extended problem setting where the data center also participates in the PIR stage.Rigorously, the optimal trade-off between storage and download cost in this case is given by the lower convex envelope of the following (µ, D(µ)) pairs, for t = 0, 1, . . ., N, To achieve this trade-off, the data center arranges the files into the N databases as in [38,44].
In the retrieval phase, the user accesses also the data center; therefore, the subfiles are stored in one more database.For the converse, we no longer require all the N databases to reconstruct the entire K files as in [38,44].Thus, while in [38,44] the smallest allowable µ is µ = 1 N , since the N databases need to reconstruct the entire K files, here since the user can access the data center, the parameter µ starts from 0. Now, we can compare PIR from centralized caching databases and PIR from decentralized caching databases fairly, since in the retrieval phase, the user can access the data center in both cases.We consider the case where K = 10 and N = 5, and plot the result in Fig. 7.

Achievability Scheme
The achievability scheme consists of two parts: the design of the probability distribution used in the caching phase and the PIR scheme used in the retrieval phase.In the caching phase, each database uniformly and randomly stores µKL bits from the data center.The storage size constraint in (2) is satisfied directly.Each database operates independently through the same probability distribution resulting in decentralized caching.
In the retrieval phase, suppose there are N databases in addition to the data center available to the user.Each file W j can be expressed as where W j,S represents the bits of file W j which are stored in databases in S. Since each bit must be stored in the data center, i.e., DB 0 , we have {0} ⊆ S. By the law of large numbers, when the file size is large enough.
To retrieve the desired file, say W j , privately, we retrieve each subfile, W j,S , privately.
Subfile W j,S is replicated in |S| databases, and for each of these |S| databases, there are K subfiles, i.e., W k,S , k = 1, . . ., K. We apply the PIR scheme in [12] to retrieve W j,S privately by downloading bits.We also note that there are N |S|−1 types of W j,S .Therefore, the following normalized download cost is achievable.

Converse Proof
We first derive a lower bound for D H . Since in the retrieval phase the content of DB 0 , . . ., DB N , are fixed to be Z 0 , . . ., Z N , we can use the lower bound derived in [44,Eqn. (71)] to serve as the lower bound for D H .A key step to obtain [44,Eqn.(71)] is to replace the query and answering string random variables with the content of each database, i.e., replacement of Q

[k]
N and A [k] N with Z N .With this replacement, one can account for different cached content in the caching phase resulting in different lower bound in the normalized download cost in the retrieval phase.In addition, due to the nature of uncoded caching, this replacement facilitates a further lower bound.Moreover, to obtain [44,Eqn. (71)], the authors find interesting recursive relationships to compactly deal with the nested harmonic sums.Therefore, from [44,Eqn.(71)] we have where and W 1:K,S represents the bits of files W 1:K which are stored in databases in S.
In the following lemma, we develop a lower bound for E[x l ].
Lemma 1 For l ∈ [1 : N + 1], and x l given in (42), we have phase, we focused on the marginal distributions on individual bits.By using the nature of decentralization and uncoded caching, we further lower bounded the normalized download cost.Finally, we showed the matching converse for the expected normalized download cost, obtaining the exact capacity of the resulting PIR problem.

a 1 a 2 a 3 a 4 b 1 b 2 b 3 b 4 c 1 c 2 c 3 c 4 a 1 a 2 a 3 a 4 caching option 1 caching option 2 a 3 b 1 b 4 c 2 caching option 3 a 1 a 2 b 1 c 1
UserRetrieval phase

Figure 5 :
Figure 5: PIR from different number of available databases in the retrieval phase with K = 10 and µ = 1 2 .

Figure 7 :
Figure 7: PIR from centralized caching databases and decentralized caching databases.