A Multi-Keyword Searchable Encryption Scheme Based on Probability Trapdoor over Encryption Cloud Data

With the rapid development of cloud computing, massive data are transferred to cloud servers for storage and management savings. For privacy concerns, data should be encrypted before being uploaded. In the encrypted-domain (ED), however, many data computing methods working in the plain-domain are no longer applicable. Data retrieval has become a significant obstacle to cloud storage services. To break through this limitation, we propose a multi-keyword searchable encryption scheme by introducing probability trapdoors. Firstly, a keywords probability trapdoor is established to ensure that the scheme can resist indistinguishable attacks. Based on the keywords trapdoor, we present the keywords vector to make the scheme realize multi-keyword search in the process of data retrieval in the ED. Both security and performance analysis confirm the advantages of the proposed scheme in terms of search functionality and complexity.


Introduction
Presently, with the rapid development of Internet technology, data undergo explosive growth. It leads to an increasing burden of data management and computations until the cloud computing appears. As a new data processing mode, the cloud computing began to be popular worldwide and favored by users or organizations with the strong promotion of IBM, Google, Yahoo, and other companies.
Cloud storage, an extension of the cloud computing application field, provides users with data storage and access. Besides, due to its convenience, low cost, and flexible access, it is becoming more and more popular with outsourcing local data to cloud servers. However, due to the cloud server's vulnerability and the uncertainty of the external environment, data stored in the cloud suffers from the problem of privacy leakage [1][2][3][4][5]. Therefore, users prefer making the private data encrypted and then uploading data in the encrypted-domain (ED) to the cloud server. Traditionally, the ciphertext retrieval method downloads the entire set of documents in ED to the local first and decrypts all the ciphertext to get the required data and documents. However, it is inefficient to outsource data to the cloud that is not meant for the traditional ciphertext retrieval methods [6,7]. Intuitively, users expect an efficient ciphertext retrieval method that gets the required data in cloud storage without downloading and decrypting the entire ciphertext. Thus, the practical ability of cloud storage can be unleashed.
Towards this requirement, searchable encryption technology [8] is created. It completes data retrieval in ED that prevents data privacy from being disclosed. Due to ciphertext retrieval, the cloud server only needs to complete a ciphertext search according to the keyword trapdoor provided by the users without downloading the entire ciphertext documents. Thus, the ciphertext queries are accelerated while the bandwidth consumption can be saved. Unfortunately, most of the existing searchable encryption schemes are based on the deterministic trapdoor scheme to complete ciphertext retrieval, making the scheme unable to resist the indistinguishable attacks. Some searchable encryption schemes based on probabilistic trapdoor can resist indistinguishable attacks at the cost of the schemes' functionality. In other words, these schemes can only complete a single keyword search but cannot accomplish a multi-keyword search. Consequently, in this paper, we propose a searchable encryption scheme which supports multi-keyword searches. The main contributions are as follows.
(1) A multi-keyword searchable encryption scheme is presented. By introducing the keyword vector when constructing the trapdoor, the scheme can realize multi-keyword searches in the ciphertext search process, thus enhancing the scheme's functionality.
(2) In this scheme, the probability trapdoor construction makes the scheme resistant to indistinguishable attacks and ensures the scheme's security.
(3) The comparison results between this scheme and other searchable encryption schemes prove that this scheme has distinct advantages over other schemes in terms of the search function and storage complexity.
The remainder of this paper is organized as follows. Section 2 gives a brief review of related works in literature. Section 3 presents the system model, threat model, and design goal, as well as introduces the definitions and preliminaries. Section 4 gives the whole process of multi-keyword search based on probability trapdoor. Security analysis of the mechanism, storage complexity, and performance analysis are presented in Section 5. The conclusions are drawn in the Section 6.

Related Works
Generally, the searchable encryption scheme has begun to take shape and can be roughly divided into three categories: single-keyword search, multi-keyword search, and fuzzy keyword search. The specific research work is shown as follows.

Single Keyword Search
The concept of searchable encryption was proposed by Song et al. [7], who also presented the first symmetric searchable encryption scheme for ciphertext data retrieval. However, the search complexity of the scheme increases linearly with the size of the document collection. It can only do a simple keyword search with a high cost yet low efficiency. Later, Boneh et al. [9] proposed the first public key-based searchable encryption scheme (PKES), which was based on bilinear mapping operation, resulting in rising computation and low search efficiency. By considering the privacy of trapdoor, Curtmola et al. [10] adopted inverted index technology to improve search efficiency. The inverted index's adoption makes the complexity of scheme search only relates to the number of keywords and has nothing to do with the size of the document collection. Notice that this scheme defines the security target of a symmetric searchable encryption scheme for the first time.

Multi-Keyword Search
Undoubtedly, the single-keyword searchable scheme can not meet the user's requirement of retrieving data with multiple keywords. In 2004, Golle et al. [11] proposed the first multi-keyword searchable encryption scheme supporting simple query. A more practical query scheme was immediately given by Boneh et al. [12], which supports any joint query, such as comparative query, and subset query. Actually, Cao et al. [13] introduced the first searchable encryption scheme, which truly supports multiple keywords search. It gives a sorted output according to the relative weight of documents, which saves network bandwidth. Later, in 2014, Cao et al. [14] put forward a multi-keyword searchable encryption scheme that supports privacy protection. It is the first scheme that introduced the coordinate system matching search into the multi-keyword sorting search, even though its accuracy is insufficient due to the lack of consideration of the weights' differences among keywords. In 2015, the inverted index was first employed to realize a multi-keyword search by Wang et al. [15]. Towards efficiency improvement for the multi-keyword scheme, Xia et al. [16] designed the index based on tree structure according to the vector model, word frequency and inverse document frequency model in the scheme, and introduced the index into the search process. Recently, Ding et al. [17] constructed the index based on tree structure in the scheme and proposed the random traversal algorithm, which enables the scheme to complete ciphertext search more quickly.

Fuzzy Keyword Search
Fuzzy keyword search allows users to input content with subtle errors or format inconsistency. It greatly improves the practicality of the scheme and user experience. In 2010, the fuzzy keyword search scheme was firstly proposed by Li et al. [18]. The similarities of keywords were measured by editing distance. For massive data collection, this approach is not feasible because the size of the fuzzy keyword set might grow exponentially, leading to pricey memory consumption and resource waste. To overcome this problem, the Locality Sensitive Hashing (LSH) was introduced to improve the fuzzy search by Wang et al. [19]. Thus, the fuzzy keyword set with massive memory consumption can be abandoned. Unfortunately, due to the predefined bloom filter or vector requirement, pricey storage overhead cannot be well avoided. Therefore, it cannot work well when the objective data set is too large. In [20], Fu et al. employs a gram-based fuzzy set to implement a fuzzy keyword search that reaches a better efficiency. However, the scheme cannot withstand the indistinguishable attacks because of the deterministic keyword trapdoor generated in the scheme. Tahir et al. [21] proposed a keyword search scheme based on a probability trapdoor, which can resist indistinguishable attacks. This scheme supports deterministic single-keyword search, but could not complete multi-keyword search. To support logic queries over encrypted data, ref. [8] presented a fuzzy search scheme which is expected to be combined with exact search.

System Model
As illustrated in Figure 1, the system model includes two main entities: the Cloud Servers (CS) and the Client User (CU). With sufficient computation and storage resources, CS provides users or organizations with management and maintenance of massive data, storage services, quick access, and complex computing services to obtain commercial benefits. CU uses these services provided by CS. Generally, for data privacy and security, CU prefers data being outsourced to CS in ED, i.e., data should be encrypted first. To search the interested data, CU submits the corresponding keyword trapdoor to CS. Then CS returns the corresponding document data.

Encrypted Documents
Index Table   Searchable Trap

Threat Model
In this scheme, we consider that CS as "semi-honest" since it can be any third party providing cloud service. In other words, the cloud server must satisfy the following descriptions.
(1) CS should ensure data security and integrity, and it will not remove or tamper the outsourced data by CU.
(2) CS should execute CU's query request honestly according to the preset protocol and return the complete query result.
(3) CS is curious and wants to infer and analyze additional privacy based on the retrieved data. Throughout the scenario, CU is an honest entity who honestly encrypts the outsourced document, builds the searchable index table, and uploads it to CS without colluding with CS.

Design Goal
To complete a multi-keyword search on ciphertext in the above model, the proposed scheme's objectives are as follows.

•
Data privacy: It ensures data security and prevents CS from getting any additional private information during the whole interactive processes, including document collection, index

Notations
Before presenting algorithms in the proposed scheme, we list the used notations in Table 1. Table 1. Notaions.

Notations Description λ
The security parameter. δ The threshold used to extract keywords of documents. m The total number of the keywords. n The total number of the plaintext documents.

RF
The correlation frequency of the keywords with the documents.
The set of ciphertext documents.
The set of the i th plaintext document's keywords.
The keyword set of all plaintext documents. W = { w 1 , w 2 , . . . , w l } The set of the keywords to be searched.
The set of ciphertext documents returned by CS.
The set of plaintext documents corresponding to F.

•
(K, k s , p) ← KenGen 1 λ : a probabilistic algorithm run by CU, which takes a security parameters λ as input, and outputs the master key K, a session key k s and a prime number p.
: a deterministic algorithm run by CU, which takes the master key K and a documents collection D as input, and then outputs the secure index table I.
• T W ← Build_Trap (K, k s , W ): a probabilistic algorithm run by CU. It inputs the master key K, a session key k s and a set of keywords to be searched W , and then outputs a set of trapdoors of keywords T W . • F ← Search_Output (k s , I, T W ): a deterministic algorithm run by CU. It inputs the session key k s , the secure index table I, and a set of trapdoors of keywords T W , and then outputs a collection of ciphertext documents containing the set of the keywords to be searched F. • f ← Dec_Document (K, F): a deterministic algorithm run by CU, which takes the master key K and the collection of ciphertext documents F as input, and then outputs a collection of plaintext documents f .

Correctness
Generally, a MSE Scheme ∏ is correct, if for any (K, k s , p) outputs by KenGen 1 λ , any I outputs by Build_Index (K, D), any T W outputs by Build_Trap (K, k s , W ), any F outputs by Search_Output (k s , I, T W ), any f outputs by Dec_Document (K, F). ( Here, W is the set of the keywords to be searched, and W i is the set of keywords extracted from the i th plaintext documents.

Threshold
In the considered scenario, CU sets the threshold δ which is critical for controlling the extraction of keywords. When the threshold is met, it indicates that the word can be used as a keyword. In this paper, δ is set according to CUs' requirements.

Preliminaries
For a searchable encryption scheme, keyword extraction is crucial for subsequent ciphertext search. In this paper, we use the TF*IDF technique to extract keywords from plaintext documents. The specific progress is described as follows.
(1) For each word w in the plaintext documents f i , CU calculates its frequency TF f i ,w by were n f i ,w represents the term frequency of w in the documents.
(2) CU calculates the inverse document frequency IDF f i of the document f i following Here, f q ∈ f , | f | represents the total number of document f i , and p : w ∈ f q represents the total number of documents which contain w.
(3) CU computes TF f i ,w · IDF f i , and takes the word w as the keyword of the plaintext document f i when the value is larger than or equal to the preset threshold δ.

KeyGen Phase
CU inputs a security parameter λ, and gets the master key K and a session key k s . In addition, a prime number p is generated randomly using the Cryptographically Secure Pseudo-Random Number Generator (CSPRN). In this phase, K ∈ {0, 1} λ , k s ∈ {0, 1} λ , p ← 1 λ .

Build_Index Phase
According to the keyword collection W extracted from the plaintext documents collection D, CU builds the index table I. Steps in this build_index phase are as follows.
(1) Extract the keyword set W i = w i 1 , w i 2 , . . . , w i m i of each plaintext document D i in the plaintext document set D = {D 1 , D 2 , . . . , D n }, and put it into the keyword set W= W 1 , W 2 , . . . , W n .
(2) Select a hash function H following Ref. [11], and use the master key K to calculate the keyword hash value, where the expression of H is described by [j] = 0, 0 · RF ij , where r represents a random number, and RF ij is the correlation frequency of the keyword w i k with the document D i . The random number r multiplied RF ij is used to cover up the relative frequency of the keywords relative to the document, which helps prevent frequency analysis attacks and the leakage of document size.
(7) Take the array A as an index table I and send it to CS.

Build_Trap Phase
In the build_trap phase, CU constructs keyword trapdoors set T W . The implemented steps are as follows.
(1) Calculate the hash value H K (w t ) of the keyword w t according to the hash function H and the master key K, and set a t =H K (w t ), where w t ∈ W = { w 1 , w 2 , . . . , w l } , t = 1, 2, . . . , l, of which l represents the number of keywords in the keyword set W .
(2) Use the master key K to encrypt the keyword w t , and get the ciphertext Enc K (w t ) of the keyword w t , and then set b t =Enc K (w t ).
(3) Calculate c t =a t · b t , and the hash value H k s (b t ) of the keyword ciphertext b t =Enc K (w t ) according to the hash function H and the session key k s . Finally, set d t =H k s (b t ).
(4) Calculate the vector T t = (0, 0, . . . , r, . . . , 0, . . . , r). If the keyword w t appears in the corresponding position of m keywords, the value of the corresponding position of the vector T t is r; otherwise, the value of the corresponding position of the vector T t is 0.
(5) Get the trapdoor T w t = (d t , c t , T t ). (6) Get the trapdoors collection T W = T w 1 , . . . , T w t , . . . , T w l and sends it to CS.

Search_Output Phase
In the phase of search_output, CS performs keywords search on the ciphertext document according to the received index table I and trapdoors set T W . The specific steps are as follows.
If R ≥ l, then return the document Enc K (D h ), which is related with the document identifier Enc K (id (D h )), and add it to F , where 1 ≤ h ≤ n, F ⊂ F. (4) Intersect the ciphertext document set F and F , and get the final result F = F ∩ F . (5) Send a collection of ciphertext documents F to CU, where F ⊂ F.

Dec_Documents Phase
To obtain the required data, in this dec_document phase, CU decrypts the ciphertext document set F through the master key K and gets the plaintext document set f , which contains the multi-keyword set W .

Privacy Leakage
This section consists of the examination of any information that might be compromised by the scheme, as well as components of the polynomial-time algorithm that might reveal private information: index table I, trapdoors sets T W , and search output F . Under the standard model, the privacy leakage of the scheme in this paper is defined as follows.

•
Leakage L 1 :This leak function is related to the index table I. It is assumed that the index table I generated by the client user is leaked to the cloud server and the attacker A. The leakage function L 1 is formulated by • Leakage L 2 :This leak function is related to the trapdoor T w i of the keyword w i . It is presumed that the trapdoor T w i generated by the client user is released to the cloud server and the attacker A. The leakage function L 2 is formulated by • Leakage L 3 :This leak function is related to the Search O utput result generated by the trapdoor T w i . It is supposed that the Search O utput result is revealed to the client user and the attacker A. The leakage function L 3 is formulated by

Privacy Leakage
Since the trapdoor is generated based on probabilistic encryption algorithms and hash functions, the leak information associated with the trapdoor is meaningless and will not be discussed in this subsection. In the proposed MSE scheme, random numbers are deployed to obscure the related frequency, which will also expose the information whether the keyword is in the document or not. However, the leaked information will not affect the nature of trapdoor untraceability and indistinguishability, which is the private information about the leakage of related frequency. Therefore, we focus on the security and privacy issues caused by the leak function L 1 and the leak function L 3 . Now, the scheme is still secure even though the privacy of functions (L 1 , L 2 , L 3 ) is disclosed. It can be seen that the leaked information of the functions (L 1 , L 2 , L 3 ) could be encrypted information, the hash value, or hidden by random numbers. It is presumed that the adversary will not acquire the master key K and hash function H. Therefore, it is believed that the leaked information of the functions (L 1 , L 2 , L 3 ) has no influence on our scheme. More precisely, no one can guess the hash message in polynomial time with a given the hash value because the hash function is not invertible. Besides, we use a probabilistic encryption algorithm to encrypt the message. Therefore, the attacker is not possible to obtain meaningful information in polynomial time.

Storage Overhead
To evaluate our scheme's storage overhead, we consider it for the CU and the CS separately. In our scheme, the CU stores a master key K, a session key k s , and a prime number p. Given the security parameter λ, the size of p is λ + 1 bits. If we use 128 bit AES-CBC to achieve confidentiality and SHA-256 for the keyed cryptographic hash function, the keys k s and K require 256 bits. Obtained from the output of SHA-256, we have 256 bits for λ and 257 bits for p. Totally, the CU will consume (128 + 257)/8 = 64.125 bytes in terms of storage overhead.
Referring to the CS, it should keep the encrypted documents and the secure index table. Following Ref. [21], we use D avg to represent the average storage consumption of an encrypted document. Thus, the document set with n documents require n × D avg . For the secure index, the storage overhead is (m + 1)(n + 1) × bytes where n, m, represent the total number of documents, the total number of keywords and the number of bytes required by each item of A in Section 4.2. For instance, can be 32 if we use SHA-256 in the scheme. Hence the total storage overhead at the CS would be (m + 1)(n + 1) × + n × D avg .

Performance Analysis
In this section, we conduct performance analysis by comparing our scheme with the state-of-the-art schemes [21][22][23][24][25] in terms of functionalities and computational complexities. Let n denote the total number of the plaintext document, m be the number of distinct keywords extracted from the entire plaintext document set, h represent the computational complexity of the hash function (e.g., SHA-256), and e is the complexity of the encryption algorithm (e.g., AES-CBC). Similar to our scheme discussed in Section 3.5, these schemes comprise of five phases, i.e., KeyGen, Build_Index phase, Build_Trap phase, Search_Output phases, and Dec_Documents phase. Generally, the KeyGen and Dec_Documents phases of them are fairly the same with each other. Therefore, we focus on the remaining three phases, which are detailed in Table 2. Notice that Ref. [21] provides schemes considering ranking and no-ranking. We can find the corresponding computational complexities for the Build_Index phases are O(mn + n) or O(mn), respectively. In our scheme, the Build_Index phase requires O(mn) since we prefer no-ranking strategy for multi-keyword search. Unlike the single keyword search, it is inevitable to find the intersection of sets extracted by different keywords in our scheme. So the Build_Trap phase is bound by O((2h + e)l ) for multi-keyword search support, where l is the number of words used for search. In contrast, Ref. [21] only consumes O(2h + e) for single keyword search. As discussed in Section 4.4, in theoretical, traversing the two-dimensional array A, calculating T t · T t T and getting the final results F = F ∩ F want O(mn), O(m 2 ) and O(n 2 ), respectively. However, on the one hand, calculating T t · T t T consumes less than O(m 2 ) for sparsity.
On the other hand, getting the final results F is done between F and F , which are subsets of the ciphertext documents F. The computational complexity can be O( n 2 ) where ∈ (0, 1]. In practical, we can have 1. Consequently, the Search_Output phase is bound by O(mn + m 2 + n 2 ). Based on the scheme of Tahir et al. [21], which only support single keyword search, we give an improved scheme to support multi-keyword searchable encryption. As discussed previously, three out of five phases in our scheme are the same as [21]. Besides, our scheme consumes more in the Build_Trap phase and Search_Output phase because the second retrieval cannot be avoided despite searching in plaintext. For the Build_Trap phase, the complexity of our scheme is linear to [21]. If n = m, the complexity of the Search_Output phase will be γ times of [21] where γ ∈ (1, 3). In practice, the improved functionality will take more time than that of [21]. In this study, therefore, we omit the simulation details for the computation overhead.

Conclusions
In this paper, a searchable encryption scheme based on probabilistic trapdoors is proposed, which cannot only support multi-keyword search but also resist indistinguishable attacks. The construction of probability trapdoors makes our scheme resistant to indistinguishable attacks. Besides, by introducing the keyword vector when constructing the trapdoor, the scheme can realize the multi-keyword search in the ciphertext search process. Finally, comparison results between this scheme and other searchable encryption schemes prove that our scheme has distinct advantages over other schemes in terms of search functionality and storage complexity.