Substring Position Search over Encrypted Cloud Data Supporting Efficient Multi-User Setup †

Existing Searchable Encryption (SE) solutions are able to handle simple Boolean search queries, such as single or multi-keyword queries, but cannot handle substring search queries over encrypted data that also involve identifying the position of the substring within the document. These types of queries are relevant in areas such as searching DNA data. In this paper, we propose a tree-based Substring Position Searchable Symmetric Encryption (SSP-SSE) to overcome the existing gap. Our solution efficiently finds occurrences of a given substring over encrypted cloud data. Specifically, our construction uses the position heap tree data structure and achieves asymptotic efficiency comparable to that of an unencrypted position heap tree. Our encryption takes O(kn) time, and the resulting ciphertext is of size O(kn), where k is a security parameter and n is the size of stored data. The search takes O(m2 + occ) time and three rounds of communication, where m is the length of the queried substring and occ is the number of occurrences of the substring in the document collection. We prove that the proposed scheme is secure against chosen-query attacks that involve an adaptive adversary. Finally, we extend SSP-SSE to the multi-user setting where an arbitrary group of cloud users can submit substring queries to search the encrypted data.


Introduction
Owing to the wide adoption of cloud computing services, public, as well as private organizations now outsource their data to remote servers.Cloud computing services provide efficient and cost-effective solutions for data storage.Nevertheless, outsourced data may contain sensitive information that needs to be protected.Traditional encryption techniques protect the data from unauthorized access; however, they introduce new challenges to data utilization.Specifically, allowing users to efficiently search over encrypted data is one of the most pressing issues in cloud computing.
In order to enable search over encrypted data, many Searchable Encryption (SE) schemes have been proposed in recent years [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20].(Note, we use the term searchable encryption somewhat loosely to include schemes, such as private information retrieval, as well.)Generally, SE solutions involve building an encrypted searchable index that hides the sensitive information from the remote server, yet they allow a search on the encrypted data.SE solutions differ in the level of efficiency and security guarantees that they offer; however, most of them support only exact keyword search.As a result, there is no tolerance of format inconsistencies that are part of typical cloud user behavior; and they happen frequently.It is quite common that the search queries do not exactly match the pre-set keywords due to lack of exact knowledge about the data.For example, a financial company stores its employees' income tax documents in encrypted form in the cloud.A tax accountant may issue a search query of "form1040", which describes multiple keywords, such as "form1040", "form1040A", "form1040-ez", "form1040es", and she wants to find a position of the first occurrence of the query in each encrypted document that contains the string of characters.The significant drawback of existing schemes underlines an important need for new techniques that support search flexibility over encrypted documents.In this work, we consider the problem of efficient substring position search over encrypted data.The users can query the remote untrusted server for a set of encrypted documents that contain a substring of characters.The cloud server retrieves the set of matching documents together with positions where the queried string begins.
An important application of this work is in the area of searching a genome sequence against genomic databases.Such a search can be used in the analysis of genetic diseases, genetic fingerprinting or genetic genealogy and requires a set of results that do not simply match the genome, but rather the position of the genome sequence within the genome database.The major contribution of our work is to initiate the study of a very important problem, namely substring position search over encryption data.Our solution should not be considered as a complete approach for the subject, which has very strong future directions of research.Nonetheless, our solution provides the preliminary foundation for the study of the subject, including formal definitions, building blocks, basic construction, as well as security proofs.In this work, we continue exploring the line of recent searchable encryption solutions, but from a slightly different standpoint.
We now give an overview of our contributions: 1. We present a Substring Position Searchable Symmetric Encryption (SSP-SSE) scheme that allows a substring search over an encrypted document collection.The scheme is based on a position heap tree data structure recently proposed by Ehrenfeucht et al. [21].2. We formally define two leakage functions and security against the adaptive chosen-query attack of a tree-based SSP-SSE scheme.Apart from the traditional access and search patterns, we include the definition of the path pattern in the leakage functions of a tree-based searchable encryption.We show that SSP-SSE enjoys the strong notion of semantic security [6].3. We present a construction that is very efficient and does not require large ciphertext space.
Our encryption takes O(kn) time, and the ciphertext is of size O(kn), where k is the security parameter and n is the size of stored data.The search protocol takes O(m 2 + occ) time and three rounds of communication, where m is the length of the queried substring and occ is the number of occurrences of the substring in the document collection.We perform a thorough experimental evaluation of our solution on a real-world genomic dataset.4. We consider a natural extension of the SSP-SSE scheme, where an arbitrary group of data users can submit substring queries to search the encrypted collection.We design a scheme support distributed setup, where data users choose their own secret key rather than receive the key from a trusted authority.We formally define a Multi-User Substring Position Searchable Symmetric Encryption (MSSP-SSE) and present an efficient construction.
We organize the rest of the paper as follows: Section 2 gives an outline of the most recent related work.In Section 3, we give an overview of the system and threat models, notations and preliminaries.In Section 4, we present algorithms and data structures that allow a substring search on the plaintext data.We give a brief overview of each data structure and later present a discussion on choosing the right data structure to enable substring search in an untrusted cloud environment.In Section 5, we provide the details of the SSP-SSE scheme and define the security definitions and requirements.Section 6 is devoted to security and performance analysis.The extension of our solution towards an arbitrary group of users is presented in Section 7. Lastly, we conclude in Section 8.

Related Work
Efficient searchable encryption methods are extensively studied in the literature.Traditional searchable encryption schemes focus on the problem of searching for a keyword in the document collection.In this setting, each document is assumed to consist of a sequence of keywords.The cloud server must be able to determine which encrypted documents contain a particular queried keyword, which is also encrypted.Song et al. [2] presented the first searchable symmetric encryption scheme.Their scheme has provable security properties, linear-time search complexity in the length of the document collection.Goh et al. [3] introduced formal security definitions of searchable symmetric encryption and proposed a scheme that is based on the Bloom filters [22].The scheme requires a linear search time and provides some false positive results.Many other schemes have been proposed to improve the efficiency of keyword search by implementing an inverted searchable index [4,6,13,23].Chang et al. [13] showed an index construction that enables keyword search without false positive results.Curtmola et al. [6] gave the first solution that enables sublinear search time for the entire document collection.Here, the searchable index consists of a keyword trapdoor and encrypted document identifiers whose corresponding data files contain the keyword.Recently, Cao et al. [10] proposed the multi-keyword ranked search scheme.The solution ranks encrypted documents based on a similarity score.The score is calculated between the search query (that contains multiple keywords) and the set of encrypted documents.Moataz et al. [4] developed the Boolean Symmetric Searchable Encryption (BSSE) scheme.The scheme is based on the orthogonalization of the keywords according to the Gram-Schmidt process.Later, Moataz et al. [24] proposed the Conjunctive Symmetric Searchable Encryption scheme that allows conjunctive keyword search on encrypted documents with different privacy assurances.Orencik's solution [5] proposed the privacy-preserving multi-keyword search method that utilizes minhash functions.
In the public-key setting, Boneh et al. [8] were the first to propose a searchable encryption using asymmetric cryptography.The authors developed the construction where anyone with the public key can write to the data stored on the remote server, but only authorized users with the private key can search.The other asymmetric solution was provided by Di Crescenzo et al. in [25], where the authors designed a public-key encryption scheme with keyword search based on a variant of the quadratic residuosity problem.To support more complex queries, conjunctive keyword search, subset query and range query over encrypted data have also been proposed in [7,9,14,26,27].
All of the schemes above support only exact keyword search, i.e., there is no tolerance of format inconsistencies in the search.Li et al. [17] were the first to propose a fuzzy keyword search scheme over encrypted data.The authors developed the solution that constructs fuzzy keyword sets based on document collection and later uses the edit distance to measure the similarity between keyword query and the sets.Wang et al. [18] improved previous work and proposed a scheme that achieves constant search time complexity.Later, Boldyreva et al. [19] gave an efficient fuzzy-searchable encryption (EFSE) scheme to locate the similar records.The main drawback of fuzzy keyword search solutions is that they require a large ciphertext and computation overhead and, thus, may not be suitable for the real-world cloud storage systems.
The SSE solution proposed by Curtmola et al. [6] can be adopted to allow the substring search over encrypted data.To do this, we would have to generate all possible substrings of each keyword extracted from the document collection and consider these substrings as keywords.However, this solution induces a very large storage requirement since the data owner would have to generate and keep all possible combinations of substrings of any keyword.For example, for any keyword of length n, there are n×(n+1) 2 possible substrings.For a document collection with m distinct keywords of length n, it would take m × n×(n+1) 2 entries in SSE to keep all substrings of all keywords at the cloud server.

System and Threat Models
Consider a cloud data hosting service shown in Figure 1 that involves three entities: the cloud provider, the data owner and the cloud user.The data owner has a collection of documents D that he wants to outsource to the cloud provider in a form C, encrypted with a secret key K.To enable search capability over C, the data owner constructs a searchable index I from D and then uploads both the index I and the encrypted document collection C to the cloud provider.When an authorized cloud user wants to perform a search on remote data, she first connects to the data owner to acquire the secret key K and the trapdoor information.The trapdoor serves to output secure search query Q without revealing its original input.Moreover, the trapdoor learning process is a one-time operation, and thus, the cloud user does not need to contact the data owner anymore.Finally, the cloud user submits the search query Q to the cloud provider.Upon receiving Q, the cloud provider is responsible for searching the index I and returning the matching set of encrypted documents L ⊆ C. Later, the cloud user uses the secret key K to decrypt L to its original view.As in previous works [6,11,12], the cloud provider is assumed to be an honest-but-curious entity."Honest" means that the cloud provider can provide reliable data storage: it is always available to the users; it correctly follows the designated protocol specification; and it provides all services as expected."Curious" means that the cloud provider may execute some background analysis to breach the confidentiality of the stored data.In the rest of the paper, we assume that the cloud provider and the adversary are the same entity.We do not consider the cloud provider being able to link the search query to a specific user.

Preliminaries and Notations
Let D = {D 1 , D 2 , . . ., D l } be an original set of documents, and let C = {C 1 , C 2 , . . ., C l } be an encrypted collection of documents from D. If D i and D j are two documents, we denote text t as their concatenation by D i ||D j .If A is an algorithm, then a ← A(. ..) represents the result of applying the algorithm A.
In addition to the notations above, we also make use of cryptographic notations.We begin with definitions of Pseudorandom Functions (PRF) and Pseudorandom Permutations (PRP), which are polynomial-time computable functions that cannot be distinguished from random functions by any probabilistic polynomial-time (PPT) adversary, and random oracles to which all parties have black box access.

Definition 1. (Pseudorandom Function (PRF)). A function
• Given a key K ∈ {0, 1} k and an input X ∈ {0, 1} n , there is an algorithm to compute F K (X) = F(X, K). • For any t-time oracle algorithm A, we have: where F = { f : {0, 1} n → {0, 1} l } and A makes at most q queries to the oracle.

Definition 3. (Symmetric Key Encryption (SKE)).
A symmetric key encryption scheme consists of the following PPT algorithms: • Gen(1 k ) : a key generation algorithm that inputs a security parameter k and outputs a secret key K.
• Enc(K, m) : a probabilistic algorithm that inputs a secret key K and message m, and outputs a ciphertext c. • Dec(K, c) : a deterministic algorithm that inputs a secret key K and ciphertext c, and outputs a message m or special symbol ⊥ (if decryption failed).

Definition 4. (SKE correctness).
Given the symmetric encryption scheme SKE that consists of three algorithms (Gen, Enc, Dec), for all k and all m, such that K ← Gen(1 k ), we require: We also require SKE to be secure against Pseudorandom Chosen-Plaintext Attacks (PCPA).We now give the definition of the PCPA security of the SKE scheme.

Definition 5. (PCPA-security).
Let SKE = (Gen, Enc, Dec) be a symmetric encryption scheme, A be an adversary, and there is a probabilistic experiment PCPA SKE,A (k) that is run as follows: • Use secret parameter k to output the secret key K → Gen(1 k ).
• The adversary A is given oracle access to Enc K ().The symmetric encryption scheme SKE is PCPA-secure if for all polynomial-size adversaries A, where the probability is over the choice of bit b and the coins of Gen and Enc.

Substring Search Algorithms
In this section, we present the most popular algorithms and data structures that allow a substring search on the plaintext data.Specifically, we focus on mature data structures, like suffix tree [28] and suffix array [29], that have been widely used in many substring search applications.We also, present the details of the recently-proposed position heap tree [21].For each data structure, we give a short overview with examples, and then, we present the computation and storage efficiency.Lastly, we present a discussion about choosing the right data structure to enable substring position search in an untrusted cloud environment.

Suffix Tree
Suffix tree [28,30,31] is a tree-like representation of text supporting a wide range of applications on strings.The suffix tree is pre-processed data structure that enables a substring search on the stored string.We now give the definition of the suffix tree: Definition 6. (Suffix tree).A suffix tree for string t = t 1 . . .t n is a rooted, directed tree with the following properties: • Each edge is labeled with a non-empty substring of t, named the edge label.• Every internal node has at least two children.
• No two edges out of a node have edge labels starting with the same character.
• The tree has n leaves, labeled from 1 to n.

Definition 7. (Path label).
The path label of a node is the concatenation of the edge labels on the path from the root to that node.

Definition 8. (Suffix tree search).
A string χ is a substring of t if and only if it is a prefix of some suffix of t. Figure 2a shows an example of a suffix tree constructed from the text "coconut".To check if a string χ is a substring, the algorithm searches for a path from the root whose labels match χ.For instance, searching for a string "coconut", we begin at the root node and start checking the neighbor edge labels, down to the matching node, i.e., Node 1 is the matching one, and it corresponds to the occurrence in the text.Similarly, the search of "co" leads us to the intermediate node whose leaf nodes (1,3) are the positions in the text.The suffix tree can be constructed in O(n) time for a string of length n [31].Furthermore, it can be shown that a suffix tree has at most 2n nodes, and storing edge label for all edges would require O(n 2 ) in the worst case.(Consider a suffix tree for the strings t 1 t 2 . . .t n , where t i is unique.The suffix tree would contain a distinct edge for each of the n suffixes t 1 . . .t n , t 2 . . .t n , . . ., t n .These suffixes have a total length of O(n 2 ).) Searching for a substring χ of length m takes O(m + occ) time to find all occurrences of χ, where occ is the number of occurrences.

Suffix Array
A suffix array [29] is a sorted index array of all suffixes of a string.The suffix array data structure is used in full text indices, within the field of computational biology and others.

Definition 9. (Suffix array).
Given a text t of length n, the suffix array for t is an array of integers ranging from one to n specifying the lexicographic ordering of the suffixes of the string t.
A suffix tree can be built in O(n) time for a string of length n [29].To search for a substring χ of length m, the search can be executed as simple binary search over the suffix array, i.e., for each element in the suffix array, we then compare the suffix of t at the element position with a substring χ.Thus, the search for any substring can be performed in O(m × log(n)) time.This complexity can be improved by adding the longest common prefix information, so the search can be executed in O(m + log(n)) (see [29] for details).
Consider an example of a suffix array in Figure 2b constructed from the text "coconut".Searching for "coconut" gives the occurrence of (1), while searching for "co" results in (1,3) occurrences in the text.

Position Heap Tree
We now give an overview of the position heap tree data structure [21].

Definition 10. (Position heap).
The position heap Λ of text t is a tree constructed by iteratively inserting the suffixes (t 1 , t 2 , . . ., t n ) of t in ascending order into Λ.That is, t i is inserted by creating a new node in Λ that is the shortest prefix of t i that is not already a node of the tree and labeling it with position i.

Figure 2c
shows an example of position heap tree Λ constructed from the text "coconut".The first suffix "t" of text creates the root node in Λ with the position label of "1".Next, the second suffix "ut" of text creates a new node with Position 2 and connecting edge with label "u"; the third suffix "nut" creates a new node with Position 3 and connecting edge with label "n".Similarly, the seventh suffix "coconut" creates a new node with Position 7 and connecting edge with label "o" since there is already a node in Λ with edge label "c" created by the fifth suffix "conut".Following Definition 10, the position heap Λ is constructed.The construction can be executed for any text t, and since it is deterministic, the position heap Λ for a text is unique.
We now present the definition of the search in the position heap tree.

Definition 11. (Position heap search).
The position heap search of all occurrences of a substring χ of text t in Λ consists of the following steps: • Index into the position heap Λ to find the longest prefix p of χ that is a node of Λ.For each ancestor p of p, lookup the position i stored in p .Here, position i is an occurrence of p .Determine if this occurrence is followed by χ − p .If yes, report i as an occurrence of χ. • If p = χ, also report all positions of the descendants of p.
Using the example tree in Figure 2c, the search for a substring "co" leads to the node (7).The set of traversed ancestor nodes (5, 1) needs an inspection with text t.Indeed, only Position 5 matches the substring "co".Therefore, the positions of substring query "co" are (5,7).In the case of substring query "coconut", the search algorithm falls off the tree; thus, the search algorithm returns a set of traversed nodes (7, 5, 1) for an inspection, where 7 is the only matching occurrence of "coconut" in the text.
The position heap tree for a text of length n can be constructed in O(n) time [21].All positions of substring χ of length m can be found in O(m 2 + occ), where occ is the number of occurrences reported.We refer the reader to [21] for a detailed discussion on position heap properties.

Discussion
Recall that our goal is to construct a scheme that allows a substring search over encrypted data outsourced to the cloud.In our system model, the data owner has a set of documents with sensitive information that he wants to upload in encrypted form to the cloud provider.To enable the substring search, the data owner constructs a searchable index I from the set of plaintext documents D, and then, he places I at the cloud provider to allow the substring search.Our goal is to choose an optimal data structure that has low storage requirements and fast search execution.Later, we use the selected data structure to construct the searchable index I in our scheme.
As we have noted previously, many substring matching algorithms have been proposed, and they differ in terms of storage requirements and search execution.We outline the comparison of substring search data structures in Table 1.We use several comparison parameters: the construction time, the search execution time and the storage requirements.The construction time corresponds to the time it takes to create a data structure with the input of the text t of size n.The search execution time is the time it takes to find all occurrences of substring χ of length m in the text t.The cloud storage describes a storage of all combined textual labels in each data structure.
From Table 1, we can see that the suffix tree, suffix array and position heap tree have O(n) preprocessing time of text t of size n.In search, the suffix tree has the best O(m + occ) execution time for a substring χ of length m.However, the suffix tree has at most 2n nodes, and it would take O(n 2 ) space to store the text at the cloud provider.On the other side, the suffix array can be constructed with n elements, but it has O(n 2 ) storage.Only the position heap tree allows us to have the low storage O(n) with n nodes; however, the substring search execution takes O(m 2 + occ) time.In the rest of the paper, we use the position heap tree data structure as the main construction block for our scheme.In our choice between the data structures, we believe that the O(n) storage requirement is the predominant criteria, since expanding any large dataset (e.g., the human genome with three billion letters) to a O(n 2 ) storage would cause a substantial waste of cloud computing resources.
1 Note that the suffix array data structure stores only the array of integers (no need to store the suffixes of text), and the array can be accessed by running a binary search algorithm in log(n) time, i.e., each time we access the element in the suffix array, we execute a lexicographical comparison of the strings of the suffix at the element position and the the given substring query.This can be executed locally (by the data owner); however, in our system model defined in Section 3.1, the data owner sends the data and constructed searchable index to the malicious cloud provider.Both the data and the searchable index are encrypted, so no plaintext (and no lexicographical order) is leaked to the cloud provider.If we were to encrypt the suffix array by encrypting each element of the suffix array, then the cloud provider would not be able to execute the search in log(n) (in fact, it would observe the ciphertext at each element in the array, which gives no order in binary search execution).However, to keep the binary search log(n) time, one solution is to store encrypted suffixes in each node of the binary search tree and to use an expensive homomorphic encryption (i.e., work by Gentry et al. [32]) that allows the search on the encrypted binary search tree.However, this would take O(n 2 ) as the worst case storage for all suffixes in the tree.

Definition 12. (Substring Position Searchable Symmetric Encryption (SSP-SSE)).
A tree-based SSP-SSE scheme over a set of documents D is a tuple of six polynomial-time algorithms (KeyGen, BuildTree, Encrypt, ConstructQuery, Search, Decrypt), as follows: 1. K ← KeyGen(1 k ): a probabilistic key generation algorithm to setup the SSP-SSE scheme.
The algorithm takes a secret parameter k and outputs a set of secret keys K.

Definition 13. (SSP-SSE correctness).
We say that the tree-based SSP-SSE scheme is correct if ∀k ∈ N, The SSP-SSE correctness ensures the proper output if all SSP-SSE algorithms are executed honestly by the cloud provider.

Security Model Definitions
The security goal of any searchable encryption scheme is to reveal as little information as possible to the adversary.Intuitively, in the SSP-SSE scheme, we want to provide the following security guarantees: given a searchable index I and a set of encrypted documents C = {C 1 , . . ., C l } to the adversary, no valuable information about the original documents D = {D 1 , . . ., D l } is leaked to the adversary; given a set of incoming search queries Q = {Q 1 , . . ., Q t }, the adversary cannot learn any practical information about the content of the search query Q i or the original document collection D. However, these security guarantees are difficult to achieve, and most known searchable encryption schemes [3,4,6,12,13] reveal some information, namely the access pattern and the search pattern.In SSP-SSE, we follow a similar approach to [6] to weaken the security guarantees and allow some limited information to the adversary.

Definition 14. (Access pattern).
Given the n encrypted documents C, where C = {C 1 , . . ., C l }, the search query vector Q, where Q = {Q 1 , . . ., Q t } of size t, the access pattern κ(C, Q) includes the set of document identifiers induced by a search query vector Q.

Definition 15. (Search pattern).
Given the n encrypted documents C, where C = {C 1 , . . ., C l }, the search query vector Q, where Q = {Q 1 , . . ., Q t } of size t, the search pattern γ(C, Q) is a n × t binary matrix, such that ∀i ∈ [1; n] and ∀j ∈ [1; t], the cell element of i-th row and j-th column is one, if a document identifier id i is returned by a search query Q j .The search pattern reveals whether the same search was executed in the past or not.
Since our solution is based on the position heap tree data structure, we would like to capture the path pattern security notion.The path pattern of the position heap tree reveals the path traversed from the root node to the matching node for a given search query.Let SSP-SSE be tree-based SSE scheme that consists of six algorithms as described in Definition 12. Let A be a stateful adversary and S be a stateful simulator.We consider two probabilistic experiments Real A and Ideal A,S that involve A, as well as S, with two stateful leakage algorithms L 1 and L 2 and security parameter k: Real A (k): The challenger runs the KeyGen(1 k ) to output the key set K. The adversary A sends constructed plaintext position heap tree Λ and collection D to the challenger and receives a tuple (I, C) ← Encrypt(K, Λ, D) from the challenger.The adversary A makes a polynomial number of adaptive string searches χ = χ 1 , . . ., χ t and sends them to the challenger.A then receives the search queries generated by the challenger, such that Q i ← ConstructQuery(K, χ i ).The adversary returns one if his or her queries return the expected result, otherwise zero.Ideal A,S (k): The adversary A outputs the tuple (D, Λ), where Λ ← BuildTree(D), and sends it to the simulator.Given the leakage L 1 , simulator S generates the tuple (I, C) and sends it to the adversary.
A makes a polynomial number of adaptive string searches χ = χ 1 , . . ., χ t and sends them to the simulator.Given the leakage L 2 , the simulator S sends the appropriate search queries to the adversary.
Finally, A returns one in the case of successful experiment, otherwise zero.We say that SSP-SSE is adaptively secure against the chosen-query attack if for all probabilistic polynomial time adversaries A, there exists a non-uniform probabilistic polynomial time simulator S, such that:

SSP-SSE Construction
We now present the details of the proposed SSP-SSE scheme.The scheme consists of two phases, namely the setup phase and search phase.The setup phase is done once by the data owner to upload the set of encrypted documents and the searchable index to the cloud provider.In this phase, the data owner uses the KeyGen, BuildTree and Encrypt algorithms to encrypt the document collection, as well as to construct the searchable index.The search phase is performed every time by the cloud user when a query is submitted.In this phase, the cloud user invokes the ConstructQuery algorithm to generate the search query.The cloud provider executes the Search algorithm to output matching results.Finally, the cloud user invokes the Decrypt algorithm to decrypt the document collection to the original view.Our scheme is based on a set of important notations shown in Algorithm 1.We outline the details of setup phase in Section 5.3.1.We later show the search phase in Section 5.3.2.
• t = (t 1 , t 2 , . . ., t n ) -the text constructed from document collection D. t i is the letter in text t at position i.

Setup Phase
The setup phase (Algorithm 2) includes the KeyGen, the BuildTree and the Encrypt algorithms.Let k be a security parameter, and let SKE = (Gen, Enc, Dec) be a PCPA-secure symmetric-key encryption scheme.The data owner begins with the KeyGen algorithm that inputs a secret parameter k and outputs a set of keys K 1 , K X , K Y , K V and set of random keys K Q , K L , K 2 ,K 3 R ← − {0, 1} k .He will use these keys to encrypt the document collection D = (D 1 , . . ., D l ) and construct searchable index I.
First, the data owner constructs a position heap tree Λ using the BuildTree algorithm outlined in Definition 10.The BuildTree algorithm inputs the text t, where t is constructed from the document collection D = D 1 ||$ . . .$ ||D l padded with the unique terminator string $ and outputs the single position heap tree Λ.In order to handle multiple documents in the collection, the data owner adds auxiliary information to each node that contains the document identifier D i and the position of the letter in D i .For example, if the character "a" appears in the document D 1 at Position 1, the node in Λ will have extra information of pid(D 1 ) = id(D 1 )||1.Formally, we concatenate the identifier of D j (j ∈ [1; l]) with position i of character t i in the document D j , i.e., pid(D j ) = id(D j )||pos D j , and add this information in each node in the position heap tree.Figure 3a shows an example of position heap tree Λ of the text "ab$aaa$bb" constructed from three concatenated documents (D 1 , D 2 , D 3 ), where D 1 has text "bb", D 2 has text "aaa" and D 3 has text "ab".Note, a search of "ab" in the position heap tree returns a set of nodes (9, 4, 1) where only 9 is the matching node, and it describes the document position of D 3 ||2.Thus, the search query "ab" appears only in the document D 3 at Position 2.

SETUP PHASE.
KeyGen(1 k ) : given the security parameter k, generate BuildTree(D) : given the document collection D = (D 1 , . . ., D l ): 1. construct text t = t 1 t 2 . ..t n from document collection D, and and input t of size n to build the position heap tree Λ.

index into Λ, for each node ν[i]
3. output the position heap tree Λ Encrypt(K, Λ, D) : given the secret key set K, position heap tree Λ and the set of documents D = (D 1 , . . ., D l ).Build encrypted tree: 1. index into Λ, traverse from the root node: 2. for each node ν 3. output encrypted Λ. Build encrypted arrays: 1. for each character t i of t indexed from right-to-left (i.e., t n t n−1 . ..t 1 ), set an array X[P K 2 (i)] = SKE.Enc K X (t i ).

for each i = [1, n]: set an array Y[P K 3 (i)] = SKE.Enc K Y (V(ν[i])).
Encrypt document collection: 1. for each document D i where i ∈ [1, l], let C i ← SKE.Enc K 1 (D i ).The data owner constructs the searchable index that is based on the position heap tree data structure.To present the details, we use an example of the position heap tree Λ shown in Figure 3b.The figure depicts constructed position heap tree Λ from text t = "abaaababbabaaba" and text array X (shown at the top of the figure), where each array element has a single character of text t indexed from right-to-left.
The data owner begins by extracting position information from Λ as follows: index each node in tree Λ, and create a position array Y, such that each index in Y corresponds to the node value of Λ. Figure 4a shows an example of the left-side branch of position heap tree Λ and constructed position array Y.In this example, nodes in Λ are marked with the red color index, and their corresponding values (positions) are stored as elements in Y. (In Figure 4a we show an example of the position array Y for nine nodes of Λ for demonstration purposes only.The actual algorithm is executed on all nodes in Λ.)With this, the data owner is ready to encrypt the position heap tree Λ, text array X and position array Y data structures.First, to encrypt the position heap tree Λ, the data owner uses a pseudorandom function F : {0, 1} k × {0, 1} * → {0, 1} k and PCPA-secure symmetric-key encryption scheme SKE = (Gen, Enc, Dec).For each node i in Λ, the data owner applies PRF F with key K Q on the concatenation of the path label of node i, depth of the node i, the encrypted path label of the i'-th parent node and the secret key K L . Figure 4b shows an example of the path label encryption.For instance, the label of Node 4 is The root path label is a special case, and its label is In this way, the data owner encrypts all path labels in the tree.This hides the plaintext path labels of the same character at different levels of the tree Λ.Moreover, this makes the ciphertext unique for all path labels in the tree.To hide the index information of each node in Λ, the data owner uses SKE encryption with key K V on the index of the node, i.e., V i = SKE.Enc K V (i), where i ∈ [1, n].For instance, the value of Node 8 is V 8 = SKE.Enc K V (8).With no plaintext left in Λ, the data owner outputs an encrypted position heap tree Λ.
Second, the data owner utilizes a pseudorandom permutation P : {0, 1} k × {0, 1} n → {0, 1} n and PCPA-secure symmetric-key encryption SKE to hide plaintext elements of text array X and position array Y.For each i (i ∈ [1, n]) in X, the data owner applies PRP P with secret key K 2 on each i, i.e., P K 2 (i).For each corresponding character t i at index i in X, he applies SKE with secret key K X on character t i , i.e., SKE.Enc K X (t i ).The data owner sets the encrypted array X as X[P K 2 (i)] = SKE.Enc K X (t i ).Next, for each i (i ∈ [1, n]) in Y, he utilizes PRP P with secret key K 3 and SKE with secret key K Y as follows: Finally, the data owner encrypts each document D i in the collection D using the PCPA-secure symmetric-key encryption scheme SKE with secret key K 1 to produce the encrypted document C i ← SKE.Enc K 1 (D i ).After all, the data owner uploads the encrypted collection C along the searchable index I = ( Λ, X, Y) to the cloud provider.Now, the collection is available for selective cloud retrieval.

Search Phase
The search phase (Algorithm 3) includes both the ConstructQuery and the Search interactive algorithms that are executed between the cloud user and the cloud provider.The cloud user keeps the set of secret keys In order to search a substring χ of length m, the cloud user begins with creating a search query Q: for each character χ i in χ; he applies PRF F with secret key K Q on the concatenation of χ i , i, the output of previous query Q i−1 and the secret parameter K L .The cloud user forms a query Q = (Q 1 , . . ., Q l ) and sends Q to the cloud provider.For instance, for a substring "aba", the cloud user creates to the cloud provider.The cloud server indexes into the encrypted position heap tree Λ, and for each given Q i , it matches encrypted label L of each node in Λ to Q i and continues until the longest matching node ν match in Λ is found.The cloud server returns the set of ancestor and descendant nodes of ν match to the cloud user.Using the example in Figure 4b and search query Q = (Q 1 , Q 2 , Q 3 ), the cloud provider returns the set of encrypted ancestor nodes (SKE.Enc K V (1), SKE.Enc K V (2), SKE.Enc K V (6)) and the set of encrypted descendant nodes (SKE.Enc K V (7), SKE.Enc K V (8)).
Now, the cloud user applies the SKE scheme with secret key K V to decrypt the ancestor and descendant nodes, i.e., (1,2,6) ancestor nodes and (7, 8) descendant nodes.Next, he uses PRP P with secret key K 3 on each decrypted node, i.e., y idx = P K 3 (idx), where idx is (1,2,6,7,8), and sends the resulting query y to the cloud provider.
The cloud provider uses array Y to fetch the elements at index y i (i ∈ [1; 5]) as Y[y i ] and sends back the results.Once received, the cloud user applies SKE with secret key K Y to decrypt the positions in the ancestor and descendant nodes, i.e., (1,3,6) positions in ancestor nodes and (11,15) positions in descendant nodes.According to Definition 11, descendant nodes (11,15) are the positions of query "aba" in the text, and ancestor nodes (1,3,6) require an inspection, since some of them can point at "aba" in the text.Note, since the substring "aba" has a length of three, the substring may exist at positions (6,5,4) and (3,2,1) in the text.Therefore, to launch the inspection, the cloud user applies PRP P with secret key K X at each position (6, 5, 4, 3, 2, 1) as x idx = P K 2 (idx) and sends query x to the cloud provider.Now, the cloud provider uses array X and sends back the elements of the array at index X[x i ] (i ∈ [1; 6]).The cloud user uses SKE.Enc with secret key K X to decrypt the characters t j at positions (6, 5, 4, 3, 2, 1) (i.e., received characters are (a, b, a, a, b, a)).Using this information, the cloud user verifies if substring characters χ i match received characters t j at each ancestor position.The inspection of ancestors shows that only (6,3) are the positions.Thus, the cloud user concludes that substring query "aba" is at position (3,6,11,15) in the text.
Note, if multiple documents are involved in the original text construction, ancestor and descendant nodes contain the document identifiers, which can be later used by the cloud user to download the matching encrypted documents and decrypt them locally using PCPA-secure symmetric-key encryption SKE with secret key K 1 .

Security
In this section, we focus on the the security of the SSP-SSE scheme.First, we show that the SSP-SSE scheme is correct according to Definition 13.Second, we prove that the SSP-SSE scheme is secure against the Chosen-Query Attack (CQA-2) executed by the adaptive adversary according to Definition 17.
Proof.The index I in the Search algorithm consists of the encrypted position heap tree Λ and two arrays X, Y (both encrypted).Since the path labels in Λ and the search query Q are both encrypted with the same instance of pseudorandom function F with the same secret key K Q , the correctness of the SSP-SSE scheme relies on the correctness of the pseudorandom function.
When the cloud provider receives the search query Q in the Search algorithm, it traverses the path labels in the encrypted position heap tree Λ according to Definition 11.Search query Q is constructed using the pseudorandom function F applied on the substring χ with key K Q .Each encrypted path label in Λ is constructed using the pseudorandom function F with the key K Q on the set of characters extracted from the plaintext document collection D = {D 1 , . . ., D l }.The search algorithm outputs true if the document D i contains the string of characters χ.Thus, the cloud provider outputs a set of documents that matches the search query Q.

Theorem 19. (Security).
Let SKE be a symmetric PCPA-secure encryption scheme, F be a pseudorandom function and P be a pseudorandom permutation.Substring Position Searchable Symmetric Encryption (SSP-SSE) presented above is (L 1 , L 2 )-adaptively secure against chosen-query attacks defined in Definition 17 (CQA-2 security), where L 1 and L 2 are the possible leakages.
In a nutshell, the proof of security of SSP-SSE scheme works as follows.The simulator S generates a simulated searchable index I that consists of simulated encrypted position heap tree Λ, simulated position array Y and simulated text array X, i.e., I = ( Λ, Y, X); as well as the simulated set of ciphertexts C = { C 1 , . . ., C l }.Both I and C are constructed using the leakage L 1 that discloses the number of encrypted documents, the size of the encrypted documents and the identifier of each encrypted document.The simulated encrypted position heap tree Λ is constructed using the pseudorandom function F and symmetric-key encryption SKE with random values {0, 1}.Both simulated Y and X are constructed using the pseudorandom permutation P and symmetric-key encryption SKE on random values {0, 1}.The security of the proposed scheme relies on the following assumptions.The pseudo-randomness of F guarantees that the simulated encrypted position heap tree Λ is indistinguishable from the real encrypted position heap tree Λ.The pseudo-randomness of P will guarantee that simulated Y and X are indistinguishable from the real Y and X.Moreover, the simulated set of ciphertext C is indistinguishable from the real encrypted document collection C.
The search algorithm is simulated in a similar way that requires keeping track of different dependencies between the result output and the search query.However, since the real search query is constructed with pseudorandom function F and pseudorandom permutation P, the simulator is not able to distinguish it from the simulated query.Similarly, the simulated outcome of the search is indistinguishable from the real set of nodes.We outline the formal proof as follows.
Proof.Polynomial-size simulator S can be defined such that for any challenger and any polynomial-time adversary A, the outputs of two experiments Ideal A,S (k) and Real A (k) with secret parameter k are computationally indistinguishable according to Definition 17.We now describe the details of experiment Ideal A,S (k) that presents the simulator S.
• S(1 k , L 1 ): The simulator S has a leakage L 1 , which gives the simulator information about the number and size of documents, as well as identifier of each encrypted document.The simulator S randomly generates a set of simulated ciphertexts C and simulated searchable index I as follows: -Simulator S outputs the set of ciphertexts -Simulator S sets the simulated encrypted position heap tree Λ, where each node is set as The simulator outputs the encrypted position heap tree Λ.
-Simulator S then constructs simulated arrays X and Y: X -Simulator S outputs simulated searchable index I = ( Λ, Y, X) and the set of simulated ciphertexts C.
At this point, the simulator S generated the set of simulated encrypted documents C and simulated index I.Next, the adversary A adaptively queries the polynomial-size simulator S as follows.
• S(1 k , L 1 , L 2 ): The adversary A sends a new query Q to the simulator S. The simulator then starts collecting various dependencies between the incoming search query and the resulting output.
-With given search query Q, simulator S traverses the simulated encrypted position heap tree Λ starting from the root node, following the simulated path labels to find the set of matching encrypted nodes in Λ.The simulator outputs the set of simulated matching nodes: ancestors and descendants.-With given search requests ( y 1 , . . ., y num ), the simulator performs a search in simulated array Y and returns matching elements ( Y 1 , . . ., Y num ).-With given search requests ( x 1 , . . ., x h ), the simulator performs a search in simulated array X and returns matching elements ( X 1 , . . ., X h ).
We now need to show that the outputs of the two experiments Ideal A,S (k) and Real A (k) are indistinguishable.Since the simulator generates randomly the set of ciphertexts C, the output of the simulator is truly indistinguishable from the real ciphertexts that are generated with the PCPA-secure symmetric encryption SKE scheme using secret key K 1 .Otherwise, this would mean that the simulator could distinguish between the output of the PCPA-secure symmetric encryption scheme SKE and the random value.Next, the simulated encrypted position heap tree Λ is truly indistinguishable from the real encrypted position heap tree.Otherwise, this would mean that simulator could distinguish between the output of pseudorandom function F with secret key K Q and the random values.Similarly, the simulated arrays Y and X are truly indistinguishable from the real arrays Y and X.Otherwise, this would mean that the simulator can distinguish between the output of pseudorandom permutation P with keys K 2 , K 3 , SKE scheme with keys K Y , K X and the random values.Thus, it is concluded that the outputs of the two experiments are indistinguishable.

Performance
In this section, we outline the performance of the proposed solution.We assume that the encryption and decryption using the SKE scheme take O(k) time, where k is the security parameter.We also assume that the element selection from the array takes O(1) time.
We first focus on the encryption efficiency of the SSP-SSE scheme.Given plaintext position heap tree Λ with n nodes, we compute encrypted position heap tree Λ using SKE in O(kn) time.The arrays We have developed and implemented a proof-of-concept prototype of the SSP-SSE scheme using C++ language.Our prototype leverages the libtomcrypt cryptographic library [33], which is a portable C cryptographic library that supports symmetric ciphers, one-way hashes, pseudo-random number generators and a plethora of support routines.We use libtomcrypt to build the searchable index I and encrypt the document collection.We utilize AES-CTR encryption for the SKE symmetric-key encryption scheme, HMAC-SHA1 for pseudorandom function F and DES encryption for pseudorandom permutation P.
We show a thorough experimental evaluation of the SSP-SSE scheme on a real-world dataset: the Genome database [34] (published by the National Center for Biotechnology Information, National Institutes of Health) that contains sequence data from the whole genomes of over 1000 species or strains.The database includes all three main domains of life (Bacteria, Archaea and Eukaryota), as well as many viruses, phages, viroids, plasmids and organelles.All experiments have been performed on a six-core Intel Xeon E5645 2.40-GHz processor and 98 GB memory running 64-bit Fedora 23.The cloud server, data owner and cloud user applications were run on the same machine, as the network communication overhead was assumed to be negligible.
For our experiments, we pick large mRNA transcript datasets of various insects.Table 2 shows the details of the experimental set.Figure 5a shows the overhead of constructing the encrypted position heap tree Λ.We compare the time of the construction of the plaintext position heap tree (original algorithm) and the encrypted position heap tree proposed in this work.Figure 5b shows the storage overhead of searchable index I that consists of encrypted position heap tree Λ, position array Y and text array X.In short, we notice that the proposed scheme adds insignificant overhead to the computation time; however, its storage overhead depends on the block cipher size of the underlying encryption schemes.We believe that the proposed solution can be easily deployed in a real-world cloud environment.

Multi-User Substring Position Searchable Symmetric Encryption
Our original system model shown in Figure 1 includes only three single entities.To make an important step towards widespread adoption of the searchable encryption techniques, there is a need to efficiently support hundreds, even thousands of users in the cloud.In this section, we consider a simple extension to our work, where a data owner has a document collection, and there is a group of data users that wants to query encrypted data in the cloud.
Curtmola et al.'s [6] solution extends the single-user searchable encryption framework with broadcast encryption [35], where the data owner sends the searchable index and encrypted document collection to the cloud, and a group of cloud users is allowed to invoke the search over encrypted cloud data.The framework describes the solution where the data owner distributes a single shared secret key among the group of cloud users.However, this solution may not work in the real-world cloud environment that involves a potentially large number of data users, since a single secret key is given to all participants.For instance, if the data owner decides to revoke the search access for one cloud user, he/she will have to generate a new key and distribute it to the remaining users.It is preferable that each cloud participant could keep its own secret key, thus making key management easier and more efficient.
We propose a new multi-user substring position searchable symmetric encryption (MSSP-SSE) scheme that solves the problem of managing access privileges and searching a substring over encrypted cloud data.Our solution is based on the distributed broadcast encryption scheme [36].First, we present the definitions of a multi-user substring position searchable symmetric encryption scheme.Later, we give an efficient construction that combines the ideas of a single-user SSP-SSE scheme with the distributed broadcast encryption scheme.

Preliminaries
In this section, we present several definitions used in our work.We begin with the definition of the Witness Pseudo-Random Function (WPRF).Informally, a witness PRF for an NP language L is a PRF F, such that anyone with a valid witness that x ∈ L can compute F(x) without the secret key, but for all x ∈ L, F(x) is computationally hidden without knowledge of the secret key.Formally, a witness PRF is defined as follows.We now present the security model for a Multi-user Substring Position Searchable Symmetric Encryption (MSSP-SSE) scheme.Intuitively, our security model requires the security of a single-user SSP-SSE scheme and the security of a distributed broadcast encryption scheme.We formalize the security requirements of MSSP-SSE scheme as follows:

Definition 20. (Witness Pseudo-Random Function (WPRF) [36]). A triple of algorithms
• Given searchable index I and the set of encrypted documents C = {C 1 , . . ., C l }, the adversary should learn nothing about the original document collection D = {D 1 , . . ., D l }. • Given the set of incoming search queries Q = {Q 1 , . . ., Q m }, access pattern, search pattern and path pattern, the adversary should learn nothing about the content of each search query Q i or the content of resulted documents.• Once a user is removed from the set of authorized cloud users, he/she is no longer allowed to invoke a search over encrypted documents in the cloud.Thus, we require the revocation of the cloud users.
In MSSP-SSE, we use the adaptive semantic security notion of a single-user SSP-SSE scheme.It provides the security against an adaptive adversary: the cloud server does not learn anything about the document collection and search queries beyond the access, search and path patterns.However, with the addition of the access privilege property, we expand our security definitions towards the Remove functionality (Algorithm 4).We define the Rev algorithm as follows: Definition 24.(Revocation).Let MSSP-SSE = (KeyGen, BuildTree, Encrypt, Join, GroupSetup, Remove, ConstructQuery, Search, Decrypt) be a group SSP-SSE scheme, k be a security parameter and A = (A 1 , A 2 , A 3 ) be an adversary.We use the following probabilistic experiment Rev MSSP-SSE,A (k): where O(I, C, st S , •) is an oracle that inputs a search query Q and outputs ciphertexts C indexed by L ← Search(I, Ω, c r ) if L = ⊥ and ⊥ otherwise.We say that the Remove algorithm achieves user revocation if for all polynomial-size adversaries A = (A 1 , A 2 , A 3 ), the following is correct: where the probability is over the coins of KeyGen, Join, GroupSetup, Remove and Encrypt.

MSSP-SSE Construction
Algorithm 5 shows the details of our multi-user scheme MSSP-SSE = (KeyGen, BuildTree, Encrypt, Join, GroupSetup, Remove, ConstructQuery, Search, Decrypt).Let SSP-SSE = (KeyGen, BuildTree, Encrypt, ConstructQuery, Search, Decrypt) be a single-user substring position searchable symmetric encryption scheme.Let BE-NIKE-WPRF = (Setup, Join, Enc, Dec) be a distributed broadcast encryption scheme.We require standard security notions for broadcast encryption, i.e., in addition to providing PCPA-security, it provides revocation-scheme security against a group of revoked users.Let ρ be a pseudorandom permutation, such that ρ: {0, 1} k × {0, 1} t → {0, 1} t (ρ can be constructed as a pseudorandom permutation over domains of arbitrary size [37]), where t is the size of search query Q in the SSP-SSE scheme.We assume that the cloud server does not collude with revoked users; otherwise, our construction cannot prevent a revoked user from invoking the search.We now describe the scheme using the following hospital example.Consider a doctor (data owner) that performed a set of early cancer screening tests on a patient and wishes to share the resulting documents with a group of hospital nurses (data users).To remove the burden of key management, the doctor enables a distributed setup, where each nurse generates his or her own secret key and establishes a group of authorized participants that includes a head nurse and his or her subordinate nurses.First, the doctor samples the secret parameter k and generates the set of encryption keys K, secret key λ and group g for the distributed broadcast encryption.Second, the doctor encrypts the resulted documents with PCPA-secure symmetric encryption scheme SKE and outputs the searchable index I to the cloud server.Next, each participating nurse invokes the Join algorithm with secret λ, group g (both distributed by the doctor) to generate (sk, (z, ek)), where secret sk is kept private and (z, ek) are published to the cloud server.Now, the head nurse (group owner) creates a group of authorized users that are allowed to invoke a search over encrypted documents in the cloud.The head nurse launches the GroupSetup algorithm, where she selects public values {z i , ek i } i∈h of authorized participants h ∈ g, samples random secret parameter r and invokes the distributed broadcast encryption to output c r .
In order to search for a substring χ, the authorized nurse first contacts the cloud provider to receive the latest ciphertext c r and invokes distributed broadcast encryption with his or her own secret sk, public values {z i , ek i } i∈h to recover secret r.If r is successfully recovered, the nurse then constructs a single-user search query Q , encrypts it with pseudorandom permutation ρ with r and outsources ρ r (Q ) to the cloud provider.The cloud provider recovers the search query Q by computing ρ −1 r (ρ r (Q )).Here, the key r is only known by the data owner and the set of authorized users that includes the cloud provider.Next, the ConstructQuery and the Search interactive algorithms are are executed between the authorized nurse and the cloud server.
If a nurse o is no longer the authorized user in the system, the head nurse samples a new key r and generates new ciphertext c r .The new c r is sent to the cloud provider to replace the old c r .Since revoked nurse o is not able to recover the new secret r , permuted search query Q will not yield a valid search query.This simple extra layer given by the pseudo-random permutation ρ prevents cloud users from performing a successful search once they are removed from the system.MSSP-SSE utilizes the security and performance of a single-user SSP-SSE scheme.Our construction is very efficient, since the cloud provider needs only to execute a pseudorandom permutation to evaluate the access privileges, thus eliminating the need of more expensive authentication protocols.

Conclusions
In this work, we present a new Substring Position Searchable Symmetric Encryption scheme (SSP-SSE) that allows efficient substring search on encrypted documents outsourced to the cloud.Specifically, our solution efficiently finds the occurrences and positions of a substring over encrypted cloud data.We formally define the leakage functions and security notions of SSP-SSE.We show that our scheme is secure against chosen-query attacks executed by an adaptive adversary.We also present a multi-user SSP-SSE scheme that supports a distributed setup, where data users choose their own secret key rather than receive the key from a trusted authority.As future work, we plan to focus on enhancing query privacy in SSP-SSE, while keeping all of the good properties in the current design.Furthermore, we plan to expand the SSP-SSE scheme to support dynamic updates on the document collection that will allow query execution when the document corpus is modified.

•
The adversary A outputs a message m. • Let c 0 ← Enc K (m) and c 1 R ← − C. C denotes the set of all possible ciphertexts.A bit b is chosen at random, and c b is given to the adversary A. • The adversary A is again given to the oracle access to Enc K (), and A runs the number of polynomial queries to output a bit b .• The experiment outputs one if b = b , otherwise zero.

Figure 2 .
Figure 2.An example of the data structure constructed from the text "coconut".(a) A suffix tree; (b) a suffix array; (c) a position heap tree.

2 .
(Λ) ← BuildTree(D): a deterministic algorithm to build a position heap tree Λ.The algorithm takes a document collection D = (D 1 , . . ., D l ) and outputs a position heap tree Λ. 3. (I, C) ← Encrypt(K, Λ, D): a probabilistic algorithm to encrypt a position heap tree Λ and document corpus D. The algorithm inputs a set of secret keys K, a position heap tree Λ and a documents corpus D. The output of algorithm is a searchable index I and encrypted collection C = (C 1 , . . ., C l ). 4. [(Q) ← ConstructQuery(K, χ)] ↔ [(L) ← Search(I, Q)]: two deterministic algorithms that are executed interactively between the cloud user and the cloud provider.The ConstructQuery algorithm inputs a set of secret keys K, a substring χ, and it outputs a search query Q.The Search is an algorithm that inputs a searchable index I and a search query Q.The algorithm finds the set of matching encrypted document identifiers L ∈ C. 5. (D i , pos D i ) ← Decrypt(K, C i ): a deterministic algorithm that takes a set of secret keys K and a ciphertext C i as input and outputs an original document D i , ∀i ∈ [1; n] and a set of χ's positions pos D i in D i .

Definition 16 .
(Path pattern).Given the n encrypted documents C, where C = {C 1 , . . ., C l }, and the searchable index I built from the document collection, the path pattern of (C, I) induced by the search query vector Q, where Q = {Q 1 , . . ., Q t } of size t, is a tuple δ(C, I, Q) that reveals the set of identifiers of nodes in the index I that are reached by query Q i∈[1;t]  .Now, we define the leakage functions to capture all of the information leakage we have in this work:• Leakage L 1 (I, C).Given the encrypted collection C = {C 1 , . . ., C l } and the searchable index I, the leakage consists of the following information: the number of encrypted documents, the size of encrypted documents and the identifier of each encrypted document.• Leakage L 2 (Q, I, C).Given the encrypted collection C = {C 1 , . . ., C l }, the searchable index I and the search query Q, the leakage function outputs the access pattern κ(C, Q), search pattern γ(C, Q) and path pattern δ(C, I, Q).Definition 17. (Security against adaptive Chosen-Query Attack (CQA2)).
e., apply PRF F with key K Q on the concatenation of the path label L of ν[i], depth of the node ν[i], encrypted parent label L parent of ν[i] and the secret key 2. output C = (C 1 , C 2 , . . ., C l ).Output: index I = ( Λ, X, Y) and encrypted document collection C = (C 1 , C 2 , . . ., C l ).

Figure 4 .
Figure 4. Construction of a searchable index.(a) An example of position array Y; (b) an example of the path label encryption of position heap tree.

X
and Y each have n elements and can be computed in O(kn) time.Therefore, encryption takes O(kn) time, and the total ciphertext is O(kn) in size.We now analyze the efficiency of proposed search algorithm.The cloud user inputs a substring χ of length m and outputs a search query in O(m) time.The cloud provider uses Λ, performs m matches in the tree and retrieves occ descendant nodes, in O(m + occ) time.The cloud user then computes y 1 , . . . ,y m+occ elements, and the cloud provider retrieves Y[y 1 ], . . ., Y[y m+occ ] in O(m + occ) time.The cloud user then computes x 1 , . . ., x m 2 elements (the cloud user wants to inspect m ancestor positions and the substring χ of m length that may appear at each ancestor position), and the cloud provider retrieves X[x 1 ], . . ., X[x l 2 ] in O(m 2 ) time.Now, the cloud user performs an inspection of m ancestors m times, making execution in O(m 2 ) time.Thus, both the cloud user and the cloud provider take computation time O(m 2 + occ) in the query protocol and three rounds of communication to complete the execution of the protocol.

Figure 5 .
Figure 5. Experimental results.(a) The construction of the position heap tree; (b) the searchable index storage.

Table 1 .
Comparison of plaintext substring search data structures.n is the length of the text t, m is the length of the substring χ, occ is the number of occurrences of χ in t.

Name Description mRNA Size (MB) Organism Name Description mRNA Size (MB)
(Gen, F, Eval) is a witness PRF if: 4. Dec({z i , ek i } i∈[g] , sk, c m ): a deterministic algorithm to decrypt c m .The algorithm invokes NIKE-WPRF.KeyGen({z i , ek i } i∈[g] , sk) to derive k.If k =⊥, then the algorithm decrypts c m using k and outputs the original message m. (K, λ, g) ← KeyGen(1 k ): a probabilistic key generation algorithm to setup the SSP-SSE scheme.The algorithm takes a secret parameter k and outputs a set of secret keys K, secret parameter λ and group g. 2. (Λ) ← BuildTree(D): a deterministic algorithm to build a position heap tree Λ.The algorithm takes a document collection D = {D , . . ., D l } and constructs a position heap tree Λ. 3. (I, C) ← Encrypt(K, Λ, D): a probabilistic algorithm to encrypt a position heap tree and document corpus.The algorithm inputs a set of secret keys K, a position heap tree Λ and a documents corpus D. The output of algorithm is a searchable index I and encrypted collection C = {C 1 , . . ., C l }. 4. (sk, (z, ek)) ← Join(λ, g): a probabilistic algorithm run by each data user to participate in the scheme.The algorithm invokes BE-NIKE-WPRF.Join with an input of secret parameter λ and group order g.It outputs a pair (sk, (z, ek)). 5. c r ← GroupSetup({z i , ek i } i∈[h] , sk): a probabilistic algorithm run by the group owner to establish the group h ⊆ g of authorized data users.The algorithm runs BE-NIKE-WPRF.Enc with an input of public values {z i , ek i } i∈[h] , group owner's secret key sk and a sampled secret r.The output is encrypted ciphertext c r .6. c r ← Remove({z i , ek i } i∈[h\o], , sk): a probabilistic algorithm run by the group owner to remove a user o from the set of authorized users.The algorithm invokes BE-NIKE-WPRF.Enc that inputs the set of public values {z i , ek i } i∈[h\o] , group owner's secret key sk and a new secret r.The output is encrypted ciphertext c r .7. [(Q) ← ConstructQuery(K, χ, c r )] ↔ [(L) ← Search(I, Q, c r )]: two deterministic algorithms that are executed interactively between the authorized cloud user and the cloud provider.The algorithm inputs a set of secret keys K, ciphertext c r and a substring χ, and it outputs a search query Q.The algorithm uses a query Q, searchable index I and ciphertext c r .It outputs a sequence of identifiers L ∈ C. 8. (D i , pos D i ) ← Decrypt(K, C i ): a deterministic algorithm that takes a set of secret keys K and a ciphertext C i as input, and it outputs an original document D i , ∀i ∈ [1; n], and a set of χ's positions pos D i in D i .