Substring Position Search over Encrypted Cloud Data Supporting Efficient Multi-User Setup

Strizhov, Mikhail; Osman, Zachary; Ray, Indrajit

doi:10.3390/fi8030028

Open AccessArticle

Substring Position Search over Encrypted Cloud Data Supporting Efficient Multi-User Setup^†

by

Mikhail Strizhov

^*,

Zachary Osman

and

Indrajit Ray

Computer Science Department, Colorado State University, Fort Collins, CO 80523, USA

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in the 2015 IEEE International Conference on Cloud Engineering (IC2E), Tempe, AZ, USA, 9–13 March 2015.

Future Internet 2016, 8(3), 28; https://doi.org/10.3390/fi8030028

Submission received: 15 January 2016 / Revised: 17 June 2016 / Accepted: 23 June 2016 / Published: 4 July 2016

(This article belongs to the Special Issue Security in Cloud Computing and Big Data)

Download

Browse Figures

Versions Notes

Abstract

:

Existing Searchable Encryption (SE) solutions are able to handle simple Boolean search queries, such as single or multi-keyword queries, but cannot handle substring search queries over encrypted data that also involve identifying the position of the substring within the document. These types of queries are relevant in areas such as searching DNA data. In this paper, we propose a tree-based Substring Position Searchable Symmetric Encryption (SSP-SSE) to overcome the existing gap. Our solution efficiently finds occurrences of a given substring over encrypted cloud data. Specifically, our construction uses the position heap tree data structure and achieves asymptotic efficiency comparable to that of an unencrypted position heap tree. Our encryption takes

O (k n)

time, and the resulting ciphertext is of size

O (k n)

, where k is a security parameter and n is the size of stored data. The search takes

O (m^{2} + o c c)

time and three rounds of communication, where m is the length of the queried substring and

o c c

is the number of occurrences of the substring in the document collection. We prove that the proposed scheme is secure against chosen-query attacks that involve an adaptive adversary. Finally, we extend SSP-SSE to the multi-user setting where an arbitrary group of cloud users can submit substring queries to search the encrypted data.

Keywords:

substring position search; searchable symmetric encryption; cloud computing; position heap tree

PACS:

J0101

Graphical Abstract

1. Introduction

Owing to the wide adoption of cloud computing services, public, as well as private organizations now outsource their data to remote servers. Cloud computing services provide efficient and cost-effective solutions for data storage. Nevertheless, outsourced data may contain sensitive information that needs to be protected. Traditional encryption techniques protect the data from unauthorized access; however, they introduce new challenges to data utilization. Specifically, allowing users to efficiently search over encrypted data is one of the most pressing issues in cloud computing.

In order to enable search over encrypted data, many Searchable Encryption (SE) schemes have been proposed in recent years [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]. (Note, we use the term searchable encryption somewhat loosely to include schemes, such as private information retrieval, as well.) Generally, SE solutions involve building an encrypted searchable index that hides the sensitive information from the remote server, yet they allow a search on the encrypted data. SE solutions differ in the level of efficiency and security guarantees that they offer; however, most of them support only exact keyword search. As a result, there is no tolerance of format inconsistencies that are part of typical cloud user behavior; and they happen frequently. It is quite common that the search queries do not exactly match the pre-set keywords due to lack of exact knowledge about the data. For example, a financial company stores its employees’ income tax documents in encrypted form in the cloud. A tax accountant may issue a search query of “form1040”, which describes multiple keywords, such as “form1040”, “form1040A”, “form1040-ez”, “form1040es”, and she wants to find a position of the first occurrence of the query in each encrypted document that contains the string of characters. The significant drawback of existing schemes underlines an important need for new techniques that support search flexibility over encrypted documents. In this work, we consider the problem of efficient substring position search over encrypted data. The users can query the remote untrusted server for a set of encrypted documents that contain a substring of characters. The cloud server retrieves the set of matching documents together with positions where the queried string begins.

An important application of this work is in the area of searching a genome sequence against genomic databases. Such a search can be used in the analysis of genetic diseases, genetic fingerprinting or genetic genealogy and requires a set of results that do not simply match the genome, but rather the position of the genome sequence within the genome database. The major contribution of our work is to initiate the study of a very important problem, namely substring position search over encryption data. Our solution should not be considered as a complete approach for the subject, which has very strong future directions of research. Nonetheless, our solution provides the preliminary foundation for the study of the subject, including formal definitions, building blocks, basic construction, as well as security proofs. In this work, we continue exploring the line of recent searchable encryption solutions, but from a slightly different standpoint.

We now give an overview of our contributions:

We present a Substring Position Searchable Symmetric Encryption (SSP-SSE) scheme that allows a substring search over an encrypted document collection. The scheme is based on a position heap tree data structure recently proposed by Ehrenfeucht et al. [21].
We formally define two leakage functions and security against the adaptive chosen-query attack of a tree-based SSP-SSE scheme. Apart from the traditional access and search patterns, we include the definition of the path pattern in the leakage functions of a tree-based searchable encryption. We show that SSP-SSE enjoys the strong notion of semantic security [6].
We present a construction that is very efficient and does not require large ciphertext space. Our encryption takes $O (k n)$ time, and the ciphertext is of size $O (k n)$ , where k is the security parameter and n is the size of stored data. The search protocol takes $O (m^{2} + o c c)$ time and three rounds of communication, where m is the length of the queried substring and $o c c$ is the number of occurrences of the substring in the document collection. We perform a thorough experimental evaluation of our solution on a real-world genomic dataset.
We consider a natural extension of the SSP-SSE scheme, where an arbitrary group of data users can submit substring queries to search the encrypted collection. We design a scheme support distributed setup, where data users choose their own secret key rather than receive the key from a trusted authority. We formally define a Multi-User Substring Position Searchable Symmetric Encryption (MSSP-SSE) and present an efficient construction.

We organize the rest of the paper as follows: Section 2 gives an outline of the most recent related work. In Section 3, we give an overview of the system and threat models, notations and preliminaries. In Section 4, we present algorithms and data structures that allow a substring search on the plaintext data. We give a brief overview of each data structure and later present a discussion on choosing the right data structure to enable substring search in an untrusted cloud environment. In Section 5, we provide the details of the SSP-SSE scheme and define the security definitions and requirements. Section 6 is devoted to security and performance analysis. The extension of our solution towards an arbitrary group of users is presented in Section 7. Lastly, we conclude in Section 8.

2. Related Work

Efficient searchable encryption methods are extensively studied in the literature. Traditional searchable encryption schemes focus on the problem of searching for a keyword in the document collection. In this setting, each document is assumed to consist of a sequence of keywords. The cloud server must be able to determine which encrypted documents contain a particular queried keyword, which is also encrypted. Song et al. [2] presented the first searchable symmetric encryption scheme. Their scheme has provable security properties, linear-time search complexity in the length of the document collection. Goh et al. [3] introduced formal security definitions of searchable symmetric encryption and proposed a scheme that is based on the Bloom filters [22]. The scheme requires a linear search time and provides some false positive results. Many other schemes have been proposed to improve the efficiency of keyword search by implementing an inverted searchable index [4,6,13,23]. Chang et al. [13] showed an index construction that enables keyword search without false positive results. Curtmola et al. [6] gave the first solution that enables sublinear search time for the entire document collection. Here, the searchable index consists of a keyword trapdoor and encrypted document identifiers whose corresponding data files contain the keyword. Recently, Cao et al. [10] proposed the multi-keyword ranked search scheme. The solution ranks encrypted documents based on a similarity score. The score is calculated between the search query (that contains multiple keywords) and the set of encrypted documents. Moataz et al. [4] developed the Boolean Symmetric Searchable Encryption (BSSE) scheme. The scheme is based on the orthogonalization of the keywords according to the Gram–Schmidt process. Later, Moataz et al. [24] proposed the Conjunctive Symmetric Searchable Encryption scheme that allows conjunctive keyword search on encrypted documents with different privacy assurances. Orencik’s solution [5] proposed the privacy-preserving multi-keyword search method that utilizes minhash functions.

In the public-key setting, Boneh et al. [8] were the first to propose a searchable encryption using asymmetric cryptography. The authors developed the construction where anyone with the public key can write to the data stored on the remote server, but only authorized users with the private key can search. The other asymmetric solution was provided by Di Crescenzo et al. in [25], where the authors designed a public-key encryption scheme with keyword search based on a variant of the quadratic residuosity problem. To support more complex queries, conjunctive keyword search, subset query and range query over encrypted data have also been proposed in [7,9,14,26,27].

All of the schemes above support only exact keyword search, i.e., there is no tolerance of format inconsistencies in the search. Li et al. [17] were the first to propose a fuzzy keyword search scheme over encrypted data. The authors developed the solution that constructs fuzzy keyword sets based on document collection and later uses the edit distance to measure the similarity between keyword query and the sets. Wang et al. [18] improved previous work and proposed a scheme that achieves constant search time complexity. Later, Boldyreva et al. [19] gave an efficient fuzzy-searchable encryption (EFSE) scheme to locate the similar records. The main drawback of fuzzy keyword search solutions is that they require a large ciphertext and computation overhead and, thus, may not be suitable for the real-world cloud storage systems.

The SSE solution proposed by Curtmola et al. [6] can be adopted to allow the substring search over encrypted data. To do this, we would have to generate all possible substrings of each keyword extracted from the document collection and consider these substrings as keywords. However, this solution induces a very large storage requirement since the data owner would have to generate and keep all possible combinations of substrings of any keyword. For example, for any keyword of length n, there are

\frac{n \times (n + 1)}{2}

possible substrings. For a document collection with m distinct keywords of length n, it would take

m \times \frac{n \times (n + 1)}{2}

entries in SSE to keep all substrings of all keywords at the cloud server.

3. Background and Building Blocks

3.1. System and Threat Models

Consider a cloud data hosting service shown in Figure 1 that involves three entities: the cloud provider, the data owner and the cloud user. The data owner has a collection of documents D that he wants to outsource to the cloud provider in a form C, encrypted with a secret key K. To enable search capability over C, the data owner constructs a searchable index I from D and then uploads both the index I and the encrypted document collection C to the cloud provider. When an authorized cloud user wants to perform a search on remote data, she first connects to the data owner to acquire the secret key K and the trapdoor information. The trapdoor serves to output secure search query Q without revealing its original input. Moreover, the trapdoor learning process is a one-time operation, and thus, the cloud user does not need to contact the data owner anymore. Finally, the cloud user submits the search query Q to the cloud provider. Upon receiving Q, the cloud provider is responsible for searching the index I and returning the matching set of encrypted documents L⊆C. Later, the cloud user uses the secret key K to decrypt L to its original view.

As in previous works [6,11,12], the cloud provider is assumed to be an honest-but-curious entity. “Honest” means that the cloud provider can provide reliable data storage: it is always available to the users; it correctly follows the designated protocol specification; and it provides all services as expected. “Curious” means that the cloud provider may execute some background analysis to breach the confidentiality of the stored data. In the rest of the paper, we assume that the cloud provider and the adversary are the same entity. We do not consider the cloud provider being able to link the search query to a specific user.

3.2. Preliminaries and Notations

Let D =

{D_{1}, D_{2}, \dots, D_{l}}

be an original set of documents, and let C =

{C_{1}, C_{2}, \dots, C_{l}}

be an encrypted collection of documents from D. If

D_{i}

and

D_{j}

are two documents, we denote text t as their concatenation by

D_{i} | | D_{j}

. If A is an algorithm, then a←

A (\dots)

represents the result of applying the algorithm A.

In addition to the notations above, we also make use of cryptographic notations. We begin with definitions of Pseudorandom Functions (PRF) and Pseudorandom Permutations (PRP), which are polynomial-time computable functions that cannot be distinguished from random functions by any probabilistic polynomial-time (PPT) adversary, and random oracles to which all parties have black box access.

Definition 1.

(Pseudorandom Function (PRF)). A function f:

{0, 1}^{k}

×

{0, 1}^{n}

→

{0, 1}^{l}

is a

(t, ϵ, q)

-pseudorandom function if:

Given a key $K \in {0, 1}^{k}$ and an input $X \in {0, 1}^{n}$ , there is an algorithm to compute $F_{K} (X) = F (X, K)$ .
For any t-time oracle algorithm A, we have:

$\begin{matrix} | P r_{K \leftarrow {0, 1}^{k}} [A^{f_{K}}] - P r_{f \in F} [A^{f}] | < ϵ \end{matrix}$

(1)

where

F = {f : {0, 1}^{n} \to {0, 1}^{l}}

and A makes at most q queries to the oracle.

Definition 2.

(Pseudorandom Permutation (PRP)). If pseudorandom function f in Definition 1 is bijective, then it is a pseudorandom permutation as follows:

{0, 1}^{k}

×

{0, 1}^{n}

→

{0, 1}^{n}

.

Definition 3.

(Symmetric Key Encryption (SKE)). A symmetric key encryption scheme consists of the following PPT algorithms:

$G e n (1^{k}) :$ a key generation algorithm that inputs a security parameter k and outputs a secret key K.
$E n c (K, m) :$ a probabilistic algorithm that inputs a secret key K and message m, and outputs a ciphertext c.
$D e c (K, c) :$ a deterministic algorithm that inputs a secret key K and ciphertext c, and outputs a message m or special symbol ⊥ (if decryption failed).

Definition 4.

(SKE correctness). Given the symmetric encryption scheme SKE that consists of three algorithms (Gen, Enc, Dec), for all k and all m, such that K←

G e n (1^{k})

, we require:

D e c (K, E n c (K, m)) = m .

(2)

We also require SKE to be secure against Pseudorandom Chosen-Plaintext Attacks (PCPA). We now give the definition of the PCPA security of the SKE scheme.

Definition 5.

(PCPA-security). Let SKE = (Gen, Enc, Dec) be a symmetric encryption scheme, A be an adversary, and there is a probabilistic experiment

P C P A_{S K E, A} (k)

that is run as follows:

Use secret parameter k to output the secret key $K \to G e n (1^{k})$ .
The adversary A is given oracle access to $E n c_{K} ()$ .
The adversary A outputs a message m.
Let $c_{0} \leftarrow E n c_{K} (m)$ and $c_{1} \overset{R}{\leftarrow} C$ . C denotes the set of all possible ciphertexts. A bit b is chosen at random, and $c_{b}$ is given to the adversary A.
The adversary A is again given to the oracle access to $E n c_{K} ()$ , and A runs the number of polynomial queries to output a bit $b^{'}$ .
The experiment outputs one if b = $b^{'}$ , otherwise zero.

The symmetric encryption scheme SKE is PCPA-secure if for all polynomial-size adversaries A,

P r [P C P A_{S K E, A} (k) = 1] \leq \frac{1}{2} + n e g l (k),

(3)

where the probability is over the choice of bit b and the coins of Gen and Enc.

4. Substring Search Algorithms

In this section, we present the most popular algorithms and data structures that allow a substring search on the plaintext data. Specifically, we focus on mature data structures, like suffix tree [28] and suffix array [29], that have been widely used in many substring search applications. We also, present the details of the recently-proposed position heap tree [21]. For each data structure, we give a short overview with examples, and then, we present the computation and storage efficiency. Lastly, we present a discussion about choosing the right data structure to enable substring position search in an untrusted cloud environment.

4.1. Suffix Tree

Suffix tree [28,30,31] is a tree-like representation of text supporting a wide range of applications on strings. The suffix tree is pre-processed data structure that enables a substring search on the stored string. We now give the definition of the suffix tree:

Definition 6.

(Suffix tree). A suffix tree for string t =

t_{1} \dots t_{n}

is a rooted, directed tree with the following properties:

Each edge is labeled with a non-empty substring of t, named the edge label.
Every internal node has at least two children.
No two edges out of a node have edge labels starting with the same character.
The tree has n leaves, labeled from 1 to n.

Definition 7.

(Path label). The path label of a node is the concatenation of the edge labels on the path from the root to that node.

Definition 8.

(Suffix tree search). A string χ is a substring of t if and only if it is a prefix of some suffix of t.

Figure 2a shows an example of a suffix tree constructed from the text “coconut”. To check if a string χ is a substring, the algorithm searches for a path from the root whose labels match χ. For instance, searching for a string “coconut”, we begin at the root node and start checking the neighbor edge labels, down to the matching node, i.e., Node 1 is the matching one, and it corresponds to the occurrence in the text. Similarly, the search of “co” leads us to the intermediate node whose leaf nodes

(1, 3)

are the positions in the text.

The suffix tree can be constructed in

O (n)

time for a string of length n [31]. Furthermore, it can be shown that a suffix tree has at most

2 n

nodes, and storing edge label for all edges would require

O (n^{2})

in the worst case. (Consider a suffix tree for the strings

t_{1} t_{2} \dots t_{n}

, where

t_{i}

is unique. The suffix tree would contain a distinct edge for each of the n suffixes

t_{1} \dots t_{n}

,

t_{2} \dots t_{n}

,

\dots, t_{n}

. These suffixes have a total length of

O (n^{2})

.) Searching for a substring χ of length m takes

O (m + o c c)

time to find all occurrences of χ, where

o c c

is the number of occurrences.

4.2. Suffix Array

A suffix array [29] is a sorted index array of all suffixes of a string. The suffix array data structure is used in full text indices, within the field of computational biology and others.

Definition 9.

(Suffix array). Given a text t of length n, the suffix array for t is an array of integers ranging from one to n specifying the lexicographic ordering of the suffixes of the string t.

A suffix tree can be built in

O (n)

time for a string of length n [29]. To search for a substring χ of length m, the search can be executed as simple binary search over the suffix array, i.e., for each element in the suffix array, we then compare the suffix of t at the element position with a substring χ. Thus, the search for any substring can be performed in

O (m \times l o g (n))

time. This complexity can be improved by adding the longest common prefix information, so the search can be executed in

O (m + l o g (n))

(see [29] for details).

Consider an example of a suffix array in Figure 2b constructed from the text “coconut”. Searching for “coconut” gives the occurrence of

(1)

, while searching for “co” results in

(1, 3)

occurrences in the text.

4.3. Position Heap Tree

We now give an overview of the position heap tree data structure [21].

Definition 10.

(Position heap). The position heap Λ of text t is a tree constructed by iteratively inserting the suffixes

(t_{1}, t_{2}, \dots, t_{n})

of t in ascending order into Λ. That is,

t_{i}

is inserted by creating a new node in Λ that is the shortest prefix of

t_{i}

that is not already a node of the tree and labeling it with position i.

Figure 2c shows an example of position heap tree Λ constructed from the text “coconut”. The first suffix “t” of text creates the root node in Λ with the position label of “1”. Next, the second suffix “ut” of text creates a new node with Position 2 and connecting edge with label “u”; the third suffix “nut” creates a new node with Position 3 and connecting edge with label “n”. Similarly, the seventh suffix “coconut” creates a new node with Position 7 and connecting edge with label “o” since there is already a node in Λ with edge label “c” created by the fifth suffix “conut”. Following Definition 10, the position heap Λ is constructed. The construction can be executed for any text t, and since it is deterministic, the position heap Λ for a text is unique.

We now present the definition of the search in the position heap tree.

Definition 11.

(Position heap search). The position heap search of all occurrences of a substring χ of text t in Λ consists of the following steps:

Index into the position heap Λ to find the longest prefix p of χ that is a node of Λ. For each ancestor $p^{'}$ of p, lookup the position i stored in $p^{'}$ . Here, position i is an occurrence of $p^{'}$ . Determine if this occurrence is followed by $χ - p^{'}$ . If yes, report i as an occurrence of χ.
If $p = χ$ , also report all positions of the descendants of p.

Using the example tree in Figure 2c, the search for a substring “co” leads to the node

(7)

. The set of traversed ancestor nodes

(5, 1)

needs an inspection with text t. Indeed, only Position 5 matches the substring “co”. Therefore, the positions of substring query “co” are

(5, 7)

. In the case of substring query “coconut”, the search algorithm falls off the tree; thus, the search algorithm returns a set of traversed nodes

(7, 5, 1)

for an inspection, where 7 is the only matching occurrence of “coconut” in the text.

The position heap tree for a text of length n can be constructed in

O (n)

time [21]. All positions of substring χ of length m can be found in

O (m^{2} + o c c)

, where

o c c

is the number of occurrences reported. We refer the reader to [21] for a detailed discussion on position heap properties.

4.4. Discussion

Recall that our goal is to construct a scheme that allows a substring search over encrypted data outsourced to the cloud. In our system model, the data owner has a set of documents with sensitive information that he wants to upload in encrypted form to the cloud provider. To enable the substring search, the data owner constructs a searchable index I from the set of plaintext documents D, and then, he places I at the cloud provider to allow the substring search. Our goal is to choose an optimal data structure that has low storage requirements and fast search execution. Later, we use the selected data structure to construct the searchable index I in our scheme.

As we have noted previously, many substring matching algorithms have been proposed, and they differ in terms of storage requirements and search execution. We outline the comparison of substring search data structures in Table 1. We use several comparison parameters: the construction time, the search execution time and the storage requirements. The construction time corresponds to the time it takes to create a data structure with the input of the text t of size n. The search execution time is the time it takes to find all occurrences of substring χ of length m in the text t. The cloud storage describes a storage of all combined textual labels in each data structure.

From Table 1, we can see that the suffix tree, suffix array and position heap tree have

O (n)

preprocessing time of text t of size n. In search, the suffix tree has the best

O (m + o c c)

execution time for a substring χ of length m. However, the suffix tree has at most

2 n

nodes, and it would take

O (n^{2})

space to store the text at the cloud provider. On the other side, the suffix array can be constructed with n elements, but it has

O (n^{2})

storage. Only the position heap tree allows us to have the low storage

O (n)

with n nodes; however, the substring search execution takes

O (m^{2} + o c c)

time. In the rest of the paper, we use the position heap tree data structure as the main construction block for our scheme. In our choice between the data structures, we believe that the

O (n)

storage requirement is the predominant criteria, since expanding any large dataset (e.g., the human genome with three billion letters) to a

O (n^{2})

storage would cause a substantial waste of cloud computing resources.

5. Substring Position Searchable Symmetric Encryption

5.1. Algorithm Definitions

Definition 12.

(Substring Position Searchable Symmetric Encryption (SSP-SSE)). A tree-based SSP-SSE scheme over a set of documents D is a tuple of six polynomial-time algorithms (KeyGen, BuildTree, Encrypt, ConstructQuery, Search, Decrypt), as follows:

K← $K e y G e n (1^{k})$ : a probabilistic key generation algorithm to setup the SSP-SSE scheme. The algorithm takes a secret parameter k and outputs a set of secret keys K.
$(Λ)$ ← $B u i l d T r e e (D)$ : a deterministic algorithm to build a position heap tree Λ. The algorithm takes a document collection D = $(D_{1},$ $\dots,$ $D_{l})$ and outputs a position heap tree Λ.
$(I, C)$ ← $E n c r y p t (K, Λ, D)$ : a probabilistic algorithm to encrypt a position heap tree Λ and document corpus D. The algorithm inputs a set of secret keys K, a position heap tree Λ and a documents corpus D. The output of algorithm is a searchable index I and encrypted collection C = $(C_{1},$ $\dots,$ $C_{l})$ .
$[(Q)$ ← $C o n s t r u c t Q u e r y (K, χ)]$ ↔ $[(L)$ ← $S e a r c h (I, Q)]$ : two deterministic algorithms that are executed interactively between the cloud user and the cloud provider. The ConstructQuery algorithm inputs a set of secret keys K, a substring χ, and it outputs a search query Q. The Search is an algorithm that inputs a searchable index I and a search query Q. The algorithm finds the set of matching encrypted document identifiers $L \in C$ .
$(D_{i}, p o s_{D_{i}})$ ← $D e c r y p t (K, C_{i})$ : a deterministic algorithm that takes a set of secret keys K and a ciphertext $C_{i}$ as input and outputs an original document $D_{i}$ , $\forall i \in [1; n]$ and a set of χ’s positions $p o s_{D_{i}}$ in $D_{i}$ .

Definition 13.

(SSP-SSE correctness). We say that the tree-based SSP-SSE scheme is correct if

\forall k \in N

, ∀ K produced by

K e y G e n (1^{k})

, ∀D, ∀Λ output by

B u i l d T r e e (D)

, ∀χ, ∀

i \in [1; n]

:

\begin{matrix} S e a r c h (E n c r y p t (K, Λ, D), C o n s t r u c t Q u e r y (K, χ)) = \\ = C (χ) ⋀ D e c r y p t (K, C_{i}) = (D_{i}, p o s_{D_{i}}) \end{matrix}

(4)

The SSP-SSE correctness ensures the proper output if all SSP-SSE algorithms are executed honestly by the cloud provider.

5.2. Security Model Definitions

The security goal of any searchable encryption scheme is to reveal as little information as possible to the adversary. Intuitively, in the SSP-SSE scheme, we want to provide the following security guarantees: given a searchable index I and a set of encrypted documents C =

{C_{1},

\dots,

C_{l}}

to the adversary, no valuable information about the original documents D =

{D_{1},

\dots,

D_{l}}

is leaked to the adversary; given a set of incoming search queries Q =

{Q_{1},

\dots,

Q_{t}}

, the adversary cannot learn any practical information about the content of the search query

Q_{i}

or the original document collection D. However, these security guarantees are difficult to achieve, and most known searchable encryption schemes [3,4,6,12,13] reveal some information, namely the access pattern and the search pattern. In SSP-SSE, we follow a similar approach to [6] to weaken the security guarantees and allow some limited information to the adversary.

Definition 14.

(Access pattern). Given the n encrypted documents C, where C =

{C_{1},

\dots,

C_{l}}

, the search query vector Q, where Q =

{Q_{1},

\dots,

Q_{t}}

of size t, the access pattern

κ (C, Q)

includes the set of document identifiers induced by a search query vector Q.

Definition 15.

(Search pattern). Given the n encrypted documents C, where C =

{C_{1},

\dots,

C_{l}}

, the search query vector Q, where Q =

{Q_{1},

\dots,

Q_{t}}

of size t, the search pattern

γ (C, Q)

is a

n \times t

binary matrix, such that

\forall i \in [1; n]

and

\forall j \in [1; t]

, the cell element of i-th row and j-th column is one, if a document identifier

i d_{i}

is returned by a search query

Q_{j}

. The search pattern reveals whether the same search was executed in the past or not.

Since our solution is based on the position heap tree data structure, we would like to capture the path pattern security notion. The path pattern of the position heap tree reveals the path traversed from the root node to the matching node for a given search query.

Definition 16.

(Path pattern). Given the n encrypted documents C, where C =

{C_{1},

\dots,

C_{l}}

, and the searchable index I built from the document collection, the path pattern of

(C, I)

induced by the search query vector Q, where Q =

{Q_{1},

\dots,

Q_{t}}

of size t, is a tuple

δ (C, I, Q)

that reveals the set of identifiers of nodes in the index I that are reached by query

Q_{i \in [1; t]}

.

Now, we define the leakage functions to capture all of the information leakage we have in this work:

Leakage $L_{1} (I, C)$ . Given the encrypted collection C = ${C_{1},$ $\dots,$ $C_{l}}$ and the searchable index I, the leakage consists of the following information: the number of encrypted documents, the size of encrypted documents and the identifier of each encrypted document.
Leakage $L_{2} (Q, I, C)$ . Given the encrypted collection C = ${C_{1},$ $\dots,$ $C_{l}}$ , the searchable index I and the search query Q, the leakage function outputs the access pattern $κ (C, Q)$ , search pattern $γ (C, Q)$ and path pattern $δ (C, I, Q)$ .

Definition 17.

(Security against adaptive Chosen-Query Attack (CQA2)). Let SSP-SSE be tree-based SSE scheme that consists of six algorithms as described in Definition 12. Let

A

be a stateful adversary and

S

be a stateful simulator. We consider two probabilistic experiments

R e a l_{A}

and

I d e a l_{A, S}

that involve

A

, as well as

S

, with two stateful leakage algorithms

L_{1}

and

L_{2}

and security parameter k:

\underline{R e a l_{A} (k)}

: The challenger runs the

K e y G e n (1^{k})

to output the key set K. The adversary

A

sends constructed plaintext position heap tree Λ and collection D to the challenger and receives a tuple

(I, C)

←

E n c r y p t (K, Λ, D)

from the challenger. The adversary

A

makes a polynomial number of adaptive string searches χ =

χ_{1},

\dots,

χ_{t}

and sends them to the challenger.

A

then receives the search queries generated by the challenger, such that

Q_{i}

←

C o n s t r u c t Q u e r y (K, χ_{i})

. The adversary returns one if his or her queries return the expected result, otherwise zero.

\underline{I d e a l_{A, S} (k)}

: The adversary

A

outputs the tuple

(D

,

Λ)

, where Λ←

B u i l d T r e e (D)

, and sends it to the simulator. Given the leakage

L_{1}

, simulator

S

generates the tuple

(I, C)

and sends it to the adversary.

A

makes a polynomial number of adaptive string searches χ =

χ_{1},

\dots,

χ_{t}

and sends them to the simulator. Given the leakage

L_{2}

, the simulator

S

sends the appropriate search queries to the adversary. Finally,

A

returns one in the case of successful experiment, otherwise zero.

We say that SSP-SSE is adaptively secure against the chosen-query attack if for all probabilistic polynomial time adversaries

A

, there exists a non-uniform probabilistic polynomial time simulator

S

, such that:

| P r [R e a l_{A} (k)] = 1 - P r [I d e a l_{A, S} (k) = 1] | \leq n e g l (k)

(5)

5.3. SSP-SSE Construction

We now present the details of the proposed SSP-SSE scheme. The scheme consists of two phases, namely the setup phase and search phase. The setup phase is done once by the data owner to upload the set of encrypted documents and the searchable index to the cloud provider. In this phase, the data owner uses the KeyGen, BuildTree and Encrypt algorithms to encrypt the document collection, as well as to construct the searchable index. The search phase is performed every time by the cloud user when a query is submitted. In this phase, the cloud user invokes the ConstructQuery algorithm to generate the search query. The cloud provider executes the Search algorithm to output matching results. Finally, the cloud user invokes the Decrypt algorithm to decrypt the document collection to the original view. Our scheme is based on a set of important notations shown in Algorithm 1. We outline the details of setup phase in Section 5.3.1. We later show the search phase in Section 5.3.2.

Algorithm 1: Notations.

t = $(t_{1}, t_{2}, \dots, t_{n})$ - the text constructed from document collection D. $t_{i}$ is the letter in text t at position i.
$ν [i]$ - the node in Λ at index i ( $i \in [1; n]$ ).
$V (ν [i])$ - the position value of node $ν [i]$ in Λ.
$p i d (D_{j})$ = $i d (D_{j}) | | p o s_{D_{j}}$ - concatenation of document identifier $D_{j}$ ( $j \in [1; l]$ ) with position i of character $t_{i}$ in the document $D_{j}$ .
$L (ν [i])$ - the path label of node $ν [i]$ in position heap tree Λ.
$L_{p a r e n t} (ν [i])$ - the path label of $ν [i]$ ’s parent node.
$d e p t h (ν [i])$ - the depth of node $ν [i]$ in Λ.
$\bar{ν [i]}$ - the encrypted node in $\bar{Λ}$ at index i.
$\bar{V} (\bar{ν [i]})$ - the encrypted value of node $\bar{ν [i]}$ in $\bar{Λ}$ .
$\bar{L} (\bar{ν [i]})$ - the encrypted path label of node $\bar{ν [i]}$ .
$\bar{L_{p a r e n t}} (\bar{ν [i]})$ - the encrypted path label $\bar{ν [i]}$ ’s parent node.
$d e s c e n d a n t s (\bar{ν [i]})$ - the set of descendant (child) nodes in the subtree rooted at node $\bar{ν [i]}$ . If $\bar{ν [i]}$ is the leaf node, then $d e s c e n d a n t s (\bar{ν [i]})$ = 0.
$a n c e s t o r s (\bar{ν [i]})$ - the set of ancestor (parent) nodes at node $\bar{ν [i]}$ . If $\bar{ν [i]}$ is root node, then $a n c e s t o r s (\bar{ν [i]})$ = 0.

5.3.1. Setup Phase

The setup phase (Algorithm 2) includes the KeyGen, the BuildTree and the Encrypt algorithms. Let k be a security parameter, and let SKE = (Gen, Enc, Dec) be a PCPA-secure symmetric-key encryption scheme. The data owner begins with the KeyGen algorithm that inputs a secret parameter k and outputs a set of keys

K_{1}

,

K_{X}

,

K_{Y}

,

K_{V}

and set of random keys

K_{Q}

,

K_{L}

,

K_{2}

,

K_{3}

\overset{R}{\leftarrow}

{0, 1}^{k}

. He will use these keys to encrypt the document collection D =

(D_{1},

\dots,

D_{l})

and construct searchable index I.

Algorithm 2: SSP-SSE setup phase.

Let

S K E

=

(G e n,

E n c,

D e c)

be a PCPA-secure symmetric-key encryption scheme; let F:

{0, 1}^{k}

×

{0, 1}^{*}

→

{0, 1}^{k}

be a PRF; and let P:

{0, 1}^{k}

×

{0, 1}^{n}

→

{0, 1}^{n}

be a PRP.
SETUP PHASE.

\underline{KeyGen (1^{k})}

: given the security parameter k, generate

K_{1}

,

K_{X}

,

K_{Y}

,

K_{V}

←

S K E . G e n (1^{k})

and

K_{Q}

,

K_{L}

,

K_{2}

,

K_{3}

\overset{R}{\leftarrow}

{0, 1}^{k}

. Output the key set K =

(K_{1}

,

K_{X}

,

K_{Y}

,

K_{V}

,

K_{Q}

,

K_{L}

,

K_{2}

,

K_{3})

.
BuildTree(D) : given the document collection D =

(D_{1},

\dots,

D_{l})

:

construct text t = $t_{1}$ $t_{2}$ … $t_{n}$ from document collection D, and and input t of size n to build the position heap tree Λ.
index into Λ, for each node $ν [i]$ (i∈ $[1,$ $n]$ ):
(a)
set $V (ν [i])$ = $p i d (D_{j}) | | V (ν [i])$ , where $D_{j}$ ( $j \in [1, l]$ ) is the document in collection D.
output the position heap tree Λ

Encrypt(K, Λ, D) : given the secret key set K, position heap tree Λ and the set of documents D =

(D_{1},

\dots,

D_{l})

.
Build encrypted tree:

index into Λ, traverse from the root node:
for each node $ν [i]$ (i∈ $[1,$ $n]$ ):
(a)
set $\bar{L} (\bar{ν [i]})$ = $F_{K_{Q}} (L (ν [i]) | | d e p t h (ν [i]) | | \bar{L_{p a r e n t}} (\bar{ν [i]}) | | K_{L})$ (i.e., apply PRF F with key $K_{Q}$ on the concatenation of the path label L of $ν [i]$ , depth of the node $ν [i]$ , encrypted parent label $\bar{L_{p a r e n t}}$ of $ν [i]$ and the secret key $K_{L}$ .
(b)
set $\bar{V} (\bar{ν [i]})$ = $S K E . E n c_{K_{V}} (i)$ .
output encrypted $\bar{Λ}$ .

Build encrypted arrays:

for each character $t_{i}$ of t indexed from right-to-left (i.e., $t_{n}$ $t_{n - 1}$ … $t_{1}$ ), set an array $X [P_{K_{2}} (i)]$ = $S K E . E n c_{K_{X}} (t_{i})$ .
for each i = $[1,$ $n]$ : set an array $Y [P_{K_{3}} (i)]$ = $S K E . E n c_{K_{Y}} (V (ν [i]))$ .

Encrypt document collection:

for each document $D_{i}$ where $i \in [1, l]$ , let $C_{i}$ ← $S K E . E n c_{K_{1}} (D_{i})$ .
output C = $(C_{1},$ $C_{2},$ $\dots,$ $C_{l})$ .

Output: index I =

(\bar{Λ},

X,

Y)

and encrypted document collection C =

(C_{1},

C_{2},

\dots,

C_{l})

.

First, the data owner constructs a position heap tree Λ using the BuildTree algorithm outlined in Definition 10. The BuildTree algorithm inputs the text t, where t is constructed from the document collection D =

D_{1} | | $ \dots $

| | D_{l}

padded with the unique terminator string $ and outputs the single position heap tree Λ. In order to handle multiple documents in the collection, the data owner adds auxiliary information to each node that contains the document identifier

D_{i}

and the position of the letter in

D_{i}

. For example, if the character “a” appears in the document

D_{1}

at Position 1, the node in Λ will have extra information of

p i d (D_{1})

=

i d (D_{1}) | | 1

. Formally, we concatenate the identifier of

D_{j}

(

j \in [1; l]

) with position i of character

t_{i}

in the document

D_{j}

, i.e.,

p i d (D_{j})

=

i d (D_{j}) | | p o s_{D_{j}}

, and add this information in each node in the position heap tree. Figure 3a shows an example of position heap tree Λ of the text “ab$aaa$bb” constructed from three concatenated documents

(D_{1},

D_{2},

D_{3})

, where

D_{1}

has text “bb”,

D_{2}

has text “aaa” and

D_{3}

has text “ab”. Note, a search of “ab” in the position heap tree returns a set of nodes

(9, 4, 1)

where only 9 is the matching node, and it describes the document position of

D_{3} | | 2

. Thus, the search query “ab” appears only in the document

D_{3}

at Position 2.

The data owner constructs the searchable index that is based on the position heap tree data structure. To present the details, we use an example of the position heap tree Λ shown in Figure 3b. The figure depicts constructed position heap tree Λ from text t = “abaaababbabaaba” and text array X (shown at the top of the figure), where each array element has a single character of text t indexed from right-to-left.

The data owner begins by extracting position information from Λ as follows: index each node in tree Λ, and create a position array Y, such that each index in Y corresponds to the node value of Λ. Figure 4a shows an example of the left-side branch of position heap tree Λ and constructed position array Y. In this example, nodes in Λ are marked with the red color index, and their corresponding values (positions) are stored as elements in Y. (In Figure 4a we show an example of the position array Y for nine nodes of Λ for demonstration purposes only. The actual algorithm is executed on all nodes in Λ.) With this, the data owner is ready to encrypt the position heap tree Λ, text array X and position array Y data structures.

First, to encrypt the position heap tree Λ, the data owner uses a pseudorandom function F:

{0, 1}^{k}

×

{0, 1}^{*}

→

{0, 1}^{k}

and PCPA-secure symmetric-key encryption scheme SKE = (Gen, Enc, Dec). For each node i in Λ, the data owner applies PRF F with key

K_{Q}

on the concatenation of the path label of node i, depth of the node i, the encrypted path label of the i’-th parent node and the secret key

K_{L}

. Figure 4b shows an example of the path label encryption. For instance, the label of Node 4 is

L_{4}

=

F_{K_{Q}} (a | | 3 | | L_{3} | | K_{L})

, where

L_{3}

=

F_{K_{Q}} (a | | 2 | | L_{2} | | K_{L})

. The root path label is a special case, and its label is

L_{1}

=

K_{K_{Q}} (a | | 0 | | \emptyset | | K_{L})

. In this way, the data owner encrypts all path labels in the tree. This hides the plaintext path labels of the same character at different levels of the tree Λ. Moreover, this makes the ciphertext unique for all path labels in the tree. To hide the index information of each node in Λ, the data owner uses SKE encryption with key

K_{V}

on the index of the node, i.e.,

V_{i}

=

S K E . E n c_{K_{V}} (i)

, where

i \in [1, n]

. For instance, the value of Node 8 is

V_{8}

=

S K E . E n c_{K_{V}} (8)

. With no plaintext left in Λ, the data owner outputs an encrypted position heap tree

\bar{Λ}

.

Second, the data owner utilizes a pseudorandom permutation P:

{0, 1}^{k}

×

{0, 1}^{n}

→

{0, 1}^{n}

and PCPA-secure symmetric-key encryption SKE to hide plaintext elements of text array X and position array Y. For each i (

i \in [1, n]

) in X, the data owner applies PRP P with secret key

K_{2}

on each i, i.e.,

P_{K_{2}} (i)

. For each corresponding character

t_{i}

at index i in X, he applies SKE with secret key

K_{X}

on character

t_{i}

, i.e.,

S K E . E n c_{K_{X}} (t_{i})

. The data owner sets the encrypted array X as

X [P_{K_{2}} (i)]

=

S K E . E n c_{K_{X}} (t_{i})

. Next, for each i (

i \in [1, n]

) in Y, he utilizes PRP P with secret key

K_{3}

and SKE with secret key

K_{Y}

as follows:

Y [P_{K_{3}} (i)]

=

S K E . E n c_{K_{Y}} (V_{i})

, where

V_{i}

is the i’-th element in Y.

Finally, the data owner encrypts each document

D_{i}

in the collection D using the PCPA-secure symmetric-key encryption scheme SKE with secret key

K_{1}

to produce the encrypted document

C_{i}

←

S K E . E n c_{K_{1}} (D_{i})

. After all, the data owner uploads the encrypted collection C along the searchable index I =

(\bar{Λ}, X, Y)

to the cloud provider. Now, the collection is available for selective cloud retrieval.

5.3.2. Search Phase

The search phase (Algorithm 3) includes both the ConstructQuery and the Search interactive algorithms that are executed between the cloud user and the cloud provider. The cloud user keeps the set of secret keys K =

(K_{1}

,

K_{X}

,

K_{Y}

,

K_{V}

,

K_{Q}

,

K_{L}

,

K_{2}

,

K_{3})

received from the data owner.

Algorithm 3: SSP-SSE search phase.

SEARCH PHASE.
[(Q) ← ConstructQuery(K, χ)] ↔ [(L) ← Search(I,Q)] is an interactive protocol between the cloud user and the cloud provider. The cloud user keeps the key set K =

(K_{1}

,

K_{X}

,

K_{Y}

,

K_{V}

,

K_{Q}

,

K_{L}

,

K_{2}

,

K_{3})

and queries cloud provider for a substring χ. The cloud provider executes search on searchable index I =

(\bar{Λ},

X,

Y)

and returns results back to the cloud user.

cloud user: given the secret key K Q and the string of interest χ, output the search query Q as follows:
(a)
for each character $χ_{i}$ , $i \in [1; m]$ , where $m = | χ |$
set $Q_{i}$ = $F_{K_{Q}} (χ_{i} | | i | | Q_{i - 1} | | K_{L})$ (i.e., apply PRF F with key $K_{Q}$ on the concatenation of character $χ_{i}$ of χ, i, output of query $Q_{i - 1}$ and secret key $K_{L}$ .)
set Q = $(Q_{1}, Q_{2}, \dots, Q_{m})$ .
(b)
send search query Q to the cloud provider.
cloud provider: index into $\bar{Λ}$ , start at the root node
(a)
for each $Q_{i}$ and each node $\bar{ν}$ in $\bar{Λ}$ , match the encrypted label $\bar{L} (\bar{ν})$ to $Q_{i}$ . Continue until the longest node $\bar{ν [m a t c h]}$ is found.
(b)
If $\bar{ν [m a t c h]}$ ≠ ⊥, return $(d e s c e n d a n t s (\bar{ν [m a t c h]},$ $a n c e s t o r s (\bar{ν [m a t c h]}))$ , otherwise return ⊥.
cloud user: let TMP-AN and TMP-DE be two arrays; let TMP-RES = TMP-AN + TMP-DE be an array that combines elements from TMP-AN and TMP-DE.
(a)
for each node $\bar{ν}$ in $a n c e s t o r s (\bar{ν [m a t c h]})$ :
if $S K E . D e c_{K_{V}} (\bar{V} (\bar{ν}))$ =⊥, abort. Otherwise, output $i d x$ , and add to TMP-AN.
(b)
for each node $\bar{ν}$ in $d e s c e n d a n t s (\bar{ν [m a t c h]})$ :
if $S K E . D e c_{K_{V}} (\bar{V} (\bar{ν}))$ = ⊥, abort. Otherwise, output $i d x$ , and add to TMP-DE.
(c)
set TMP-RES = TMP-AN + TMP-DE, for each $i d x$ in TMP-RES, set $y_{i d x}$ = $P_{K_{3}} (i d x)$ . Send $(y_{1}, \dots, y_{n u m})$ to the cloud provider.
cloud provider: get $Y_{i} = Y [y_{i}]$ ( $i \in [1, n u m])$ , output $(Y_{1},$ $\dots,$ $Y_{n u m})$ .
cloud user: let AN and DE be two arrays.
(a)
for i = $[1, m]$ , if $S K E . D e c_{K_{Y}} (Y_{i})$ = ⊥, abort; otherwise, add output to AN.
(b)
for i = $[m + 1, n u m]$ , if $S K E . D e c_{K_{Y}} (Y_{i})$ = ⊥, abort; otherwise, add output to DE.
(c)
parse each element from AN as $p i d (D) | | p o s$ .
(d)
for each $p o s$ in AN, for $j = p o s,$ $j > (j - m)$ (where $(j - m)$ > 0), $j - -$ , let $x_{j}$ = $P_{K_{2}} (j)$ , send $(x_{1}, \dots, x_{h})$ to the cloud provider.
cloud provider: get $X_{i} = X [x_{i}]$ ( $i \in [1, h]$ ), output $(X_{1}, \dots, X_{h})$ .
cloud user: let REAL-AN be an array.
(a)
for $i = [1; h]$ , if $S K E . D e c_{K_{X}} (X_{i})$ = ⊥, abort. Otherwise, parse the output as $t_{j}$ .
(b)
for each $p o s$ in AN, compare characters $χ_{u}$ = $t_{j}$ , where $u = 0,$ $u < m,$ $u + +$ and $j = p o s,$ $j > (j - l)$ (where $(j - l)$ ≮ 0), $j - -$ . If all $χ_{u}$ = $t_{j}$ match at given $p o s$ , add $p o s$ to REAL-AN; otherwise, ignore $p o s$ .
(c)
let RES = REAL-AN + DE. Parse each element of array $R E S$ as $i d (D_{h}) | | p o s_{D_{h}}$ , where $p o s_{D_{h}}$ is the position of substring χ in document $D_{h}$ (h∈ $[1,$ $l]$ ).

\underline{Decrypt (K_{1}, K_{2}, C_{i})}

:

retrieve set C = $(C_{1},$ $\dots,$ $C_{k})$ from the cloud provider.
$D_{i}$ ← $S K E . D e c_{K_{1}} (C_{i})$ , where $i \in [1; k]$ .
output $((D_{1}, p o s_{D_{1}}),$ $\dots,$ $(D_{k}, p o s_{D_{k}}))$ .

In order to search a substring χ of length m, the cloud user begins with creating a search query Q: for each character

χ_{i}

in χ; he applies PRF F with secret key

K_{Q}

on the concatenation of

χ_{i}

, i, the output of previous query

Q_{i - 1}

and the secret parameter

K_{L}

. The cloud user forms a query Q =

(Q_{1}, \dots, Q_{l})

and sends Q to the cloud provider. For instance, for a substring “aba”, the cloud user creates

Q_{1}

=

F_{K_{Q}} (a | 1 | L_{r o o t} | K_{L})

(

L_{r o o t}

is shared by the data owner to the cloud user),

Q_{2}

=

F_{K_{Q}} (b | 2 | Q_{1} | K_{L})

,

Q_{3}

=

F_{K_{Q}} (a | 3 | Q_{2} | K_{L})

and sends Q =

(Q_{1}, Q_{2}, Q_{3})

to the cloud provider. The cloud server indexes into the encrypted position heap tree

\bar{Λ}

, and for each given

Q_{i}

, it matches encrypted label

\bar{L}

of each node in

\bar{Λ}

to

Q_{i}

and continues until the longest matching node

\bar{ν_{m a t c h}}

in

\bar{Λ}

is found. The cloud server returns the set of ancestor and descendant nodes of

\bar{ν_{m a t c h}}

to the cloud user. Using the example in Figure 4b and search query Q =

(Q_{1}, Q_{2}, Q_{3})

, the cloud provider returns the set of encrypted ancestor nodes (

S K E . E n c_{K_{V}} (1),

S K E . E n c_{K_{V}} (2)

,

S K E . E n c_{K_{V}} (6)

) and the set of encrypted descendant nodes (

S K E . E n c_{K_{V}} (7),

S K E . E n c_{K_{V}} (8)

).

Now, the cloud user applies the SKE scheme with secret key

K_{V}

to decrypt the ancestor and descendant nodes, i.e.,

(1, 2, 6)

ancestor nodes and

(7, 8)

descendant nodes. Next, he uses PRP P with secret key

K_{3}

on each decrypted node, i.e.,

y_{i d x}

=

P_{K_{3}} (i d x)

, where

i d x

is

(1, 2, 6, 7, 8

), and sends the resulting query y to the cloud provider.

The cloud provider uses array Y to fetch the elements at index

y_{i}

(

i \in [1; 5]

) as

Y [y_{i}]

and sends back the results. Once received, the cloud user applies SKE with secret key

K_{Y}

to decrypt the positions in the ancestor and descendant nodes, i.e.,

(1, 3, 6)

positions in ancestor nodes and

(11, 15)

positions in descendant nodes. According to Definition 11, descendant nodes

(11, 15)

are the positions of query “aba” in the text, and ancestor nodes

(1, 3, 6)

require an inspection, since some of them can point at “aba” in the text. Note, since the substring “aba” has a length of three, the substring may exist at positions

(6, 5, 4)

and

(3, 2, 1)

in the text. Therefore, to launch the inspection, the cloud user applies PRP P with secret key

K_{X}

at each position

(6, 5, 4, 3, 2, 1)

as

x_{i d x} = P_{K_{2}} (i d x)

and sends query x to the cloud provider.

Now, the cloud provider uses array X and sends back the elements of the array at index

X [x_{i}]

(i ∈

[1; 6]

). The cloud user uses SKE.Enc with secret key

K_{X}

to decrypt the characters

t_{j}

at positions

(6, 5, 4, 3, 2, 1)

(i.e., received characters are

(a, b, a, a, b, a)

). Using this information, the cloud user verifies if substring characters

χ_{i}

match received characters

t_{j}

at each ancestor position. The inspection of ancestors shows that only

(6, 3)

are the positions. Thus, the cloud user concludes that substring query “aba” is at position

(3, 6, 11, 15)

in the text.

Note, if multiple documents are involved in the original text construction, ancestor and descendant nodes contain the document identifiers, which can be later used by the cloud user to download the matching encrypted documents and decrypt them locally using PCPA-secure symmetric-key encryption SKE with secret key

K_{1}

.

6. Security and Performance Analysis

6.1. Security

In this section, we focus on the the security of the SSP-SSE scheme. First, we show that the SSP-SSE scheme is correct according to Definition 13. Second, we prove that the SSP-SSE scheme is secure against the Chosen-Query Attack (CQA-2) executed by the adaptive adversary according to Definition 17.

Theorem 18.

(Correctness). The Substring Positions Searchable Symmetric Encryption (SSP-SSE) scheme consisting of six polynomial-time algorithms (KeyGen, BuildTree, Encrypt, ConstructQuery, Search, Decrypt) is correct according to Definition 13.

Proof.

The index I in the Search algorithm consists of the encrypted position heap tree

\bar{Λ}

and two arrays X, Y (both encrypted). Since the path labels in

\bar{Λ}

and the search query Q are both encrypted with the same instance of pseudorandom function F with the same secret key

K_{Q}

, the correctness of the SSP-SSE scheme relies on the correctness of the pseudorandom function.

When the cloud provider receives the search query Q in the Search algorithm, it traverses the path labels in the encrypted position heap tree

\bar{Λ}

according to Definition 11. Search query Q is constructed using the pseudorandom function F applied on the substring χ with key

K_{Q}

. Each encrypted path label in

\bar{Λ}

is constructed using the pseudorandom function F with the key

K_{Q}

on the set of characters extracted from the plaintext document collection

D = {D_{1}, \dots, D_{l}}

. The search algorithm outputs true if the document

D_{i}

contains the string of characters χ. Thus, the cloud provider outputs a set of documents that matches the search query Q. ☐

Theorem 19.

(Security). Let SKE be a symmetric PCPA-secure encryption scheme, F be a pseudorandom function and P be a pseudorandom permutation. Substring Position Searchable Symmetric Encryption (SSP-SSE) presented above is

(L_{1}

,

L_{2})

-adaptively secure against chosen-query attacks defined in Definition 17 (CQA-2 security), where

L_{1}

and

L_{2}

are the possible leakages.

In a nutshell, the proof of security of SSP-SSE scheme works as follows. The simulator

S

generates a simulated searchable index

\tilde{I}

that consists of simulated encrypted position heap tree

\tilde{Λ}

, simulated position array

\tilde{Y}

and simulated text array

\tilde{X}

, i.e.,

\tilde{I}

=

(\tilde{Λ},

\tilde{Y},

\tilde{X})

; as well as the simulated set of ciphertexts

\tilde{C}

=

{\tilde{C_{1}}, \dots, \tilde{C_{l}}}

. Both

\tilde{I}

and

\tilde{C}

are constructed using the leakage

L_{1}

that discloses the number of encrypted documents, the size of the encrypted documents and the identifier of each encrypted document. The simulated encrypted position heap tree

\tilde{Λ}

is constructed using the pseudorandom function F and symmetric-key encryption SKE with random values

{0, 1}

. Both simulated

\tilde{Y}

and

\tilde{X}

are constructed using the pseudorandom permutation P and symmetric-key encryption SKE on random values

{0, 1}

. The security of the proposed scheme relies on the following assumptions. The pseudo-randomness of F guarantees that the simulated encrypted position heap tree

\tilde{Λ}

is indistinguishable from the real encrypted position heap tree

\bar{Λ}

. The pseudo-randomness of P will guarantee that simulated

\tilde{Y}

and

\tilde{X}

are indistinguishable from the real Y and X. Moreover, the simulated set of ciphertext

\tilde{C}

is indistinguishable from the real encrypted document collection C.

The search algorithm is simulated in a similar way that requires keeping track of different dependencies between the result output and the search query. However, since the real search query is constructed with pseudorandom function F and pseudorandom permutation P, the simulator is not able to distinguish it from the simulated query. Similarly, the simulated outcome of the search is indistinguishable from the real set of nodes. We outline the formal proof as follows.

Proof.

Polynomial-size simulator

S

can be defined such that for any challenger and any polynomial-time adversary

A

, the outputs of two experiments

I d e a l_{A, S} (k)

and

R e a l_{A} (k)

with secret parameter k are computationally indistinguishable according to Definition 17. We now describe the details of experiment

I d e a l_{A, S} (k)

that presents the simulator

S

.

$S (1^{k}, L_{1})$ : The simulator $S$ has a leakage $L_{1}$ , which gives the simulator information about the number and size of documents, as well as identifier of each encrypted document. The simulator $S$ randomly generates a set of simulated ciphertexts $\tilde{C}$ and simulated searchable index $\tilde{I}$ as follows:
–
Simulator $S$ outputs the set of ciphertexts $\tilde{C} = {\tilde{C_{1}}, \dots, \tilde{C_{l}}}$ , where $\tilde{C_{i}} \overset{R}{\leftarrow} {0, 1}^{| D_{i} |}$ .
–
Simulator $S$ sets the simulated encrypted position heap tree $\tilde{Λ}$ , where each node is set as $\tilde{V} (ν [i])$ $\overset{R}{\leftarrow}$ ${0, 1}^{k}$ and each path label of node $ν [i]$ is set as $\tilde{L} (ν [i])$ $\overset{R}{\leftarrow}$ ${0, 1}^{k}$ , where $i \in [1; n]$ . The simulator outputs the encrypted position heap tree $\tilde{Λ}$ .
–
Simulator $S$ then constructs simulated arrays $\tilde{X}$ and $\tilde{Y}$ : $\tilde{X} [i]$ = ${0, 1}^{k}$ and $\tilde{Y} [i]$ = ${0, 1}^{k}$ , where $i \in [1; n]$ .
–
Simulator $S$ outputs simulated searchable index $\tilde{I}$ = $(\tilde{Λ},$ $\tilde{Y},$ $\tilde{X})$ and the set of simulated ciphertexts $\tilde{C}$ .

At this point, the simulator

S

generated the set of simulated encrypted documents

\tilde{C}

and simulated index

\tilde{I}

. Next, the adversary

A

adaptively queries the polynomial-size simulator

S

as follows.

$S (1^{k}, L_{1}, L_{2})$ : The adversary $A$ sends a new query Q to the simulator $S$ . The simulator then starts collecting various dependencies between the incoming search query and the resulting output.
–
With given search query Q, simulator $S$ traverses the simulated encrypted position heap tree $\tilde{Λ}$ starting from the root node, following the simulated path labels to find the set of matching encrypted nodes in $\tilde{Λ}$ . The simulator outputs the set of simulated matching nodes: $\tilde{a n c e s t o r s}$ and $\tilde{d e s c e n d a n t s}$ .
–
With given search requests $({\tilde{y}}_{1}, \dots, {\tilde{y}}_{n u m})$ , the simulator performs a search in simulated array $\tilde{Y}$ and returns matching elements $({\tilde{Y}}_{1}, \dots, {\tilde{Y}}_{n u m})$ .
–
With given search requests $({\tilde{x}}_{1}, \dots, {\tilde{x}}_{h})$ , the simulator performs a search in simulated array $\tilde{X}$ and returns matching elements $({\tilde{X}}_{1}, \dots, {\tilde{X}}_{h})$ .

We now need to show that the outputs of the two experiments

I d e a l_{A, S} (k)

and

R e a l_{A} (k)

are indistinguishable. Since the simulator generates randomly the set of ciphertexts

\tilde{C}

, the output of the simulator is truly indistinguishable from the real ciphertexts that are generated with the PCPA-secure symmetric encryption SKE scheme using secret key

K_{1}

. Otherwise, this would mean that the simulator could distinguish between the output of the PCPA-secure symmetric encryption scheme SKE and the random value. Next, the simulated encrypted position heap tree

\tilde{Λ}

is truly indistinguishable from the real encrypted position heap tree. Otherwise, this would mean that simulator could distinguish between the output of pseudorandom function F with secret key

K_{Q}

and the random values. Similarly, the simulated arrays

\tilde{Y}

and

\tilde{X}

are truly indistinguishable from the real arrays Y and X. Otherwise, this would mean that the simulator can distinguish between the output of pseudorandom permutation P with keys

K_{2}

,

K_{3}

, SKE scheme with keys

K_{Y}

,

K_{X}

and the random values. Thus, it is concluded that the outputs of the two experiments are indistinguishable. ☐

6.2. Performance

In this section, we outline the performance of the proposed solution. We assume that the encryption and decryption using the SKE scheme take

O (k)

time, where k is the security parameter. We also assume that the element selection from the array takes

O (1)

time.

We first focus on the encryption efficiency of the SSP-SSE scheme. Given plaintext position heap tree Λ with n nodes, we compute encrypted position heap tree Λ using SKE in

O (k n)

time. The arrays X and Y each have n elements and can be computed in

O (k n)

time. Therefore, encryption takes

O (k n)

time, and the total ciphertext is

O (k n)

in size.

We now analyze the efficiency of proposed search algorithm. The cloud user inputs a substring χ of length m and outputs a search query in

O (m)

time. The cloud provider uses

\bar{Λ}

, performs m matches in the tree and retrieves

o c c

descendant nodes, in

O (m + o c c)

time. The cloud user then computes

y_{1}, \dots, y_{m + o c c}

elements, and the cloud provider retrieves

Y [y_{1}], \dots, Y [y_{m + o c c}]

in

O (m + o c c)

time. The cloud user then computes

x_{1}, \dots, x_{m^{2}}

elements (the cloud user wants to inspect m ancestor positions and the substring χ of m length that may appear at each ancestor position), and the cloud provider retrieves

X [x_{1}], \dots, X [x_{l^{2}}]

in

O (m^{2})

time. Now, the cloud user performs an inspection of m ancestors m times, making execution in

O (m^{2})

time. Thus, both the cloud user and the cloud provider take computation time

O (m^{2} + o c c)

in the query protocol and three rounds of communication to complete the execution of the protocol.

We have developed and implemented a proof-of-concept prototype of the SSP-SSE scheme using C++ language. Our prototype leverages the libtomcrypt cryptographic library[33], which is a portable C cryptographic library that supports symmetric ciphers, one-way hashes, pseudo-random number generators and a plethora of support routines. We use libtomcrypt to build the searchable index I and encrypt the document collection. We utilize AES-CTR encryption for the SKE symmetric-key encryption scheme, HMAC-SHA1 for pseudorandom function F and DES encryption for pseudorandom permutation P.

We show a thorough experimental evaluation of the SSP-SSE scheme on a real-world dataset: the Genome database [34] (published by the National Center for Biotechnology Information, National Institutes of Health) that contains sequence data from the whole genomes of over 1000 species or strains. The database includes all three main domains of life (Bacteria, Archaea and Eukaryota), as well as many viruses, phages, viroids, plasmids and organelles. All experiments have been performed on a six-core Intel Xeon E5645 2.40-GHz processor and 98 GB memory running 64-bit Fedora 23. The cloud server, data owner and cloud user applications were run on the same machine, as the network communication overhead was assumed to be negligible.

For our experiments, we pick large mRNA transcript datasets of various insects. Table 2 shows the details of the experimental set. Figure 5a shows the overhead of constructing the encrypted position heap tree Λ. We compare the time of the construction of the plaintext position heap tree (original algorithm) and the encrypted position heap tree proposed in this work. Figure 5b shows the storage overhead of searchable index I that consists of encrypted position heap tree Λ, position array Y and text array X. In short, we notice that the proposed scheme adds insignificant overhead to the computation time; however, its storage overhead depends on the block cipher size of the underlying encryption schemes. We believe that the proposed solution can be easily deployed in a real-world cloud environment.

7. Multi-User Substring Position Searchable Symmetric Encryption

Our original system model shown in Figure 1 includes only three single entities. To make an important step towards widespread adoption of the searchable encryption techniques, there is a need to efficiently support hundreds, even thousands of users in the cloud. In this section, we consider a simple extension to our work, where a data owner has a document collection, and there is a group of data users that wants to query encrypted data in the cloud.

Curtmola et al.’s [6] solution extends the single-user searchable encryption framework with broadcast encryption [35], where the data owner sends the searchable index and encrypted document collection to the cloud, and a group of cloud users is allowed to invoke the search over encrypted cloud data. The framework describes the solution where the data owner distributes a single shared secret key among the group of cloud users. However, this solution may not work in the real-world cloud environment that involves a potentially large number of data users, since a single secret key is given to all participants. For instance, if the data owner decides to revoke the search access for one cloud user, he/she will have to generate a new key and distribute it to the remaining users. It is preferable that each cloud participant could keep its own secret key, thus making key management easier and more efficient.

We propose a new multi-user substring position searchable symmetric encryption (MSSP-SSE) scheme that solves the problem of managing access privileges and searching a substring over encrypted cloud data. Our solution is based on the distributed broadcast encryption scheme [36]. First, we present the definitions of a multi-user substring position searchable symmetric encryption scheme. Later, we give an efficient construction that combines the ideas of a single-user SSP-SSE scheme with the distributed broadcast encryption scheme.

7.1. Preliminaries

In this section, we present several definitions used in our work. We begin with the definition of the Witness Pseudo-Random Function (WPRF). Informally, a witness PRF for an

N P

language L is a PRF F, such that anyone with a valid witness that

x \in L

can compute

F (x)

without the secret key, but for all

x \notin L

,

F (x)

is computationally hidden without knowledge of the secret key. Formally, a witness PRF is defined as follows.

Definition 20.

(Witness Pseudo-Random Function (WPRF) [36]). A triple of algorithms

(G e n,

F,

E v a l)

is a witness PRF if:

$G e n$ : a probabilistic algorithm that inputs a security parameter λ and a circuit $R :$ $X$ × $W$ → ${0, 1}$ , and outputs a secret function key $f k$ and a public evaluation key $e k$ .
F: a deterministic algorithm that inputs the function key $f k$ and an input $x \in X$ and outputs some output $y \in Y$ for some set $Y$ .
$E v a l$ : a deterministic algorithm that inputs the evaluation key $e k$ , an input $x \in X$ and a witness $w \in W$ and that produces an output $y \in Y$ or ⊥.

A witness PRF is correct if the following holds:

E v a l (e k, x, w) = \{\begin{matrix} F (f k, x) & if R (x, w) = 1 \\ ⊥ & if R (x, w) = 0 \end{matrix} for all x \in X, w \in W .

(6)

A multiparty key exchange protocol allows a group of g users to simultaneously post a message to a public bulletin board, retaining some user-independent secrets. After reading off the contents of the bulletin board, all users establish the same shared secret key. The multiparty key exchange protocol consists of the following algorithms.

Definition 21.

(Non-Interactive Multiparty Key Exchange protocol (NIKE-WPRF) [36]). Let

G : S \to Z

be a pseudo-random generator with

| S | / | Z | \leq n e g l

. Let

W P R F

=

(G e n,

F,

E v a l)

be a witness PRF. Let

R_{g} :

Z^{g} \times (S \times [g])

→

{0, 1}

be a relation that outputs one on input

((z_{1}, \dots, z_{g}), (s, i))

if and only if

z_{i}

=

G (s)

. The non-interactive key exchange protocol consists of:

$P u b l i s h (λ, g)$ : a probabilistic algorithm to output public and secret keys. The algorithm inputs the security parameter λ and the group order g. It computes $(f k, e k) \overset{R}{\leftarrow} G e n (λ, R_{g})$ . Next, it picks a random seed $s k$ $\overset{R}{\leftarrow} S$ and computes z← $G (s k)$ . It outputs a secret key $s k$ and public values $(z, e k)$ , where $s k$ is kept secret and $(z, e k)$ are published to the bulletin board.
$K e y G e n ({z_{i}, e k_{i}}_{i \in [g]}, s k)$ : a deterministic algorithm that inputs group g and user’s secret $s k$ . It outputs a group key k = $E v a l (e k_{i}, (z_{1}, \dots, z_{g}), (s k, i))$ .

Broadcast encryption [35] allows an encryptor (data owner) to broadcast a message to a subset of recipients (data users). The system is said to be collusion resistant if non-data users can learn information about the plaintext. The most recent work by Zhandry et al. [36] proposes a distributed broadcast encryption scheme that removes the burden of key management from the encryptor and lets group establishment run by participating recipients. Each user is allowed to pick the desired participants and to establish a shared key. The distributed broadcast encryption scheme is defined as follows.

Definition 22.

(Distributed Broadcast Encryption over NIKE (BE-NIKE-WPRF) [36]). The distributed broadcast encryption scheme over multi-party non-interactive key exchange protocol consists of the four following algorithms:

$S e t u p$ : a probabilistic algorithm to setup the BE-NIKE-WPRF scheme. The algorithm outputs a secret parameter λ and group order g.
$J o i n (λ, g)$ : a probabilistic algorithm to join the scheme that is executed by each participant. The algorithm inputs a secret parameter λ and group order g. The algorithm invokes $NIKE - WPRF . P u b l i s h (λ, g)$ to output secret $s k$ and public values $(z, e k)$ . The user makes $(z, e k)$ publicly available to other participants.
$E n c ({z_{i}, e k_{i}}_{i \in [g]}, s k, m)$ : a probabilistic algorithm to encrypt message m under the shared key. The algorithm inputs the set of public values ${z_{i}, e k_{i}}_{i \in [g]}$ , secret key $s k$ and plaintext message m. The algorithm runs $NIKE - WPRF . K e y G e n ({z_{i}, e k_{i}}_{i \in [g]}, s k)$ to derive the shared key k. The algorithm outputs a ciphertext c, which is the encryption of message m using the shared key k.
$D e c ({z_{i}, e k_{i}}_{i \in [g]}, s k, c_{m})$ : a deterministic algorithm to decrypt $c_{m}$ . The algorithm invokes $NIKE - WPRF . K e y G e n ({z_{i}, e k_{i}}_{i \in [g]}, s k)$ to derive k. If $k \neq ⊥$ , then the algorithm decrypts $c_{m}$ using k and outputs the original message m.

7.2. Algorithm Definitions

Definition 23.

(Multi-User Substring Position Searchable Symmetric Encryption (MSSP-SSE)). A tree-based MSSP-SSE scheme over a set of documents D is a tuple of nine polynomial-time algorithms (KeyGen, BuildTree, Encrypt, Join, GroupSetup, Remove, ConstructQuery, Search, Decrypt), as follows:

$(K, λ, g)$ ← $K e y G e n (1^{k})$ : a probabilistic key generation algorithm to setup the SSP-SSE scheme. The algorithm takes a secret parameter k and outputs a set of secret keys K, secret parameter λ and group g.
$(Λ)$ ← $B u i l d T r e e (D)$ : a deterministic algorithm to build a position heap tree Λ. The algorithm takes a document collection D = ${D_{1},$ $\dots,$ $D_{l}}$ and constructs a position heap tree Λ.
$(I, C)$ ← $E n c r y p t (K, Λ, D)$ : a probabilistic algorithm to encrypt a position heap tree and document corpus. The algorithm inputs a set of secret keys K, a position heap tree Λ and a documents corpus D. The output of algorithm is a searchable index I and encrypted collection C = ${C_{1},$ $\dots,$ $C_{l}}$ .
$(s k, (z, e k))$ ← $J o i n (λ, g)$ : a probabilistic algorithm run by each data user to participate in the scheme. The algorithm invokes $BE - NIKE - WPRF . J o i n$ with an input of secret parameter λ and group order g. It outputs a pair $(s k, (z, e k))$ .
$c_{r}$ ← $G r o u p S e t u p ({z_{i}, e k_{i}}_{i \in [h]}, s k)$ : a probabilistic algorithm run by the group owner to establish the group $h \subseteq g$ of authorized data users. The algorithm runs $BE - NIKE - WPRF . E n c$ with an input of public values ${z_{i}, e k_{i}}_{i \in [h]}$ , group owner’s secret key $s k$ and a sampled secret r. The output is encrypted ciphertext $c_{r}$ .
$c_{r}$ ← $R e m o v e ({z_{i}, e k_{i}}_{i \in [h ∖ o],}, s k)$ : a probabilistic algorithm run by the group owner to remove a user o from the set of authorized users. The algorithm invokes $BE - NIKE - WPRF . E n c$ that inputs the set of public values ${z_{i}, e k_{i}}_{i \in [h ∖ o]}$ , group owner’s secret key $s k$ and a new secret r. The output is encrypted ciphertext $c_{r}$ .
$[(Q)$ ← $C o n s t r u c t Q u e r y (K, χ, c_{r})]$ ↔ $[(L)$ ← $S e a r c h (I, Q, c_{r})]$ : two deterministic algorithms that are executed interactively between the authorized cloud user and the cloud provider. The algorithm inputs a set of secret keys K, ciphertext $c_{r}$ and a substring χ, and it outputs a search query Q. The algorithm uses a query Q, searchable index I and ciphertext $c_{r}$ . It outputs a sequence of identifiers $L \in C$ .
$(D_{i}, p o s_{D_{i}})$ ← $D e c r y p t (K, C_{i})$ : a deterministic algorithm that takes a set of secret keys K and a ciphertext $C_{i}$ as input, and it outputs an original document $D_{i}$ , $\forall i \in [1; n]$ , and a set of χ’s positions $p o s_{D_{i}}$ in $D_{i}$ .

We now present the security model for a Multi-user Substring Position Searchable Symmetric Encryption (MSSP-SSE) scheme. Intuitively, our security model requires the security of a single-user SSP-SSE scheme and the security of a distributed broadcast encryption scheme. We formalize the security requirements of MSSP-SSE scheme as follows:

Given searchable index I and the set of encrypted documents C = ${C_{1},$ $\dots,$ $C_{l}}$ , the adversary should learn nothing about the original document collection D = ${D_{1},$ $\dots,$ $D_{l}}$ .
Given the set of incoming search queries Q = ${Q_{1},$ $\dots,$ $Q_{m}}$ , access pattern, search pattern and path pattern, the adversary should learn nothing about the content of each search query $Q_{i}$ or the content of resulted documents.
Once a user is removed from the set of authorized cloud users, he/she is no longer allowed to invoke a search over encrypted documents in the cloud. Thus, we require the revocation of the cloud users.

In MSSP-SSE, we use the adaptive semantic security notion of a single-user SSP-SSE scheme. It provides the security against an adaptive adversary: the cloud server does not learn anything about the document collection and search queries beyond the access, search and path patterns. However, with the addition of the access privilege property, we expand our security definitions towards the Remove functionality (Algorithm 4). We define the Rev algorithm as follows:

Definition 24.

(Revocation). Let MSSP-SSE = (KeyGen, BuildTree, Encrypt, Join, GroupSetup, Remove, ConstructQuery, Search, Decrypt) be a group SSP-SSE scheme, k be a security parameter and

A

=

(A_{1},

A_{2},

A_{3})

be an adversary. We use the following probabilistic experiment

R e v_{MSSP - SSE, A}

(k):

Algorithm 4: $R e v_{MSSP - SSE, A} (k)$ .

(s t_{A}, D)

←

A_{1} (1^{k})

(s k_{A}, (z_{A}, e k_{A}))

←

J o i n (λ, g)

c_{r}

←

S e t u p G r o u p ((z_{A}, e k_{A}), s k)

(Λ)

←

B u i l d T r e e (D)

(I, C)

←

E n c r y p t (K, Λ, D)

s t_{A}

←

A_{2}^{O (I, C, s t_{S}, \cdot)} (s t_{A}, s k_{A}, (z_{A}, e k_{A}), c_{r})

c_{r}^{'}

←

R e m o v e ((z_{A}, e k_{A}), s k)

Q ←

A_{3} (s t_{A})

L ←

S e a r c h (s t_{S}, I, Q, c_{r}^{'})

if L ≠ ⊥, output one, otherwise output zero

where

O (I, C, s t_{S}, \cdot)

is an oracle that inputs a search query Q and outputs ciphertexts C indexed by L ←

S e a r c h (I, Ω, c_{r}^{'})

if L ≠ ⊥ and ⊥ otherwise. We say that the Remove algorithm achieves user revocation if for all polynomial-size adversaries

A

=

(A_{1},

A_{2},

A_{3})

, the following is correct:

P r [R e v_{MSSP - SSE, A} (k) = 1] \leq n e g l (k),

(7)

where the probability is over the coins of KeyGen, Join, GroupSetup, Remove and Encrypt.

7.3. MSSP-SSE Construction

Algorithm 5 shows the details of our multi-user scheme MSSP-SSE = (KeyGen, BuildTree, Encrypt, Join, GroupSetup, Remove, ConstructQuery, Search, Decrypt). Let SSP-SSE = (KeyGen, BuildTree, Encrypt, ConstructQuery, Search, Decrypt) be a single-user substring position searchable symmetric encryption scheme. Let BE-NIKE-WPRF = (Setup, Join, Enc, Dec) be a distributed broadcast encryption scheme. We require standard security notions for broadcast encryption, i.e., in addition to providing PCPA-security, it provides revocation-scheme security against a group of revoked users. Let ρ be a pseudorandom permutation, such that ρ:

{0, 1}^{k} \times {0, 1}^{t}

→

{0, 1}^{t}

(ρ can be constructed as a pseudorandom permutation over domains of arbitrary size [37]), where t is the size of search query Q in the SSP-SSE scheme. We assume that the cloud server does not collude with revoked users; otherwise, our construction cannot prevent a revoked user from invoking the search.

Algorithm 5: MSSP-SSE construction.

\underline{KeyGen (1^{k})}

:

generate K ← $SSP - SSE . K e y G e n (1^{k})$ .
generate λ, g ← $BE - NIKE - WPRF . S e t u p (1^{k})$ .

Output the key set K, secret parameter λ and group g.
BuildTree(D) :
Given a document collection D =

{D_{1},

\dots,

D_{l}}

, output Λ ←

SSP - SSE . B u i l d T r e e (D)

.
Encrypt(K, Λ, D) :

set $(I, C)$ ← $SSP - SSE . E n c r y p t (K, Λ, D)$ .

Output

(I, C)

.
Join(λ, g) :

generate $(s k, (z, e k))$ ← $BE - NIKE - WPRF . J o i n (λ, g)$ .

Keep

s k

private; output

(z, e k)

to the cloud server.

\underline{GroupSetup ({z_{i}, e k_{i}}_{i \in [h]}, sk)}

:

pick $h \subseteq g$ and get public values ${z_{i}, e k_{i}}_{i \in [h]}$ from the cloud server.
sample r ← ${0, 1}^{s}$ and compute $c_{r}$ ← $BE - NIKE - WPRF . E n c ({z_{i}, e k_{i}}_{i \in [h]}, s k, r)$ .

Output

c_{r}

to the cloud server.

\underline{Remove ({z_{i}, e k_{i}}_{i \in [h ∖ o]}, sk)}

set $(h ∖ o) \subseteq g$ , and retrieve public values ${z_{i}, e k_{i}}_{i \in [h ∖ o]}$ from the cloud server.
sample new r ← ${0, 1}^{s}$ , and compute $c_{r}$ ← $BE - NIKE - WPRF . E n c ({z_{i}, e k_{i}}_{i \in [h ∖ o]}, s k, r)$ .

Output new

c_{r}

to the cloud server.

\underline{[(Q) \leftarrow ConstructQuery (K, χ, c_{r})] \leftrightarrow [(L) \leftarrow Search (I, Q, c_{r})]}

cloud user:
(a)
get $c_{r}$ from the cloud server.
(b)
compute r← $BE - NIKE - WPRF . D e c ({z_{i}, e k_{i}}_{i \in [h]}, s k, c_{r})$ . If r = ⊥, output ⊥.
(c)
calculate $Q^{'}$ ← $SSP - SSE . C o n s t r u c t Q u e r y (K, χ))$ and Q ← $ρ_{r} (Q^{'})$ .
cloud provider:
(a)
compute r ← $BE - NIKE - WPRF . D e c ({z_{i}, e k_{i}}_{i \in [h]}, s k, c_{r})$ .
(b)
calculate $Q^{'}$ ← $ρ_{r}^{- 1} (Q)$ .
(c)
get L ← $SSP - SSE . S e a r c h (I, Q^{'})$ , where $L \in C$ .
(d)
output L.

\underline{Decrypt (K, C_{i})}

Output

(D_{i}, p o s_{D_{i}})

←

SSP - SSE . D e c r y p t (K, C_{i})

.

We now describe the scheme using the following hospital example. Consider a doctor (data owner) that performed a set of early cancer screening tests on a patient and wishes to share the resulting documents with a group of hospital nurses (data users). To remove the burden of key management, the doctor enables a distributed setup, where each nurse generates his or her own secret key and establishes a group of authorized participants that includes a head nurse and his or her subordinate nurses. First, the doctor samples the secret parameter k and generates the set of encryption keys K, secret key λ and group g for the distributed broadcast encryption. Second, the doctor encrypts the resulted documents with PCPA-secure symmetric encryption scheme SKE and outputs the searchable index I to the cloud server. Next, each participating nurse invokes the Join algorithm with secret λ, group g (both distributed by the doctor) to generate

(s k,

(z, e k))

, where secret

s k

is kept private and

(z, e k)

are published to the cloud server.

Now, the head nurse (group owner) creates a group of authorized users that are allowed to invoke a search over encrypted documents in the cloud. The head nurse launches the GroupSetup algorithm, where she selects public values

{z_{i}, e k_{i}}_{i \in h}

of authorized participants

h \in g

, samples random secret parameter r and invokes the distributed broadcast encryption to output

c_{r}

.

In order to search for a substring χ, the authorized nurse first contacts the cloud provider to receive the latest ciphertext

c_{r}

and invokes distributed broadcast encryption with his or her own secret

s k

, public values

{z_{i}, e k_{i}}_{i \in h}

to recover secret r. If r is successfully recovered, the nurse then constructs a single-user search query

Q^{'}

, encrypts it with pseudorandom permutation ρ with r and outsources

ρ_{r} (Q^{'})

to the cloud provider. The cloud provider recovers the search query

Q^{'}

by computing

ρ_{r}^{- 1} (ρ_{r} (Q^{'}))

. Here, the key r is only known by the data owner and the set of authorized users that includes the cloud provider. Next, the ConstructQuery and the Search interactive algorithms are are executed between the authorized nurse and the cloud server.

If a nurse o is no longer the authorized user in the system, the head nurse samples a new key

r^{'}

and generates new ciphertext

c_{r}

. The new

c_{r}^{'}

is sent to the cloud provider to replace the old

c_{r}

. Since revoked nurse o is not able to recover the new secret

r^{'}

, permuted search query Q will not yield a valid search query. This simple extra layer given by the pseudo-random permutation ρ prevents cloud users from performing a successful search once they are removed from the system.

MSSP-SSE utilizes the security and performance of a single-user SSP-SSE scheme. Our construction is very efficient, since the cloud provider needs only to execute a pseudorandom permutation to evaluate the access privileges, thus eliminating the need of more expensive authentication protocols.

8. Conclusions

In this work, we present a new Substring Position Searchable Symmetric Encryption scheme (SSP-SSE) that allows efficient substring search on encrypted documents outsourced to the cloud. Specifically, our solution efficiently finds the occurrences and positions of a substring over encrypted cloud data. We formally define the leakage functions and security notions of SSP-SSE. We show that our scheme is secure against chosen-query attacks executed by an adaptive adversary. We also present a multi-user SSP-SSE scheme that supports a distributed setup, where data users choose their own secret key rather than receive the key from a trusted authority. As future work, we plan to focus on enhancing query privacy in SSP-SSE, while keeping all of the good properties in the current design. Furthermore, we plan to expand the SSP-SSE scheme to support dynamic updates on the document collection that will allow query execution when the document corpus is modified.

Acknowledgments

This work was partially supported by the U.S. National Science Foundation under Grant No. 0905232. We are grateful to anonymous referees for their constructive comments and valuable suggestions that helped to improve this paper.

Author Contributions

M.S. contributed to the design of the proposed scheme, literature survey and manuscript preparation; Z.O. implemented the scheme and performed the experiments; I.R. supervised this research work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Strizhov, M.; Ray, I. Substring Position Search over Encrypted Cloud Data Using Tree-Based Index. In Proceedings of the 2015 IEEE International Conference on Cloud Engineering (IC2E), Tempe, AZ, USA, 9–13 March 2015.
Song, D.X.; Wagner, D.; Perrig, A. Practical Techniques for Searches on Encrypted Data. In Proceedings of the 2000 IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 14–17 May 2000.
Goh, E.J. Secure Indexes. Cryptology ePrint Archive, Report 2003/216. 2003. Available online: http://eprint.iacr.org/2003/216/ (accessed on 10 January 2016).
Moataz, T.; Shikfa, A. Boolean Symmetric Searchable Encryption. In Proceedings of the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security, Hangzhou, China, 8–10 May 2013.
Orencik, C.; Kantarcioglu, M.; Savas, E. A Practical and Secure Multi-keyword Search Method over Encrypted Cloud Data. In Proceedings of the 6th IEE International Conference on Cloud Computing, Santa Clara, CA, USA, 28 June–3 July 2013.
Curtmola, R.; Garay, J.; Kamara, S.; Ostrovsky, R. Searchable Symmetric Encryption: Improved Definitions and Efficient Constructions. In Proceedings of the 13th ACM Conference on Computer and Communications Security, Alexandria, VA, USA, 30 October–3 November 2006.
Boneh, D.; Waters, B. Conjunctive, Subset, and Range Queries on Encrypted Data. In Proceedings of the 4th IACR Theory of Cryptography Conference, Amsterdam, The Netherlands, 21–24 February 2007.
Boneh, D.; Crescenzo, G.D.; Ostrovsky, R.; Persiano, G. Public Key Encryption with Keyword Search. In Proceedings of the EUROCRYPT 2004, Jeju Island, Korea, 5–9 December 2004.
Lai, J.; Zhou, X.; Deng, R.H.; Li, Y.; Chen, K. Expressive Search on Encrypted Data. In Proceedings of the 8th ACM SIGSAC Symposium on Information, Computer and Communications Security, Hangzhou, China, 8–10 May 2013.
Cao, N.; Wang, C.; Li, M.; Ren, K.; Lou, W. Privacy-Preserving Multi-keyword Ranked Search over Encrypted Cloud Data. In Proceedings of the 30th IEEE International Conference on Computer Communications, Shanghai, China, 31 July–2 August 2011.
Cash, D.; Jarecki, S.; Jutla, C.; Krawczyk, H.; Rosu, M.C.; Steiner, M. Highly-Scalable Searchable Symmetric Encryption with Support for Boolean Queries. In Proceedings of the 33rd Annual International Cryptology Conference CRYPTO 2013, Santa Barbara, CA, USA, 18–22 August 2013.
Kamara, S.; Papamanthou, C.; Roeder, T. Dynamic Searchable Symmetric Encryption. In Proceedings of the 2012 ACM Conference on Computer and Communications Security, Raleigh, NC, USA, 16–18 October 2012.
Chang, Y.C.; Mitzenmacher, M. Privacy Preserving Keyword Searches on Remote Encrypted Data. In Proceedings of the 3rd International Conference on Applied Cryptography and Network Security, New York, NY, USA, 7–10 June 2005.
Shi, E.; Bethencourt, J.; Chan, T.H.H.; Song, D.; Perrig, A. Multi-Dimensional Range Query over Encrypted Data. In Proceedings of the 2007 IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 20–23 May 2007.
Agrawal, R.; Kiernan, J.; Srikant, R.; Xu, Y. Order-Preserving Encryption for Numeric Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, 13–18 June 2004.
Blanton, M. Achieving Full Security in Privacy-Preserving Data Mining. In Proceedings of the 3rd IEEE International Conference on Privacy, Security, Risk and Trust, Boston, MA, USA, 9–11 October 2011.
Li, J.; Wang, Q.; Wang, C.; Cao, N.; Ren, K.; Lou, W. Fuzzy Keyword Search over Encrypted Data in Cloud Computing. In Proceedings of the 29th Conference on Information Communications, London, UK, 7–9 September 2010.
Wang, C.; Ren, K.; Yu, S.; Urs, K. Achieving Usable and Privacy-assured Similarity Search over Outsourced Cloud Data. In Proceedings of the 31th Conference on Information Communications, Hertfordshire, UK, 29–31 October 2012.
Boldyreva, A.; Chenette, N. Efficient Fuzzy Search on Encrypted Data. In Proceedings of the 21st International Workshop on Fast Software Encryption, London, UK, 3–5 March 2014.
Strizhov, M.; Ray, I. Multi-keyword Similarity Search over Encrypted Cloud Data. In Proceedings of the ICT Systems Security and Privacy Protection, Marrakech, Morocco, 2–4 June 2014.
Ehrenfeucht, A.; McConnell, R.M.; Osheim, N.; Woo, S.W. Position Heaps: A Simple and Dynamic Text Indexing Data Structure. J. Discret. Algorithms 2011, 9, 100–121. [Google Scholar] [CrossRef]
Bloom, B.H. Space/Time Trade-offs in Hash Coding with Allowable Errors. Commun. ACM 1970, 13, 422–426. [Google Scholar] [CrossRef]
Wang, C.; Cao, N.; Li, J.; Ren, K.; Lou, W. Secure Ranked Keyword Search over Encrypted Cloud Data. In Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing Systems, Genoa, Italy, 21–25 June 2010.
Moataz, T.; Justus, B.; Ray, I.; Cuppens-Boulahia, N.; Cuppens, F.; Ray, I. Privacy-Preserving Multiple Keyword Search on Outsourced Data in the Clouds. In Proceedings of the Data and Applications Security and Privacy XXVIII, Vienna, Austria, 14–16 July 2014.
Crescenzo, G.D.; Saraswat, V. Public Key Encryption with Searchable Keywords Based on Jacobi Symbols. In Proceedings of the 8th International Conference on Cryptology in India, Chennai, India, 9–13 December 2007.
Golle, P.; Staddon, J.; Waters, B. Secure Conjunctive Keyword Search over Encrypted Data. In Proceedings of the Applied Cryptography and Network Security 2004, Yellow Mountain, China, 8–11 June 2004.
Hwang, Y.H.; Lee, P.J. Public Key Encryption with Conjunctive Keyword Search and Its Extension to a Multi-user System. In Proceedings of the First International Conference on Pairing-Based Cryptography, Tokyo, Japan, 2–4 July 2007.
Weiner, P. Linear Pattern Matching Algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory (Swat 1973), Washington, DC, USA, 15–17 October 1973; pp. 1–11.
Manber, U.; Myers, G. Suffix Arrays: A New Method for On-Line String Searches. SIAM J. Comput. 1993, 22, 935–948. [Google Scholar] [CrossRef]
Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology; Cambridge University Press: New York, NY, USA, 1997. [Google Scholar]
Ukkonen, E. On-line Construction of Suffix Trees. Algorithmica 1995, 14, 249–260. [Google Scholar] [CrossRef]
Gentry, C.; Goldman, K.; Halevi, S.; Julta, C.; Raykova, M.; Wichs, D. Optimizing ORAM and Using It Efficiently for Secure Computation. In Proceedings of the 13th Privacy Enhancing Technologies Symposium, Bloomington, IN, USA, 10–12 July 2013.
LibTomCrypt. Cryptographic Toolkit. 2016. Available online: https://github.com/libtom/libtomcrypt (accessed on 10 May 2016).
NCBI. Genome Database. 2016. Available online: http://www.ncbi.nlm.nih.gov/genome (accessed on 10 May 2016). [Google Scholar]
Fiat, A.; Naor, M. Broadcast Encryption. In Proceedings of the 13th Annual International Cryptology Conference CRYPTO ’93, Santa Barbara, CA, USA, 22–26 August 1993.
Zhandry, M. How to Avoid Obfuscation Using Witness PRFs. In Proceedings of the 13th International Conference on Theory of Cryptography TCC 2016, Tel Aviv, Israel, 10–13 January 2016.
Morris, B.; Rogaway, P.; Stegers, T. How to Encipher Messages on a Small Domain. In Proceedings of the CRYPTO 2009, Santa Barbara, CA, USA, 16–20 August 2009.

Figure 1. Cloud data hosting architecture.

Figure 2. An example of the data structure constructed from the text “coconut”. (a) A suffix tree; (b) a suffix array; (c) a position heap tree.

Figure 3. An example of the position heap tree. (a) Constructed from the text “ab$aaa$bb”’ extracted from documents

(D_{1},

D_{2},

D_{3})

; (b) constructed from the text “abaaababbabaaba”.

Figure 3. An example of the position heap tree. (a) Constructed from the text “ab$aaa$bb”’ extracted from documents

(D_{1},

D_{2},

D_{3})

; (b) constructed from the text “abaaababbabaaba”.

Figure 4. Construction of a searchable index. (a) An example of position array Y; (b) an example of the path label encryption of position heap tree.

Figure 5. Experimental results. (a) The construction of the position heap tree; (b) the searchable index storage.

Table 1. Comparison of plaintext substring search data structures. n is the length of the text t, m is the length of the substring χ,

o c c

is the number of occurrences of χ in t.

**Table 1.** Comparison of plaintext substring search data structures. n is the length of the text t, m is the length of the substring χ, $o c c$ is the number of occurrences of χ in t.
Data Structure	Construction	Search	Cloud Storage
Suffix Tree	$O (n)$	$O (m + o c c)$	$O (n^{2})$
Suffix Array	$O (n)$	$O (m + l o g (n))$	$O {(n^{2})}^{1}$
Position Heap Tree	$O (n)$	$O (m^{2} + o c c)$	$O (n)$

^{1}

Note that the suffix array data structure stores only the array of integers (no need to store the suffixes of text), and the array can be accessed by running a binary search algorithm in

l o g (n)

time, i.e., each time we access the element in the suffix array, we execute a lexicographical comparison of the strings of the suffix at the element position and the the given substring query. This can be executed locally (by the data owner); however, in our system model defined in Section 3.1, the data owner sends the data and constructed searchable index to the malicious cloud provider. Both the data and the searchable index are encrypted, so no plaintext (and no lexicographical order) is leaked to the cloud provider. If we were to encrypt the suffix array by encrypting each element of the suffix array, then the cloud provider would not be able to execute the search in

l o g (n)

(in fact, it would observe the ciphertext at each element in the array, which gives no order in binary search execution). However, to keep the binary search

l o g (n)

time, one solution is to store encrypted suffixes in each node of the binary search tree and to use an expensive homomorphic encryption (i.e., work by Gentry et al. [32]) that allows the search on the encrypted binary search tree. However, this would take

O (n^{2})

as the worst case storage for all suffixes in the tree.

Table 2. Experimental database.

**Table 2.** Experimental database.
Organism Name	mRNA Size (MB)	Organism Name	mRNA Size (MB)
Dufourea novaeangliae	28	Papilio Polytes	41
Bactrocera dorsalis	49	Fopius arisanus	60
Halyomorpha halys	63	Tribolium castaneum	63
Stomoxys calcitrans	70	Orussus abietinus	72
Nasonia vitripennis	75	Linepithema humile	77

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Strizhov, M.; Osman, Z.; Ray, I. Substring Position Search over Encrypted Cloud Data Supporting Efficient Multi-User Setup. Future Internet 2016, 8, 28. https://doi.org/10.3390/fi8030028

AMA Style

Strizhov M, Osman Z, Ray I. Substring Position Search over Encrypted Cloud Data Supporting Efficient Multi-User Setup. Future Internet. 2016; 8(3):28. https://doi.org/10.3390/fi8030028

Chicago/Turabian Style

Strizhov, Mikhail, Zachary Osman, and Indrajit Ray. 2016. "Substring Position Search over Encrypted Cloud Data Supporting Efficient Multi-User Setup" Future Internet 8, no. 3: 28. https://doi.org/10.3390/fi8030028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Substring Position Search over Encrypted Cloud Data Supporting Efficient Multi-User Setup^†

Abstract

1. Introduction

2. Related Work

3. Background and Building Blocks

3.1. System and Threat Models

3.2. Preliminaries and Notations

4. Substring Search Algorithms

4.1. Suffix Tree

4.2. Suffix Array

4.3. Position Heap Tree

4.4. Discussion

5. Substring Position Searchable Symmetric Encryption

5.1. Algorithm Definitions

5.2. Security Model Definitions

5.3. SSP-SSE Construction

5.3.1. Setup Phase

5.3.2. Search Phase

6. Security and Performance Analysis

6.1. Security

6.2. Performance

7. Multi-User Substring Position Searchable Symmetric Encryption

7.1. Preliminaries

7.2. Algorithm Definitions

7.3. MSSP-SSE Construction

8. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Substring Position Search over Encrypted Cloud Data Supporting Efficient Multi-User Setup †

Abstract

1. Introduction

2. Related Work

3. Background and Building Blocks

3.1. System and Threat Models

3.2. Preliminaries and Notations

4. Substring Search Algorithms

4.1. Suffix Tree

4.2. Suffix Array

4.3. Position Heap Tree

4.4. Discussion

5. Substring Position Searchable Symmetric Encryption

5.1. Algorithm Definitions

5.2. Security Model Definitions

5.3. SSP-SSE Construction

5.3.1. Setup Phase

5.3.2. Search Phase

6. Security and Performance Analysis

6.1. Security

6.2. Performance

7. Multi-User Substring Position Searchable Symmetric Encryption

7.1. Preliminaries

7.2. Algorithm Definitions

7.3. MSSP-SSE Construction

8. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Substring Position Search over Encrypted Cloud Data Supporting Efficient Multi-User Setup^†