Next Article in Journal
Fast Multi-View Subspace Clustering Based on Flexible Anchor Fusion
Previous Article in Journal
A Comprehensive Analysis of Energy Consumption in Battery-Electric Buses Using Experimental Data: Impact of Driver Behavior, Route Characteristics, and Environmental Conditions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Privacy-Preserving Hierarchical Top-k Nearest Keyword Search on Graphs

School of Cyberspace Security, Hainan University, 58 Renmin Avenue, Haikou 570228, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(4), 736; https://doi.org/10.3390/electronics14040736
Submission received: 15 January 2025 / Revised: 8 February 2025 / Accepted: 12 February 2025 / Published: 13 February 2025
(This article belongs to the Section Computer Science & Engineering)

Abstract

:
Graph search techniques are increasingly vital for applications involving labeled or textual content on network vertices. A key task is the top-k nearest keyword (kNK) search on undirected graphs where a query vertex and keywords identify k closest vertices containing the keywords. With cloud storage widely used for outsourcing graph services, ensuring data privacy and security is critical. Existing solutions employ encrypted indexes for privacy-preserving keyword searches but lack fine-grained access control, limiting their ability to accommodate diverse user needs. To address this, we propose privacy-preserving hierarchical top-k nearest keyword search on graphs (PH-kNK), a novel scheme enhancing privacy-preserving top-k keyword searches by integrating hierarchical access control. PH-kNK introduces hierarchical query entry indexes that regulate access at multiple security levels, significantly improving privacy, security and adaptability. The granular query entry indexes established by our approach enables users with higher security levels to query the graph structure and access corresponding vertices while maintaining transparency for lower-level users. The scheme leverages pseudo-random mapping, order-preserving encryption and re-encryption of search indexes to ensure robust data security. Experimental results on real-world datasets demonstrate the scheme’s high efficiency and validate its security.

1. Introduction

1.1. Background

Keyword search techniques on graphs have gained significant popularity in recent years for various real-life applications in networks that contain labeled or textual content on their vertices. In these scenarios, a top-k nearest keyword (kNK) search [1,2,3] is one of the most commonly used search types where a query request consists of a query vertex and a set of query keywords. The objective is to identify k vertices in the graph that contain the query keywords and are closest to the query vertex.
Cloud storage has become an increasingly popular solution for data owners seeking to outsource graph services, driven by the need for cost efficiency, enhanced scheme responsiveness and performance optimization. The simultaneous evolution of cloud services and big data represents a pivotal trend in the current digital economy. However, this rapid growth comes with significant risks. For instance, in 2017, a misconfiguration in Amazon AWS S3 buckets led to the exposure of sensitive information belonging to millions of users, affecting numerous prominent companies and organizations by leaking personal identity and financial data. Similarly, in 2019, security experts discovered a vulnerability within a service on Aliyun that could allow attackers to access user RAM permissions and subsequently obtain all authorized privileges. During the same year, researchers identified that some user API keys on Tencent Cloud were stored unprotected in publicly accessible code repositories, which could have been exploited by attackers to access sensitive user data and cloud resources. Tencent Cloud responded promptly by enhancing its security measures.
Recently, advancements in cloud services have included the continuous maturation of technologies such as artificial intelligence, blockchain and big data computing services. These developments underscore the growing universality, versatility and standardization of cloud-based solutions. Throughout the service lifecycle, both the data owned by service providers and the personal data of users are highly sensitive and must be protected diligently. Ensuring data privacy is a critical concern. To secure outsourced data, encrypted indexes are frequently employed. Specifically, to maintain the confidentiality of graph data, original keyword searches can only be conducted on encrypted versions.
As data owners increasingly rely on cloud storage for outsourcing graph services [4], the need for efficient algorithms becomes even more critical. One of the most commonly utilized searches within graph structures is the shortest path search, which is pivotal in applications like transportation networks where it calculates the shortest route between two points for navigation purposes. The shortest path problem has been extensively studied across various contexts. Numerous algorithms have been proposed to efficiently identify the shortest path between two vertices in a graph, such as transforming the graph into a directed graph and constructing a variant of a DT tree to discover time-limited shortest paths [5]. Alternatively, some approaches discussed searchable encryption [6,7,8,9], some are based on the shortest path community expansion algorithm, which leverages the dense connections within communities and sparse connections between them. This method uses the shortest path technique for expansion, developing index data structures and algorithms for path queries on general graphs, thereby optimizing performance for both serial and parallel graphs and solving multi-graph problems in polynomial time [10]. Additionally, researchers have explored the use of graph mining schemes to effectively represent specific unstructured data as graph data [11,12].
In practical network applications that involve vertex labels or textual content, keyword search technology within graph structures has gained increasing significance. A crucial operation in this context is the “top-k Nearest Keyword” (kNK) search on undirected graphs [13] where query requests typically involve a query vertex and a set of query keywords. The primary goal is to identify vertices in the graph that not only contain the query keywords but are also closest to the query vertex. The kNK search problem has garnered significant attention from the database and data mining communities. Lei Zou introduced the Top-k Correlated Subgraph Search (top-cgSearch) problem [14], developed the pattern growth algorithm (PG-Search) and conducted searches using graph indexing methods, significantly improving the efficiency of each independent local search involved. Similar problems were also studied [15].
Looking towards the future, considering factors like storage costs, service responsiveness and performance optimization, the trend is expected to favor the widespread adoption of cloud storage for outsourcing graph services. Throughout the service lifecycle, data belonging to both data owners and users is highly sensitive and requires meticulous protection. Data privacy remains a paramount concern. Encrypted indexes play a significant role in ensuring the security of outsourced data. To preserve the confidentiality of graph data, original keyword searches are confined to encrypted versions.
Existing solutions, such as the PPkNK scheme [16], excel at performing privacy-protected searches for the top-k nearest keywords in a graph. These methods involve constructing two-level indexes, utilizing encryption techniques and intersecting privacy-protected sets to ensure both privacy and operational efficiency during searches. Furthermore, a hierarchical searchable encryption scheme based on blockchain indexing, known as HSE-BI [17], has been introduced. This innovative solution provides robust fine-grained access control and maintains the reliability of search results through the consistency and immutability of blockchain-based indexing. However, these schemes are not specifically designed for graph structure searches. Existing kNK schemes fall short of addressing the needs of users who require access to information from vertices with varying secrecy levels for the same keyword.
To address this, the scheme has fine-grained access control for performing kNK searches over encrypted graphs. Each vertex possesses a security level, allowing users with higher security levels to query the graph structure and access corresponding vertices. In contrast, vertices remain hidden during the query process for users with lower security levels. This approach effectively caters to scenarios where users require access to data with varying sensitivities, enabling flexible and secure querying.
Imagine a collaboration between multiple research institutes across different countries working on a global health initiative. These institutes need to share and analyze large datasets, such as patient records, clinical trial results and genetic data, to develop treatments for a particular disease. However, due to varying privacy regulations and the sensitivity of the data, not all researchers can access the same level of information. Researchers in different countries need access to this data to collaborate effectively, but they must comply with their respective national regulations. This means that access to certain data must be restricted based on the user’s location, role and security clearance. Each piece of data, represented as a vertex in the graph, is assigned a security level based on its sensitivity. For example, basic research data might be accessible to all researchers, while patient records are only accessible to researchers with the highest security clearance. Our scheme allows international researchers to collaborate effectively by providing them with the information they need while complying with local data privacy regulations. High-level researchers can access detailed data for their analysis, while others only see the information they are permitted to view. The privacy-preserving methods ensure that sensitive data remains secure throughout the process.

1.2. Our Contribution

  • We propose a hierarchical privacy-preserving top-k keyword search scheme (PH-kNK) on graphs which achieves hierarchical access control and privacy preserving.
  • The scheme has fine-grained access control allowing users with certain security level can only access vertices with lower level.
  • We analysis the proposed scheme and conduct experiments with real-world datasets. The result of the experiment shows the proposed scheme has higher efficiency compared to other existing solutions.

2. Related Works

The concept of searchable encryption technology was first introduced by Song et al. in 2000 [18]. Since then, research has extensively expanded, focusing on improving search efficiency and security across diverse applications.
In tackling the challenges of secure, efficient querying over complex, structured data, Chase and Kamara made a groundbreaking leap with their concept of structured encryption [19]. Moving beyond the realm of symmetric searchable encryption (SSE), they formulated structured encryption to handle data as intricate as web graphs or social networks, pushing the boundaries of private data access. Their work brought forth the Chosen-Query Attack (CQA) security model, which reshaped security standards for structured data, along with a suite of encryption schemes: for example, matrix-structured encryption to enable rapid lookups, keyword-searchable labeled data encryption and multiple graph-based encryption schemes.
Recent literature on searchable encryption over graph-structured data has explored various aspects of security, accuracy and efficiency. As shown in Table 1, MVSSE [20] can achieve hierarchical privacy protection on non-structured data. k-NK [13], PPkNK [16], Aton [21] and other frameworks have made strides in handling large-scale knowledge graphs, but fail to fully address privacy-preserving searches in hierarchical graph structures. Our work uniquely integrates these dimensions, providing a comprehensive solution for top-k keyword searches in hierarchical privacy-preserving graph structures.
As shown in Figure 1, research on searchable encryption covers aspects related to data security, searchability, user authentication and access control. Some schemes have been conducted on keywords search on graphs [22,23]. Our scheme takes all of the above into account and addresses the limitation of existing kNK schemes by offering fine-grained access control, improves query efficiency with a novel hierarchical index structure and enhances data security through advanced encryption techniques.
Another research focus aspect is keyword search within encrypted graph structures. Graph encryption schemes, such as those proposed by Teng et al., support privacy-preserving Top-k nearest keyword searches on encrypted graphs, using secure indexing and encryption techniques to balance security and efficiency [16,24]. Moreover, distance-based keyword search methods have gained attention, such as the keyLabel-Indexing algorithm, which efficiently handles shortest path queries on encrypted graph data [25]. More recently, semantic search engines leveraging graph data have been introduced to improve the accuracy of keyword searches. Cozza and Sellami proposed frameworks integrating semantic knowledge from linked web data, enhancing search engines’ ability to process complex queries [26,27]. Figure 2 illustrates the development overview of graph structure encryption and hierarchical encryption. Our solution is a combination of solutions to problems in aspects hierarchical searchable encryption and keyword search on encrypted graph.

3. Preliminaries

3.1. Pruned 2-Hop Labeling

The pruned landmark labeling technique [28], which builds upon the 2-hop cover framework, is a method designed for querying shortest path distances efficiently in large-scale networks. It involves creating distance labels (also known as distance-aware 2-hop covers) for each vertex, allowing for rapid computation of shortest paths between vertices. This is done by applying breadth-first search (BFS) to precompute distance labels and utilizing pruning during the search.
In this method, for each vertex v, a label, denoted L ( v ) , is generated. The label comprises pairs of the form ( u , δ u v ) , where u denotes a vertex and δ u v is the shortest distance between vertices u and v. The collection of these labels, L ( v ) v V , forms the index used for queries. To answer a distance query between two vertices s and t, the query function is defined as follows:
Query ( s , t , L ) = min { δ v s + δ v t | ( v , δ v s ) L ( s ) , ( v , δ v t ) L ( t ) } .
If L ( s ) and L ( t ) do not share any common vertex, the query result is , signifying the absence of a path between s and t based on the labels. A label set L is termed a 2-hop cover if, for any vertices s and t, Query ( s , t , L ) accurately equals d G ( s , t ) , the shortest path distance in the graph G.
For optimization, each vertex’s label L ( v ) is pre-sorted by vertex indices, enabling an efficient merge-join algorithm for querying. This reduces the computational complexity of each query to O ( | L ( s ) | + | L ( t ) | ) , ensuring fast retrieval of shortest paths.
The method guarantees the construction of a valid 2-hop cover, ensuring that for any vertex pair ( s , t ) , Query ( s , t , L ) returns the exact shortest path, d G ( s , t ) .

3.2. Order-Preserving Encryption

The Order-Preserving Encryption (OPE) scheme, as outlined by Boldyreva et al. [29], retains the relative order of plaintext values in their corresponding ciphertexts, resembling the behavior of pseudorandom functions (PRFs). OPE takes advantage of the mathematical properties of order-preserving functions and relies on the hypergeometric distribution for efficient sampling in encryption processes.
In formal terms, consider two subsets A and B of the natural numbers N , with  | A | | B | . A function f : A B is labeled order-preserving (or strictly-increasing) if for any i , j A , f ( i ) > f ( j ) implies i > j .
An encryption scheme S E = ( K , e n c , Dec ) is defined as order-preserving when, for all secret keys K generated by the key generation algorithm, the encryption function e n c ( K , · ) ensures that the relative order of plaintexts is maintained in the ciphertext domain. Specifically, this is achieved by defining e n c ( K , · ) as an order-preserving function between the plaintext space D and ciphertext space R where both spaces consist of natural numbers encoded as strings. Typically, the plaintext space is denoted as [ M ] and the ciphertext space as [ N ] , where N M .

3.3. Proxy Re-Encryption

Proxy Re-encryption (PRE) is a cryptographic mechanism designed to securely transform ciphertext from one key to another. Introduced by Blaze et al. at Eurocrypt in 1998, PRE allows a ciphertext encrypted under the delegator’s public key to be transformed, enabling a delegate to decrypt it using their own private key, without revealing the plaintext to the proxy performing the transformation. This transformation is achieved using a re-encryption key, which the delegator provides to the proxy. The key enables the semi-trusted proxy to convert the original ciphertext for the delegate without accessing the underlying data. Umbral Encryption [30] employs a Key Encapsulation Mechanism (KEM) to delegate decryption capabilities from Alice to Bob. For example, if data is encrypted using Alice’s public key, c A = encrypt ( p k A , m ) , Bob can be authorized to decrypt the data by re-encrypting the ciphertext with a transformation key, r k A B . Bob can then decrypt the re-encrypted ciphertext using his private key: c B = reencrypt ( r k A B , c A ) .

4. Model and Definition

This section outlines the architecture of the proposed scheme and its associated security model. It details the structure of the graph and describes the framework for the graph encryption scheme.
Our scheme consists of three main entities: the data owner, the cloud server and the users. The data owner holds the graph data, generates the indexes, encrypts both the graph and its indexes and is responsible for key management. The encrypted index is sent to the cloud server, which processes query tokens from users and delivers search results. Users, on the other hand, must request access to specific query levels and the corresponding keys from the data owner. Once they have the search results from the cloud server, they decrypt the data to retrieve the final results.
The graph in our scheme is a labeled structure designed to support PH-kNK searches. This feature allows users to query the graph for vertices that meet specified permission conditions, contain the queried keywords and are within the k-nearest to the queried vertex.

4.1. System Construction

This section discusses the search framework of the hierarchical labeled graph. A formal definition of the hierarchical labeled graph is presented below.

4.1.1. Definition of Hierarchical Labeled Graph

Definition 1 (hierarchical labeled graph).
A graph G = V , E comprises a set of vertices and edges. Each vertex is represented as a tuple ( v i , w i , l i ) where v i V , w i W and l i L . Here, V refers to the set of vertex identifiers, W refers the set of keyword identifiers and L refers to the set of security level of each vertex. The set E consists of all the edges in the graph, each edge is denoted by e = ( v i , v j ) , where v i and v j are two vertices in V.
The graph security framework assigns independent security levels to each vertex through a triple structure ( v e r t e x , w o r d , l e v e l ) formally defined in Definition 1.
The mechanism of user access to vertices is based on Mandatory Access Control (MAC), where both users and vertices are assigned levels of access. The user security level, as well as the level of the vertices, is determined by the data owner, who also provides the necessary authorization. In summary, access control follows hierarchical constraints where users can only retrieve keywords from vertices with security levels l i satisfying l i l u , where l u denotes the user security level.
Our scheme does not consider how the data owner assigns the security level. In real word application scenarios, the data owner can follow the related rules and regulations. For example, in healthcare scenarios, the data owner (healthcare institution or research organization) can assign different patient records including clinical trial data, genetic metadata, public research summaries may be assigned different security levels. The data owner assigns access levels to users, and only those with the appropriate security level can access the high-level data.

4.1.2. Graph Encryption Scheme

Below is the framework for the proposed PH-kNK scheme, consisting of seven algorithms: Π = {KeyGen, BuildIndex, EncryptIndex, KeyAssign, QueryToken, KNKSearch, Decrypt}. The architecture of the scheme is shown in Figure 3.

4.2. Graph Encryption Scheme

The proposed scheme comprises seven core algorithms, which are outlined below:
K KeyGen ( λ ) : This algorithm generates a set of secret keys, using a security parameter λ as input. The result is a key set K.
I BuildIndex ( G ) : Given a graph G and its edges E, the algorithm constructs an encrypted index set I.
I e n c EncryptIndex ( K , I ) : This algorithm encrypts the index I produced by BuildIndex, using the key set K and outputs the search index I e n c .
K U KeyAssign ( K ) : This algorithm takes the data owner’s public key K as input and produces a unique user key K U .
T QueryToken ( K U , S ) : This algorithm generates a query token T by using the user’s secret key K U and the query information S.
R e n c kNKSearch ( T , I e n c ) : This algorithm takes the query token T and the encrypted index I e n c as input, returning an encrypted query result R e n c .
R Decrypt ( K U , R e n c ) : This algorithm decrypts the encrypted query result R e n c using the user’s key K U , producing the final result R.

4.3. Security Model

In this paper, the server is considered as semi-honest or “honest-but-curious”, which means it will execute the algorithms of the scheme faithfully, while trying to learn any information about the graph of the data owner. The security goal of the scheme is to minimize the information leakage to the server. Furthermore, we define two leakage functions, L 1 and L 2 . L 1 represents the information that the server can learn from the encrypted search index I e n c and L 2 represents the information that the server can learn from the query token T and the corresponding query result R e n c .
The formal security definition is described as follows.
Definition 2 (Security Model).
Let Π = (KeyGen, KeyAssign, BuildIndex, EncryptIndex, QueryToken, KnkSearch, Decrypt) be a graph encryption scheme supporting hierarchical top-k keyword search, A be a semi-honest adversary, S be a simulator and L 1 and L 2 be two leakage functions. Consider the following experiments:
R e a l A ( λ ) :
In the setup phase, A first chooses a graph G and sends it to the challenger. The challenger then runs KeyGen( λ ) to generate a set of keys K and runs KeyAssign(K) to produce a user key K U . After that, the challenger runs BuildIndex(G) and Encrypted(K, I) to obtain the encrypted search index I e n c . The challenger sends I e n c to A .
In the query phase, A adaptively chooses hierarchical top-k keyword search queries of polynomial number. For each query S, the challenger runs QueryToken( K U , S ) and sends the corresponding query token T to A . Then A runs KnkSearch(T, I e n c ) and obtains the encrypted query result R e n c . Finally, A outputs a bit b as the output of the experiment.
I d e a l A , S ( λ ):
In the setup phase, A first chooses a graph G and sends it to the challenger. Then S simulates a encrypted search index I e n c using the leakage function L 1 ( G ) and sends I e n c to  A .
In the query phase, A adaptively chooses hierarchical top-k keyword search queries of polynomial number. For each query S, S simulates a query token T using the leakage funtion L 2 ( G , S ) and sends T to A . Then A runs KnkSearch( T , I e n c ) and obtains the encrypted query result R e n c . Finally, A outputs a bit b as the output of the experiment.
We say that the proposed scheme Π is ( L 1 , L 2 )-secure against the adaptive chosen-query attack, if for every probabilistic polynomial-time (PPT) adversary A , there exist a simulator S such that
| Pr [ Real A ( λ ) = 1 ] Pr [ Ideal A , S ( λ ) = 1 ] | negl ( λ ) ,
where negl(·) is a negligible function.

5. Algorithm Construction

This section outlines the seven algorithms implemented within the PH-kNK scheme. Table 2 outlines the various notation used in this paper.

5.1. Building Blocks

5.1.1. Pruned Landmark Labeling (PLL)

The Pruned Landmark Labeling (PLL) technique consists of two algorithms: PLL_indexgen( g r a p h ) and PLL_search( v s t a r t , v e n d , i n d e x ). The PLL_indexgen algorithm is responsible for generating a search index for the graph, which enhances query performance. In contrast, the PLL_search algorithm calculates the shortest path between two specific vertices, v s t a r t and v e n d in the index.
  • PLL_indexgen( g r a p h ) → search index: Generates an efficient search index for the graph.
  • PLL_search( v s t a r t , v e n d , i n d e x ) → dis: Calculates the shortest distance between two vertices in the graph.

5.1.2. Order-Preserving Encryption (OPE)

The Order-Preserving Encryption (OPE) algorithm guarantees that for two plaintext values d 1 and d 2 , if  d 1 > d 2 , their corresponding encrypted values c 1 and c 2 maintain the same order c 1 > c 2 . This property allows for secure range queries on encrypted data.
  • OPE_Enc(K, d) → c: Encrypts the plaintext value d using the secret key K.
  • OPE_Dec(K, c) → d: Decrypts the encrypted value c using the secret key K to retrieve the plaintext.

5.1.3. Proxy Re-Encryption (PRE)

The Proxy Re-Encryption (PRE) scheme facilitates encrypted data sharing among different users. It allows an intermediary (the proxy) to re-encrypt data from one user to another without accessing the plaintext. The key functions of this scheme are:
  • PRE_KeyGen() → keypair (pk, sk): Generates a pair of public and private keys.
  • PRE_ReKeyGen(skA, pkB, N, t) → N fragments of the re-encryption key: Generates fragments of the re-encryption key, allowing re-encryption from Alice to Bob.
  • PRE_Encapsulate(pkA) → K, capsule: Encapsulates a symmetric key and generates a capsule using Alice’s public key.
  • PRE_Decapsulate(skA, capsule) → K or ⊥: Decrypts the capsule using Alice’s private key to retrieve the symmetric key.
  • PRE_ReEncapsulate(kFrag, capsule) → cFrag: Re-encapsulates a fragment of the capsule using the re-encryption key fragment.
  • PRE_DecapsulateFrags(skB, c F r a g _ i , capsule) → K or ⊥: Uses Bob’s private key to decrypt the capsule fragments and retrieve the symmetric key.
  • PRE_Encrypt(K, M) → C: Encrypts a message M using the symmetric key K.
  • PRE_Decrypt(K, C) → M or ⊥: Decrypts the ciphertext using the symmetric key and retrieves the original message.

5.2. KeyGen Algorithm

Data owner runs the KeyGen algorithm (Algorithm 1) and produces two randomly generated keys for HAMC and OPE, respectively, with the key length determined by the security parameter λ . Additionally, it generates a proxy re-encrypted data owner key set O w n e r k e y = s e c r e t k e y O , p u b l i c k e y O , s i g n i n g k e y O , v e r i f y i n g k e y O , s i g n e r O using the PRE_KeyGen algorithm.
Algorithm 1 KeyGen
Input: A security parameter λ
Output: A set of secret keys K
  1:
K 1   $   0 , 1 λ
  2:
K 2   $   0 , 1 λ
  3:
O w n e r K e y ← PRE_KeyGen()
  4:
Return K K 1 , K 2 , O w n e r K e y

5.3. Buildindex Algorithm

The BuildIndex algorithm (Algorithm 2) is conducted by the data owner. Firstly, the data owner constructs w o r d i n d e x and e n t r y i n d e x by categorizing the set of vertex V = ( v , w , l ) in G into the list w o r d i based on their respective keywords w i . Next, arrange the elements in w o r d i in descending order according to the vertex security level l. When querying the keyword w i , due to the descending order of w o r d i , it is only necessary to find the first element in the e n t r y i n d e x that is less than or equal to the queried security level l. The index of the current list w o r d i serves as the entry point for querying w i and the key level l. For the word index, its basic structure is keyword: (vertex, level). For the entry index, its basic structure is keyword: level: entry. The construction of the q u e r y l i n d e x is based on a 2-hop hierarchical labeled graph search scheme, with the purpose of enabling minimum distance queries between two points on the graph.
The process of establishing indices w o r d i n d e x , e n t r y i n d e x and q u e r y i n d e x is as follows:
  • Initialization of Index  w o r d i n d e x : w o r d i n d e x is initialized by traversing the set W. Each keyword w in W serves as a key, while the associated value is an array of tuples ( v , l ) where v represents vertices and l indicates levels containing the keyword w. Thus, the format of w o r d i n d e x entries is ( [ v , l ] ) w .
  • Initialization of Index  e n t r y i n d e x : e n t r y i n d e x is initialized for each keyword by creating an array with levels from 0 up to l, associating each level l with an entry index v that corresponds to that level. The  e n t r y i n d e x is structured as ( [ l ] : v ) w . To secure this structure, e n t r y i n d e x employs HMAC encryption on a constant c and entry v, which restricts backward decryption and prevents direct access to subsequent array elements.
  • Initialization of Index  q u e r y i n d e x : q u e r y i n d e x is constructed based on the 2-hop pruned algorithm to compute the shortest distance between two vertices in a query graph. In  q u e r y i n d e x , each vertex v i in the graph is a key, with values represented as pairs ( v j , d ) where v j is a reachable vertex and d denotes the shortest distance from v i to v j . This structure allows efficient querying of shortest paths in the graph.
Algorithm 2 BuildIndex
Input: A graph G
Output: A set of indexs I
   1:
Parse G as V , E
   2:
Parse V as v , w , l
   3:
Initialize a dictionary w o r d i n d e x
   4:
Initialize a dictionary e n t r y i n d e x
   5:
Initialize a dictionary q u e r y i n d e x
   6:
w o r d i n d e x = w o r d i w o r d i W
   7:
for  v i , w i , l i V   do
   8:
      for  w j W  do
   9:
           if  w i = w j  then
 10:
                w o r d i v i , w i , l i
 11:
          w o r d i l i w o r d i l i d e s c
 12:
         for  l k L  do
 13:
               while  w o r d i l i < = l k  do
 14:
                     e n t r y i n d e x w o r d i l i i
 15:
Generate PLL index
 16:
q u e r y i n d e x PLL_indexgen G
 17:
Return I w o r d i n d e x , e n t r y i n d e x , q u e r y i n d e x

5.4. Encryptindex Algorithm

The main function of this Algorithm 3 is to encrypt the search index, input the encryption key group K and the index set I to be encrypted and output an encrypted index set I E n c , run at the data owner. For  w o r d i n d e x , each keyword is encrypted using K 1 and HAMC and each vertex in the w o r d i list is encrypted with two layers. Firstly, a layer of K 2 is used to encrypt OPE and then a data owner public key is used to proxy re-encrypt, outputting ciphertext and capsule where capsule are used to authorize users to decrypt. For  e n t r y i n d e x , encrypt the keywords using the same HAMC and secret key and OPE for each level. For  q u e r y i n d e x , it use two layers of encryption including symmetric encryption and proxy re-encryption for each vertex. it does not contain keywords, while vertices use a single-layer K 1 sequence preserving encryption OPE.
The algorithm EncryptIndex takes as input a set of indices I and a set of keys K and outputs an encrypted index set I e n c . First, the key set K is parsed into K 1 , K 2 and O w n e r K e y . Similarly, the index set I is parsed into three parts: w o r d i n d e x , e n t r y i n d e x and q u e r y i n d e x .
For each word w o r d i in w o r d i n d e x , the algorithm processes its elements as follows: each element w i is encrypted using the HMAC algorithm with K 1 , resulting in w i ( e n c ) . Each element v i is encrypted using the SE algorithm with K 1 , yielding v i ( s e ) . A capsule is then generated using the PRE _ Encapsulate function and v i is further encrypted using PRE _ Encrypt with the capsule, producing v i ( e n c ) . For each element l i , the  OPE _ Enc algorithm with K 2 is applied to generate l i ( o p e ) . Another capsule is generated and l i is encrypted using PRE _ Encrypt with the capsule, resulting in l i ( e n c ) . The encrypted results for each item in w o r d i n d e x are stored in w o r d i n d e x e n c .
Algorithm 3 EncryptIndex
  • Input: A set of indexs I and a set of keys K
  • Output: A encrypted index set I e n c
   1:
Parse K as K 1 , K 2 , O w n e r K e y
   2:
Parse I as w o r d i n d e x , e n t r y i n d e x , q u e r y i n d e x
   3:
for  w o r d i w o r d i n d e x  do
   4:
      for  w i w o r d i  do
   5:
             w i ( e n c ) H A M C K 1 , w i
   6:
      for  v i w o r d i  do
   7:
             v i ( s e ) SE_Encrypt K 1 , v i
   8:
             c a p s u l e PRE_Encapsulate p k A
   9:
             v i ( e n c ) PRE_Encrypt ( c a p s u l e , v i )
 10:
      for  l i w o r d i  do
 11:
             l i ( o p e ) OPE_Enc K 2 , l i
 12:
             c a p s u l e PRE_Encapsulate p k A
 13:
             l i ( e n c ) PRE_Encrypt ( c a p s u l e , l i )
 14:
            Let w o r d i n d e x e n c be the encrypted index of each item in w o r d i n d e x
 15:
for  w o r d i e n t r y i n d e x   do
 16:
       w i ( e n c ) H A M C K 1 , w i
 17:
      for  l i w o r d i  do
 18:
             l i ( e n c ) OPE_Enc K 2 , l i
 19:
Set c = 0
 20:
for  v i q u e r y i n d e x  do
 21:
       v i ( s e ) SE_Encrypt K 1 , v i
 22:
      for  ( u , d ) q u e r y i n d e x [ v i ]  do
 23:
            u = H A M C ( v i ( s e ) , K 2 )
 24:
             T v c = g ( H A M C ( K 1 , v ) , c )
 25:
             q u e r y i n d e x [ v i ] = ( u | | d | | H A M C ( k e y 1 , u ) g ( T v c , c ) ) T v c
 26:
            Set c = c + 1
 27:
             c a p s u l e PRE_Encapsulate p k A
 28:
             v i ( e n c ) PRE_Encrypt ( c a p s u l e , v i )
 29:
             d i u ( e n c ) O P E K 2 , d i u
 30:
Return  c a p s u l e , I e n c  ← ( w o r d i n d e x e n c , e n t r y i n d e x e n c , q u e r y i n d e x e n c )
For each w o r d i in e n t r y i n d e x , the algorithm encrypts each element w i using the HMAC algorithm with K 1 , producing w i ( e n c ) . Each element l i is encrypted using the OPE _ Enc algorithm with K 2 , yielding l i ( e n c ) .
For the q u e r y i n d e x , a counter c is initialized to 0. For each element v i in queryindex, the algorithm encrypts v i using the SE algorithm with K 1 , resulting in v i ( s e ) . For each tuple ( u , d ) in q u e r y i n d e x [ v i ] , u is computed using the HMAC algorithm applied to v i ( s e ) and K 2 . The value T v c is calculated as g ( HMAC ( K 1 , v ) , c ) and q u e r y i n d e x [ v i ] is updated to ( u | | d | | HMAC ( K 1 , u ) g ( T v c , c ) ) T v c . The counter c is incremented by 1. A capsule is generated using PRE _ Encapsulate and v i is encrypted using PRE _ Encrypt with the capsule, producing v i ( e n c ) . The value d is encrypted using the OPE algorithm with K 2 , yielding  d i u ( e n c ) .
Finally, the algorithm returns the generated capsule and the encrypted index set I e n c which consists of the encrypted w o r d i n d e x e n c , e n t r y i n d e x e n c and q u e r y i n d e x e n c . Then the data owner transmits the encrypted index to the server.

5.5. KeyAssign Algorithm

Before a new user requests an encrypted query, the following steps in Algorithm 4 must be carried out: User gets key set from owner who runs generation algorithm to generate a public and private key pair. Then the user needs to ask permission from the data owner, acquire key fragments and also receive the OPE key ( K 2 ) to generate query tokens. The user get proxy re-encrypt key fragments (KeyFrags) from data owner. The server runs the same steps as the user except receiving K 2 .
Algorithm 4 KeyAssign
Input: A data owner secret key and a user public key K
Output: A user key set K U
  1:
( U s e r S e c r e t K e y , U s e r P u b l i c K e y )   $ PRE_KeyGen
  2:
K e y F r a g s   $ PRE_ReKeyGen ( O w n e r S e c r e t K e y , U s e r P u b l i c K e y )
  3:
Return  K U  ← ( U s e r S e c r e t K e y , U s e r P u b l i c K e y , K e y F r a g s , K 2 )

5.6. Query Token Algorithm

When a user initiates a query request, the QueryToken algorithm (Algorithm 5) is run. The algorithm QueryToken takes as input a user key set K U and a query message S. It outputs a query token T. The user key set K U is parsed into the public key PublicKey and the secret key SecretKey. The query message S is parsed into keyword, level and vertex. Then the data owner assigns keys by generating a key fragment keyfrag, an encrypted keyword wordenc K and other necessary keys including K and OwnerKey. The KeyAssign function is used to assign keys based on the keyword, level and vertex.
Algorithm 5 QueryToken
  •  Input: A user key set K U and query message S
  •  Output: A query token T
1:
Parse K U as ( P u b l i c K e y , S e c r e t K e y )
2:
Parse S as ( k e y w o r d , l e v e l , v e r t e x )
3:
K e y F r a g s , w o r d e n c   ( K 1 , K 2 , O w n e r K e y )
4:
( K e y F r a g s , O w n e r P u b l i c k e y ) KeyAssign ( k e y w o r d , l e v e l , v e r t e x )
5:
( l e v e l e n c ) OPE_Enc ( l e v e l )
6:
( v e r t e x e n c ) SE ( v e r t e x )
7:
Return  I e n c ( k e y w o r d e n c , l e v e l e n c , v e r t e x e n c , k )
Then, the level level is encrypted using the OPE _ Enc function to produce level e n c . The vertex vertex is encrypted using the SE function to produce vertex e n c .
The user finally gets the query token T, which consists of the encrypted keyword keyword e n c , the encrypted level level e n c , the encrypted vertex vertex e n c and the parameter k.

5.7. Knk Search Algorithm

When the server gets a query request and it starts a query trying to get the results from encrypted e n t r y i n d e x , first it finds an entry level value that is not less than the input vertex level. Then, it check whether the next value in the entry list is not bigger than the input vertex. If not, search down in order to find the next entry in list. If both of the above conditions are met, then according to the principle of order-preserving encryption, the current entry level value is the input vertex level and it send back the entry for next searching stage. Starting from the corresponding entry in w o r d i within W , extract all vertices from the list and include them in the result candidate set.
Given the input vertex as src and all points in the result candidate set as des, calculate the distance between Src and des and subsequently add the des vertex along with the calculated distance to the result candidate set. The candidate set of results is sorted in ascending order based on the distance between two points. Subsequently, the first k results are selected and returned as the final result, representing the query result of the input vertex within the security level limit of the topk keyword.
The KNKSearch algorithm (Algorithm 6) takes as input a query token T and a set of encrypted indices I e n c and outputs a set of encrypted search results R e n c . The query token T is parsed into four parts: keyword e n c , level e n c , vertex e n c and k.
Algorithm 6 kNKSearch
Input: A query token T and aet of encrypted indexs I e n c
Output: A set of encrypted search results R e n c
   1:
Parse T as k e y w o r d e n c , l e v e l e n c , v e r t e x e n c , k
   2:
for  w o r d i w o r d i n d e x  do
   3:
      if  w o r d i = k e y w o r d e n c  then
   4:
            for  v e n c w o r d i  do
   5:
                   v r e e n c ReEncapsulation c a p s u l e , k e y f r a g
   6:
                   v c e Decrypt ( s e c r e t k e y u , p u b l i c k e y o , c a p s u l e , k e y f r a g , v r e e n c )
   7:
            while  e n t r y l j < l e v e l e n c  do
   8:
                   e n t r y l j = e n t r y l j + 1
   9:
for  l , e n t r y e n t r y i n d e x { k e y w o r d e n c }  do
 10:
      if  e n t r y < = l e v e l e n c  then
 11:
            break
 12:
set c = 0
 13:
while  T v c  do
 14:
       T v c = g ( H A M C ( K 1 , v ) , c )
 15:
       v x o r = ( q u e r y i n d e x [ v i ] T v c ) g ( T v c , c )
 16:
      Parse v x o r as v c e : ( u , d )
 17:
      if  c > = e n t r y  then
 18:
            add v c e to R
 19:
       c = c + 1
 20:
s r c = v e r t e x e n c
 21:
for  v c e R  do
 22:
       d e s = v c e
 23:
       d i s t a n c e = PLL_search s r c , d e s
 24:
       R v c e , d i s t a n c e
 25:
R R , d i s t a n c e asc
 26:
for  i < k  do
 27:
       R e n c R i
 28:
Return  R e n c
For each word word i in wordindex, the algorithm checks if word i matches keyword e n c . If a match is found, for each encrypted value v e n c in word i , the algorithm performs the following steps: v r e e n c is obtained by applying the ReEncapsulation function to the capsule and key fragment. Then, v ce is decrypted using the Decrypt function with the user’s secret key sec retkey u , the owner’s public key publickey o , the capsule, the key fragment and  v r e e n c .
For each entry entry i j in word i , if  entry i j is less than level e n c , it is incremented by 1. The algorithm then iterates over the entries in entryindex associated with keyword e n c . If an entry is less than or equal to level e n c , the loop breaks.
A counter c is initialized to 0. While T v c is not ⊥, the algorithm computes T v c as g ( HMAC ( K 1 , v ) , c ) . The value v xor is calculated as ( queryindex [ v i ] T v c ) g ( T v c , c ) . The algorithm then parses v xor as v ce : ( u , d ) . If c is greater than or equal to the current entry, v ce is added to the result set R. The counter c is incremented by 1.
The source vertex src is set to vertex e n c . For each v ce in R, the destination vertex des is set to v ce and the distance between src and des is computed using the PLL _ search function. The result set R is updated to include v ce and the computed distance. The results are sorted in ascending order.
Finally, the top k results from R are selected and stored in R e n c , which is returned as the output of the algorithm.

5.8. Decrypt Algorithm

The user acquires the query result set R returned by the server. After encryption, the vertices within the result set can be locally decrypted. Users have the option to re-encrypt the ciphertext using Capsule and previously authorized kfragments through multiple proxy points. Then, by employing their own private key, the data owner’s public key, the  c a p s u l e and k f r a g m e n t , they decrypt the re-encrypted ciphertext and obtain the vertex plaintext, which serves as the final query result, thus concluding the query.
The Decrypt algorithm (Algorithm 7) takes as input a user key set K U , the owner’s public key k o and an encrypted query result R e n c . It outputs a decrypted set of query results R.
Algorithm 7 Decrypt
Input: A user key set K U , owner’s publickey k o and query result R e n c
Output: A decrypted set of query result R
  1:
Parse K U as s e c r e t k e y u , c a p s u l e , k e y f r a g , K 2
  2:
for  v r e e n c R e n c  do
  3:
       v s e PREdecrypt s e c r e t k e y u , k o , c a p s u l e , k e y f r a g , v r e e n c
  4:
       v SEDecrypt K 1 , v s e
  5:
       d i s OPE_Dec d i s t a n c e o p e , K 2
  6:
Return R
First, the user key set K U is parsed into four parts: the user’s secret key sec retkey u , a capsule, a key fragment keyfrag and a key K 2 .
For each re-encrypted value v r e e n c in R e n c , v se is obtained by applying the PRE _ decrypt function with the user’s secret key sec retkey u , the owner’s public key k o , the capsule, the key fragment keyfrag and v r e e n c . The value v is then decrypted using the SEDecrypt function with the key K 1 and the value v s e . The distance is decrypted using the OPE _ Dec function with the key K 2 and the encrypted distance d i s t a n c e o p e .
The algorithm returns the decrypted set of query results R.
Taking Figure 4 below as an example, the data owner initializes the dictionary of w o r d i n d e x , which contains all keywords w 1 , w 2 as keys. Then, for each keyword w i , a list is created, every element in list represents a vertex in the graph that has keyword w i . The unique value of constant c and T v c is used to mark the position of the vertex in the index when encrypting the index. In the second step, in order to improve the query efficiency, the data owner initializes a dictionary called e n t r y i n d e x . This index helps to narrow the range of vertices to be searched in w o r d i n d e x . It records the number of vertices that need to be unpacked from c = 0 in w o r d i n d e x according to the security level. Similar to w o r d i n d e x , it initializes all keywords in the dictionary as keys. The value of this dictionary consists of a fixed-length list corresponding to the highest permission level. This list contains the entry point for each query level.
When a user at security level l 3 try to query top-2 vertices containing keyword w 1 nearest to v 3 , the query steps are as following:
  • Step 1: User generates a query token T = F( w 1 ), SE.Enc ( v 3 ) , 2 , OPE.Enc ( l 3 ) and send it to server.
  • Step 2: Server uses F ( w 1 ) as a key to get 4 values OPE.Enc ( l 4 ) , OPE.Enc ( l 3 ) , OPE.Enc ( l 2 ) , OPE.Enc ( l 1 ) in the e n t r y i n d e x . The server uses the user level l 3 from the query token to compare with encrypted entries in the entry index. Leveraging the order-preserving encryption (OPE) property, it identifies the position where the previous entry is greater and the next entry is smaller than l 3 . It retrieves the tuple ( OPE . Enc ( L 3 ) , 1 ) and extracts the value 1 to access the corresponding entry in the word index F ( w 1 ) . It means server need to skip the first vertex v 1 has level l 4 higher than user level l 3 and get values from index 1 to the end of the list are then retrieved. Here includes the tuple ( SE . Enc ( v 2 ) , OPE . Enc ( l 2 ) ) .
  • Step 3: Finally, SE . Enc ( v 2 ) is combined with SE . Enc ( v 3 ) from the query token and the query index as inputs to the PLL _ search function for further query processing. Server get the distance between v 2 and v 3 from query index. The queryindex shows that the distance of v 1 to v 3 is 1 and it can be known that the distance of v 1 to v 3 and v 2 is also 1. Therefore, distance between v 3 and v 2 is d i s ( v 3 , v 2 ) = d i s ( v 3 , v 1 ) + d i s ( v 1 , v 2 ) = 2 and the final query result of nearest top-2 is ( S E . E n c ( v 2 ) ) . Then user decrypt the result and finally get v 2 .

6. Analysis and Experiment

6.1. Security Analysis

We first describle two leakage functions in Section 4.3, L 1 and L 2 , which represent the information that an adversary can learn from the encrypted search indices set I e n c = ( w o r d i n e d e x e n c , e n t r y i n d e x e n c , q u e r y i n d e x e n c ) in setup process and query tokens T in query process respectively.
The fuction L 1 leaks graph structure, keyword and hierarchy metadata as following: The adversary A learns | V | by analyzing the encrypted query index. A infers number of keywords | W | and number of vertices containing keyword v ( w ) from the encrypted w o r d i n d e x e n c . The number of security levels | L | is revealed via the e n t r y i n d e x e n c .
The function L 2 leaks search pattern, access Pattern, result cardinality as following: A detects repeated queries via identical tokens of keyword or vertex. The same vertices in two query results can be learned by A . The size of top k results is leaked.
Theorem 1.
Assuming the HAMC, SE, OPE, PRE are secure, the PH-kNK is ( L 1 , L 2 ) -secure against adaptive chosen-query attacks (CQA2) under the semi-honest adversarial model.
Proof. 
To prove Theorem 1, we construct two experiments Real A ( λ ) and Ideal A , S ( λ ) introduced in Section 4.3.
Real A ( λ ) Experiment:
  • A selects graphs G 0 and G 1 with identical condition: (1) same number of vertices | V | ; (2) same number of edges | E | ; (3) same number of keyword | W | and number of vertices containing same keyword v ( m ) ; (4) same number of security level | L | and number of vertices containing same level v ( l ) .
  • The challenger C encrypts G b ( b { 0 , 1 } ) and sends I e n c to A .
  • A adaptively queries and receives tokens/results.
  • A outputs guess b . The experiment succeeds if b = b .
In setup process, A can get nothing but leakages information from L 1 , so A can not distinguish G 0 and G 1 . In query process, L 2 leaks only search or access patterns and result sizes, equal for G 0 and G 1 . If SE, HAMC, OPE, and PRE are secure, A ’s advantage is negligible:
Adv A = Pr [ Real A ( λ ) = 1 ] 1 2 = negl ( λ ) .
Ideal A , S ( λ ) Experiment:
  • Simulator S uses L 1 and L 2 to generate simulated index I e n c and tokens T .
  • S replaces cryptographic primitives with random sampling.
Simulated tokens and indexes are indistinguishable due to the security of SE, HAMC, OPE and PRE. A cannot distinguish I e n c and T from real counterparts. Hence, Adv A in the ideal experiment is also negligible:
Adv A = Pr [ Ideal A ( λ ) = 1 ] 1 2 = negl ( λ ) .
Conclusion. Since A ’s advantages in both experiments are negligible, PH-kNK satisfies ( L 1 , L 2 ) -secure against adaptive chosen-query attacks (CQA2) under the semi-honest adversarial model. □

6.2. Theoretical Analysis

In this section, we analyze the theoretical and practical aspects of the algorithm and provide an analysis of its complexity. The findings are presented in Table 3. In the KeyGen phase, the data owner generates a set of public and private keys, including the HAMC and symmetric encryption key, the order-preserving encryption (OPE) key and the re-encryption key used by the owner to encrypt the vertices. The process is performed by the owner, involving computation and storage. So both the storage and computational complexities are constant.
In the KeyAssign algorithm phase, the server S locally generates its own set of keys and sends the public key to the owner for assignment. The assigned key fragments are stored by the server. Both of these steps are algorithms with a computational complexity of O(1).
In the BuildIndex phase, the owner constructs the index, which involves: (1) building w o r d i n d e x where keywords serve as identifiers, having store and computational complexity of O( | V | | W | ); (2) building e n t r y i n d e x for query at different levels L having store and computational complexity of O( | L | | W | ) and (3) building a q u e r y i n d e x having computational complexity of O( | V | m ) and store complexity of O( | V | 2 ). So the final store complexity of BuildIndex is O( ( | L | + | V | ) | W | + | V | 2 ) and computational complexity is O( ( | L | + | V | ) | W | + | V | m ). Each step involves the participation of corresponding sets of vertices, so the algorithm’s complexity is determined by the parameters | L | , | V | and | W | . Here | L | means number of hierarchies in our experiment, | V | means number of vertices in used graph and | W | means how much keywords in keyword list.
In the EncryptIndex algorithm, the owner encrypts the indies and send them to server. This includes encrypting the keywords in the keyword index using HAMC and SE, encrypting the levels using OPE, encrypted w o r d i n d e x having store and computational complexity of O( | V | | W | ). For entry index which needs to encrypt the vertices using both OPE and re-encryption having store and computational complexity of O( | L | | W | ). The q u e r y i n d e x The encryption operations are performed by the owner, who sends the encrypted index to the server for storage. In the QueryToken algorithm phase, the user generates a query token. The user requests key allocation and generates the token based on the desired vertex, keyword, level and other information. The complexity of this algorithm is O(1).
In the kNKSearch phase, the server performs queries based on the tokens received from the user. The complexity of the query process for different keywords depends on the number of vertices containing the respective keyword v ( m ) . In the DECRYPT phase, the user receives the encrypted query results from the server and decrypts the encrypted vertices within it.

6.3. Experimental Evaluation

The experiment is conducted under the following environment.
  • RAM: 16 G
  • operating system: Windows 11
  • CPU: 11th Gen Intel(R)Core(TM) i5-11400H @2.70 GHZ
  • language: python
  • library: 1. The hashlib library to construct a pseudorandom function. 2. The pyope library was employed for order-preserving encryption (OPE). 3. umbral library based on OpenSSL was used for proxy re-encryption. 4. The PrunedLandmarkLabeling(PLL) algorithm was employed to obtain the shortest distances in the graph.
  • experiment parameter: 1. we used a 128-bit security parameter. 2. we set 1 to 10 security levels. 3. we set 10 to 10,000 keyword frequency.
We implemented the PH-kNK scheme in Python. To further substantiate our findings, we conducted scalability tests on large datasets to assess the real-world applicability of our scheme: ego-Facebook and Facebook LPPN from the Stanford Large Network Dataset Collection website [31]. As shown in Table 4, these datasets vary in size and complexity, providing a comprehensive testbed for evaluating the robustness of our scheme. The ego-Facebook consists of ’friends lists’ from Facebook. It has 4039 vertices and 88,234 edges. The Facebook LPPN has 22,470 vertices and 171,002 edges. These datasets were chosen for their varying sizes and complexities, providing a robust testbed for assessing the scheme’s efficiency and scalability.
A comparative analysis of the our scheme and Aton [21], a state-of-the-art solution for privacy-preserving top-k keyword searches was performed in the experiment. Figure 5a,b illustrate the query time performance of both schemes on the ego-Facebook and Facebook LPPN datasets respectively. The x-axis represents the keyword frequency, defined as the number of vertices containing the specified keyword, ranging from 1 to 10 4 . For each keyword frequency, the query vertex v is randomly selected and 100 queries of the same security level are performed, with the average results being recorded. As shown in Figure 5a,b, the PH-kNK scheme generally has greater performance at keyword frequencies of 10 3 and below. However, at a keyword frequency of 10 4 and above, the Aton scheme demonstrates superior query efficiency. “Extreme keyword frequency” refers to a scenario in which a 22,470 vertices graph has only 10 keywords. It means that our solution has more advantages within the range of non-extreme keyword frequencies. Nevertheless, our scheme remains highly competitive in most practical scenarios, where extreme keyword frequencies are less common.
The results demonstrate that the PH-kNK scheme outperforms Aton in scenarios with moderate keyword frequencies (up to 10 3 ). For instance, on the ego-Facebook dataset, the PH-kNK scheme achieved an average query time of 198 ms for a keyword frequency of 10, compared to 243 ms for Aton. Similarly, on the Facebook LPPN dataset, the PH-kNK scheme recorded an average query time of 277 ms for a keyword frequency of 10, while Aton required 303 ms. The average query efficiency increased by 14.35% in ego-Facebook, 12.86% in Facebook LPPN.
Moreover, we incorporated memory usage and time cost to provide a more comprehensive analysis of efficiency and scalability. As shown in Table 5, We measured memory consumption during critical phases of our scheme on ego-Facebook and Facebook LPPN data set. It can be inferred that the algorithm’s memory overhead is primarily concentrated in the BuildIndex and EncryptIndex stages, while the memory usage for KeyGen and KeyAssign is relatively low. On the other hand, kNKSearch requires loading the search indexes, resulting in a high memory overhead.
As illustrated in Figure 5c,d, the relationship between query time and the number of graph vertices can be summarized as follows: as the number of vertices and edges increases, the query speed descends; conversely, fewer vertices and edges result in faster query performance. This trend is also observed in the Aton scheme.
Furthermore, the search results at various levels are presented in Figure 6. The Aton scheme does not incorporate constraints for security levels concerning keywords and vertices, making it incomparable to our proposed scheme in the context of hierarchical searchable encryption. Regarding hierarchical relationships, our scheme was evaluated by generating 100 random queries for each security level from 1 to 10, with keyword frequencies ranging from 1000 to 3000. The results, as illustrated in Figure 6, demonstrate that higher query levels result in slower performance for higher query levels contain more search results than lower levels.
Table 6 presents the query time performance of the two schemes under the same dataset and keyword frequency, with the top k values ranging from 10 to 50. As shown, the query time increases as k increases.

7. Conclusions

Overall, our contributions include the introduction of a novel scheme for kNK graph search, the implementation of fine-grained access control and the utilization of secure encryption techniques. We present the architecture of the proposed scheme, provide a detailed description of the algorithm and conduct simulation experiments to validate its performance. Notably, our scheme incorporates hierarchical search functionality that is currently absent in existing solutions. Experimental results on real-world datasets, such as ego-Facebook and Facebook LPPN, demonstrate that the PH-kNK scheme achieves significant improvements in query efficiency, particularly for moderate keyword frequencies (up to 10 3 ). Compared to existing solutions like Aton, our scheme outperforms in scenarios with non-extreme keyword frequencies, offering faster query times and better scalability.
Our scheme is applicable to a diverse range of real-world application scenarios such as healthcare, finance and social networks where data privacy is paramount. The ability to perform privacy-preserving searches with fine-grained access control ensures compliance with ethical standards and data protection regulations. These applications highlight the practical relevance of our work.
Future work will focus on addressing scalability challenges across large graph sizes and heterogeneous data types, integrating the scheme with existing infrastructures such as Neo4j and Amazon Neptune. The other possible research could be accomplish dynamic updating of graph structure. Additionally, we plan to explore enhancing the scheme to resist more advanced adversarial models and to incorporate post-quantum cryptographic techniques for long-term security.

Author Contributions

Conceptualization, Z.X.; methodology, X.Z.; software, X.Z.; validation, X.Z. and C.H.; formal analysis, X.Z. and J.L.; investigation, C.H. and J.L.; resources, Z.X.; writing—original draft preparation, X.Z.; writing—review and editing, C.H. and J.L.; supervision, Z.X.; project administration, Z.X.; funding acquisition, Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Hainan University High-level Talent Scientific Research Start-up Fund No. RZ2100003340 and National Natural Science Foundation of China No. 62262012. The APC was funded by Hainan University High-level Talent Scientific Research Start-up Fund No. RZ2100003340.

Data Availability Statement

The data featured in this study can be requested from the corresponding author, subject to privacy restrictions related to the Hainan Project data.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tao, Y.; Papadopoulos, S.; Sheng, C.; Stefanidis, K. Nearest keyword search in xml documents. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, Athens, Greece, 12–16 June 2011; pp. 589–600. [Google Scholar]
  2. Wang, B.; Yu, S.; Lou, W.; Hou, Y.T. Privacy-preserving multi-keyword fuzzy search over encrypted data in the cloud. In Proceedings of the IEEE INFOCOM 2014-IEEE Conference on Computer Communications, Toronto, ON, Canada, 27 April–2 May 2014; pp. 2112–2120. [Google Scholar]
  3. Qiao, M.; Qin, L.; Cheng, H.; Yu, J.X.; Tian, W. Top-k nearest keyword search on large graphs. Proc. VLDB Endow. 2013, 6, 901–912. [Google Scholar] [CrossRef]
  4. Cao, N.; Yang, Z.; Wang, C.; Ren, K.; Lou, W. Privacy-preserving query over encrypted graph-structured data in cloud computing. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems, Minneapolis, MN, USA, 20–24 June 2011; pp. 393–402. [Google Scholar]
  5. Jiang, M.; Fu, A.W.C.; Wong, R.C.W. Exact top-k nearest keyword search in large networks. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Australia, 31 May–4 June 2015; pp. 393–404. [Google Scholar]
  6. Amorim, I.; Costa, I. Leveraging Searchable Encryption through Homomorphic Encryption: A Comprehensive Analysis. Mathematics 2023, 11, 2948. [Google Scholar] [CrossRef]
  7. Amorim, I.; Costa, I. Homomorphic Encryption: An Analysis of its Applications in Searchable Encryption. arXiv 2023, arXiv:2306.14407. [Google Scholar]
  8. Gui, Z.; Paterson, K.G.; Patranabis, S. Rethinking searchable symmetric encryption. In Proceedings of the 2023 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 21–25 May 2023; pp. 1401–1418. [Google Scholar]
  9. Noorallahzadeh, M.; Alimoradi, R.; Gholami, A. Searchable Encryption Taxonomy: Survey. J. Appl. Secur. Res. 2023, 18, 880–924. [Google Scholar] [CrossRef]
  10. Zou, L.; Chen, L. Dominant graph: An efficient indexing structure to answer top-k queries. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, Cancun, Mexico, 7–12 April 2008; pp. 536–545. [Google Scholar]
  11. Lysenko, A.; Roznovăţ, I.A.; Saqi, M.; Mazein, A.; Rawlings, C.J.; Auffray, C. Representing and querying disease networks using graph databases. BioData Min. 2016, 9, 1–19. [Google Scholar] [CrossRef] [PubMed]
  12. Ortega-Guzmán, V.H.; Gutiérrez-Preciado, L.; Cervantes, F.; Alcaraz-Mejia, M. A Methodology for Knowledge Discovery in Labeled and Heterogeneous Graphs. Appl. Sci. 2024, 14, 838. [Google Scholar] [CrossRef]
  13. Hertz, A.; Plumettaz, M.; Zufferey, N. Variable space search for graph coloring. Discret. Appl. Math. 2008, 156, 2551–2560. [Google Scholar] [CrossRef]
  14. Zou, L.; Chen, L.; Lu, Y. Top-K correlation sub-graph search in graph databases. In Database Systems for Advanced Applications, Proceedings of the 14th International Conference, DASFAA 2009, Brisbane, Australia, 21–23 April 2009; Proceedings 14; Springer: Berlin/Heidelberg, Germany, 2009; pp. 168–185. [Google Scholar]
  15. Yuan, Y.; Wang, G.; Chen, L.; Wang, H. Efficient keyword search on uncertain graph data. IEEE Trans. Knowl. Data Eng. 2013, 25, 2767–2779. [Google Scholar] [CrossRef]
  16. Teng, Y.; Cheng, X.; Su, S.; Bi, R. Privacy-preserving top-k nearest keyword search on outsourced graphs. In Proceedings of the 2016 IEEE Trustcom/BigDataSE/ISPA, Tianjin, China, 23–26 August 2016; pp. 815–822. [Google Scholar]
  17. Li, Y.; Zhou, F.; Ji, D.; Xu, Z. A Hierarchical Searchable Encryption Scheme Using Blockchain-Based Indexing. Electronics 2022, 11, 3832. [Google Scholar] [CrossRef]
  18. Song, D.X.; Wagner, D.; Perrig, A. Practical techniques for searches on encrypted data. In Proceedings of the 2000 IEEE Symposium on Security and Privacy. S&P 2000, Berkeley, CA, USA, 14–17 May 2000; pp. 44–55. [Google Scholar]
  19. Chase, M.; Kamara, S. Structured encryption and controlled disclosure. In Advances in Cryptology-ASIACRYPT 2010, Proceedings of the 16th International Conference on the Theory and Application of Cryptology and Information Security, Singapore, 5–9 December 2010; Proceedings 16; Springer: Berlin/Heidelberg, Germany, 2010; pp. 577–594. [Google Scholar]
  20. Liu, X.; Yang, G.; Mu, Y.; Deng, R.H. Multi-user verifiable searchable symmetric encryption for cloud storage. IEEE Trans. Dependable Secur. Comput. 2018, 17, 1322–1332. [Google Scholar] [CrossRef]
  21. Shen, M.; Wang, M.; Xu, K.; Zhu, L. Privacy-preserving approximate top-K nearest keyword queries over encrypted graphs. In Proceedings of the 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), Tokyo, Japan, 25–28 June 2021; pp. 1–10. [Google Scholar]
  22. Cheng, J.; Zhang, Y.; Ye, Q.; Du, H. High-precision shortest distance estimation for large-scale social networks. In Proceedings of the IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, San Francisco, CA, USA, 10–14 April 2016; pp. 1–9. [Google Scholar]
  23. Delling, D.; Goldberg, A.V.; Pajor, T.; Werneck, R.F. Robust distance queries on massive networks. In Algorithms-ESA 2014, Proceedings of the 22th Annual European Symposium, Wroclaw, Poland, 8–10 September 2014; Proceedings 21; Springer: Berlin/Heidelberg, Germany, 2014; pp. 321–333. [Google Scholar]
  24. Yang, J.; Yao, W.; Zhang, W. Keyword search on large graphs: A survey. Data Sci. Eng. 2021, 6, 142–162. [Google Scholar] [CrossRef]
  25. Li, P.; Zhou, F.; Xu, Z.; Li, Y.; Xu, J. Privacy-Preserving Top-K Nearest Keyword Search Queryies over Encrypted Graph Data. In Proceedings of the 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 22–24 October 2021; pp. 531–537. [Google Scholar]
  26. Sellami, S.; Zarour, N.E. Keyword-based faceted search interface for knowledge graph construction and exploration. Int. J. Web Inf. Syst. 2022, 18, 453–486. [Google Scholar] [CrossRef]
  27. Cozza, V. Towards a framework for graph-based keyword search over relational data. Int. J. Intell. Inf. Database Syst. 2022, 15, 183–198. [Google Scholar] [CrossRef]
  28. Akiba, T.; Iwata, Y.; Yoshida, Y. Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013; pp. 349–360. [Google Scholar]
  29. Boldyreva, A.; Chenette, N.; Lee, Y.; O’neill, A. Order-preserving symmetric encryption. In Advances in Cryptology-EUROCRYPT 2009, Proceedings of the 28th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Cologne, Germany, 26–30 April 2009; Proceedings 28; Springer: Berlin/Heidelberg, Germany, 2009; pp. 224–241. [Google Scholar]
  30. Nunez, D. Umbral: A Threshold Proxy Re-Encryption Scheme; NuCypher Inc and NICS Lab, University of Malaga: Malaga, Spain, 2018. [Google Scholar]
  31. Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. 2014. Available online: http://snap.stanford.edu/data (accessed on 1 March 2024).
Figure 1. Research on searchable encryption.
Figure 1. Research on searchable encryption.
Electronics 14 00736 g001
Figure 2. Overview of Graph Structure Encryption and Hierarchical Encryption.
Figure 2. Overview of Graph Structure Encryption and Hierarchical Encryption.
Electronics 14 00736 g002
Figure 3. Model of PH-kNK scheme.
Figure 3. Model of PH-kNK scheme.
Electronics 14 00736 g003
Figure 4. An example of query steps.
Figure 4. An example of query steps.
Electronics 14 00736 g004
Figure 5. (a) The Relationship of keyword frequency and query time in dataset ego-Facebook. (b) The Relationship of keyword frequency and query time in dataset Facebook LPPN. (c) The Relationship of keyword frequency and query time in dataset ego-Facebook and Facebook LPPN for PH-knk. (d) The Relationship of keyword frequency and query time in dataset ego-Facebook and Facebook LPPN for Aton.
Figure 5. (a) The Relationship of keyword frequency and query time in dataset ego-Facebook. (b) The Relationship of keyword frequency and query time in dataset Facebook LPPN. (c) The Relationship of keyword frequency and query time in dataset ego-Facebook and Facebook LPPN for PH-knk. (d) The Relationship of keyword frequency and query time in dataset ego-Facebook and Facebook LPPN for Aton.
Electronics 14 00736 g005
Figure 6. The Relationship of Search Times and Hierarchies.
Figure 6. The Relationship of Search Times and Hierarchies.
Electronics 14 00736 g006
Table 1. Comparison of privacy protection and graph types across different methods.
Table 1. Comparison of privacy protection and graph types across different methods.
k-NKMVSSEPPkNKAtonPH-kNK
Privacy protection
labeled graphs
accurate search
hierarchical search
Table 2. Notations.
Table 2. Notations.
NotationsDenotations
GA graph
nThe number of vertices in the graph G
v i vertex v i ( 1 i n ) in graph
w i Keyword which vertex v i contains
l i Level of vertex v i
VA tuple of ( v i , w i , l i ) v i V , w i W , l i L
EEdegs in the graph G
w o r d i A tuple having v i , w i , l i in w o r d i n d e x
d i s i j Shortest distance between vertex v i and vertex v j
K 1 A Secret key for hash and symmetric encryption
K 2 A Secret key for order-preserving encryption
w o r d i n d e x A index generated for search sorted by keywords
e n t r y i n d e x A index indicating where to start the search in w o r d i n d e x
q u e r y i n d e x A search index of PLL algorithm to get d i s i j
L e n c A set of encrypted indexes above
PLLA pruned landmark labeling scheme
OPEAn order-preserving encryption scheme
PREA proxy re-encryption scheme, Umbral encryption
SEA symmetric encryption scheme
HAMCa hash-based Message Authentication Coding method
ga pseudorandom function
Table 3. The time and space complexities of algorithms.
Table 3. The time and space complexities of algorithms.
Stor O Stor U Stor S Comp O Comp U Comp S
KeyGen O ( λ ) -- O ( λ ) --
KeyAssign O ( λ ) O(1) O ( 1 ) O ( λ ) --
BuildIndex O ( ( | L | + | V | ) | W | + | V | 2 ) -- O ( ( | L | + | V | ) | W | + | V | m ) --
EncryptIndex-- O ( ( | L | + | V | ) | W | ) O ( | V | | W | ) --
QueryToken- O ( 1 ) O ( 1 ) - O ( 1 ) -
KnkSearch-- O ( 1 ) -- O ( v ( w ) )
Decrypt- O ( 1 ) -- O ( 1 ) -
λ : security parameter; | L | : number of hierarchies; | W | : number of keywords; | V | : number of vertices; v ( w ) : number of vertices containing keyword w; m: number of edges.
Table 4. Dataset Information.
Table 4. Dataset Information.
Dataset|V||E||W|
ego-Facebook403988,2341311
Facebook LPPN22,470171,0024714
Table 5. Memory Usage Analysis.
Table 5. Memory Usage Analysis.
ProcessDatasetKeyGenKeyAssignBuild IndexEncrypt IndexkNKSearch
memory peak (kb)ego-facebook7.4792.7811,652.7516,697.8916,868.49
Facebook LPPN7.5293.21115,566.41161,575.56166,739.00
time cost (ms)ego-facebook0.0950.911899.3318,176.5198.55
Facebook LPPN0.0951.2461,031.881,477,650.34277.83
Table 6. Top-k query time results.
Table 6. Top-k query time results.
SchemeDatasetk
1020304050
PH-kNKego-Facebook198217232240262
Facebook LPPN277286305316338
Atonego-Facebook243259270285302
Facebook LPPN303326340361373
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, X.; Xu, Z.; Hu, C.; Lin, J. Privacy-Preserving Hierarchical Top-k Nearest Keyword Search on Graphs. Electronics 2025, 14, 736. https://doi.org/10.3390/electronics14040736

AMA Style

Zhu X, Xu Z, Hu C, Lin J. Privacy-Preserving Hierarchical Top-k Nearest Keyword Search on Graphs. Electronics. 2025; 14(4):736. https://doi.org/10.3390/electronics14040736

Chicago/Turabian Style

Zhu, Xijuan, Zifeng Xu, Chao Hu, and Jun Lin. 2025. "Privacy-Preserving Hierarchical Top-k Nearest Keyword Search on Graphs" Electronics 14, no. 4: 736. https://doi.org/10.3390/electronics14040736

APA Style

Zhu, X., Xu, Z., Hu, C., & Lin, J. (2025). Privacy-Preserving Hierarchical Top-k Nearest Keyword Search on Graphs. Electronics, 14(4), 736. https://doi.org/10.3390/electronics14040736

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop