1. Introduction
Large language models (LLMs) have become indispensable in numerous domains, significantly advancing applications in data management, mining, and analysis. Efficient retrieval of massive and heterogeneous data serves as the foundation for LLM training and deployment. Taking advantage of rapid advances in computing, storage, and communication technologies, the explosive growth of Internet data has provided strong momentum for the continuous development of LLMs. However, such large-scale, structurally diverse and complex-origin Internet data pose significant challenges to data management systems supporting LLM training and applications. On the one hand, traditional data retrieval techniques have become inadequate, urgently requiring new technologies and tools for data management and analysis. On the other hand, privacy leakage issues in LLMs’ usage of such data demand immediate solutions.
In terms of data retrieval, the k-nearest neighbor (
kNN) search [
1] represents one of the most commonly used techniques in LLMs [
2,
3,
4], aiming to find K elements closest to a given query from a dataset. For example, by modeling images as high-dimensional vector data,
kNN enables image recognition and classification. However, as data scales grow, traditional exact
kNN methods face challenges of high computational complexity and substantial storage requirements, leading to the emergence of an approximate search for the nearest neighbor as an effective solution. For example, in corpora containing millions or even billions of text snippets, identifying the most similar text to a user query for reference is essential. Traditional exact nearest neighbor search methods, which compute the distance between the input and every snippet, become prohibitively expensive at this scale. Among these, locality-sensitive hashing (LSH) has become a widely used approximate neighbor-retrieval technique. LSH maps data points into hash buckets, ensuring that similar points have higher probability of being mapped to the same bucket, thereby partially mitigating the “curse of dimensionality.” Nevertheless, despite the excellent performance of LSH in approximate nearest neighbor searches for large-scale high-dimensional data, it still suffers from limited query accuracy.
Regarding secure data storage and management, cloud computing, which is a computing paradigm enabled by Internet technology, utilizes the Internet as its foundational infrastructure for connectivity and interaction [
5]. This allows users to access and leverage remote computational resources without local ownership or maintenance, providing powerful computing capabilities and massive storage capacity. Consequently, data owners typically entrust cloud servers with managing large-scale high-dimensional data. When LLM users need to perform neighbor queries, they simply send requests to the cloud and await results, significantly reducing local storage and computational costs. However, with growing concerns about data security and privacy across sectors, protecting user data security has become a critical challenge in data retrieval and analysis. Privacy protection measures are essential during data retrieval to prevent unpredictable malicious cyber attacks from compromising cloud-based data security [
6]. Typically, data must be stored in encrypted form on the cloud to prevent attackers from inferring plaintext through user queries. Researchers have proposed various encrypted data retrieval schemes, including homomorphic encryption, searchable encryption, and secure multiparty computation. However, these methods often suffer from excessive computational complexity, reduced query efficiency, or high memory overhead. Therefore, developing appropriate secure nearest neighbor query schemes has become an urgent challenge, requiring careful balance among retrieval efficiency, query accuracy, and data security.
Nearest neighbor graph (NNG), as one of the fundamental tools for high-dimensional vector data retrieval, demonstrates exceptional performance in storing neighborhood information for large-scale datasets. Compared with LSH techniques, NNGs can provide query results that more closely approximate exact nearest neighbors. Building upon this advantage, we propose an NNG-based Secure Approximate Nearest Neighbor (NNG-SANN) scheme by integrating an NNG with LSH technology. Our approach achieves three key objectives:
High-accuracy secure indexing: We innovatively combine LSH with NNG technology to construct a secure index structure for high-dimensional vector data. This design significantly improves query accuracy while maintaining retrieval efficiency.
Refined greedy partitioning method:
Extensive experiments on both real-world and synthetic datasets demonstrate that our scheme achieves statistically significant improvements in secure ANN query accuracy compared to state-of-the-art SANN methods.
The rest of the paper is organized as follows: The related works of NN, ANN, and SANN queries are discussed in
Section 2. After describing the preliminaries of NNG-SANN scheme in
Section 3, we propose the NNG-SANN model in
Section 4. The theoretical analysis for the NNG-SANN follows in
Section 5. The experiments are conducted in
Section 6. We conclude the paper in
Section 7.
2. Related Works
The field of nearest neighbor search has witnessed the emergence of various solutions, including both exact and ANN search methods, to cater to diverse application requirements regarding data scale and precision demands. To further enhance data privacy and security, research on secure nearest neighbor search continues to advance and explore new frontiers. Below we present an overview of relevant methodologies.
2.1. Nearest Neighbor Query
Numerous studies have employed tree-based structures to design indexing methods for high-dimensional vector data to achieve exact KNN queries. Bentley [
7] proposed the k-d tree algorithm, which recursively partitions the k-dimensional space into hyper-rectangular regions by splitting the space into left and right subspaces based on data point distribution, forming a binary search tree structure. This approach proves effective for space partitioning and search in low-dimensional spaces by eliminating unnecessary subspaces to reduce search scope. Guttman [
8] developed the R-tree algorithm, which divides the data space into distinct polygonal regions and organizes adjacent regions into a tree structure. Through dynamic adjustments to accommodate data variations, this method enables efficient retrieval of high-dimensional data. Chakrabarti and Mehrotra [
9] introduced a hybrid tree structure that combines advantages of different tree types to address challenges in exact queries for high-dimensional spaces, dynamically selecting indexing strategies according to dataset characteristics. Additionally, the VA-File [
10] reduces disk I/O costs through quantization compression and approximate storage, though this approximation inevitably sacrifices some data fidelity, leading to compromised retrieval accuracy.
While these solutions demonstrate satisfactory performance for nearest neighbor retrieval in low-dimensional vector spaces, their effectiveness deteriorates significantly when applied to high-dimensional data. This performance degradation stems from the inherent data sparsity in high-dimensional spaces, which renders tree structures ineffective in evaluating inter-point distances or similarities, ultimately causing index failure.
2.2. Approximate Nearest Neighbor Query
The nearest neighbor search problem in LLMs faces two major challenges: First, the increasing dimensionality of data vectors, which complicates similarity calculations, and second, how to construct efficient index structures for large-scale high-dimensional vector datasets. To address these difficulties, the concept of ANN search (K-ANN) has been proposed, which allows for a certain degree of reduced query accuracy in exchange for improved search efficiency. In recent years, numerous solutions have emerged, with common K-ANN methods primarily including product quantization (PQ)-based approaches, LSH-based methods, and NNG-based techniques.
Jégou et al. [
11] employed product quantization to compute asymmetric distances for obtaining approximate nearest neighbors. Ge et al. subsequently proposed optimized product quantization (OPQ) [
12] to reduce quantization errors, although significant error increases may occur with highly imbalanced dataset distributions. Indyk introduced LSH, which maps high-dimensional vectors into compact hash values. Datar et al. developed E2LSH [
13], representing the first application of LSH in Euclidean space, albeit with substantial storage requirements. To address this limitation, Entropy-based LSH [
14] and Multi-Probe LSH [
15] were proposed to retrieve more candidate points from a single hash table, thereby reducing the number of required hash tables. Gan et al. presented C2LSH [
16], which adopts a dynamic collision counting approach to measure inter-point similarity. For graph-based approaches, Malkov et al. proposed the navigable small world (NSW) [
17] and its enhanced version, hierarchical navigable small world (HNSW) [
18], which demonstrate high efficiency for large-scale dataset searches but suffer from considerable memory consumption during graph index construction.
Among the aforementioned methods, LSH-based approaches demonstrate favorable performance and relatively low computational overhead for ANN searches in high-dimensional vector spaces. However, these methods often struggle to guarantee query accuracy. In contrast, graph-based methods achieve higher query precision but require substantial computational resources for index construction. Current solutions still face challenges in providing both efficient and highly accurate ANN searches for high-dimensional vector data.
2.3. Secure Approximate Nearest Neighbor Query
Secure nearest neighbor search has emerged as one of the prominent techniques for ensuring data privacy in LLMs. Searchable encryption, a core technology in this domain, enables computation and query operations on ciphertext data, making it particularly suitable for meeting privacy protection requirements in data analysis and computational tasks of LLMs.
Khoshgozaran and Shahabi [
19] proposed a secure index structure based on order-preserving encryption (OPE), which supports both range queries and nearest neighbor searches on encrypted data. Demertzis et al. [
20] developed a secure indexing approach utilizing Paillier homomorphic encryption, allowing additive and multiplicative operations on ciphertext. Peng et al. introduced the LS-RQ scheme [
21], which reduces communication frequency between users and cloud servers while improving query efficiency.
In summary, index-based query schemes demonstrate significant advantages in the field of secure high-dimensional vector data retrieval. The design of efficient, accurate, and secure query methods remains an actively researched direction with broad application prospects in practical scenarios.
3. Preliminaries
In this section, we describe the preliminaries of NNG-SANN scheme in detail.
3.1. Locality-Sensitive Hashing
LSH is a widely used ANN technique that enables efficient search for data points most similar to a query vector in large datasets. The fundamental principle of LSH involves partitioning the data space into multiple segments and assigning each segment to a hash bucket, thereby ensuring that similar data points have a high probability of being mapped to the same bucket. This mechanism significantly enhances the efficiency of nearest neighbor searches. The LSH function employed in our scheme is calculated as follows:
where
v is an arbitrary vector on the original data,
,
is a randomly selected vector that conforms to a normal distribution, and
r is a random number,
. Specifically, a novel vector
is generated by performing an inner product calculation on the original d-dimensional vector
v. It projects a vector onto a randomly selected direction. Then, in order to control the range of the hash value, a fixed value
w is introduced and defined by the user. The
w is a scaling factor used to adjust the range of the hash value to ensure that the hash value is within an appropriate range. Meanwhile, a random value
r is randomly selected from
, which randomly offsets the value of hash.
The above hash function maps the d-dimensional vector to a single integer, , where N is the set of natural numbers. For the given two thresholds R2 and R1 (R2 > R1) and the probability parameters P1 and P2 (P1 > P2), a hash function that satisfies the following properties is called (R1, R2, P1, P2) LSH:
If , then .
If , , and , then .
The calculation formulas for
P1 and
P2 are as follows.
3.2. Nearest Neighbor Graph
A neighborhood graph is utilized to represent proximity relationships among data points. In such a graph, each node is connected to its nearest neighbor nodes. Each node corresponds to a data point in the dataset, while edges indicate similarity or distance between data points. Typically, connections between nodes are determined based on a specific distance metric. In this work, Euclidean distance is adopted for computation. ANN search aims to efficiently identify the approximate nearest neighbors for each point in a dataset, and neighborhood graphs can be employed to accelerate this search process. Below, we introduce two currently prevalent high-efficiency neighborhood graph structures.
3.2.1. NSW
NSW is an efficient graph-based data structure that exhibits properties of small-world networks while maintaining strong navigability. It enables rapid node localization in large-scale networks, demonstrating exceptional performance in efficient ANNs.
The NSW structure is specifically designed to possess small-world network characteristics, meaning that any two nodes in the graph are typically connected by a short path. This property ensures efficient local search operations, allowing neighboring nodes to be identified effectively. Moreover, the NSW incorporates long-range connections between nodes, significantly enhancing navigation efficiency. During ANN search, this facilitates rapid traversal across the graph to locate potential candidates. In
Figure 1, the arrows illustrate an example path obtained via a greedy algorithm from an entry point to a query point.
When sequentially inserting new points into the NSW graph, a naive search method can be employed to identify nearest neighbors for each incoming point. Specifically, for each point to be inserted, the algorithm first computes its distances to existing nodes in the graph, then selects m nearest nodes as its immediate neighbors. The new point is subsequently connected to these m neighbors, thereby expanding the graph’s topological structure. This incremental process progressively enriches the NSW graph by establishing local neighborhood relationships among nodes, which fundamentally supports subsequent nearest neighbor search operations.
Following the construction of the NSW structure and node insertion, an efficient method becomes essential for retrieving m nearest neighbors of a given query point. This step proves critical as it directly determines both the search speed and accuracy within the NSW framework. Algorithm 1 presents the formal neighbor search procedure for NSW graphs, demonstrating how rapid and reliable nearest neighbor queries can be achieved through this optimized approach.
Algorithm 1 ANNs queries on NNGs |
Input: Graph structure: G, the element to be searched: q, the number of nearest neighbors that need to be searched: m, the entry node: ep. Output: An NNG: S
- 1:
//Initialize the dynamic list - 2:
//The shadow list of c - 3:
//A collection used to record the visited nodes - 4:
//Calculate the distance from the query point - 5:
- 6:
- 7:
while
do - 8:
//Update - 9:
//Take points in sequence from c - 10:
if then - 11:
//Find the neighbor points of point - 12:
for do - 13:
if then - 14:
- 15:
- 16:
end if - 17:
end for - 18:
end if - 19:
end while - 20:
//Arrange in ascending order of distance - 21:
//Return the first m as nearest neighbors
|
3.2.2. HNSW
HNSW structure represents an enhanced version of the NSW graph architecture, achieving efficient nearest neighbor search on large-scale datasets. Compared to NSW, HNSW introduces a hierarchical organization through skip lists and multi-layered node arrangements, which significantly improves search efficiency. As illustrated in
Figure 2, HNSW incorporates additional connection pointers at each layer, enabling a skip-based search process that enhances overall performance.
In this hierarchical structure, each data point has a predefined probability (typically 50%) of being promoted to the next higher-level ordered linked list. This multi-layered organization facilitates concurrent search operations across multiple levels, substantially accelerating the search process. The search algorithm initiates from the highest layer and progressively descends to lower levels until either locating the target or reaching the base layer, ensuring both efficiency and completeness of the search procedure.
The HNSW algorithm organizes connections into distinct hierarchical levels based on their lengths, enabling efficient multi-layer graph traversal. This hierarchical architecture ensures that each node maintains a constant number of connections, independent of the overall network size. As depicted in
Figure 3, the search process initiates from the topmost layer containing only the longest-range connections.
The search algorithm employs a greedy traversal strategy at each layer until reaching a local minimum. The process then transitions to the subsequent lower layer, where the search restarts from the previously identified local minimum. This iterative procedure continues until completion. In
Figure 3, the search begins at a top-layer element, with red arrows indicating the traversal path generated by the greedy algorithm from the entry point to the query location. The complete hierarchical search procedure is formally described in Algorithm 2. The HNSW structures are used to search for the nearest neighbor information within it as a supplement to improve the accuracy of the query.
Algorithm 2 ANNs queries on NNGs with HNSW structures |
Input: Graph structure of HNSW: hnsw, the element to be searched: q, the number of nearest neighbors that need to be returned: k, the entry node: ep. Output: The k elements closest to q.
- 1:
//Create an empty set to store the most recent elements at present - 2:
//Obtain the entry point of the HNSW graph - 3:
//Obtain the number of floors where the entrance point is located - 4:
for
do - 5:
//Find the nearest ef elements on the current layer lc - 6:
//Update the entry point to the element in W that is closest to q - 7:
end for - 8:
//Find the nearest ef elements on the last layer - 9:
return
|
3.2.3. Secure Nearest Neighbor Query Framework
In this paper, the data owner stores ciphertext data along with secure indexes on the cloud server. The cloud server responds to query tasks from the LLMs, and the retrieved data is ultimately decrypted at the user end of the LLMs. Within this cloud service model, there exists potential threats from malicious servers or semi-honest servers that may attack the data. For malicious servers, they may compromise data integrity and security through means such as query result modification, data tampering, or denial of service. Semi-honest servers, while not actively attacking data, may analyze transmitted data during processing or exploit access privileges to snoop on data, attempting to infer sensitive information or identify special patterns to serve their own interests or third-party demands. This paper assumes the server to be semi-honest, necessitating measures to restrict the server’s data access and utilization to prevent it from obtaining private information from either the dataset or queries. Therefore, this study employs encryption techniques to protect plaintext data, combined with a greedy subset partitioning approach, which effectively prevents information leakage and satisfies IND-CPA security.
The SANN scheme based on NNGs involves three main participants: the data owner, the user, and the server:
The Data Owner processes the original dataset and partitions it into subsets, then outsources the encrypted data to the cloud server for management.
The User uses LLMs to submit a query request and interacts once with the server to retrieve a candidate set.
The Server stores the encrypted dataset and upon receiving a query request, searches for the corresponding subset based on the query.
The secret key and the secure index table are assumed to be transmitted to the user via a secure channel and stored locally. As illustrated in
Figure 4, during the query process, the LLM interacts once with the cloud server. The query token
t is generated based on the local index rather than the original query vector
q, thereby preventing direct exposure of the query data during transmission. The cloud server retrieves the labels of encrypted subsets according to
t to identify the candidate set S and returns the encrypted candidate set to the LLM. The LLM user, who possesses the decryption key, decrypts the returned results to obtain the candidate set. Finally, the user performs a local sequential search on the small-scale candidate set to determine the approximate k-nearest neighbors.
3.3. Symmetrical Encryption
To ensure data security and privacy, we employ symmetric encryption for data protection. Symmetric encryption utilizes the same key for both encryption and decryption operations. This method is characterized by its simplicity, strong security, high-speed processing, and low computational overhead, making it an efficient cryptographic approach. Owing to its minimal key management requirements and ease of implementation and maintenance, symmetric encryption has been widely adopted in applications such as data transmission and storage, establishing itself as a fundamental component in the field of information security.
Symmetric encryption is a block cipher algorithm that divides plaintext into fixed-size blocks and encrypts each block sequentially. The Advanced Encryption Standard (AES) algorithm employed in this study is also a block cipher. In the Cipher Block Chaining (CBC) mode, each plaintext block undergoes an XOR operation with the preceding ciphertext block before encryption. This mechanism ensures that each ciphertext block depends on the previous one, thereby enhancing randomness and security.
While the inherent security strength of the algorithm influences the safety of symmetric encryption, its algorithms are typically public, and the same key is used for both encryption and decryption. Thus, the security primarily relies on the confidentiality of the symmetric key. To safeguard data during transmission, establishing a secure key management mechanism is essential. Additionally, regular key updates are crucial to effectively reduce key lifespan and mitigate the risk of key exposure.
In multi-user scenarios, the server must generate and securely store distinct keys for each user. As the number of users grows, the volume of keys managed by the server increases accordingly. To address this challenge, the server must implement a secure and reliable key storage system to ensure the safe management of each user’s key. Furthermore, the server must handle key distribution and updates efficiently, guaranteeing that all users receive the latest keys in a timely manner.
4. NNG-SANN
As outlined in
Section 3, while LSH-based ANN schemes can significantly improve query efficiency, they often suffer from either low result accuracy or excessive memory consumption. To enhance query accuracy, this paper proposes a novel scheme that incorporates NNGs to assist in index construction while employing symmetric cryptographic techniques to ensure the storage security of original data. The subsequent sections provide detailed descriptions of the scheme:
Section 4.1 presents formal definitions of key components,
Section 4.2 elaborates on the data subset partitioning strategy and index construction algorithm, and
Section 4.3 details the query algorithm.
4.1. Framework
The NNG-based SANN scheme primarily consists of the following four phases:
Preparation: Select a dataset
D and a secret key
sk. We denote
as the subset of
D and each subset associated with a uniquely generated random tag
. Construct an index table
I to store the mapping(
) between subsets and their corresponding tags. Encrypt the subsets and store them on the server in the form of (
). For ease of illustration, we use the notations in
Table 1.
Query T generated by LLMs: For a given query q, locate the corresponding subset tag according to the index table I, then generate a query request T.
Secure query on sever: Upon receiving query request T from the user, return the candidate set .
Local search: Decrypt the returned using key sk to obtain the plaintext candidate set S, then perform a sequential kNN search locally on S.
4.2. Design of Index
During the offline preparation phase, the data owner should generate an index table I and multiple encrypted subsets. To achieve this objective, we designed an index construction algorithm based on LSH and NNG, which theoretically guarantees a certain level of accuracy for ANN queries. Furthermore, to prevent data explosion and ensure data security, a greedy partitioning method is employed to divide the entire dataset.
4.2.1. Rapid Positioning Based on LSH
As shown in Formula (
1), a hash function maps a d-dimensional vector to an integer-valued hash space. As indicated by the properties of LSH, similar data points have a high probability of being mapped to the same hash bucket after computation. By increasing the number of hash functions used, the probability of similar data being projected into identical hash buckets can be further improved. However, as mentioned earlier, as more hash functions are employed, the number of generated hash mappings also increases, which severely consumes memory space and reduces query efficiency while increasing the number of I/O operations.
Therefore, in the proposed scheme, we first apply a single LSH function to perform hash computations on all data. Each vector is assigned a corresponding hash value, and each hash bucket stores all data points mapped to that value. Algorithm 3 illustrates the generation method of the hash mapping table. Additionally, in
Section 5.2, a method utilizing two hash functions is introduced to enhance the accuracy of query results.
Algorithm 3 Generation of hash mapping table |
Input: Dataset: D, a parameter that controls the width of the interval in the function LSH: w. Output: Hash bucket mapping index, where each bucket stores all the neighbors mapped to that hash value: hash_index.
- 1:
//Select different normalization methods according to the characteristics of the dataset - 2:
//User-defined interval width - 3:
//The offset of the LSH function is set to a random number - 4:
//Randomly generate a d-dimensional vector that conforms to N(0,1) - 5:
for
do - 6:
//Calculate the hash value for each vector of the dataset in sequence - 7:
//Add the data to the corresponding hash bucket (creating the hash bucket if it does not exist) - 8:
end for
|
To improve the performance of the LSH algorithm, the dataset needs to be normalized before hash computation. The min-max normalization and Z-score normalization methods are primarily employed in this paper, with the choice of normalization technique determined by the specific characteristics of each dataset. Min-max normalization linearly scales the data to a specified range, typically [0, 1], and is calculated as follows:
where the
is the standardized data,
is the original data,
is the maximum value of all elements in the dataset, and
is the minimum value of all elements in the dataset. Z-score normalization transforms the data into a standard normal distribution with zero mean and unit standard deviation. This processing ensures that the data exhibits comparable scales across different dimensions. The calculation formula is as follows:
where
is the mean value of column j, and
is the standard deviation of the column j. After normalization, the dataset demonstrates enhanced stability and reliability during LSH computation, thereby ensuring the accuracy and efficiency of the algorithm.
4.2.2. Generation of the NNG
The algorithms introduced in
Section 3.2 are related to the construction and querying of NNGs. When a small number of hash functions are employed for mapping, the probability calculation in Equation (
2) indicates that the likelihood of hash collisions increases. This leads to a higher rate of false negatives during queries, where originally similar data points are erroneously mapped to different hash buckets and subsequently filtered out, thereby compromising query accuracy.
To address this issue, the structural information of the NNG can be leveraged to augment the data within each hash bucket. Specifically, if a neighbor of the query vector q is found in a particular hash bucket, other data points in the same bucket can be assumed to be potential neighbors of q as well. Thus, during the retrieval of bucket entries, this neighboring point can serve as a reference to additionally fetch a set of nearby nodes, expanding the candidate pool for the query. In other words, the neighborhood relationships encoded in the NNG can provide supplementary candidate nodes, compensating for potential inaccuracies introduced by the LSH mapping process. This strategy enhances both the accuracy and recall rate of query results while mitigating the probability of false negatives caused by hash collisions.
The NNG structures of the HNSW are illustrated in
Figure 3. In the proposed method, we first construct an HNSW graph on the normalized dataset. The bottom layer of this graph structure stores the original data information and neighborhood relationships of all nodes, enabling direct and efficient retrieval of neighboring nodes for each vector. However, due to the uncertainty in connection numbers, the distribution of neighboring nodes may become highly imbalanced. Moreover, the heuristic neighbor selection method may introduce long-distance connections to certain nodes to ensure global connectivity, which does not guarantee that the nodes connected at the bottom layer are necessarily the
m nearest neighbors of a given point.
To address these limitations, after constructing the HNSW graph, our approach employs the NNG search algorithm (Algorithm 4) to identify the
m nearest neighbors for each vector. Thanks to the hierarchical structure’s efficient search capability and the heuristic algorithm’s ability to overcome the local optimum limitation inherent in traditional NNGs, Algorithm 4 achieves highly accurate results. This significantly enriches the candidate set, thereby enhancing the overall robustness of the proposed solution.
Algorithm 4 Search the neighbor points |
Input: Dataset: D, the number of nearest neighbors to be selected: M Output: Neighbor Information List: G
- 1:
//Initialize the HNSW structure - 2:
for
do - 3:
//Insert nodes in sequence to create the HNSW structure - 4:
end for - 5:
for
do - 6:
//Find M nearest neighbors for each node - 7:
end for
|
4.2.3. Greedy Division of Subsets
Based on the aforementioned construction schemes of LSH and NNG, the indexing structure and greedy subset partitioning method for the proposed approach are deigned carefully. As illustrated in
Figure 5, Algorithm 3 is employed to perform LSH computation for each vector in the dataset and construct a hash mapping table. To enhance query accuracy, each data point is supplemented with m neighbor nodes to form a neighbor list. These data points are then sorted in descending order based on their hash values, followed by subset partitioning within the range from the minimum to maximum hash values. The partitioning principles and corresponding analysis are detailed as follows:
- 1.
All data points sharing identical hash values, along with their neighbor nodes, must be aggregated into the same subset while minimizing the total number of subsets. This approach reduces communication overhead and ensures minimal data retrieval, thereby enhancing query efficiency and response time.
- 2.
Subset sizes should be balanced as uniformly as possible to prevent data explosion while providing security guarantees. This is achieved by maintaining indistinguishable encrypted subsets of identical sizes on the server.
- 3.
Each subset maintains independent and unique upper/lower bounds to prevent query failures caused by overlapping index ranges across multiple subsets, ensuring the feasibility of local indexing.
According to principle 1, to maintain balanced subset sizes (i.e., aggregating an equal number of vector data points into each subset), it is necessary to maximally merge hash buckets before selectively supplementing subsets with partial vector data from subsequent hash buckets. This ensures uniform subset sizes. However, special cases may occur when processing the final subset. Following the merging and filling procedures, the last subset may contain an insufficient number of data points. In such cases, partial data from the preceding subset can be redistributed to achieve equal size across all subsets. As specified in principle 3, for each partition, the upper and lower bounds of the current subset’s hash range encompassing all data points within the hash bucket serve as the partition’s index bounds. Furthermore, each data subset is assigned a unique identifier t and the
is incorporated into the index structure I. The complete algorithm is presented in Algorithm 5.
Algorithm 5 Greedy division |
Input: Dataset: D, hash table: hash_index, hash value set: hash_value Output: The data of subset: SD, the table of indexes: I
- 1:
//Arrange the index in ascending order according to the hash value - 2:
// Dynamically set the subset size - 3:
- 4:
- 5:
while until do - 6:
if i=0 then - 7:
- 8:
end if - 9:
- 10:
if then - 11:
- 12:
- 13:
//Remove the redundant elements of the current subset - 14:
// Randomly generate labels - 15:
- 16:
i=i+1 - 17:
end if - 18:
end while
|
4.3. Secure Query
The data owner processes the original dataset by generating a symmetric key sk and a hash function h, while configuring the parameters for the hash function. Following Algorithm 3, the data subsets are partitioned, and an index table I is constructed. The index table stores the hash value range (minimum and maximum) of each subset along with a unique label assigned to it. Subsequently, the data owner employs the AES encryption algorithm with the key sk to encrypt each data subset into its ciphertext form . Each encrypted subset is then associated with its corresponding label and uploaded to the cloud server. To ensure secure querying, the data owner transmits the symmetric key , hash function h, and index table I to authorized query users through a secure channel. This prevents the cloud server from obtaining the key and decrypting the encrypted data subsets.
To retrieve the k-nearest neighbors of a query vector q, the system first computes its hash value using the predefined hash function h. The corresponding data subset is then located in the local index through Algorithm 6 by sequentially comparing with the hash ranges in the index table. When falls within a specific subset’s hash range, the associated subset tag is added to query request T, along with tags of its immediately adjacent subsets to enhance query accuracy. For cases where lies between two subsets’ hash ranges, both neighboring subset tags are recorded. Boundary conditions are handled by selecting the first subset’s tag when is below the minimum hash value, or the last subset’s tag when exceeds the maximum value. This comprehensive approach effectively mitigates interval selection bias and significantly improves query precision.
On the server side, the encrypted subset data are stored in the form of . Upon receiving a query request T containing the required subset tags from users, the server retrieves the corresponding encrypted data files. In our experimental setup, each encrypted subset is stored as an individual file named according to its assigned tag, enabling efficient retrieval. This design allows the server to simply fetch and transmit the encrypted candidate set to the user based on the specified tags in query T, while maintaining data confidentiality throughout the process.
Upon receiving the encrypted candidate set
, the user decrypts it using the pre-shared symmetric key
sk obtained from the data owner, thereby obtaining the plaintext candidate set
S. The user then performs a local sequential search on
S by computing the distance between each candidate element and the query vector
q. All candidate elements are sorted in ascending order based on their calculated distances, from which the top-k elements are selected as the final query results. This process ensures both data confidentiality during transmission and accurate k-nearest neighbor retrieval through local computation.
Algorithm 6 Query request generation |
Input: Query vector: q, index table: I, Subset quantity: nop Output: Dataset: Query requirement: T
- 1:
for
do - 2:
// Read the hash index from the index table to H - 3:
end for - 4:
// Calculate the hash value of the query vector - 5:
for
do - 6:
if then - 7:
- 8:
end if - 9:
if then - 10:
- 11:
end if - 12:
if then - 13:
- 14:
end if - 15:
if then - 16:
- 17:
end if - 18:
end for
|
4.4. Optimization
To further enhance the query accuracy of the algorithm, we employ two LSH functions (as shown in Equation (
1)) to construct two index tables. During query processing, the intersection of the candidate sets retrieved from both index tables is selected as the final candidate set. This approach reduces the likelihood of false positive points being included in the candidate set because the two distinct hash functions exhibit a degree of independence in their hash value computations. Consequently, the probability of data point overlap between the two hash buckets decreases, thereby minimizing the risk of misjudgment. Additionally, this method improves query stability and reduces dependency on parameter selection and hash function choices.
The data owner must perform two separate hash function computations on the dataset while still leveraging the nearest neighbor relationships in the NNG to supplement candidate nodes. This process generates index tables and partitions data subsets, enhancing query accuracy and reducing the size of the final candidate set for sequential search, thus improving local search efficiency. However, storing multiple index tables increases spatial overhead, and reading multiple data groups to compute their intersection incurs additional I/O operations and time costs. To balance efficiency and accuracy, we limit the implementation to only two LSH functions, ensuring improved query precision while maintaining computational performance.
7. Conclusions
The secure nearest neighbor query protects data privacy in retrieval tasks, particularly for LLMs. Existing methods face performance degradation in high-dimensional spaces due to the curse of dimensionality. While LSH alleviates this issue by trading precision for efficiency, it often yields false positives with limited hash functions. To address this, we propose a secure query method combining LSH with NNGs. The NNG supplements potential false negatives in hash buckets, improving candidate quality. Data is encrypted using AES before outsourcing to semi-honest servers. During queries, the user retrieves a small candidate set in one interaction and performs a local search for approximate k-nearest neighbors. Experiments show that neighbor supplementation enhances accuracy and recall, especially for small k. For larger queries, adding an LSH function improves precision with minimal overhead. This approach balances efficiency and security in privacy-preserving retrieval.
The proposed scheme achieves efficient nearest-neighbor retrieval but has two limitations: (1) While increasing LSH functions enhances accuracy for larger queries, it incurs higher computational costs, suggesting future work on optimized LSH variants to reduce false positives. (2) Although HNSW graphs enable fast searches, their memory-intensive construction warrants investigation of more compact graph structures to maintain performance while lowering spatial overhead.