NNG-Based Secure Approximate k-Nearest Neighbor Query for Large Language Models

Zhou, Heng; Wang, Yuchao; Qiao, Yi; Huang, Jin

doi:10.3390/math13132199

Open AccessArticle

NNG-Based Secure Approximate k-Nearest Neighbor Query for Large Language Models

by

Heng Zhou

¹

,

Yuchao Wang

^1,*

,

Yi Qiao

¹

and

Jin Huang

^2,*

¹

School of Computer Science and Technology, Xidian University, Xi’an 710071, China

²

Beijing Academy of Blockchain and Edge Computing, Beijing 100080, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(13), 2199; https://doi.org/10.3390/math13132199

Submission received: 9 June 2025 / Revised: 4 July 2025 / Accepted: 4 July 2025 / Published: 5 July 2025

(This article belongs to the Special Issue Privacy-Preserving Machine Learning in Large Language Models (LLMs))

Download

Browse Figures

Versions Notes

Abstract

Large language models (LLMs) have driven transformative progress in artificial intelligence, yet critical challenges persist in data management and privacy protection during model deployment and training. The approximate nearest neighbor (ANN) search, a core operation in LLMs, faces inherent trade-offs between efficiency and security when implemented through conventional locality-sensitive hashing (LSH)-based secure ANN (SANN) methods, which often compromise either query accuracy due to false positives. To address these limitations, this paper proposes a novel secure ANN scheme based on nearest neighbor graph (NNG-SANN), which is designed to ensure the security of approximate k-nearest neighbor queries for vector data commonly used in LLMs. Specifically, a secure indexing structure and subset partitioning method are proposed based on LSH and NNG. The approach utilizes neighborhood information stored in the NNG to supplement subset data, significantly reducing the impact of false positive points generated by LSH on query results, thereby effectively improving query accuracy. To ensure data privacy, we incorporate a symmetric encryption algorithm that encrypts the data subsets obtained through greedy partitioning before storing them on the server, providing robust security guarantees. Furthermore, we construct a secure index table that enables complete candidate set retrieval through a single query, ensuring our solution completes the search process in one interaction while minimizing communication costs. Comprehensive experiments conducted on two datasets of different scales demonstrate that our proposed method outperforms existing state-of-the-art algorithms in terms of both query accuracy and security, effectively meeting the precision and security requirements for nearest neighbor queries in LLMs.

Keywords:

large language models; secure nearest neighbor query; nearest neighbor graph; locality-sensitive hashing

MSC:

68M25

1. Introduction

Large language models (LLMs) have become indispensable in numerous domains, significantly advancing applications in data management, mining, and analysis. Efficient retrieval of massive and heterogeneous data serves as the foundation for LLM training and deployment. Taking advantage of rapid advances in computing, storage, and communication technologies, the explosive growth of Internet data has provided strong momentum for the continuous development of LLMs. However, such large-scale, structurally diverse and complex-origin Internet data pose significant challenges to data management systems supporting LLM training and applications. On the one hand, traditional data retrieval techniques have become inadequate, urgently requiring new technologies and tools for data management and analysis. On the other hand, privacy leakage issues in LLMs’ usage of such data demand immediate solutions.

In terms of data retrieval, the k-nearest neighbor (kNN) search [1] represents one of the most commonly used techniques in LLMs [2,3,4], aiming to find K elements closest to a given query from a dataset. For example, by modeling images as high-dimensional vector data, kNN enables image recognition and classification. However, as data scales grow, traditional exact kNN methods face challenges of high computational complexity and substantial storage requirements, leading to the emergence of an approximate search for the nearest neighbor as an effective solution. For example, in corpora containing millions or even billions of text snippets, identifying the most similar text to a user query for reference is essential. Traditional exact nearest neighbor search methods, which compute the distance between the input and every snippet, become prohibitively expensive at this scale. Among these, locality-sensitive hashing (LSH) has become a widely used approximate neighbor-retrieval technique. LSH maps data points into hash buckets, ensuring that similar points have higher probability of being mapped to the same bucket, thereby partially mitigating the “curse of dimensionality.” Nevertheless, despite the excellent performance of LSH in approximate nearest neighbor searches for large-scale high-dimensional data, it still suffers from limited query accuracy.

Regarding secure data storage and management, cloud computing, which is a computing paradigm enabled by Internet technology, utilizes the Internet as its foundational infrastructure for connectivity and interaction [5]. This allows users to access and leverage remote computational resources without local ownership or maintenance, providing powerful computing capabilities and massive storage capacity. Consequently, data owners typically entrust cloud servers with managing large-scale high-dimensional data. When LLM users need to perform neighbor queries, they simply send requests to the cloud and await results, significantly reducing local storage and computational costs. However, with growing concerns about data security and privacy across sectors, protecting user data security has become a critical challenge in data retrieval and analysis. Privacy protection measures are essential during data retrieval to prevent unpredictable malicious cyber attacks from compromising cloud-based data security [6]. Typically, data must be stored in encrypted form on the cloud to prevent attackers from inferring plaintext through user queries. Researchers have proposed various encrypted data retrieval schemes, including homomorphic encryption, searchable encryption, and secure multiparty computation. However, these methods often suffer from excessive computational complexity, reduced query efficiency, or high memory overhead. Therefore, developing appropriate secure nearest neighbor query schemes has become an urgent challenge, requiring careful balance among retrieval efficiency, query accuracy, and data security.

Nearest neighbor graph (NNG), as one of the fundamental tools for high-dimensional vector data retrieval, demonstrates exceptional performance in storing neighborhood information for large-scale datasets. Compared with LSH techniques, NNGs can provide query results that more closely approximate exact nearest neighbors. Building upon this advantage, we propose an NNG-based Secure Approximate Nearest Neighbor (NNG-SANN) scheme by integrating an NNG with LSH technology. Our approach achieves three key objectives:

High-accuracy secure indexing: We innovatively combine LSH with NNG technology to construct a secure index structure for high-dimensional vector data. This design significantly improves query accuracy while maintaining retrieval efficiency.
Refined greedy partitioning method:
Extensive experiments on both real-world and synthetic datasets demonstrate that our scheme achieves statistically significant improvements in secure ANN query accuracy compared to state-of-the-art SANN methods.

The rest of the paper is organized as follows: The related works of NN, ANN, and SANN queries are discussed in Section 2. After describing the preliminaries of NNG-SANN scheme in Section 3, we propose the NNG-SANN model in Section 4. The theoretical analysis for the NNG-SANN follows in Section 5. The experiments are conducted in Section 6. We conclude the paper in Section 7.

2. Related Works

The field of nearest neighbor search has witnessed the emergence of various solutions, including both exact and ANN search methods, to cater to diverse application requirements regarding data scale and precision demands. To further enhance data privacy and security, research on secure nearest neighbor search continues to advance and explore new frontiers. Below we present an overview of relevant methodologies.

2.1. Nearest Neighbor Query

Numerous studies have employed tree-based structures to design indexing methods for high-dimensional vector data to achieve exact KNN queries. Bentley [7] proposed the k-d tree algorithm, which recursively partitions the k-dimensional space into hyper-rectangular regions by splitting the space into left and right subspaces based on data point distribution, forming a binary search tree structure. This approach proves effective for space partitioning and search in low-dimensional spaces by eliminating unnecessary subspaces to reduce search scope. Guttman [8] developed the R-tree algorithm, which divides the data space into distinct polygonal regions and organizes adjacent regions into a tree structure. Through dynamic adjustments to accommodate data variations, this method enables efficient retrieval of high-dimensional data. Chakrabarti and Mehrotra [9] introduced a hybrid tree structure that combines advantages of different tree types to address challenges in exact queries for high-dimensional spaces, dynamically selecting indexing strategies according to dataset characteristics. Additionally, the VA-File [10] reduces disk I/O costs through quantization compression and approximate storage, though this approximation inevitably sacrifices some data fidelity, leading to compromised retrieval accuracy.

While these solutions demonstrate satisfactory performance for nearest neighbor retrieval in low-dimensional vector spaces, their effectiveness deteriorates significantly when applied to high-dimensional data. This performance degradation stems from the inherent data sparsity in high-dimensional spaces, which renders tree structures ineffective in evaluating inter-point distances or similarities, ultimately causing index failure.

2.2. Approximate Nearest Neighbor Query

The nearest neighbor search problem in LLMs faces two major challenges: First, the increasing dimensionality of data vectors, which complicates similarity calculations, and second, how to construct efficient index structures for large-scale high-dimensional vector datasets. To address these difficulties, the concept of ANN search (K-ANN) has been proposed, which allows for a certain degree of reduced query accuracy in exchange for improved search efficiency. In recent years, numerous solutions have emerged, with common K-ANN methods primarily including product quantization (PQ)-based approaches, LSH-based methods, and NNG-based techniques.

Jégou et al. [11] employed product quantization to compute asymmetric distances for obtaining approximate nearest neighbors. Ge et al. subsequently proposed optimized product quantization (OPQ) [12] to reduce quantization errors, although significant error increases may occur with highly imbalanced dataset distributions. Indyk introduced LSH, which maps high-dimensional vectors into compact hash values. Datar et al. developed E2LSH [13], representing the first application of LSH in Euclidean space, albeit with substantial storage requirements. To address this limitation, Entropy-based LSH [14] and Multi-Probe LSH [15] were proposed to retrieve more candidate points from a single hash table, thereby reducing the number of required hash tables. Gan et al. presented C2LSH [16], which adopts a dynamic collision counting approach to measure inter-point similarity. For graph-based approaches, Malkov et al. proposed the navigable small world (NSW) [17] and its enhanced version, hierarchical navigable small world (HNSW) [18], which demonstrate high efficiency for large-scale dataset searches but suffer from considerable memory consumption during graph index construction.

Among the aforementioned methods, LSH-based approaches demonstrate favorable performance and relatively low computational overhead for ANN searches in high-dimensional vector spaces. However, these methods often struggle to guarantee query accuracy. In contrast, graph-based methods achieve higher query precision but require substantial computational resources for index construction. Current solutions still face challenges in providing both efficient and highly accurate ANN searches for high-dimensional vector data.

2.3. Secure Approximate Nearest Neighbor Query

Secure nearest neighbor search has emerged as one of the prominent techniques for ensuring data privacy in LLMs. Searchable encryption, a core technology in this domain, enables computation and query operations on ciphertext data, making it particularly suitable for meeting privacy protection requirements in data analysis and computational tasks of LLMs.

Khoshgozaran and Shahabi [19] proposed a secure index structure based on order-preserving encryption (OPE), which supports both range queries and nearest neighbor searches on encrypted data. Demertzis et al. [20] developed a secure indexing approach utilizing Paillier homomorphic encryption, allowing additive and multiplicative operations on ciphertext. Peng et al. introduced the LS-RQ scheme [21], which reduces communication frequency between users and cloud servers while improving query efficiency.

In summary, index-based query schemes demonstrate significant advantages in the field of secure high-dimensional vector data retrieval. The design of efficient, accurate, and secure query methods remains an actively researched direction with broad application prospects in practical scenarios.

3. Preliminaries

In this section, we describe the preliminaries of NNG-SANN scheme in detail.

3.1. Locality-Sensitive Hashing

LSH is a widely used ANN technique that enables efficient search for data points most similar to a query vector in large datasets. The fundamental principle of LSH involves partitioning the data space into multiple segments and assigning each segment to a hash bucket, thereby ensuring that similar data points have a high probability of being mapped to the same bucket. This mechanism significantly enhances the efficiency of nearest neighbor searches. The LSH function employed in our scheme is calculated as follows:

h (v) = ⌊\frac{y (v) + r}{w}⌋

(1)

where v is an arbitrary vector on the original data,

y (v) = α \cdot v

,

α

is a randomly selected vector that conforms to a normal distribution, and r is a random number,

r \in [0, w]

. Specifically, a novel vector

y (v)

is generated by performing an inner product calculation on the original d-dimensional vector v. It projects a vector onto a randomly selected direction. Then, in order to control the range of the hash value, a fixed value w is introduced and defined by the user. The w is a scaling factor used to adjust the range of the hash value to ensure that the hash value is within an appropriate range. Meanwhile, a random value r is randomly selected from

[0, w]

, which randomly offsets the value of hash.

The above hash function maps the d-dimensional vector to a single integer,

h : R^{d} \to N

, where N is the set of natural numbers. For the given two thresholds R2 and R1 (R2 > R1) and the probability parameters P1 and P2 (P1 > P2), a hash function that satisfies the following properties is called (R1, R2, P1, P2) LSH:

If $‖ q, v ‖ ⩽ R 1$ , then $P r [h (v) = h (q)] \geq P 1$ .
If $‖ q, v ‖ \geq R 2$ , $R 2 = c R 1$ , and $c > 1$ , then $P r [h (v) = h (q)] ⩽ P 2$ .

The calculation formulas for P1 and P2 are as follows.

P 1 = \int_{o}^{w} f (t) (1 - \frac{t}{w}) d t

(2)

P 2 = \int_{o}^{w} \frac{1}{c} f (\frac{t}{c}) (1 - \frac{t}{w}) d t

(3)

3.2. Nearest Neighbor Graph

A neighborhood graph is utilized to represent proximity relationships among data points. In such a graph, each node is connected to its nearest neighbor nodes. Each node corresponds to a data point in the dataset, while edges indicate similarity or distance between data points. Typically, connections between nodes are determined based on a specific distance metric. In this work, Euclidean distance is adopted for computation. ANN search aims to efficiently identify the approximate nearest neighbors for each point in a dataset, and neighborhood graphs can be employed to accelerate this search process. Below, we introduce two currently prevalent high-efficiency neighborhood graph structures.

3.2.1. NSW

NSW is an efficient graph-based data structure that exhibits properties of small-world networks while maintaining strong navigability. It enables rapid node localization in large-scale networks, demonstrating exceptional performance in efficient ANNs.

The NSW structure is specifically designed to possess small-world network characteristics, meaning that any two nodes in the graph are typically connected by a short path. This property ensures efficient local search operations, allowing neighboring nodes to be identified effectively. Moreover, the NSW incorporates long-range connections between nodes, significantly enhancing navigation efficiency. During ANN search, this facilitates rapid traversal across the graph to locate potential candidates. In Figure 1, the arrows illustrate an example path obtained via a greedy algorithm from an entry point to a query point.

When sequentially inserting new points into the NSW graph, a naive search method can be employed to identify nearest neighbors for each incoming point. Specifically, for each point to be inserted, the algorithm first computes its distances to existing nodes in the graph, then selects m nearest nodes as its immediate neighbors. The new point is subsequently connected to these m neighbors, thereby expanding the graph’s topological structure. This incremental process progressively enriches the NSW graph by establishing local neighborhood relationships among nodes, which fundamentally supports subsequent nearest neighbor search operations.

Following the construction of the NSW structure and node insertion, an efficient method becomes essential for retrieving m nearest neighbors of a given query point. This step proves critical as it directly determines both the search speed and accuracy within the NSW framework. Algorithm 1 presents the formal neighbor search procedure for NSW graphs, demonstrating how rapid and reliable nearest neighbor queries can be achieved through this optimized approach.

Algorithm 1 ANNs queries on NNGs

Input: Graph structure: G, the element to be searched: q, the number of nearest neighbors that need to be searched: m, the entry node: ep.
Output: An NNG: S

1:: $c \to {}$ //Initialize the dynamic list
2:: $c_s h a d o w \to {}$ //The shadow list of c
3:: $v \leftarrow {}$ //A collection used to record the visited nodes
4:: $d i s t a n c e_e p_t o_q = c a l c u l a t e_d i s t a n c e (e p, q)$ //Calculate the distance from the query point
5:: $c . a d d ((p, d i s t a n c e_{e} p_{t} o_{q}))$
6:: $c_s h a d o w . a d d ((p, d i s t a n c e_e p_t o_q))$
7:: while $c \neq c_s h a d o w$ do
8:: $c_s h a d o w \leftarrow c$ //Update $c_s h a d o w$
9:: $p o i n t \leftarrow s e q_s e l e c t (c)$ //Take points in sequence from c
10:: if $p o i n t \notin v$ then
11:: $n e i g h b o r s \leftarrow f i n d_n e i g h b o r s (p o i n t, G)$ //Find the neighbor points of point
12:: for $p \in n e i g h b o r s$ do
13:: if $p \notin v$ then
14:: $d i s t a n c e s_p_t o_q = c a l c u l a t e_d i s t a n c e s (p, q)$
15:: $c . a d d ((p, d i s t a n c e_p_t o_q))$
16:: end if
17:: end for
18:: end if
19:: end while
20:: $s o r t (c, d i s t a n c e)$ //Arrange in ascending order of distance
21:: $s \leftarrow m_n e a r e s t_e l e m e n t s (c, m)$ //Return the first m as nearest neighbors

3.2.2. HNSW

HNSW structure represents an enhanced version of the NSW graph architecture, achieving efficient nearest neighbor search on large-scale datasets. Compared to NSW, HNSW introduces a hierarchical organization through skip lists and multi-layered node arrangements, which significantly improves search efficiency. As illustrated in Figure 2, HNSW incorporates additional connection pointers at each layer, enabling a skip-based search process that enhances overall performance.

In this hierarchical structure, each data point has a predefined probability (typically 50%) of being promoted to the next higher-level ordered linked list. This multi-layered organization facilitates concurrent search operations across multiple levels, substantially accelerating the search process. The search algorithm initiates from the highest layer and progressively descends to lower levels until either locating the target or reaching the base layer, ensuring both efficiency and completeness of the search procedure.

The HNSW algorithm organizes connections into distinct hierarchical levels based on their lengths, enabling efficient multi-layer graph traversal. This hierarchical architecture ensures that each node maintains a constant number of connections, independent of the overall network size. As depicted in Figure 3, the search process initiates from the topmost layer containing only the longest-range connections.

The search algorithm employs a greedy traversal strategy at each layer until reaching a local minimum. The process then transitions to the subsequent lower layer, where the search restarts from the previously identified local minimum. This iterative procedure continues until completion. In Figure 3, the search begins at a top-layer element, with red arrows indicating the traversal path generated by the greedy algorithm from the entry point to the query location. The complete hierarchical search procedure is formally described in Algorithm 2. The HNSW structures are used to search for the nearest neighbor information within it as a supplement to improve the accuracy of the query.

Algorithm 2 ANNs queries on NNGs with HNSW structures

Input: Graph structure of HNSW: hnsw, the element to be searched: q, the number of nearest neighbors that need to be returned: k, the entry node: ep.
Output: The k elements closest to q.

1:: $W \leftarrow {}$ //Create an empty set to store the most recent elements at present
2:: $e p \leftarrow h n s w . g e t_e n t e r_p o i n t ()$ //Obtain the entry point of the HNSW graph
3:: $L \leftarrow h n s w . l e v e l_o f (e p)$ //Obtain the number of floors where the entrance point is located
4:: for $l c \in L$ do
5:: $W \leftarrow S e a r c h_N e i g h b o r (q, e p, e f, l c)$ //Find the nearest ef elements on the current layer lc
6:: $e p \leftarrow n e a r e s t_e l e m e n t (w, q)$ //Update the entry point to the element in W that is closest to q
7:: end for
8:: $W \leftarrow S e a r c h_N e i g h b o r (q, e p, e f, l c)$ //Find the nearest ef elements on the last layer
9:: return $K_n e a r s e t_e l e m e n t (W, q, K)$

3.2.3. Secure Nearest Neighbor Query Framework

In this paper, the data owner stores ciphertext data along with secure indexes on the cloud server. The cloud server responds to query tasks from the LLMs, and the retrieved data is ultimately decrypted at the user end of the LLMs. Within this cloud service model, there exists potential threats from malicious servers or semi-honest servers that may attack the data. For malicious servers, they may compromise data integrity and security through means such as query result modification, data tampering, or denial of service. Semi-honest servers, while not actively attacking data, may analyze transmitted data during processing or exploit access privileges to snoop on data, attempting to infer sensitive information or identify special patterns to serve their own interests or third-party demands. This paper assumes the server to be semi-honest, necessitating measures to restrict the server’s data access and utilization to prevent it from obtaining private information from either the dataset or queries. Therefore, this study employs encryption techniques to protect plaintext data, combined with a greedy subset partitioning approach, which effectively prevents information leakage and satisfies IND-CPA security.

The SANN scheme based on NNGs involves three main participants: the data owner, the user, and the server:

The Data Owner processes the original dataset and partitions it into subsets, then outsources the encrypted data to the cloud server for management.
The User uses LLMs to submit a query request and interacts once with the server to retrieve a candidate set.
The Server stores the encrypted dataset and upon receiving a query request, searches for the corresponding subset based on the query.

The secret key and the secure index table are assumed to be transmitted to the user via a secure channel and stored locally. As illustrated in Figure 4, during the query process, the LLM interacts once with the cloud server. The query token t is generated based on the local index rather than the original query vector q, thereby preventing direct exposure of the query data during transmission. The cloud server retrieves the labels of encrypted subsets according to t to identify the candidate set S and returns the encrypted candidate set to the LLM. The LLM user, who possesses the decryption key, decrypts the returned results to obtain the candidate set. Finally, the user performs a local sequential search on the small-scale candidate set to determine the approximate k-nearest neighbors.

3.3. Symmetrical Encryption

To ensure data security and privacy, we employ symmetric encryption for data protection. Symmetric encryption utilizes the same key for both encryption and decryption operations. This method is characterized by its simplicity, strong security, high-speed processing, and low computational overhead, making it an efficient cryptographic approach. Owing to its minimal key management requirements and ease of implementation and maintenance, symmetric encryption has been widely adopted in applications such as data transmission and storage, establishing itself as a fundamental component in the field of information security.

Symmetric encryption is a block cipher algorithm that divides plaintext into fixed-size blocks and encrypts each block sequentially. The Advanced Encryption Standard (AES) algorithm employed in this study is also a block cipher. In the Cipher Block Chaining (CBC) mode, each plaintext block undergoes an XOR operation with the preceding ciphertext block before encryption. This mechanism ensures that each ciphertext block depends on the previous one, thereby enhancing randomness and security.

While the inherent security strength of the algorithm influences the safety of symmetric encryption, its algorithms are typically public, and the same key is used for both encryption and decryption. Thus, the security primarily relies on the confidentiality of the symmetric key. To safeguard data during transmission, establishing a secure key management mechanism is essential. Additionally, regular key updates are crucial to effectively reduce key lifespan and mitigate the risk of key exposure.

In multi-user scenarios, the server must generate and securely store distinct keys for each user. As the number of users grows, the volume of keys managed by the server increases accordingly. To address this challenge, the server must implement a secure and reliable key storage system to ensure the safe management of each user’s key. Furthermore, the server must handle key distribution and updates efficiently, guaranteeing that all users receive the latest keys in a timely manner.

4. NNG-SANN

As outlined in Section 3, while LSH-based ANN schemes can significantly improve query efficiency, they often suffer from either low result accuracy or excessive memory consumption. To enhance query accuracy, this paper proposes a novel scheme that incorporates NNGs to assist in index construction while employing symmetric cryptographic techniques to ensure the storage security of original data. The subsequent sections provide detailed descriptions of the scheme: Section 4.1 presents formal definitions of key components, Section 4.2 elaborates on the data subset partitioning strategy and index construction algorithm, and Section 4.3 details the query algorithm.

4.1. Framework

The NNG-based SANN scheme primarily consists of the following four phases:

Preparation: Select a dataset D and a secret key sk. We denote $S D_{i}$ as the subset of D and each subset associated with a uniquely generated random tag $t_{i} (t_{i} \in Z)$ . Construct an index table I to store the mapping( $S D_{i}, t_{i}$ ) between subsets and their corresponding tags. Encrypt the subsets and store them on the server in the form of ( $E (S D_{i}), t_{i}$ ). For ease of illustration, we use the notations in Table 1.
Query T generated by LLMs: For a given query q, locate the corresponding subset tag according to the index table I, then generate a query request T.
Secure query on sever: Upon receiving query request T from the user, return the candidate set $E (S) = {E (S D_{i}) | t_{i} \in T}$ .
Local search: Decrypt the returned $E (S)$ using key sk to obtain the plaintext candidate set S, then perform a sequential kNN search locally on S.

4.2. Design of Index

During the offline preparation phase, the data owner should generate an index table I and multiple encrypted subsets. To achieve this objective, we designed an index construction algorithm based on LSH and NNG, which theoretically guarantees a certain level of accuracy for ANN queries. Furthermore, to prevent data explosion and ensure data security, a greedy partitioning method is employed to divide the entire dataset.

4.2.1. Rapid Positioning Based on LSH

As shown in Formula (1), a hash function maps a d-dimensional vector to an integer-valued hash space. As indicated by the properties of LSH, similar data points have a high probability of being mapped to the same hash bucket after computation. By increasing the number of hash functions used, the probability of similar data being projected into identical hash buckets can be further improved. However, as mentioned earlier, as more hash functions are employed, the number of generated hash mappings also increases, which severely consumes memory space and reduces query efficiency while increasing the number of I/O operations.

Therefore, in the proposed scheme, we first apply a single LSH function to perform hash computations on all data. Each vector is assigned a corresponding hash value, and each hash bucket stores all data points mapped to that value. Algorithm 3 illustrates the generation method of the hash mapping table. Additionally, in Section 5.2, a method utilizing two hash functions is introduced to enhance the accuracy of query results.

Algorithm 3 Generation of hash mapping table

Input: Dataset: D, a parameter that controls the width of the interval in the function LSH: w.
Output: Hash bucket mapping index, where each bucket stores all the neighbors mapped to that hash value: hash_index.

1:: $D a t a_N o r m a l i z a t i o n (D)$ //Select different normalization methods according to the characteristics of the dataset
2:: $l s h . w = w$ //User-defined interval width
3:: $l s h . r = r a n d ()$ //The offset of the LSH function is set to a random number
4:: $l s h . a = r a n d v e c t o r (d)$ //Randomly generate a d-dimensional vector that conforms to N(0,1)
5:: for $v_{i} \in D$ do
6:: $h (v_{i}) = (α \cdot v_{i} + r) / w$ //Calculate the hash value for each vector of the dataset in sequence
7:: $h a s h_i n d e x [h (v_{i})] . i n s e r t (i)$ //Add the data to the corresponding hash bucket (creating the hash bucket if it does not exist)
8:: end for

To improve the performance of the LSH algorithm, the dataset needs to be normalized before hash computation. The min-max normalization and Z-score normalization methods are primarily employed in this paper, with the choice of normalization technique determined by the specific characteristics of each dataset. Min-max normalization linearly scales the data to a specified range, typically [0, 1], and is calculated as follows:

x_{i j}^{*} = \frac{x_{i j} - m i n (D)}{m a x (D) - m i n (D)}

(4)

where the

x_{i j}^{*}

is the standardized data,

x_{i j}

is the original data,

m a x (D)

is the maximum value of all elements in the dataset, and

m i n (D)

is the minimum value of all elements in the dataset. Z-score normalization transforms the data into a standard normal distribution with zero mean and unit standard deviation. This processing ensures that the data exhibits comparable scales across different dimensions. The calculation formula is as follows:

z_{i j} = \frac{x_{i j} - μ_{j}}{σ_{j}}

(5)

where

μ_{j}

is the mean value of column j, and

σ_{j}

is the standard deviation of the column j. After normalization, the dataset demonstrates enhanced stability and reliability during LSH computation, thereby ensuring the accuracy and efficiency of the algorithm.

4.2.2. Generation of the NNG

The algorithms introduced in Section 3.2 are related to the construction and querying of NNGs. When a small number of hash functions are employed for mapping, the probability calculation in Equation (2) indicates that the likelihood of hash collisions increases. This leads to a higher rate of false negatives during queries, where originally similar data points are erroneously mapped to different hash buckets and subsequently filtered out, thereby compromising query accuracy.

To address this issue, the structural information of the NNG can be leveraged to augment the data within each hash bucket. Specifically, if a neighbor of the query vector q is found in a particular hash bucket, other data points in the same bucket can be assumed to be potential neighbors of q as well. Thus, during the retrieval of bucket entries, this neighboring point can serve as a reference to additionally fetch a set of nearby nodes, expanding the candidate pool for the query. In other words, the neighborhood relationships encoded in the NNG can provide supplementary candidate nodes, compensating for potential inaccuracies introduced by the LSH mapping process. This strategy enhances both the accuracy and recall rate of query results while mitigating the probability of false negatives caused by hash collisions.

The NNG structures of the HNSW are illustrated in Figure 3. In the proposed method, we first construct an HNSW graph on the normalized dataset. The bottom layer of this graph structure stores the original data information and neighborhood relationships of all nodes, enabling direct and efficient retrieval of neighboring nodes for each vector. However, due to the uncertainty in connection numbers, the distribution of neighboring nodes may become highly imbalanced. Moreover, the heuristic neighbor selection method may introduce long-distance connections to certain nodes to ensure global connectivity, which does not guarantee that the nodes connected at the bottom layer are necessarily the m nearest neighbors of a given point.

To address these limitations, after constructing the HNSW graph, our approach employs the NNG search algorithm (Algorithm 4) to identify the m nearest neighbors for each vector. Thanks to the hierarchical structure’s efficient search capability and the heuristic algorithm’s ability to overcome the local optimum limitation inherent in traditional NNGs, Algorithm 4 achieves highly accurate results. This significantly enriches the candidate set, thereby enhancing the overall robustness of the proposed solution.

Algorithm 4 Search the neighbor points

Input: Dataset: D, the number of nearest neighbors to be selected: M
Output: Neighbor Information List: G

1:: $h n s w . i n i t ()$ //Initialize the HNSW structure
2:: for $p_{i} \in D$ do
3:: $h n s w . a d d (p_{i})$ //Insert nodes in sequence to create the HNSW structure
4:: end for
5:: for $i \in [1 : D . l e n g t h]$ do
6:: $G [i] = h n s w . s e a r c h K N N (d a t a [i], M, e f)$ //Find M nearest neighbors for each node
7:: end for

4.2.3. Greedy Division of Subsets

Based on the aforementioned construction schemes of LSH and NNG, the indexing structure and greedy subset partitioning method for the proposed approach are deigned carefully. As illustrated in Figure 5, Algorithm 3 is employed to perform LSH computation for each vector in the dataset and construct a hash mapping table. To enhance query accuracy, each data point is supplemented with m neighbor nodes to form a neighbor list. These data points are then sorted in descending order based on their hash values, followed by subset partitioning within the range from the minimum to maximum hash values. The partitioning principles and corresponding analysis are detailed as follows:

1.: All data points sharing identical hash values, along with their neighbor nodes, must be aggregated into the same subset while minimizing the total number of subsets. This approach reduces communication overhead and ensures minimal data retrieval, thereby enhancing query efficiency and response time.
2.: Subset sizes should be balanced as uniformly as possible to prevent data explosion while providing security guarantees. This is achieved by maintaining indistinguishable encrypted subsets of identical sizes on the server.
3.: Each subset maintains independent and unique upper/lower bounds to prevent query failures caused by overlapping index ranges across multiple subsets, ensuring the feasibility of local indexing.

According to principle 1, to maintain balanced subset sizes (i.e., aggregating an equal number of vector data points into each subset), it is necessary to maximally merge hash buckets before selectively supplementing subsets with partial vector data from subsequent hash buckets. This ensures uniform subset sizes. However, special cases may occur when processing the final subset. Following the merging and filling procedures, the last subset may contain an insufficient number of data points. In such cases, partial data from the preceding subset can be redistributed to achieve equal size across all subsets. As specified in principle 3, for each partition, the upper and lower bounds of the current subset’s hash range encompassing all data points within the hash bucket serve as the partition’s index bounds. Furthermore, each data subset is assigned a unique identifier t and the

(t_{i}, h_{m i n}, h_{m a x})

is incorporated into the index structure I. The complete algorithm is presented in Algorithm 5.

Algorithm 5 Greedy division

Input: Dataset: D, hash table: hash_index, hash value set: hash_value
Output: The data of subset: SD, the table of indexes: I

1:: $s o r t (h a s h_i n d e x, h a s h_v a l u e)$ //Arrange the index in ascending order according to the hash value
2:: $s u b s i z e = \underset{1 \leq k \leq n}{m a x} {h a s h_i n d e x_{k} . l e n g t h \cdot M}$ // Dynamically set the subset size
3:: $t = h a s h_i n d e x . b e g i n ()$
4:: $i = 0$
5:: while until $t = h a s h_i n d e x . e n d ()$ do
6:: if i=0 then
7:: $I_{i} . m i n = t . v a l u e$
8:: end if
9:: $S D_{i} \cup {p_{i}, G (p_{i}) | h (p_{i}) = t}$
10:: if $S D_{i} . l e n g t h > s u b s i z e$ then
11:: $I_{i} . m a x = t - 1$
12:: $I_{i} . m i n = t$
13:: $D e c r e a s e (S D_{i}, S D_{i} . l e n g t h - s u b s i z e$ //Remove the redundant elements of the current subset
14:: $T a g i = r a n d o m ()$ // Randomly generate labels
15:: $I \cup (I_{i} . m i n, I_{i} . m a x, T a g i)$
16:: i=i+1
17:: end if
18:: end while

4.3. Secure Query

The data owner processes the original dataset by generating a symmetric key sk and a hash function h, while configuring the parameters for the hash function. Following Algorithm 3, the data subsets

\cup S D_{i}

are partitioned, and an index table I is constructed. The index table stores the hash value range (minimum and maximum) of each subset along with a unique label assigned to it. Subsequently, the data owner employs the AES encryption algorithm with the key sk to encrypt each data subset

S D_{i}

into its ciphertext form

E (S D_{i})

. Each encrypted subset is then associated with its corresponding label and uploaded

\cup (E (S D_{i}), T a g i)

to the cloud server. To ensure secure querying, the data owner transmits the symmetric key

s k

, hash function h, and index table I to authorized query users through a secure channel. This prevents the cloud server from obtaining the key and decrypting the encrypted data subsets.

To retrieve the k-nearest neighbors of a query vector q, the system first computes its hash value

h (q)

using the predefined hash function h. The corresponding data subset is then located in the local index through Algorithm 6 by sequentially comparing

h (q)

with the hash ranges in the index table. When

h (q)

falls within a specific subset’s hash range, the associated subset tag is added to query request T, along with tags of its immediately adjacent subsets to enhance query accuracy. For cases where

h (q)

lies between two subsets’ hash ranges, both neighboring subset tags are recorded. Boundary conditions are handled by selecting the first subset’s tag when

h (q)

is below the minimum hash value, or the last subset’s tag when

h (q)

exceeds the maximum value. This comprehensive approach effectively mitigates interval selection bias and significantly improves query precision.

On the server side, the encrypted subset data are stored in the form of

\cup (E (S D_{i}), t a g_{i})

. Upon receiving a query request T containing the required subset tags from users, the server retrieves the corresponding encrypted data files. In our experimental setup, each encrypted subset is stored as an individual file named according to its assigned tag, enabling efficient retrieval. This design allows the server to simply fetch and transmit the encrypted candidate set

E (S) = {E (S D_{i}) | t a g_{i} \in T}

to the user based on the specified tags in query T, while maintaining data confidentiality throughout the process.

Upon receiving the encrypted candidate set

E (S)

, the user decrypts it using the pre-shared symmetric key sk obtained from the data owner, thereby obtaining the plaintext candidate set S. The user then performs a local sequential search on S by computing the distance between each candidate element and the query vector q. All candidate elements are sorted in ascending order based on their calculated distances, from which the top-k elements are selected as the final query results. This process ensures both data confidentiality during transmission and accurate k-nearest neighbor retrieval through local computation.

Algorithm 6 Query request generation

Input: Query vector: q, index table: I, Subset quantity: nop
Output: Dataset: Query requirement: T

1:: for $i \in [1, n o p]$ do
2:: $H \cup (I_{i} . m i n \cup I_{i} . m a x)$ // Read the hash index from the index table to H
3:: end for
4:: $h q = h (q)$ // Calculate the hash value of the query vector
5:: for $i \in [0, 2 \cdot n o p]$ do
6:: if $h_{2 i} \leq h_{q} \leq h_{2 i + 1}$ then
7:: $T \cup t a g_{i} \cup t a g_{i - 1} \cup t a g_{i + 1}$
8:: end if
9:: if $h_{2 i + 1} \leq h_{q} \leq h_{2 i + 2}$ then
10:: $T \cup t a g_{i} \cup t a g_{i + 1}$
11:: end if
12:: if $h_{q} < h_{0}$ then
13:: $T \cup t a g_{i} \cup t a g_{0}$
14:: end if
15:: if $h_{q} > h_{2 \cdot n o p}$ then
16:: $T \cup t a g_{i} \cup t a g_{n o p}$
17:: end if
18:: end for

4.4. Optimization

To further enhance the query accuracy of the algorithm, we employ two LSH functions (as shown in Equation (1)) to construct two index tables. During query processing, the intersection of the candidate sets retrieved from both index tables is selected as the final candidate set. This approach reduces the likelihood of false positive points being included in the candidate set because the two distinct hash functions exhibit a degree of independence in their hash value computations. Consequently, the probability of data point overlap between the two hash buckets decreases, thereby minimizing the risk of misjudgment. Additionally, this method improves query stability and reduces dependency on parameter selection and hash function choices.

The data owner must perform two separate hash function computations on the dataset while still leveraging the nearest neighbor relationships in the NNG to supplement candidate nodes. This process generates index tables and partitions data subsets, enhancing query accuracy and reducing the size of the final candidate set for sequential search, thus improving local search efficiency. However, storing multiple index tables increases spatial overhead, and reading multiple data groups to compute their intersection incurs additional I/O operations and time costs. To balance efficiency and accuracy, we limit the implementation to only two LSH functions, ensuring improved query precision while maintaining computational performance.

5. Theoretical Analysis

In this section, the computational complexity of NNG-SANN is analyzed in Section 5.1 and the security properties of NNG-SANN are analyzed in Section 5.2.

5.1. Computational Complexity Analysis

As shown in Table 2, this section presents a comprehensive analysis of the time and space complexity of the proposed scheme. Specifically, given a dataset of size

n_{D} = # (D)

(here the

# (D)

represents the number of elements in the set D), the data owner first processes and partitions the original dataset and the time complexity is

O (n_{D} l o g (n_{D}))

. The computational complexity for hash computation and subsequent sorting is

O (n_{D} \cdot M)

. For the NNG construction, storing M nearest neighbors for each data point requires

O (d \cdot s u b s i z e \cdot n o p)

space. The dataset is partitioned into nop subsets, each with size subsize, necessitating

O (d \cdot s u b s i z e \cdot n o p)

storage space for all subsets. The index table, which contains hash value boundaries (upper and lower limits) and subset labels, also occupies

O (3 \cdot n o p)

space. On the server side, encrypted data subsets are stored with an assumption of

O (d \cdot S_{E})

space requirement per encrypted vector, resulting in a total

O (d \cdot S_{E} \cdot s u b s i z e . n o p + n o p)

space complexity for storing all encrypted subsets and their corresponding labels. Regarding time complexity, encrypting each vector requires

O (d \cdot t_{E})

time, leading to an overall

O (d \cdot t_{E} \cdot s u b s i z e . n o p)

time complexity for encrypting all subsets.

During the query process, the user first performs a sequential search in the index table to generate a query request T, which requires

O (n o p)

time and occupies

O (# (T)) (# (T) \leq 3)

space. The server then retrieves the corresponding subset based on the query request, with a time complexity of

O (# (T)) (# (T) \leq 3)

as the search is performed by filename. The identified encrypted subset is returned as a candidate set

E (S)

, with a space complexity of

O (# (T) \cdot d \cdot S_{E} \cdot s u b s i z e) (# (T) \leq 3)

. Subsequently, the user receives the encrypted candidate subset and decrypts it to obtain the candidate set S, which occupies

O (# (T) \cdot d \cdot s u b s i z e) (# (T) \leq 3)

space locally. Finally, the user performs a local sequential search, which consumes

O (# (S))

time—less than the

O (n_{D})

time required for a direct sequential search on the original dataset. Overall, since data processing is conducted offline, the query time complexity comprises three components: generating the query request, server retrieval, and local retrieval, resulting in a total time complexity of

O (n o p + # (T) + # (S))

.

5.2. Security Properties Analysis

In the NNG-SANN model, under the premise of key sk security, the server stores only encrypted data subsets and their corresponding random labels. Only the data owner, acting as an authorized user, possesses the decryption key, while neither the server nor the LLMs of clients can access it. Consequently, neither the server nor the LLMs can obtain plaintext data, ensuring the confidentiality of both the data subsets and the returned candidate sets. Moreover, the uniform size of the partitioned subsets guarantees that the encrypted data remains indistinguishable in terms of size. Additionally, to prevent query leakage, the query request is not a simple raw query vector. The server does not have access to the index table and cannot derive the original query vector *q* from the processed query request T, even after special manipulations. This approach ensures that the server gains no information about the query. As demonstrated in our work [22], this greedy partitioning method satisfies the Indistinguishability under Chosen-Plaintext Attack (IND-CPA) security level.

6. Experimental Evaluation

In this section, a series of experiments are presented to evaluate the query algorithm of NNG-SANN, followed by comprehensive analysis and comparison of the results. First, we describe the experimental environment, the two selected datasets, and the two metrics employed to assess algorithm performance. Subsequently, through multiple experimental trials, we analyze the impact of various parameters on algorithm performance. Furthermore, we compare our results with the LSH-based secure nearest neighbor query algorithm SANNp [22].

6.1. Experimental Settings

All experiments in this section were conducted on a platform equipped with a CPU Intel(R) Core(TM) i7-10750H (Intel Corporation, Santa Clara, CA, USA) at 2.60 GHz and 16 GB of memory and running Windows 11. To analyze the performance differences of the scheme under datasets of different scales, the experiment mainly used the Corel dataset (http://kdd.ics.uci.edu/databases/CorelFeatures/, accessed on 27 May 2014) containing 68,040 32-dimensional data, the Vector Aerial dataset (https://github.com/Lysword/Aerial_dataset/releases/tag/dataset, accessed on 11 May 2020) containing 275,415 60-dimensional data, the Audio dataset (http://www.cs.princeton.edu/cass/audio.tar.gz, accessed on 21 May 2013) containing 54,337 192-dimensional data, and the Sift dataset (http://corpus-texmex.irisa.fr, accessed on 16 December 2009) containing 1,000,000 128-dimensional data. When analyzing the experimental results, two types of indicators, namely ratio and ART, are mainly used to evaluate the performance of this scheme. The ratio is a commonly used metric for evaluating the accuracy of ANN search results, which reflects the error between the approximate nearest neighbors and the true exact k-nearest neighbors. Since using a single query to assess accuracy may introduce randomness and yield unreliable results, the average ratio is calculated over multiple queries to obtain a more robust measurement. Here,

n_{q}

denotes the number of queries and represents the i-th query vector in a query set, where

1 \leq i \leq n_{q}

, and the

o_{i 1}^{*}, o_{i 2}^{*}, \dots, o_{i j}^{*}, \dots, o_{i k}^{*}

are the real k-nearest neighbor data vectors of the query qi. The definition of ratio is as follows:

r a t i o = \frac{1}{n_{q}} \sum_{i = 1}^{n_{q}} (\frac{1}{k} \sum_{j = 1}^{k} \frac{| | o_{i j}^{*} - q_{i} | |}{| | o_{i j} - q_{i} | |})

(6)

ART measures the mean query processing duration, encompassing the entire pipeline from user query generation to server retrieval and local candidate search. To mitigate randomness inherent in single-query evaluations, the ART is typically computed by averaging the response times across

n_{q}

queries. In the proposed scheme, the total response time consists of two components: ART@Server is the duration for the server to receive and respond to a query. ART@Client is the time required for the user to perform a linear scan within the retrieved candidate set. The overall ART is thus determined by the combined latency of ART@Server and ART@Client.

A R T = A R T @ S e r v e r + A R T @ C l i e n t = \frac{1}{n_{q}} \sum_{i = 1}^{n_{q}} S t_{i} + \frac{1}{n_{q}} \sum_{i = 1}^{n_{q}} C t_{i}

(7)

6.2. Experiment and Analysis

The storage costs of various outputs of NNG-SANN are shown in Table 3, where the parameter settings are presented in Table 4 (the m is set to 1). The construction time for the HNSW on datasets Corel, Aerial, Audio, and Siftare 113.55 s, 662.491 s, 252.02 s, and 4730.84 s, respectively. In the Section 6.2.1, Section 6.2.2 and Section 6.2.3, the selection of parameters (including candidate set, the number of neighbor nodes, and LSH function) is discussed by experiment, and further experiments for checking the query performance are conducted on each dataset in Section 6.2.4. Furthermore, to streamline the experimental process, the parameter-setting experiments are conducted solely on two datasets, Corel and Aerial, with relatively small data scales.

6.2.1. Candidate Set

The parameter Np denotes the number of data points in the candidate set retrieved by the end user. A larger candidate set implies that more data points are returned to the user, increasing the likelihood of including higher-quality data points and, consequently, a greater number of true nearest neighbors. In the experiments, the size of the retrieved candidate set is adjusted by varying the width parameter w of the hash buckets, which influences the partitioning of data subsets. When two or more subsets are returned, duplicate data points may exist across subsets. Since the exact number of unique data points after deduplication cannot be predetermined, Np can only be controlled within a certain range. The experimental setup employs M = 50 and k = 80, to investigate the impact of varying Np.

Figure 6 and Figure 7 demonstrate the impact of candidate set size Np on both query accuracy and response time. The experimental results align well with theoretical expectations. As Np increases, the ratio gradually decreases, indicating improved query accuracy due to the inclusion of more high-quality data points in the returned set. However, this improvement comes at the cost of increased query time, which grows linearly with Np as the local sequential search must process more candidate points. Based on a comprehensive trade-off between efficiency and accuracy, we select

N p / # (D) = 0.03

(where

# (D)

denotes dataset size) for subsequent experimental evaluations.

6.2.2. Neighbor Node

The proposed scheme employs only a single local hash function for coarse dataset partitioning. According to the properties of LSH, similar data points have a certain probability of being mapped to the same hash bucket. The candidate nodes are supplemented with neighbors from the NNG, which helps recover false negative points by redirecting them back to appropriate hash buckets, thereby collecting them within the same subset.

However, increasing M leads to greater memory consumption, as each hash bucket incorporates more supplementary data points. This consequently raises the number of duplicate data points across subsets and extends the time overhead during the offline processing phase. To systematically investigate M’s impact on query accuracy, we fixed k = 50 and maintained

N p / # (D) = 0.03

. Note that this ratio may exhibit slight deviations due to data deduplication operations. By varying the number of supplemented neighbors M, we recorded the corresponding query accuracy ratio and average response time ART under different M values.

Figure 8 demonstrates that the ratio gradually decreases with increasing M, indicating that supplementing additional neighbor points effectively compensates for the accuracy limitations inherent in using a single hash function. However, as M continues to increase, the ratio eventually stabilizes, suggesting that sufficient neighbor points have been incorporated and further additions provide no additional performance benefits while unnecessarily increasing storage overhead. Figure 9 presents the variation in ART. The ART primarily depends on the size of the final query set, which was maintained at approximately 3% of the original dataset size in these experiments, resulting in generally stable query times. Nevertheless, higher-dimensional and larger-scale datasets require longer processing times. Consequently, the per-query time cost is significantly greater for the 60-dimensional AerialD dataset compared to the 32-dimensional Corel dataset.

6.2.3. Hash Function

Based on the aforementioned analysis of these two factors, we observe that for an index generated by a single hash function, both the number of supplemented neighbor points M and the size of the returned candidate set Np significantly influence query accuracy. However, experimental results reveal that the optimal performance achievable with a single hash function remains insufficient. While supplementing neighbor points proves highly effective for improving accuracy in queries requiring fewer nearest neighbors (small k values), the precision limitations of a single hash function become apparent as the required number of nearest neighbors k increases.

To address this limitation, we propose employing multiple hash functions to construct separate index tables. Theoretically, this approach could enhance result accuracy by increasing the probability of capturing true nearest neighbors. However, this improvement comes at the cost of greater storage requirements for maintaining multiple index tables, along with increased time overhead for accessing and processing these tables during candidate node retrieval.

In our experimental setup, we fixed M = 50 and maintained

N p / # (D) = 0.03

. The parameter L, denoting the number of hash functions employed, was varied 1 or 2 to systematically evaluate the impact on query accuracy ratio across different k values.

Figure 10 and Figure 11 illustrate the impact of the number of hash functions on query accuracy across both datasets. When only a single LSH function is employed, the supplemental nearest neighbors from the NNG ensure satisfactory precision for small k values. However, as k increases beyond the number of supplemented neighbors, the ratio rises correspondingly, leading to degraded query accuracy. In contrast, when L = 2, the intersection operation between hash tables helps eliminate certain false positive points, thereby improving the quality of the candidate set. As depicted in the figures, increasing the number of LSH functions consistently enhances query accuracy, regardless of whether the Corel or Aerial dataset is used.

6.2.4. Comparison on Query Performance

This section presents a comparative analysis between the proposed method and the SANNp algorithm on both Corel and Aerial datasets. To ensure a fair comparison, we maintained a consistent candidate set size of

N p / # (D) = 0.03

across all experiments, meaning the retrieved candidate sets constitute 3% of the entire dataset. Table 4 details the parameter configurations used for both algorithms in different datasets. The experimental evaluation primarily focuses on comparing the accuracy ratio and ART between the two approaches.

Figure 12 and Figure 13 present the performance comparison on the Corel dataset. Figure 14 and Figure 15 demonstrate the results on the Aerial dataset. Figure 16 and Figure 17 demonstrate the results on the Audio dataset. Figure 18 and Figure 19 demonstrate the results on the Sift dataset. As k increases, both NNG-SANN and SANNp exhibit an upward trend in ratio values, though SANNp shows a more gradual increase. The experimental results reveal that our approach achieves significantly higher accuracy when querying for fewer nearest neighbors (small k values). However, this improved accuracy comes at the cost of increased query time, as the incorporation of supplemental neighbor points expands the candidate set size, requiring additional processing time for retrieval and evaluation.

Figure 20, Figure 21, Figure 22 and Figure 23 present a comparative analysis of the ratio between our optimized NNG-SANN and the SANNp algorithm. The results demonstrate that employing multiple index tables and performing intersection operations on two independent index searches significantly mitigates the accuracy degradation observed when querying for larger numbers of nearest neighbors (higher k values). Consequently, while our method exhibits excellent performance for queries requiring fewer nearest neighbors, the dual-LSH-function configuration provides an effective solution for maintaining high precision when processing queries with larger k values. This flexible approach allows users to select the optimal configuration based on their specific accuracy requirements.

7. Conclusions

The secure nearest neighbor query protects data privacy in retrieval tasks, particularly for LLMs. Existing methods face performance degradation in high-dimensional spaces due to the curse of dimensionality. While LSH alleviates this issue by trading precision for efficiency, it often yields false positives with limited hash functions. To address this, we propose a secure query method combining LSH with NNGs. The NNG supplements potential false negatives in hash buckets, improving candidate quality. Data is encrypted using AES before outsourcing to semi-honest servers. During queries, the user retrieves a small candidate set in one interaction and performs a local search for approximate k-nearest neighbors. Experiments show that neighbor supplementation enhances accuracy and recall, especially for small k. For larger queries, adding an LSH function improves precision with minimal overhead. This approach balances efficiency and security in privacy-preserving retrieval.

The proposed scheme achieves efficient nearest-neighbor retrieval but has two limitations: (1) While increasing LSH functions enhances accuracy for larger queries, it incurs higher computational costs, suggesting future work on optimized LSH variants to reduce false positives. (2) Although HNSW graphs enable fast searches, their memory-intensive construction warrants investigation of more compact graph structures to maintain performance while lowering spatial overhead.

Author Contributions

Conceptualization, H.Z. and Y.W.; formal analysis, H.Z.; methodology, H.Z. and Y.Q.; project administration, J.H.; software, Y.Q. and J.H.; supervision, Y.Q.; validation, Y.Q.; writing—original draft, H.Z., Y.Q. and Y.W.; writing—review and editing, H.Z., Y.W. and J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Shenzhen Science and Technology Program (No. KCXFZ20211020174801002).

Data Availability Statement

All the datasets used in this study are available on the Internet. The Corel datasets can be downloaded from http://kdd.ics.uci.edu/databases/CorelFeatures/ (accessed on 27 May 2014) and the Aerial datasets can be downloaded from https://github.com/Lysword/Aerial_dataset/releases/tag/dataset (accessed on 11 May 2020).

Acknowledgments

We acknowledge the contributions from Yanguo Peng (School of Computer Science and Technology, Xidian University), the editors, and the reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ART	Average response time
ANN	Approximate nearest neighbor
IND-CPA	Indistinguishability under chosen-plaintext attack
kNN	k-nearest neighbor
LSH	Locality-sensitive hashing
LLMs	Large language models
NNG	Nearest neighbor graph
PQ	Product quantization
SANN	Secure approximate nearest neighbor

References

Kouiroukidis, N.; Evangelidis, G. The effects of dimensionality curse in high dimensional knn search. In Proceedings of the 2011 15th Panhellenic Conference on Informatics, Kastoria, Greece, 30 September–2 October 2011; pp. 41–45. [Google Scholar]
Khandelwal, U.; Fan, A.; Jurafsky, D.; Zettlemoyer, L.; Lewis, M. Nearest neighbor machine translation. arXiv 2020, arXiv:2010.00710. [Google Scholar]
Li, W.; Wang, Q.; Zhao, P.; Yin, Y. KNN Transformer with Pyramid Prompts for Few-Shot Learning. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 1082–1091. [Google Scholar]
Khandelwal, U.; Levy, O.; Jurafsky, D.; Zettlemoyer, L.; Lewis, M. Generalization through memorization: Nearest neighbor language models. arXiv 2019, arXiv:1911.00172. [Google Scholar]
Zhong, L.; Wang, B.; Wei, H. Cloud computing applied in the mobile internet. In Proceedings of the 2012 7th International Conference on Computer Science & Education (ICCSE), Melbourne, Australia,, 14–17 July 2012; pp. 218–221. [Google Scholar]
Shaikh, F.B.; Haider, S. Security threats in cloud computing. In Proceedings of the 2011 International Conference for Internet Technology and Secured Transactions, Abu Dhabi, United Arab Emirates, 11–14 December 2011; pp. 214–219. [Google Scholar]
Bentley, J.L. Multidimensional binary search trees used for associative searching. Commun. ACM 1975, 18, 509–517. [Google Scholar] [CrossRef]
Guttman, A. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, Boston, MD, USA, 18–21 June 1984; pp. 47–57. [Google Scholar]
Chakrabarti, K.; Mehrotra, S. The hybrid tree: An index structure for high dimensional feature spaces. In Proceedings of the Proceedings 15th International Conference on Data Engineering (Cat. No. 99CB36337), Sydney, Australia, 23–26 March 1999; pp. 440–447. [Google Scholar]
Weber, R.; Schek, H.J.; Blott, S. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the VLDB, New York, NY, USA, 24–27 August 1998; Volume 98, pp. 194–205. [Google Scholar]
Jegou, H.; Douze, M.; Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 117–128. [Google Scholar] [CrossRef] [PubMed]
Ge, T.; He, K.; Ke, Q.; Sun, J. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2946–2953. [Google Scholar]
Datar, M.; Immorlica, N.; Indyk, P.; Mirrokni, V.S. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, Brooklyn, NY, USA, 9–11 June 2004; pp. 253–262. [Google Scholar]
Panigrahy, R. Entropy based nearest neighbor search in high dimensions. arXiv 2005, arXiv:cs/0510019. [Google Scholar]
Joly, A.; Buisson, O. A posteriori multi-probe locality sensitive hashing. In Proceedings of the 16th ACM International Conference on Multimedia, Vancouver, BC, Canada, 27–31 October 2008; pp. 209–218. [Google Scholar]
Gan, J.; Feng, J.; Fang, Q.; Ng, W. Locality-sensitive hashing scheme based on dynamic collision counting. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, AZ, USA, 20–24 May 2012; pp. 541–552. [Google Scholar]
Malkov, Y.; Ponomarenko, A.; Logvinov, A.; Krylov, V. Scalable distributed algorithm for approximate nearest neighbor search problem in high dimensional general metric spaces. In Proceedings of the Similarity Search and Applications: 5th International Conference, SISAP 2012, Toronto, ON, Canada, 9–10 August 2012; Proceedings 5. Springer: Berlin/Heidelberg, Germany, 2012; pp. 132–147. [Google Scholar]
Malkov, Y.A.; Yashunin, D.A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 824–836. [Google Scholar] [CrossRef] [PubMed]
Khoshgozaran, A.; Shahabi, C. Blind evaluation of nearest neighbor queries using space transformation to preserve location privacy. In International Symposium on Spatial and Temporal Databases; Springer: Berlin/Heidelberg, Germany, 2007; pp. 239–257. [Google Scholar]
Demertzis, I.; Papadopoulos, S.; Papapetrou, O.; Deligiannakis, A.; Garofalakis, M. Practical private range search revisited. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA, 26 June 2016–1 July 2016; pp. 185–198. [Google Scholar]
Peng, Y.; Wang, L.; Cui, J.; Liu, X.; Li, H.; Ma, J. LS-RQ: A lightweight and forward-secure range query on geographically encrypted data. IEEE Trans. Dependable Secur. Comput. 2020, 19, 388–401. [Google Scholar] [CrossRef]
Peng, Y.; Li, H.; Cui, J.; Ma, J.; Liu, Y. Towards Secure Approximate k-Nearest Neighbor Query Over Encrypted High-Dimensional Data. IEEE Access 2018, 6, 23137–23151. [Google Scholar] [CrossRef]

Figure 1. The structures of NSW.

Figure 2. Schematic diagram of the skip table structure.

Figure 3. The structure of HNSW.

Figure 4. SANN query framework based on NNGs.

Figure 5. Schematic diagram of the greedy partitioning subset method.

Figure 6. The ratio with different Np.

Figure 7. The ART with different Np.

Figure 8. The ratio with different M.

Figure 9. The ART with different M.

Figure 10. The ratio on the Corel dataset with different L.

Figure 11. The ratio on the Aerial dataset with different L.

Figure 12. The ratio on the Corel dataset with NNG-SANN and SANNp.

Figure 13. The ART on the Corel dataset with NNG-SANN and SANNp.

Figure 14. The ratio on the Aerial dataset with NNG-SANN and SANNp.

Figure 15. The ART on the Aerial dataset with NNG-SANN and SANNp.

Figure 16. The ratio on the Audio dataset with NNG-SANN and SANNp.

Figure 17. The ART on the Audio dataset with NNG-SANN and SANNp.

Figure 18. The ratio on the Sift dataset with NNG-SANN and SANNp.

Figure 19. The ART on the Sift dataset with NNG-SANN and SANNp.

Figure 20. The ratio on the Corel dataset with NNG-SANN (where the L set to 2) and SANNp.

Figure 21. The ratio on the Aerial dataset with NNG-SANN (where the L set to 2) and SANNp.

Figure 22. The ratio on the Audio dataset with NNG-SANN (where the L set to 2) and SANNp.

Figure 23. The ratio on the Sift dataset with NNG-SANN (where the L set to 2) and SANNp.

Table 1. Primary notations.

Name	Meaning
D	A dataset.
$S D_{i}$	A subset of D.
E(X)	The cipher of X. E is the encryption algorithm.
S	The candidate set.
$N p$	The size of candidate set.
M	The number of neighbors.
G	An NNG.
q	A query vector.
d	The dimension of a vector.
$t_{i}$	A tag for indexing the subset.
T	A clock query (which includes a set of tags).
h	An LSH function.
L	The number of LSH functions.
$# (X)$	The number of elements in X.
$\| \| q - v \| \|$	Euclidean distance between q and v.
I	A map index for subsets.

Table 2. Complexity analysis.

Stage	Type	Item	Complexity
Data encryption	Time complexity	Partitioning subset	$O (n_{D} l o g (n_{D}))$
	Time complexity	Encrypting	$O (d \cdot t_{E} \cdot s u b s i z e . n o p)$
	Space complexity	Ciphertext subset	$O (d \cdot s u b s i z e \cdot n o p)$
		Index table	$O (3 \cdot n o p)$
		Total space	$O (d \cdot S_{E} \cdot s u b s i z e . n o p + n o p)$
Secure query	Time complexity	Generating query request	$O (n o p)$
		Querying on sever	$O (# (T))$
		Refining on client	$O (# (S))$
	Space complexity	Query request	$O (# (T))$
	Space complexity	Candidate set	$O (# (T) \cdot d \cdot S_{E} \cdot s u b s i z e)$

Table 3. The storage cost of outputs of NNG-SANN.

Dataset	Plaintext Storage Cost	Output	Ciphertext/Index Storage Cost
Corel	19.1 MB	NNG HNSW	26.2 MB
		Security Index	60.3 KB
		Ciphertext data	491 MB
		Candidate set	300.1 KB
Aerial	145 MB	NNG HNSW	135 MB
		Security Index	57 KB
		Ciphertext data	3522.22 MB
		Candidate set	2.29 MB
Audio	94.6 MB	NNG HNSW	54.1 MB
		Security Index	57 KB
		Ciphertext data	2313.3 MB
		Candidate set	1.45 MB
Sift	752 MB	NNG HNSW	751 MB
		Security Index	62 KB
		Ciphertext data	29,929.7 MB
		Candidate set	2.98 MB

Table 4. Values of different algorithm parameters.

Method	Parameter	Corel	Aerial	Audio	Sift
SANNp	The number of functions (m)	18	18	18	18
	Encoding length ( $λ$ )	2	2	2	2
	The number of groups (nop)	35	35	35	35
NNG-SANN	The number of functions (m)	1 or 2	1 or 2	1 or 2	1 or 2
	The number of neighbors (M)	50	50	50	50
	The size of candidate set (S)	2000	8000	1600	30,000
	The size of group	800	3200	640	12,000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, H.; Wang, Y.; Qiao, Y.; Huang, J. NNG-Based Secure Approximate k-Nearest Neighbor Query for Large Language Models. Mathematics 2025, 13, 2199. https://doi.org/10.3390/math13132199

AMA Style

Zhou H, Wang Y, Qiao Y, Huang J. NNG-Based Secure Approximate k-Nearest Neighbor Query for Large Language Models. Mathematics. 2025; 13(13):2199. https://doi.org/10.3390/math13132199

Chicago/Turabian Style

Zhou, Heng, Yuchao Wang, Yi Qiao, and Jin Huang. 2025. "NNG-Based Secure Approximate k-Nearest Neighbor Query for Large Language Models" Mathematics 13, no. 13: 2199. https://doi.org/10.3390/math13132199

APA Style

Zhou, H., Wang, Y., Qiao, Y., & Huang, J. (2025). NNG-Based Secure Approximate k-Nearest Neighbor Query for Large Language Models. Mathematics, 13(13), 2199. https://doi.org/10.3390/math13132199

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

NNG-Based Secure Approximate k-Nearest Neighbor Query for Large Language Models

Abstract

1. Introduction

2. Related Works

2.1. Nearest Neighbor Query

2.2. Approximate Nearest Neighbor Query

2.3. Secure Approximate Nearest Neighbor Query

3. Preliminaries

3.1. Locality-Sensitive Hashing

3.2. Nearest Neighbor Graph

3.2.1. NSW

3.2.2. HNSW

3.2.3. Secure Nearest Neighbor Query Framework

3.3. Symmetrical Encryption

4. NNG-SANN

4.1. Framework

4.2. Design of Index

4.2.1. Rapid Positioning Based on LSH

4.2.2. Generation of the NNG

4.2.3. Greedy Division of Subsets

4.3. Secure Query

4.4. Optimization

5. Theoretical Analysis

5.1. Computational Complexity Analysis

5.2. Security Properties Analysis

6. Experimental Evaluation

6.1. Experimental Settings

6.2. Experiment and Analysis

6.2.1. Candidate Set

6.2.2. Neighbor Node

6.2.3. Hash Function

6.2.4. Comparison on Query Performance

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI