Parallel Approaches for SNN-Based Nearest Neighbor Search in High-Dimensional Embedding Spaces: Application to Face Recognition

Mochurad, Lesia; Kapustiak, Roman

doi:10.3390/app151810139

Open AccessArticle

Parallel Approaches for SNN-Based Nearest Neighbor Search in High-Dimensional Embedding Spaces: Application to Face Recognition

by

Lesia Mochurad

^*

and

Roman Kapustiak

Department of Artificial Intelligence, Lviv Polytechnic National University, 79013 Lviv, Ukraine

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 10139; https://doi.org/10.3390/app151810139

Submission received: 28 August 2025 / Revised: 11 September 2025 / Accepted: 16 September 2025 / Published: 17 September 2025

Download

Browse Figure

Versions Notes

Abstract

The rapid growth of high-dimensional biometric data requires fast and accurate similarity search methods for real-time applications. This study proposes, for the first time, two efficient parallel implementations of the exact Sorting-based Nearest Neighbor (SNN) algorithm using OpenMP for CPUs and CUDA for GPUs. Comparative evaluation against conventional exact search methods—k-d tree and ball tree—on LFW embeddings, including FaceNet512 and VGG-Face, demonstrates an up to 58× speedup on GPUs while maintaining full accuracy. Analysis of the full recognition pipeline shows that parallelization reduces search times to about 27% of total processing, highlighting the method’s stability and efficiency for modern embeddings. These results confirm the applicability of the proposed approaches for real-time biometric identification, with potential extensions to streaming data, hybrid computing environments, and other high-dimensional representations.

Keywords:

SNN method; parallel computing; facial embeddings; nearest neighbor search; high-dimensional data; CPU and GPU acceleration

1. Introduction

The rapid advancement of deep learning technologies has increased the accuracy of artificial intelligence systems, particularly in the field of face recognition, as confirmed by numerous studies [1,2,3,4,5]. These systems sometimes surpass human recognition accuracy. This progress has been driven by the use of high-dimensional vector representations that effectively encode unique facial features [6]. However, this success has also given rise to a new fundamental challenge: the combination of high data dimensionality and the exponential growth of biometric databases has turned similarity search into a critical computational bottleneck, limiting the performance and scalability of modern systems [7,8].

One of the key operations in high-dimensional data analysis tasks, including face recognition, is Fixed-Radius Nearest Neighbor Search (FRNNS). This task involves identifying all points located within a specified radius from a query point and is critical for many algorithms in clustering, anomaly detection, and similarity searches in large databases [9]. As data’s dimensionality and volume increase, traditional algorithms for FRNNS face significant computational challenges caused by the “curse of dimensionality”, which severely reduces their performance and scalability.

The Sorting-based Nearest Neighbor (SNN) method [10], on which this study is based, is one of the promising exact approaches for solving nearest neighbor search problems, including FRNNS, in such spaces. A parallel implementation of SNN can significantly improve search performance while maintaining 100% accuracy, which is particularly important for applications with stringent reliability requirements, such as security systems and law enforcement. Therefore, optimizing and scaling the SNN algorithm for FRNNS in large biometric databases becomes an essential step toward overcoming current computational challenges.

As facial databases scale up, three bottlenecks emerge: the indexing time for graph-based methods grows superlinearly [11], distributed systems face load balancing issues in clusters, and hardware limitations constrain GPU acceleration [12]. In medical image studies, there is a risk of missing critical features in diagnostic processes [13], while security systems experience false alarms with larger datasets. These challenges highlight the need for nearest neighbor search (NNS) algorithms that remain efficient across varying dataset sizes and complexities without compromising exact search accuracy.

Existing approaches to addressing this problem fail to meet the requirements of modern, particularly mission-critical, applications. Traditional exact search methods, such as tree-based algorithms (k-d trees), experience significant performance degradation when working with high-dimensional face embeddings due to the so-called “curse of dimensionality,” rendering them entirely unsuitable for real-time systems. On the other hand, approximate nearest neighbors (ANN) methods, while offering high speed, achieve it at the cost of a fundamental trade-off—sacrificing guaranteed accuracy [14,15]. Such a compromise is unacceptable for a wide range of critical applications, such as security systems and law enforcement, where a false negative result can have extremely serious consequences.

This presents an unacceptable dilemma: either use exact but impractical slow methods or fast but potentially unreliable ones. This study aims to address precisely this dilemma. It is based on the use of the modern exact SNN algorithm, which by its nature is better adapted to high-dimensional spaces than tree-based structures. The key contribution of this work lies in the parallel implementation of this algorithm, which is a necessary step to unlocking its full potential. This enables performance that meets the strict low-latency requirements of large-scale real-time systems without sacrificing accuracy. Therefore, the development of such optimized, exact search methods is not merely a technical improvement but a critical necessity for the continued advancement and reliable deployment of state-of-the-art artificial intelligence systems.

High-dimensional facial vector embeddings produced by modern models, such as FaceNet [16] or VGG-Face [17], typically have dimensionalities ranging from 128 to 512 and are normalized onto the unit hypersphere. In such spaces, vector proximity reflects identity similarity; however, challenges arise due to distance concentration [9] and the hubness phenomenon [18], which are characteristic of high-dimensional data. The SNN method naturally accounts for these properties: by sorting along the principal component and applying projection-based bounds, it eliminates a substantial number of candidates while preserving exactness. Therefore, SNN aligns well with the geometry of facial embeddings, enabling an efficient and robust nearest neighbor search in face recognition tasks.

This work presents a novel methodological adaptation of the exact SNN algorithm for high-dimensional facial embeddings. Unlike prior studies, it addresses the combined challenges of the curse of dimensionality and hubness while preserving exact recall. Precomputation strategies are introduced to enhance query efficiency, the algorithm shows robustness on large-scale datasets, and a comprehensive analysis of an end-to-end face recognition pipeline confirms its suitability for real-time, mission-critical applications.

Main contributions of the work:

Introducing a methodological adaptation of the exact SNN algorithm for high-dimensional facial embeddings, effectively addressing challenges such as the curse of dimensionality and hubness while maintaining exact recall.
Proposing a precomputation strategy in the query phase that improves efficiency without compromising accuracy, offering a scalable alternative to tree-based exact search methods.
Conducting a comprehensive performance and stability analysis on large-scale datasets, demonstrating the algorithm’s robustness to increasing dimensionality and data volume.
Providing an end-to-end evaluation of a face recognition pipeline, highlighting integration into real-time systems with high reliability and demonstrating practical scalability limits.

The results of this study demonstrate that it is possible to develop reliable, high-performance face recognition systems capable of processing large high-dimensional biometric databases without loss of accuracy. The proposed strategies allow adaptation to modern multi-core CPUs and GPUs, reduce system response time in real-time applications, optimize computational resource usage, and lower operational costs. These outcomes are particularly valuable for critical applications, such as security systems and law enforcement, and provide insights for broader applications requiring efficient handling of large high-dimensional datasets.

2. State-of-the-Art Review

2.1. Classification and Detailed Overview of Existing FRNNS Methods

Fixed-radius nearest neighbor search algorithms can be divided into two main categories: exact methods, which guarantee finding all points within a given radius, and approximate methods, which trade some accuracy for a significant speedup. In the context of this work, the primary focus is on exact methods; however, considering the principles of approximate methods is useful for understanding current optimization trends.

2.1.1. Exact Search Method

Exact FRNNS methods ensure complete coverage of relevant results: no point within the specified radius R from the query point will be missed. At the same time, no point outside this radius will be included in the response. The first such group of methods comprises spatial indexing approaches. This category includes algorithms that operate on hierarchical data structures, particularly tree-based ones, which allow partitioning the space into subregions and efficiently pruning areas where neighbors cannot be located. This group includes the following:

K-d trees are among the earliest and most well-known structures of this type. They implement recursive partitioning of the space using hyperplanes parallel to the coordinate axes, changing the splitting axis at each tree level [10]. This approach demonstrates high efficiency in low-dimensional spaces (d < 20), but as dimensionality increases, its performance drops sharply due to the so-called “curse of dimensionality” [19].
Ball trees organize data in the form of nested hyperspheres. Each node of the tree corresponds to a hypersphere covering a certain subset of points. Compared to k-d trees, ball trees handle non-uniformly distributed data more effectively; however, in high-dimensional spaces, they also become inefficient [10].
Other tree-based structures include Vantage Point trees (VP-trees), which partition the space relative to a chosen reference point based on a distance threshold [10]; Random Projection trees (RP-trees), which employ random projections [20]; and Cover trees, which provide theoretical guarantees for search time, although their implementation is complex [10]. Regardless of the specific type, the efficiency of tree-based methods significantly decreases as the dimensionality d approaches log N, where N is the number of points in the dataset, and performance approaches that of brute-force search [10].

Despite these limitations, the development of new tree-based approaches continues. In particular, in 2025, a framework called LeaFi was proposed, which leverages machine learning to improve pruning efficiency in tree-based structures oriented toward time series. LeaFi provides a substantial speedup (up to 32×) while maintaining a high recall level—99% [21]. Although this technology is primarily designed for time series tasks, its approach can be adapted to general FRNNS problems in high-dimensional spaces.

The second group of exact methods comprises sorting-based approaches. This class of algorithms involves the preliminary sorting of data according to certain features, which allows accelerating computations. These include the following:

SNN is an exact algorithm for FRNNS, developed during 2022–2024 (Chen, Güttel), which demonstrates superiority over traditional methods [10]. It is based on three key principles: (1) candidate exclusion using a sorting criterion, which restricts the search space; (2) the use of precomputed scalar products to reduce arithmetic complexity; (3) reformulation of computations as matrix operations, enabling the use of high-performance BLAS libraries. The indexing stage includes data centering, computation of the principal component via SVD, sorting of points according to their projection onto this component, and storing the norms of the centered vectors. The query stage involves binary search on the sorted projections and filtering according to the distance criterion. The key advantages of SNN include the following: no need for hyperparameter tuning (except for the radius R), guaranteed accuracy, high speed (often outperforming both tree-based algorithms and optimized brute force), and flexibility, as the rapid index update allows the method to be applied to streaming data.
K-means for k-NN (kMkNN) is a method that applies preliminary clustering using k-means to partition the data, after which the search is carried out only within a limited number of relevant clusters. The use of the triangle inequality allows for reducing the number of distance computations. Although this method is primarily designed for the k-NN task [22], its principles can be partially transferred to FRNNS.

The most straightforward but computationally expensive approach is brute-force search. This method consists of computing the distances from the query to every point and checking the radius condition. It remains relevant as a baseline reference for evaluating efficiency, particularly in tasks with small datasets or low dimensionality. Modern implementations of brute-force search exploit hardware accelerations—such as SIMD instructions or BLAS libraries—for optimizing Euclidean distance computations, as well as GPU parallelization. A comprehensive comparison of the key exact FRNNS algorithms, including their theoretical properties, advantages, and limitations, is presented in Table 1.

In FRNNS tasks for high-dimensional spaces, the advantages of classical tree-based structures gradually diminish compared to brute-force search. In this context, newer methods such as SNN demonstrate significant benefits due to their ability to substantially narrow the search space by adapting to the geometry of high-dimensional data, as well as their efficient distance computation implementations. Moreover, the lack of the necessity to tune a large number of hyperparameters makes these methods practically more convenient compared to graph-based approaches, such as HNSW, which are highly sensitive to parameter settings [23]. Therefore, sorting-based approaches, particularly SNN, can be considered a promising direction for further research and optimization of exact FRNNS algorithms.

2.1.2. Approximate Search Methods

Although this work focuses on exact search methods, it is worth considering the potential of ANN for FRNNS tasks. Unlike exact methods, approximate methods do not guarantee the identification of all or the most precise neighbors; however, they provide a significant reduction in query processing time and resource consumption, which is critical when working with large datasets in high-dimensional spaces. A significant portion of modern ANN algorithms, such as HNSW and ScaNN, were primarily developed for the k-NN problem rather than FRNNS [10], which complicates their direct adaptation to threshold-based scenarios. This particularly concerns the selection of an appropriate threshold R and ensuring sufficient recall while minimizing false positives. Nevertheless, the conceptual foundations of ANN methods can be leveraged to improve heuristics in exact FRNNS methods, especially in terms of efficient pruning of the search space.

Among the most successful approximate search methods in recent years are graph-based approaches. These methods involve constructing a proximity graph, where nodes represent data points and edges connect neighboring points. Search in such a graph is performed via iterative local traversal, starting from one or several entry points and gradually approaching the query point [23]. Algorithms such as HNSW [24], NSG [25], DiskANN [12], and VAMANA [20] achieve high accuracy even in complex spaces, albeit at the cost of significant overhead in graph construction and parameter tuning, such as the size of the candidate list. An important characteristic of these methods is the presence of irregular memory access during search, which reduces cache efficiency.

Another class of methods, Locality Sensitive Hashing (LSH), is based on applying specialized hash functions that map nearby points to the same hash table buckets with high probability. This approach allows for a substantial reduction in the number of points considered during a query [10]. LSH has theoretical guarantees for certain distance metrics, but achieving a high accuracy often requires the use of many hash functions and tables, which, in turn, increases memory consumption. The effectiveness of the method largely depends on the proper selection of a hash function family that corresponds to the distance characteristics of the given task [20].

A separate group is formed by vector quantization methods, whose primary goal is to compress input data to accelerate computations and reduce memory requirements. The most well-known among them is Product Quantization (PQ), which partitions the space into Cartesian products of subspaces and performs clustering within each to form codebooks. Using these codebooks, vectors are approximated by short sequences of codes, enabling fast distance evaluation to a query [20]. Variants of this approach, such as Optimized PQ (OPQ) and Binary Quantization (BQ), allow for even higher compression or distance computation using simple operations on binary vectors [26]. However, this comes at the cost of a loss in accuracy due to quantization. To compensate for such a loss, a re-ranking step based on exact distances is often applied.

Although ANN methods do not guarantee full recovery of the neighbor set in FRNNS, their potential should not be underestimated. Many of the concepts underlying efficient approximate search—such as graph structures, data compression techniques, and heuristics for fast preliminary pruning—can inspire the development of hybrid or heuristically enhanced exact methods. An example of such an approach is presented in [26], where, for the k-NN problem in a distributed environment, global space partitioning using VP-trees is combined with local graph-based search based on HNSW. Thus, even in the context of exact FRNNS, understanding the limitations and advantages of ANN methods is an important element in the strategic design of new solutions.

2.2. Modern Approaches to Optimizing FRNNS Algorithms

To address the challenges associated with the efficiency of FRNNS, especially in high-dimensional spaces, various optimization approaches have been developed and investigated. These approaches can be tentatively categorized as described below.

2.2.1. Dimensionality Reduction Methods

Dimensionality reduction methods aim to transform data from a high-dimensional space into a lower-dimensional space while preserving the information essential for solving the given task. Such a transformation allows for a reduction in computational complexity during neighbor search, decreases memory requirements, and partially mitigates the negative effects of the “curse of dimensionality” [27].

Among the most well-known classical approaches are linear methods. In particular, Principal Component Analysis (PCA) identifies orthogonal directions along which data variance is maximized and projects the data onto a space spanned by the first few principal components. Another common method, Linear Discriminant Analysis (LDA), aims to enhance between-class separability while maintaining within-class compactness [28].

For tasks where the data structure is complex and nonlinear, linear methods are often insufficient. In such cases, methods like t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) are employed, which preserve local distances and topology in low-dimensional projections [29]. Although these approaches are primarily used for visualization, they are sometimes applied as a preprocessing step in search tasks.

A separate group comprises deep learning-based methods, notably autoencoders and their variants. These models are trained to create compact latent representations of input data that can reconstruct the inputs with acceptable accuracy. Variational Autoencoders (VAEs), for instance, model the latent space with a probabilistic structure, further enhancing the generative properties of the models [30].

In the context of exact FRNNS, the use of dimensionality reduction methods is problematic. On one hand, projecting data into a lower-dimensional space can substantially reduce computations and facilitate index construction. On the other hand, almost all known dimensionality reduction methods are approximate and entail information loss, which introduces the risk of decreased accuracy. If the information lost during transformation is important for identifying neighbors within radius RRR, the FRNNS results in the projected space may not correspond to those in the original space.

In the literature, dimensionality reduction is most often considered in the context of approximate nearest neighbor search (ANNS), where a certain level of error is acceptable. In contrast, for exact FRNNS, such an error may be critical. Therefore, using dimensionality reduction in exact FRNNS methods requires caution. One must either employ specialized techniques capable of preserving the necessary geometric properties or clearly understand the limits of approximation and the impact of information loss on the results. Otherwise, there is a risk that an exact FRNNS task de facto becomes an approximate search task without explicitly declaring this transition.

2.2.2. Parallelization Technologies

Parallelization of computations is a key approach for optimization of performance in FRNNS tasks, especially when handling large volumes of data. The main idea is to distribute the computational workload, both during the index construction phase and query execution, across multiple computational resources—CPU cores, GPUs, or nodes in distributed computing systems [19,20,31,32].

Parallel processing on a CPU can be implemented either through multithreading within a single processor, for example using OpenMP, or via inter-process communication on multiprocessor systems or clusters using technologies such as MPI [33]. This enables parallelization of query processing or data fragment handling. For instance, the zd-tree, as a data structure for low-dimensional space, supports efficient parallel updates, insertions, and deletions of points in a streaming manner [20,34,35].

GPUs also demonstrate significant acceleration potential. The massively parallel architecture of GPUs is well suited for computations over large sets of homogeneous operations, particularly for parallel distance calculations or construction of certain stages of index structures. For example, the RadiK algorithm, designed for top-k selection in k-NN tasks and implemented using GPUs, incorporates several efficient techniques (including hierarchical atomic operations, record buffering, and optimal use of global memory) that can be adapted for FRNNS tasks [36].

At the same time, successful implementation of parallelization for FRNNS is accompanied by a number of technical challenges:

First, it is important to ensure the correct distribution of data among threads or computational nodes to avoid redundant computations and minimize inter-process communication overheads.
Second, in cases of uneven data distribution or varying query complexity, load balancing mechanisms should be implemented to prevent overloading certain resources and inefficient idling of others.
Additionally, communication costs should be optimized, which is especially relevant in cluster systems or in CPU–GPU interactions, where data transfer often becomes a bottleneck. Finally, the need for synchronization—such as coordinating access to shared structures or merging results—can introduce additional overheads and significantly complicate the implementation of an efficient parallel algorithm.

Although the scaling potential is significant, effective implementation of parallel FRNNS, particularly on heterogeneous architectures combining a CPU and GPU, or for streaming and dynamic data, requires deep engineering solutions. Simple parallel division of the computational loop is usually insufficient; algorithmic and hardware-oriented optimizations are necessary. For example, [20] proposes a hybrid k-NN implementation using MPI and OpenMP, which improves load balancing through partial data replication. Meanwhile, the RadiK implementation demonstrates the complexity of efficient GPU programming, requiring careful memory management and synchronization [36]. Studies on parallel graph search, such as AverSearch, also highlight the challenges of balancing and coordinating threads in ANNS tasks [24].

2.2.3. Innovative Indexing and Query Processing Strategies

Recent research increasingly focuses not only on dimensionality reduction and parallelization but also on direct optimization of the indexing and query processing stages in FRNNS tasks. One promising direction involves adaptive methods that avoid rigidly fixed parameters and dynamically adjust to the properties of the dataset, query distribution, or available computational resources.

Another approach is the use of machine learning models to build so-called Learned Index Structures. For example, the LeaFi system allows for more accurate prediction of distances in tree-like structures, thereby reducing the number of unnecessary computations during searches [21]. In ANNS tasks, a similar principle is implemented in the VSAG framework, which automatically tunes parameters of graph algorithms, including HNSW, thus eliminating the need for manual tuning [14].

Optimization of memory access also plays an important role. With large datasets or random access patterns (as in graph structures), cache misses increase. To minimize them, prefetching techniques are employed, as implemented in VSAG [23].

Another direction is reducing the cost of distance computations. For instance, the FINGER method [12], although developed for ANNS, demonstrates the idea of adaptive approximation: if a potential candidate clearly falls outside the radius, computing its exact distance becomes unnecessary [12,24]. Similar heuristics can also be useful for accelerating exact FRNNS if applied cautiously.

In the field of parallel computing, asynchronous architectures deserve special mention, as in the AverSearch framework [24]. This model avoids rigid thread synchronization and provides flexible dynamic load balancing, which is especially effective for queries of varying complexity and for heterogeneous graph structures.

Overall, there is a clear shift from universal and static approaches toward dynamic, adaptive, and even learned solutions that take into account data specifics, query behavior, and hardware characteristics. Many of these innovations were initially developed in the context of ANNS, but their gradual adaptation to exact FRNNS opens new opportunities for developing high-performance algorithms. A summarized overview of contemporary approaches to FRNNS optimization is presented in Table 2.

Over the past five years, the field of nearest neighbor search, particularly FRNNS, has witnessed several key trends: there is growing interest in exact methods that compete with ANN in terms of speed, integration of machine learning for index optimization, active development of adaptations for modern hardware architectures (CPU, GPU), the emergence of algorithms tailored to specific data types, and deeper investigations into the effects of metrics in high-dimensional spaces. Among the resulting achievements are the introduction of the SNN algorithm, which combines accuracy, speed, and parameter-free operation; advances in ANNS techniques that can be adapted for FRNNS; and a better understanding of the “curse of dimensionality”. At the same time, several open challenges remain: limited scalability of FRNNS for very large N or d; lack of theoretical guarantees for modern methods; difficulties in efficient parallelization on heterogeneous systems; absence of solutions for dynamic data; and issues related to fairness, robustness against attacks, and adaptive selection of the radius R. In this context, the SNN method, based on sorting points by the first principal component, demonstrates promising results: high accuracy, speed, ease of implementation using BLAS, good scalability, and suitability for parallelization. This makes it a strong candidate for exact search tasks in facial recognition systems. Based on the conducted analysis, it is expected that a parallel implementation of the SNN algorithm can significantly reduce query processing times without compromising results’ accuracy. Leveraging statistical characteristics of embeddings is anticipated to further constrain the search space effectively. Consequently, the developed solution has the potential to achieve a performance comparable to ANN methods while maintaining the accuracy required for facial recognition tasks.

3. Materials and Methods

3.1. Data Representation

Let

P = \{p_{1}, p_{2}, \dots, p_{n}\} \subset R^{d},

be a set of facial vector representations, where each vector

p_{i}, i = \bar{1, n}

corresponds to a specific facial sample in a d-dimensional feature space obtained using convolutional neural networks (CNNs). The number of vectors

n = |P|

can reach hundreds of thousands, and the dimensionality

d \in \{128, 256, 512, 1024\}

depends on the specific embedding model.

Let

q \in R^{d}

,be a query vector. The task is to find all vectors in

P

that lie within a fixed Euclidean radius r, so as to solve

F R N N S (q, r) = \{p_{i} \in P : {‖q - p_{i}‖}_{2} \leq r, i = \bar{1, n}\}

.

It is important to note that the facial embeddings used in this study were not L2-normalized. Consequently, the Euclidean distance was chosen over cosine similarity, as it represents the natural distance in the feature space constructed by the FaceNet (Google Inc., Mountain View, CA, USA) and VGG-Face models (Visual Geometry Group, University of Oxford, Oxford, UK), where vector magnitude can encode meaningful information. The radius r is therefore a threshold specifically calibrated for this metric space.

3.2. Sorting with Neighborhood Pruning (SNN)

The approaches proposed in this work are based on the SNN method [10], which involves pre-sorting the vectors in the database P according to a chosen Euclidean or cosine similarity relative to a reference vector, followed by pruning potentially distant vectors outside the radius r using a lower-bound distance estimate.

The theoretical guarantee that SNN finds all neighbors without omission (i.e., achieves a recall of 1.0) is formally established in the original work [10]. The method’s correctness stems from a lower bound on the Euclidean distance derived from the Cauchy–Schwarz inequality.

Let μ be the mean vector of the dataset P. For a query point q and any data points

p_{i}

,

i = \bar{1, n}

, their centered representations are

\bar{q} = q - μ

and

\bar{p_{i}} = p_{i} - μ

,

i = \bar{1, n}

. Let v be the unit vector representing the first principal component of the centered data. The projections of these vectors onto v are

{p r o j}_{q} = {\bar{q}}^{T} v

and

{p r o j}_{i} = {\bar{p_{i}}}^{T} v

,

i = \bar{1, n}

.

The distance between their projections can be expressed as

|{p r o j}_{q} - {p r o j}_{i}| = |{(\bar{q} - \bar{p_{i}})}^{T} v|

. By the Cauchy–Schwarz inequality, we have

|{(\bar{q} - \bar{p_{i}})}^{T} v| \leq ‖\bar{q} - \bar{p_{i}}‖ ‖v‖ .

Since v is a unit vector, ∥v∥ = 1, which simplifies the inequality to

|{(\bar{q} - \bar{p_{i}})}^{T} v| \leq ‖\bar{q} - \bar{p_{i}}‖

This inequality shows that the Euclidean distance between any two points is always greater than or equal to the distance between their projections onto any line. Therefore, if a point

p_{i}

is a true neighbor within radius r (i.e.,

‖\bar{q} - \bar{p_{i}}‖ \leq r

), it is mathematically guaranteed that its projection also satisfies

|{p r o j}_{q} - {p r o j}_{i}| \leq r

. The initial filtering step, which identifies all candidates within this projection-based boundary, thus creates a superset of the true neighbors, ensuring none are missed.

The correctness of this approach relies on three key conditions:

Data centering is essential, as the projection is onto the principal component of the centered data, which ensures the validity of the distance bound.
The guarantee holds specifically for the Euclidean distance.
Sorting is not required for correctness, but ordering the data by projection value is the critical step that enables an efficient binary search to identify the candidate set in O (log n) time.

To efficiently handle the computationally intensive stages of the SNN algorithm—specifically during the indexing phase (computing the mean, data centering, principal component calculation, projections, norms, and sorting) and the query phase (query centering, query projection, inner product computation with candidates, and final filtering)—two main parallel computing technologies are employed: OpenMP for multicore CPUs and CUDA for GPUs. To further accelerate computations, especially in high-dimensional settings, two approaches are proposed, as described below.

3.2.1. Parallel SNN Approach (OpenMP-Based): Indexing Phase

This phase is executed once for a given dataset to prepare structures that accelerate the search. The implementation sequence consists of four main steps, with pseudocode provided in Algorithms 1–4.

Step 1: Computing the mean vector and data centering

First, the mean value is computed for each dimension across the entire dataset. Then, each data point is centered by subtracting this mean vector. Parallelization is achieved by distributing the computations among threads. The mean computation is parallelized across dimensions (#pragma omp parallel for), while centering is performed simultaneously across data points and their dimensions (#pragma omp parallel for collapse(2)).

Algorithm 1. Mean Calculation and Centering

// Calculate mean vector in parallel
FUNCTION compute_mean(data[n][d]):
mean_vector[d] = {0}
PARALLEL FOR dimension j FROM 0 TO d-1:
sum_j = 0
FOR point i FROM 0 TO n-1:
sum_j += data[i][j]
mean_vector[j] = sum_j/n
RETURN mean_vector
// Center data matrix in parallel
FUNCTION center_data(data[n][d], mean_vector[d]):
PARALLEL FOR point i FROM 0 TO n-1 AND dimension j FROM 0 TO d-1 (using collapse(2)):
data[i][j] -= mean_vector[j]
RETURN data

Step 2: Computing the first principal component v₁

To find the direction of maximum data variance, an iterative power method is employed. Although the iteration loop itself is executed sequentially, the main computational load—matrix-vector multiplication—is performed using highly optimized BLAS functions (for example, cblas_dgemv), which can be internally multithreaded.

Algorithm 2. Power Iteration for First Principal Component

FUNCTION compute_first_pc(centered_data[n][d]):
v1[d] = random_vector()
FOR iter FROM 0 TO max_iterations:
// BLAS calls can be internally multi-threaded
temp_vec = BLAS_CALL(matrix_vector_mult, centered_data, v1)
v1 = BLAS_CALL(matrix_transpose_vector_mult, centered_data, temp_vec)
v1 = BLAS_CALL(vector_normalize, v1)
RETURN v1

Step 3: Data projection and norms calculation

Each centered vector

x_{i}

is projected onto

v_{1}

. In parallel, the squared norm of each

x_{i}

is computed. Projections are calculated in a single BLAS gemv call, while norm computations are parallelized using the directive #pragma omp parallel for.

Algorithm 3. Projections and Norms Calculation

FUNCTION compute_projections_and_norms(centered_data[n][d], v1[d]):
// Calculate all projections at once using BLAS
projections[n] = BLAS_CALL(matrix_vector_mult, centered_data, v1)
  // Calculate norms in parallel
  results[n] = empty_array_of_tuples
  PARALLEL FOR point i FROM 0 TO n-1:
  norm_sq_i = BLAS_CALL(dot_product, centered_data[i], cen-tered_data[i])
  // Store projection, original index, and norm together
  results[i] = (projections[i], i, norm_sq_i)
    RETURN results

Step 4: Parallel Sorting

The set of tuples results is sorted by the projection value. The Concurrency::parallel_sort library is used, which efficiently distributes the sorting task across CPU cores.

Algorithm 4. Parallel Sorting

FUNCTION sort_by_projection(results[n]):
// parallel_sort partitions the data and sorts sub-arrays in parallel
PARALLEL_SORT(results, compare_by_first_element)
RETURN results

3.2.2. Parallel SNN Approach (OpenMP-Based): Query Phase

In this phase, neighbors are searched for the query vector. The largest performance gain is achieved when processing queries in batches. The sequence of operations is presented in pseudocode in Algorithms 5 and 6.

Step 1: Query preparation and dot product computation

The query vector q is centered. To accelerate the final distance verification, a vector of dot products between the centered query and all database points is precomputed. This operation corresponds to a matrix-vector multiplication and is efficiently executed with a single cblas_dgemv call.

Algorithm 5. Preparation and Dot Product Calculation

FUNCTION prepare_query(query_vec[d], data[n][d], v1[d], mean[d]):
// Center the query vector
centered_q[d]
// This small loop can be parallelized conditionally
PARALLEL FOR IF (d > 100) dimension j FROM 0 TO d-1:
centered_q[j] = query_vec[j] − mean[j]
// Calculate projection and norm of the query vector using BLAS
q_projection = BLAS_CALL(dot_product, v1, centered_q)
q_norm_sq = BLAS_CALL(dot_product, centered_q, centered_q)
// Pre-calculate all dot products between database points and the query vector
all_dot_products[n] = BLAS_CALL(matrix_vector_mult, data, cen-tered_q)
RETURN centered_q, q_projection, q_norm_sq, all_dot_products

Step 2: Candidate search and final filtering

A narrow range of candidates is determined using binary search (std::lower_bound/std::upper_bound) on the sorted projection array. The final loop, which verifies the exact Euclidean distance, is parallelized over the identified candidates. This allows multiple candidates to be checked simultaneously, which is especially efficient when the number of candidates is large.

Algorithm 6. Filtering

FUNCTION filter_query(sorted_data[n], R, q_projection, q_norm_sq, all_dot_products[n]):
// Find candidate range using binary search (sequential)
lower_bound = BINARY_SEARCH_LOWER(sorted_data, q_projection - R)
upper_bound = BINARY_SEARCH_UPPER(sorted_data, q_projection + R)

// Iterate through the candidate subset in parallel
results = empty_list
PARALLEL FOR each candidate IN range(lower_bound, upper_bound):
original_index = candidate.index
dot_xy = all_dot_products[original_index]
dist_sq = candidate.norm_sq + q_norm_sq - 2 * dot_xy
IF dist_sq <= R*R:
// Synchronization is needed for concurrent writes
CRITICAL_SECTION:
APPEND original_index TO results

3.2.3. Parallel SNN Approach (CUDA-Based): Indexing Phase

Step 1: Initialization and data transfer to GPU

This is the initial stage, where the data is moved from the computer’s main memory (Host) to the GPU memory (Device). Memory on the GPU is allocated for all required structures, and then the initial dataset is copied to the device in a single large block. The pseudocode for this step is provided in Algorithm 7.

Algorithm 7. GPU Memory Initialization and Data Transfer

FUNCTION initialize_on_gpu(h_data[n][d]):
// Allocate memory on Device for all data structures
d_data = CUDA_MALLOC(n*d)
d_mean = CUDA_MALLOC(d)
d_first_pc = CUDA_MALLOC(d)
d_projections = CUDA_MALLOC(n)
d_norms_sq = CUDA_MALLOC(n)
d_indices = CUDA_MALLOC(n)
// Copy main data from Host to Device
CUDA_MEMCPY(d_data, h_data, HostToDevice)

Step 2: Mean calculation and data centering on GPU

These operations are executed in parallel over all data. First, a specialized CUDA kernel performs a parallel reduction to compute the mean vector μ. Then, another kernel (subtract_mean_kernel_global) is launched with a thread grid, where each thread handles one element of the data matrix and performs centering by subtracting the corresponding mean element. The pseudocode is given in Algorithm 8.

Algorithm 8. Mean Calculation and Data Centering on GPU

FUNCTION center_data_on_gpu(d_data[n][d]):
// Calculate mean vector on GPU using a parallel reduction kernel
d_mean = LAUNCH KERNEL(parallel_reduction_mean, d_data)
// Center data matrix by launching a kernel
LAUNCH KERNEL(subtract_mean_kernel_global, d_data, d_mean, n, d)

Step 3: Computing the first principal component v₁ on GPU

The power iteration method is executed entirely on the GPU. In the iterative loop, the main computational load—matrix-vector multiplication and normalization—is performed using high-performance cuBLAS functions. The pseudocode is given in Algorithm 9.

Algorithm 9. Power Iteration for First Principal Component on GPU

FUNCTION compute_first_pc_on_gpu(d_data[n][d]):
d_v1 = random_vector_on_device()
d_temp_vec = CUDA_MALLOC(n)
FOR iter FROM 0 TO max_iterations:
// All calls are to cuBLAS library running on GPU
CUBLAS_CALL(gemv, d_data, d_v1, d_temp_vec) // d_temp_vec = X * v1
CUBLAS_CALL(gemv, d_data_transposed, d_temp_vec, d_v1) // v1 = X^T * d_temp_vec
norm = CUBLAS_CALL(nrm2, d_v1)
scale = 1.0/norm
CUBLAS_CALL(scal, d_v1, scale)
RETURN d_v1

Step 4: Projections, norms, and indices calculation on GPU

At this stage, all data required for sorting are prepared. Projections are computed in a single cuBLAS gemv call. To compute squared norms, a CUDA kernel is launched where each thread computes the norm for one vector. Simultaneously, the Thrust library is used to generate an array of initial indices. The pseudocode is given in Algorithm 10.

Algorithm 10. Projections, Norms, and Indices Calculation

FUNCTION prepare_for_sort(d_data, d_v1):
// Calculate all projections using one cuBLAS call
d_projections = CUBLAS_CALL(gemv, d_data, d_v1)
// Calculate all norms in parallel with a custom kernel
LAUNCH KERNEL(compute_row_norms_sq_kernel, d_data, d_norms_sq)
// Create a sequence of indices (0, 1, 2,...) on the GPU
THRUST_CALL(sequence, d_indices)

Step 5: Parallel sorting on GPU

For sorting, the thrust::sort_by_key function is called, which sorts the projection array in parallel. The same permutation is simultaneously applied to the index and squared-norm arrays to maintain correct correspondence. The pseudocode is given in Algorithm 11.

Algorithm 11. Parallel Sorting on GPU

FUNCTION sort_on_gpu(d_projections, d_indices, d_norms_sq):
// Sort d_indices and d_norms_sq arrays based on the keys in d_projections
THRUST_CALL(sort_by_key,
keys: d_projections,
values: {d_indices, d_norms_sq})

3.2.4. Parallel SNN Approach (CUDA-Based): Query Phase

The query phase consists of sequential steps executed entirely in GPU memory for efficient neighbor search.

Step 1: Query transfer and preparation

The query vector q is copied from host to device. All subsequent operations—centering, projection, and norm computation—are performed on the GPU using cuBLAS functions (see Algorithm 12).

Algorithm 12. Prepare Query on GPU

FUNCTION prepare_query_on_gpu(h_query_vec[d]):
d_query_vec = CUDA_MALLOC(d)
CUDA_MEMCPY(d_query_vec, h_query_vec, HostToDevice)
// Center the query vector on GPU
LAUNCH KERNEL(center_query_kernel, d_query_vec, d_mean)
// Calculate projection and norm, results remain on device/host registers
q_projection = CUBLAS_CALL(dot, d_first_pc, d_query_vec)
q_norm_sq = CUBLAS_CALL(dot, d_query_vec, d_query_vec)
RETURN q_projection, q_norm_sq, d_query_vec

Step 2: Candidate range search

A narrow range of potential candidates is determined using parallel binary search via thrust::lower_bound and thrust::upper_bound (see Algorithm 13).

Algorithm 13. Find Candidate Range on GPU

FUNCTION find_candidates_on_gpu(d_projections, q_projection, R):
// Perform parallel binary search on the GPU
range_start_iterator = THRUST_CALL(lower_bound, d_projections, q_projection - R)
range_end_iterator = THRUST_CALL(upper_bound, d_projections, q_projection + R)
// Calculate the start index and the number of candidates
first_candidate_index = range_start_iterator - d_projections
num_candidates = range_end_iterator - range_start_iterator
RETURN first_candidate_index, num_candidates

Step 3: Final filtering with CUDA kernel

A specialized CUDA kernel is launched where each thread processes a single candidate in the identified range. Each thread reads candidate data from global memory, computes the exact Euclidean distance, and, if it satisfies the radius condition, prepares to write the result.

Step 4: Atomic Result Write and Copy to Host

An atomic counter is used to avoid write conflicts. After the kernel’s execution, a compact array of found indices is copied back to the host (see Algorithm 14).

Algorithm 14. Final Filtering and Result Collection

FUNCTION filter_and_get_results(first_candidate_index, num_candidates, ...):
// Allocate buffer for results on GPU
d_results = CUDA_MALLOC(num_candidates) // Max possible size
d_result_count = CUDA_MALLOC_AND_ZERO(1) // Atomic counter
// Launch the kernel to perform final distance check in parallel
LAUNCH KERNEL(filter_candidates_kernel,
first_candidate_index,
num_candidates,
//... other necessary data pointers ...
d_results,
d_result_count
)
// Copy only the valid results back to the Host
num_found = CUDA_MEMCPY_FROM_DEVICE(d_result_count)
h_results = CUDA_MALLOC_HOST(num_found)
CUDA_MEMCPY(h_results, d_results, num_found, DeviceToHost)
RETURN h_results
// KERNEL DEFINITION
KERNEL filter_candidates_kernel(...):
thread_id = ...
IF thread_id >= num_candidates THEN RETURN
candidate_index = first_candidate_index + thread_id
// ... read candidate data (norm, original_index) ...
// ... calculate dot_product with query vector ...
dist_sq = candidate.norm_sq + query.norm_sq - 2 * dot_product
IF dist_sq <= R*R:
// Get a unique position in the output array and write the result
write_position = ATOMIC_ADD(d_result_count, 1)
d_results[write_position] = candidate.original_index

3.3. Computational Complexity of the Proposed Parallel Approaches

To formally assess the efficiency of the parallel approaches to the SNN method proposed in this work, we perform a theoretical analysis of their computational complexity.

For the analysis, we introduce the following notations:

n—the total number of vector representations (data points) in the database P.
d—the dimensionality of the feature space, i.e., the length of each vector.
P—the number of parallel computational units (for example., CPU cores) involved in the computations.
$k_{i t e r}$ —the number of iterations performed in the power iteration method to find the first principal component.
$N_{c} -$ the number of candidate vectors selected at the filtering stage after searching in the sorted array of projections.

The goal of this analysis is to derive estimates for the execution time of the sequential

T_{1}

and parallel

T_{p}

algorithms, which will allow us to theoretically justify the expected speedup and efficiency of parallelization.

3.3.1. Analysis of the Indexing Phase

Step 1: Computing the mean vector and data centering

Regarding the sequential complexity

T_{1}

, computing the mean vector requires summing the values across each of the d dimensions for all n data points, which has a complexity of O(nd). The subsequent data centering, which involves subtracting the mean vector from each of the n vectors, also requires O(nd) operations. Thus, the total sequential complexity of this step is

T_{1} (c e n t e r i n g) = O (n d) .

Regarding the parallel complexity

T_{p}

, both operations are examples of element-wise computations and are easily parallelizable. The summation for the mean can be performed using parallel reduction, while centering can be achieved by distributing the n vectors among P computational units. In an ideal case, the complexity reduces to

T_{p} (c e n t e r i n g) = O (n d / P) .

In practice, the performance of this step is often limited by memory bandwidth, since the number of arithmetic operations per byte of data is low.

Step 2: Computing the first principal component v₁

In terms of sequential complexity

T_{1}

, instead of using the classical PCA approach based on singular value decomposition (SVD) with a high computational cost (for example,

O ({n d}^{2}))

, an iterative power iteration method is applied. This method is significantly more efficient when only the first principal component is needed. The primary operation in each of the

k_{i t e r}

iterations is multiplying the n × d centered data matrix by a vector, which has a complexity of O(nd). Therefore, the total sequential complexity for computing

v_{1}

is

T_{1} (v_{1}) = O (k_{i t e r} n d)

.

For the parallel complexity

T_{p}

, the gemv operation can be efficiently parallelized using modern libraries. The computation is distributed over the rows of the matrix, reducing the complexity to

T_{p} (v_{1}) = O (k_{i t e r} n d / P)

.

Step 3: Data projection and norm computation

For the sequential complexity

T_{1}

, the projection of all n centered vectors onto the principal component

v_{1}

is computed with

g e m v (X \cdot v_{1})

, which has a complexity of O(nd). In addition, the squared Euclidean norm of each of the n vectors is computed, which also requires O(nd) operations (d multiplications and additions for each of the n vectors). Therefore, the total complexity of this step is

T_{1} (p r o j e c t i o n) = O (n d) .

For the parallel complexity

T_{p}

, similar to the previous steps, the gemv call is parallelized to O(nd/P). The computation of norms is an independent task for each vector, and thus this process is also perfectly parallelizable, achieving a complexity of O(nd/P). Consequently, the overall parallel complexity is

T_{p} (p r o j e c t i o n) = O (n d / P) .

Step 4: Parallel sorting

For the sequential complexity

T_{1}

, sorting the n projection values using an efficient sorting algorithm has an average computational complexity of

T_{1} (s o r t i n g) = O (n l o g n) .

For the parallel complexity

T_{p}

, this work employs highly optimized parallel libraries: Concurrency::parallel_sort for CPU and thrust::sort_by_key for GPU. The theoretical complexity of such algorithms approaches

T_{p} (s o r t i n g) = O ((n l o g n) / P) .

It is important to note that actual performance depends on the communication overhead between computational units and the balance of workload distribution.

Summing the complexities of all the steps, we obtain the following:

For the sequential case:

T_{1} (i n d e x i n g) = O (n d) + O (k_{i t e r} n d) + O (n d) + O (n l o g n)

.

The dominant terms are those related to computing the principal component and sorting, so the overall complexity is

T_{1} (i n d e x i n g) = O (k_{i t e r} \cdot n d + n l o g n)

.

For the parallel case:

T_{p} (i n d e x i n g) = O (n d / P) + O (k_{i t e r} n d / P) + O (n d / P) + O ((n l o g n) / P)

.

Similarly, the total parallel complexity is defined as

T_{p} (i n d e x i n g) = O (k_{i t e r} n d / P + n l o g n / P)

.

3.3.2. Analysis of the Query Phase

Step 1: Query preparation and precomputation of scalar products

For the sequential complexity

T_{1}

, this step includes centering the query vector (O(d)) and computing its projection and norm (O(d)). However, the key optimization described in the algorithm is the precomputation of scalar products between the centered query vector and all n centered vectors from the database. This operation is performed with a single gemv call and has a complexity of O(nd). This step is therefore the dominant one, leading to

T_{1}

(preparation) = O(nd). Such an approach shifts the main computational workload to the beginning of the query phase, which significantly simplifies the final distance verification.

For the parallel complexity

T_{p}

the gemv call for computing scalar products is parallelized to

T_{p} (p r e p a r a t i o n) = O (n d / P) .

Step 2: Candidate range search

For both sequential and parallel complexity, the range of potential neighbors is determined by performing a binary search (using std::lower_bound and std::upper_bound) on the sorted array of n projections. The complexity of binary search is O(logn) comparisons. In the parallel implementation, the complexity of this step remains

T_{p} (s e a r c h) = O (l o g n) .

This step represents a potential bottleneck that may limit the overall speedup according to Amdahl’s law.

Step 3: Final filtering

For the sequential complexity

T_{1}

, this stage verifies the exact Euclidean distance for

N_{c}

candidates identified in the previous step. Due to the precomputation of scalar products in Step 1, checking the condition

{‖q - p_{i}‖}_{2}^{2} \leq r^{2}

has a complexity of O(1) for each candidate. Therefore, the total complexity of filtering is

T_{1} (f i l t e r i n g) = O (N_{c}) .

For the parallel complexity

T_{p},

the verification of each candidate is an independent operation, making this loop perfectly parallelizable. Each thread (CPU) or CUDA thread can process its own candidate. The complexity of this stage is

T_{p} (f i l t e r i n g) = O (⌈ N_{c} / P ⌉) .

The efficiency of parallelization here directly depends on the number of candidates

N_{c}

: when the candidate set is small, the overhead of launching the parallel region may outweigh the performance gain.

So, final complexity of the query phase:

Sequential:

T_{1} (q u e r y) = O (n d) + O (l o g n) + O (N_{c}) .

Since in the general case nd≫

N_{c}

and nd≫logn, the dominant term is the first one:

T_{1} (q u e r y) \approx O (n d) .

Parallel:

T_{p} (q u e r y) = O (n d / P) + O (l o g n) + O (N_{c} / P) .

Here, the dominant terms are the parallelized part and the sequential binary search:

T_{p} (q u e r y) = O (n d / P + l o g n)

.

3.3.3. Theoretical Speedup $S_{p}$ and Efficiency $E_{p}$

Speedup is defined as the ratio of the execution time of the sequential algorithm to that of the parallel one:

S_{p} = T_{1} / T_{p}

. Efficiency reflects how well computational resources are utilized:

E_{p} = S_{p} / P

.

For the indexing phase, the speedup is given by

S_{p} (i n d e x i n g) = \frac{O (k_{i t e r} n d + n l o g n)}{O (k_{i t e r} n d / P + n l o g n / P)} \approx P

.

In the ideal case, the speedup approaches P, while the efficiency approaches one,

E_{p} \approx 1

. In practice, however, efficiency will be lower than unity due to overheads such as thread creation, synchronization, and load imbalance. For the query phase, the speedup can be expressed as

S_{p} (q u e r y) = \frac{O (n d)}{O (n d / P + l o g n)}

.

This formula shows that the speedup for a single query is fundamentally limited. As the number of processors P increases, the term nd/P tends toward zero, while the sequential component O(logn) remains constant. This is a classical demonstration of Amdahl’s law: the maximum speedup cannot exceed

T_{1} / T_{s e q u e n t i a l_p a r t}

, which in this case is O(nd/logn). Consequently, achieving high efficiency on massively parallel systems (such as GPUs) requires shifting to batch query processing. By handling a batch of m queries simultaneously, it becomes possible to use vectorized search functions (for instance, thrust::lower_bound for an array of values), thereby amortizing the cost of the sequential part and more effectively utilizing computational resources.

In addition to theoretical limitations, practical factors significantly affect real efficiency, including overheads for data transfer between host and device memory via cudaMemcpy, kernel launch delays in CUDA, and thread synchronization through atomic operations or critical sections. The analysis confirms the validity of the proposed parallel approaches, which demonstrate high potential for speedup through effective parallelization of computationally intensive operations. Key advantages include linear scalability of the indexing phase with respect to data dimensionality and an efficient precomputation strategy in the query phase. At the same time, a fundamental limitation for single-query processing is identified, indicating batch processing as a primary direction for further research and optimization.

The described parallel SNN mechanisms directly explain the experimental outcomes. An exact recall of 1.0 is achieved due to precise data centering, projection onto the first principal component, and final distance verification. The observed speedup for both CPU (OpenMP) and GPU (CUDA) implementations results from efficiently parallelized operations, such as matrix-vector multiplication, norm computation, and sorting. The theoretical analysis presented in Section 3.3 aligns with these results, confirming the impact of parallelization. The limitation imposed by sequential components, as predicted by Amdahl’s law, manifests in single-query speedup saturation, highlighting the advantage of batch query processing to achieve higher efficiency.

4. Results

This section presents the results of the experimental study on the efficiency of the proposed parallel approaches to SNN-based nearest neighbor search in the context of face recognition. The analysis covers the accuracy, scalability, and correctness of the algorithm’s implementation in both sequential and parallel (OpenMP and CUDA) execution.

4.1. Experimental Environment and Settings

4.1.1. LFW Dataset and Preparation

The experiments were conducted on the widely used Labeled Faces in the Wild (LFW) dataset [37], containing 13,233 images of 5749 individuals, with 1680 subjects having at least two images. Images (250 × 250 JPG) were preprocessed using the YuNet [38] face detector from the OpenCV DNN module [6], which provided bounding boxes for face cropping. Detection errors were negligible and excluded during preparation.

4.1.2. Embedding Generation

Two open models were employed for face embeddings’ generation:

FaceNet512—512-dimensional vectors, achieving up to 98.4% accuracy on LFW [6], [39].
VGG-Face—4096-dimensional vectors, with 96.7–98.78% accuracy [6,39,40].

The dataset size (n) and embedding dimensionality (d) were documented for performance analysis.

4.1.3. Augmentation for Scalability Evaluation

To assess scalability with respect to dataset size, embeddings were augmented by replicating vectors with added Gaussian noise (σ = 0.01), producing controlled dataset variants:

n_original = 13,233;
n_aug1 = 26,466;
n_aug2 = 52,932;
n_aug3 = 105,864.

This procedure preserved statistical properties while enabling evaluation of SNN’s performance under increasing load.

4.1.4. Hardware and Software

The experiments were carried out in a test environment with the following specifications: CPU—AMD Ryzen 3 1200 (4 threads), GPU—NVIDIA GeForce GTX 1050 Ti with 4 GB memory, and 16 GB DDR4 RAM.

The embedding datasets were generated using two models. The FaceNet512 model produced 512-dimensional embeddings, while the VGG-Face model generated 4096-dimensional embeddings. For both models, the original dataset contained n = 13,233 embeddings. To evaluate scalability, three augmented datasets were created through replication with the addition of Gaussian noise (σ = 0.01), resulting in n = 26,466 (n_aug1), n = 52,932 (n_aug2), and n = 105,864 (n_aug3).

To ensure numerical stability and address potential roundoff errors in high-dimensional calculations, all floating-point vector operations in our implementation were performed using double precision (float64). The facial embeddings were not L2-normalized. Given the high precision of float64, standard accumulation was employed for operations like dot products, as it is sufficient to prevent significant errors, making specialized techniques like Kahan summation unnecessary. This approach ensures that distance calculations were numerically stable and that the radius-based decisions for boundary samples were accurate, thereby preserving the exactness guarantee of the SNN algorithm.

To prevent thread oversubscription between our OpenMP code and the underlying BLAS library, and to ensure reproducible performance, the BLAS library was configured to run in a sequential mode by setting the environment variable OPENBLAS_NUM_THREADS = 1. This gave our OpenMP directives full control over the parallel execution.

4.2. Verification of SNN Algorithm’s Correctness

4.2.1. Verification Method

The accuracy of the SNN implementations (sequential, OpenMP, CUDA) was verified against a brute-force search baseline, which guarantees retrieval of all neighbors within a given radius R. The sets of neighbors obtained by SNN were compared with brute-force results; a full match confirmed correctness.

4.2.2. Recall Results

Recall was used as the main metric, defined as [10]

Recall = \frac{N u m b e r o f t r u e n e i g h b o r s f o u n d w i t h i n r a d i u s R}{T o t a l n u m b e r o f t r u e n e i g h b o r s f o u n d w i t h i n r a d i u s R} .

For an exact algorithm such as SNN, the expected value is a recall of 1.0. Tests confirmed this result for all implementations, embedding types, dataset sizes, and radius values. The experimental results are summarized in Table 3.

The radius values R presented in Table 3 are distance thresholds used to distinguish between “genuine pairs” (vectors from the same identity) and “impostor pairs” (vectors from different identities) in the Euclidean space. These thresholds are typically determined empirically on large-scale benchmark datasets by analyzing the distance distributions for both genuine and impostor pairs. The selection strategy aims to find an optimal operating point on the Receiver Operating Characteristic (ROC) curve, often corresponding to the Equal Error Rate (EER), where the False Acceptance Rate (FAR) equals the False Rejection Rate (FRR). This point represents a balanced trade-off between system security (minimizing false matches) and usability (minimizing false rejections). Since the facial embeddings used in this study are not L2-normalized, the Euclidean distance was chosen over cosine distance. While cosine distance is often used for normalized vectors where the angle (direction) is the primary measure of similarity, the Euclidean distance considers both direction and magnitude, which can contain useful information in the feature space constructed by these models. A threshold R for Euclidean distance is therefore not interchangeable with a cosine similarity threshold, which would require separate empirical validation to determine its optimal value. The specific values used in this study are the established default thresholds adopted from the DeepFace [6] open-source framework, which have been validated to provide robust and balanced performance for their respective models.

Dataset size for all experiments: n = 13,233. Comparison method: brute-force search. The results confirm that SNN is an exact and deterministic FRNNS method, ensuring 100% accuracy across all configurations and providing a solid basis for subsequent performance and scalability analysis.

4.3. Performance Analysis and Comparison with Baseline Algorithms

The performance of the SNN algorithm’s implementation was evaluated across various hardware and software platforms, and a comparison was conducted with classical nearest neighbor search algorithms, including k-d tree, ball tree, and brute force (both sequential and parallel variants). The experiments were performed for different numbers of vectors N and for two embedding dimensionalities: d = 512 and d = 4096.

4.3.1. SNN Scalability Analysis

For the OpenMP implementation of the SNN algorithm, experiments were carried out with varying numbers of threads. Execution time, speedup, and parallel efficiency were measured for both the indexing and query phases. The optimal performance was observed using 2–4 threads.

Indexing with a dimensionality of 512 required approximately 6.3 s, while for 4096 it took about 45.6 s. Query times were approximately 21 milliseconds for 512 and about 171 milliseconds for 4096. The results are presented in Figure 1.

Table 4 presents the indexing times for embeddings with dimensionality d = 4096. As the dataset size N increases, a significant growth in indexing time is observed across all methods. The fastest implementation is SNN on CUDA, whose times are considerably lower than all other approaches (ranging from 2.11 s for N = 13,233 to 21.06 s for N = 105,864). The sequential implementation without BLAS did not complete for a large N (denoted by “–”), indicating its inefficiency. The use of BLAS considerably accelerates the sequential implementation, but it remains slower than both OpenMP and CUDA. The k-d tree and ball tree algorithms exhibit substantially longer indexing times, with ball tree being faster than k-d tree, yet both lag behind SNN implementations.

Table 5 shows the indexing times for the smaller embedding dimensionality d = 512. The overall trend remains: the CUDA implementation of SNN is the fastest (ranging from 0.38 s to 1.70 s), followed by OpenMP and sequential with BLAS. The sequential implementation without BLAS again shows the worst performance, although it successfully completes for all dataset sizes. The k-d tree and ball tree methods show lower indexing times compared to the d = 4096 case, but they are still significantly slower than SNN.

Thus, the analysis of the tables confirms that SNN’s implementation on CUDA consistently provides the highest efficiency for indexing regardless of embedding dimensionality. The sequential implementation with BLAS also demonstrates better performance compared to classical structures k-d tree and ball tree. Moreover, k-d tree and ball tree scale poorly with increasing dimensionality d—their indexing times increase much faster than those of SNN, making them less suitable for high-dimensional data processing.

4.3.2. Comparison with Other Methods (Indexing)

Table 4 and Table 5 present the indexing results (index construction time) for SNN with different implementations (sequential without BLAS, sequential with BLAS, OpenMP, CUDA), as well as for k-d tree and ball tree.

4.3.3. Comparison with Other Methods (Query)

Table 6 and Table 7 present the neighbor search times (query), including implementations using BLAS, GPU, and brute-force methods.

The CUDA implementation of SNN demonstrates the highest performance in all tested cases. The brute-force implementation on the CPU lags several times behind, even when compared to the OpenMP variant. The k-d tree and ball tree algorithms prove to be practically inefficient at high dimensionality (d = 4096). For lower dimensions (d = 512), all methods perform relatively quickly, but SNN remains the fastest.

We propose two parallel approaches for implementing the SNN method based on OpenMP and CUDA technologies. The SNN implementations utilizing BLAS in combination with these technologies achieve the best performance among all tested variants. Classical data structures such as k-d tree and ball tree lose efficiency at high dimensionalities. The GPU implementation exhibits nearly linear scaling with respect to both the number of vectors N and the embedding dimensionality d. Consequently, parallel versions of the SNN method are suitable for both real-time and batch processing, especially for high-dimensional datasets.

It is also worth noting that the obtained results are consistent with the key findings of [10] regarding the advantages of the SNN algorithm: guaranteed accuracy, high speed, absence of hyperparameters (except for the radius R), and relatively low indexing time. This study significantly extends the understanding of SNN’s potential by presenting and thoroughly evaluating the efficiency of its parallel approaches on a CPU (OpenMP) and GPU (CUDA), which were not the primary focus of the original publication. This study also confirms the well-known observation that traditional tree-based structures (k-d tree, ball tree) exhibit substantially reduced efficiency when handling high-dimensional data due to the “curse of dimensionality,” in agreement with numerous previous studies. A key aspect of this work is the demonstration that parallelization, particularly using GPUs, is a critical factor in unlocking the full potential of the SNN algorithm in large-scale systems. This effectively transforms SNN from a theoretically interesting approach into a practically implementable, fast, and accurate FRNNS method.

5. Discussion

5.1. Comparative Performance Analysis

This section presents a quantitative analysis of the performance of the Sorting-based SNN nearest neighbor search algorithm and its parallel implementations, in comparison with classical exact methods. The analysis has a twofold structure. First, the single-threaded SNN algorithm presented in [10] is evaluated to establish a baseline performance. Then, the performance of the new parallel implementations developed in this study is analyzed, allowing for a quantitative assessment of the speedup achieved using OpenMP and CUDA.

To ensure an objective and meaningful comparison of the results obtained across different hardware platforms and datasets, relative percentage speedup is used as the main metric. This approach minimizes the impact of platform-specific performance characteristics, enabling a direct evaluation of the efficiency gains resulting specifically from the SNN methodology and the applied parallelization strategies.

The original SNN algorithm proposed by Chen & Güttel demonstrated significant performance advantages over traditional tree-based methods. For a quantitative assessment of the method’s efficiency, the query performance data from the original SNN study were analyzed. Table 8 below is calculated based on the data from Table 6 in [10]. The speedup is presented as the ratio of the execution time of the baseline method to that of the optimized method.

The results presented in Table 8 confirm the significant performance advantage of the SNN algorithm. For example, on the F-MNIST dataset, SNN demonstrates a speedup of 9.91× to 14.20× compared to ball tree and 14.61× to 18.83× compared to k-d tree for the tested radii. Similarly, for the GIST dataset, the speedup ranges from 6.68× to 7.67× relative to ball tree and from 9.91× to 11.17× relative to k-d tree.

The performance gap between SNN and tree-based methods increases with the dimensionality of the data. This is not only due to implementation specifics but is a fundamental consequence of their different algorithmic approaches to search space reduction. Experimental data obtained in this study show that for dimensionality d = 512, ball tree remains relatively competitive; however, for d = 4096, its performance drops catastrophically, becoming even slower than some brute-force variants. This fully aligns with the “curse of dimensionality” theory. Tree-based methods rely on the assumption that points that are close in one projection are likely to be close in the full space. This assumption loses validity in high dimensions. In contrast, candidate pruning in SNN, based on a global data property (the first principal component), does not suffer from this local separation problem. Consequently, SNN’s performance is more stable and robust to increasing dimensionality, while tree-based methods exhibit catastrophic degradation, explaining why the SNN speedup is so substantial in high-dimensional contexts.

Building on the reliable foundation of single-threaded SNN, three parallel implementations were developed and evaluated to leverage modern hardware architectures: a CPU using OpenMP and GPU using CUDA (see Table 9). Testing was performed on datasets consisting of high-dimensional facial embeddings, specifically VGG-Face (d = 4096) and FaceNet512 (d = 512), with a database size of N = 105,864 vectors.

The obtained results can be analyzed from three main perspectives.

First, performance on the CPU using the sequential BLAS-based implementation and the parallel OpenMP version. The optimized sequential BLAS implementation provides a solid baseline for comparison. For data with dimensionality d = 4096, it operates 1.43 times faster than ball tree and 4.15 times faster than k-d tree. This confirms the findings of Chen & Güttel regarding the effectiveness of using optimized linear algebra libraries to enhance performance. The OpenMP version further improves the execution speed by parallelizing key computations. For d = 4096, this version achieves a speedup of 2.86× compared to ball tree and 8.28× compared to k-d tree.

Second, GPU acceleration using CUDA demonstrates a significant performance jump. For the d = 4096 dataset, the CUDA implementation runs 20.2× faster than ball tree and 58.4× faster than k-d tree.

Third, the impact of data dimensionality (comparison of d = 512 and d = 4096) reveals important trends. Although the CUDA version is the fastest in both cases, its relative speedup compared to CPU versions (OpenMP and sequential BLAS) is more pronounced at higher dimensionality. For d = 512, the CUDA version runs approximately 14× faster than the sequential BLAS implementation, whereas for d = 4096 this speedup increases to 20.2×. At d = 512, the task is less computationally intensive, so fixed overheads such as data transfer between CPU and GPU and kernel launch account for a larger portion of execution time, somewhat reducing the GPU advantage in this case.

To contextualize the obtained results, it is important to compare them with state-of-the-art approximate nearest neighbor search (ANN) and FRNNS methods reported in recent studies. Although these results were obtained on different datasets and under varying experimental conditions, they provide a useful reference point for evaluating the efficiency of the proposed parallel SNN implementations. Table 10 presents a comparative analysis of recent methods.

As shown in Table 10, Lea Fi [21] achieves up to a 32× speedup through the use of learned filters, while RadiK [36], leveraging GPU parallelization, delivers up to a 4.8× improvement in batch queries. Methods such as VSAG [23], and DFSANNS [24] also demonstrate substantial acceleration, although direct comparison is limited due to differences in datasets and evaluation protocols. Nevertheless, the results indicate that the parallel SNN approach presented in this study achieves a performance competitive with recent ANN techniques, particularly when using GPU acceleration. It should be emphasized that VGG-Face and FaceNet512 in this table serve as models for generating embeddings, rather than as datasets, which should be considered when interpreting these comparisons.

5.2. Analysis of the Total System Response Time in the Full Recognition Cycle

For a comprehensive assessment of the practical efficiency of the proposed approach, it is essential to consider not only the query time but also the total system response time, which begins from the moment the input image is processed. This full cycle includes the computationally intensive preprocessing stage, namely face detection and subsequent generation of its vector representation. Our measurements, conducted on a central processing unit (AMD Ryzen 3 1200), show that this step takes 0.46 s to generate a 4096-dimensional embedding using the VGG-Face model and 0.38 s to generate a 512-dimensional embedding using the FaceNet512 model.

For a proper analysis on a CPU platform, these values are compared with the query time achieved by our OpenMP SNN implementation. In the most demanding scenario (a database of 105,864 vectors with dimensionality d = 4096), the best query time on the CPU was approximately 0.171 s (using four threads). Thus, the total processing time for a single identification on the CPU was 0.46 s + 0.171 s ≈ 0.631 s. In this case, although embeddings’ generation remains the primary bottleneck, the optimized query stage still accounts for a significant portion—about 27%—of the total time. This emphasizes that effective parallelization of the search on the CPU is critically important, as it allows a substantial reduction in the system response time, especially compared to traditional methods (such as k-d tree, with a query time of ~1.46 s), which would take considerably longer than the feature generation stage itself.

6. Conclusions

In this study, efficient parallel approaches for the exact SNN nearest neighbor search in high-dimensional facial vector spaces were proposed and implemented. Experiments confirmed a high accuracy (a recall of 1.0) and significant speedup, particularly for the GPU implementation, achieving up to a 58-fold runtime reduction compared to classical methods. The algorithm demonstrated stability with an increasing volume and dimensionality of data, highlighting its suitability for large-scale real-time facial recognition. Future research will focus on scaling parallel implementations to massive datasets, supporting incremental index updates, exploring hybrid approaches that combine accuracy and approximate search speed, and developing distributed implementations using MPI with OpenMP or CUDA. Additionally, adaptive projection and distance metrics’ selection, as well as analysis of ethical considerations, such as bias and robustness, will broaden its applicability and ensure responsible deployment.

Author Contributions

Conceptualization, L.M. and R.K.; methodology, L.M. and R.K.; software, R.K.; validation, L.M.; formal analysis, L.M.; investigation, R.K.; resources, L.M.; data curation, R.K.; writing—original draft preparation, L.M.; writing—review and editing, L.M. and R.K.; visualization, L.M.; supervision, R.K.; project administration, L.M.; funding acquisition, L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data are available in publicly accessible repositories, as referenced in [37].

Acknowledgments

This work is supported by the Department of Artificial Intelligence Systems at Lviv Polytechnic National University. The authors would like to express their gratitude to the reviewers for their constructive and concise recommendations, which helped improve the presentation of the materials, as well as to the Department of Artificial Intelligence Systems for its support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Santoso, W.; Safitri, R.; Samidi, S. Integration of Artificial Intelligence in Facial Recognition Systems for Software Security. Sinkron 2024, 8, 1208–1214. [Google Scholar] [CrossRef]
Gupta, A. Advancements and Challenges in Face Recognition Technology. Int. J. Comput. Trends Technol. 2024, 72, 92–104. [Google Scholar] [CrossRef]
R, V.C.; Asha, V.; Saju, B.; Suma, N.; Reddy, T.R.M.; Sumanth, M.K. Face Recognition and Identification Using Deep Learning. In Proceedings of the 2023 Third International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), Bhilai, India, 5 January 2023; IEEE: Piscataway, NJ, USA; pp. 1–5. [Google Scholar]
Deng, N.; Xu, Z.; Li, X.; Gao, C.; Wang, X. Deep Learning and Face Recognition: Face Recognition Approach Based on the DS-CDCN Algorithm. Appl. Sci. 2024, 14, 5739. [Google Scholar] [CrossRef]
Li, L. Face Recognition Model Based on Deep Learning Method. Sci. Technol. Eng. Chem. Environ. Prot. 2025, 3, 1–6. [Google Scholar] [CrossRef]
Serengil, S.; Özpınar, A. A Benchmark of Facial Recognition Pipelines and Co-Usability Performances of Modules. Bilişim Teknol. Derg. 2024, 17, 95–107. [Google Scholar] [CrossRef]
Li, Z.; Li, Z.; Li, X. Facial Recognition Leveraging Generative Adversarial Networks. arXiv 2025, arXiv:2505.11884. [Google Scholar] [CrossRef]
Ding, H.; Wu, J.; Zhao, W.; Matinlinna, J.P.; Burrow, M.F.; Tsoi, J.K.H. Artificial Intelligence in Dentistry—A Review. Front. Dent. Med. 2023, 4, 1085251. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, R.; Zhao, X.; Cheng, X.; Zhou, X. Exploring the Meaningfulness of Nearest Neighbor Search in High-Dimensional Space. In Lecture Notes in Computer Science; Springer Nature Singapore: Singapore, 2025; pp. 181–194. ISBN 978-981-9612-41-3. [Google Scholar]
Chen, X.; Güttel, S. Fast and Exact Fixed-Radius Neighbor Search Based on Sorting. PeerJ Comput. Sci. 2024, 10, e1929. [Google Scholar] [CrossRef]
Yang, S.; Xie, J.; Liu, Y.; Yu, J.X.; Gao, X.; Wang, Q.; Peng, Y.; Cui, J. Revisiting the Index Construction of Proximity Graph-Based Approximate Nearest Neighbor Search. arXiv 2024, arXiv:2410.01231. [Google Scholar] [CrossRef]
Chen, P.; Chang, W.-C.; Jiang, J.-Y.; Yu, H.-F.; Dhillon, I.; Hsieh, C.-J. FINGER: Fast Inference for Graph-Based Approximate Nearest Neighbor Search. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April 2023; ACM: New York, NY, USA; pp. 3225–3235. [Google Scholar]
Gupta, D.; Loane, R.; Gayen, S.; Demner-Fushman, D. Medical Image Retrieval via Nearest Neighbor Search on Pre-Trained Image Features. Knowl.-Based Syst. 2023, 278, 110907. [Google Scholar] [CrossRef]
Aghazadeh, A.; Amirmazlaghani, M. A Distributed Approximate Nearest Neighbor Method for Real-Time Face Recognition. arXiv 2020, arXiv:2005.05824. [Google Scholar] [CrossRef]
Li, M.; Wang, Y.-G.; Zhang, P.; Wang, H.; Fan, L.; Li, E.; Wang, W. Deep Learning for Approximate Nearest Neighbour Search: A Survey and Future Directions. IEEE Trans. Knowl. Data Eng. 2023, 35, 8997–9018. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA; pp. 815–823. [Google Scholar]
Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep Face Recognition. In Proceedings of the British Machine Vision Conference 2015, Swansea, UK, 7–10 September 2015; British Machine Vision Association: Swansen, UK, 2015; pp. 41.1–41.12. [Google Scholar]
Nielsen, B.M.G.; Hansen, L.K. Hubness Reduction Improves Sentence-BERT Semantic Spaces. arXiv 2023, arXiv:2311.18364. [Google Scholar] [CrossRef]
Xiao, B.; Biros, G. Parallel Algorithms for Nearest Neighbor Search Problems in High Dimensions. SIAM J. Sci. Comput. 2016, 38, S667–S699. [Google Scholar] [CrossRef]
Renga Bashyam, K.G.; Vadhiyar, S. Fast Scalable Approximate Nearest Neighbor Search for High-Dimensional Data. In Proceedings of the 2020 IEEE International Conference on Cluster Computing (CLUSTER), Kobe, Japan, 13 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 294–302. [Google Scholar]
Wang, Q.; Ileana, I.; Palpanas, T. LeaFi: Data Series Indexes on Steroids with Learned Filters. Proc. ACM Manag. Data 2025, 3, 1–27. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, J. Research on Knn Algorithm Based on Kmeans Clustering and Collaborative Filtering Hybrid Algorithm in AI Teaching. In Proceedings of the 2023 8th International Conference on Information Systems Engineering (ICISE), Dalian, China, 23 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 453–456. [Google Scholar]
Zhong, X.; Li, H.; Jin, J.; Yang, M.; Chu, D.; Wang, X.; Shen, Z.; Jia, W.; Gu, G.; Xie, Y.; et al. VSAG: An Optimized Search Framework for Graph-Based Approximate Nearest Neighbor Search. arXiv 2025, arXiv:2503.17911. [Google Scholar] [CrossRef]
Luo, J.; Zhang, M.; Chen, K.; Liao, X.; Shan, Y.; Jiang, J.; Wu, Y. Efficient Graph-Based Approximate Nearest Neighbor Search Achieving: Low Latency Without Throughput Loss. arXiv 2025, arXiv:2504.20461. [Google Scholar] [CrossRef]
Fu, C.; Xiang, C.; Wang, C.; Cai, D. Fast Approximate Nearest Neighbor Search with the Navigating Spreading-out Graph. Proc. VLDB Endow. 2019, 12, 461–474. [Google Scholar] [CrossRef]
Yang, M.; Li, W.; Wang, W. Fast High-Dimensional Approximate Nearest Neighbor Search with Efficient Index Time and Space. arXiv 2024, arXiv:2411.06158. [Google Scholar] [CrossRef]
Wang, Z.; Xiong, H.; Wang, Q.; He, Z.; Wang, P.; Palpanas, T.; Wang, W. Dimensionality-Reduction Techniques for Approximate Nearest Neighbor Search: A Survey and Evaluation. arXiv 2024, arXiv:2403.13491. [Google Scholar]
Mochurad, L.; Mirchuk, L.; Veretilnyk, A. Parallel Optimization of Dimensionality Reduction Methods for Disease Prediction: PCA and LDA with Dask-ML. Ceur Workshop Proc. 2024, 3777, 150–161. [Google Scholar]
Mutinda, J.K.; Langat, A.K. Exploring the Role of Dimensionality Reduction in Enhancing Machine Learning Algorithm Performance. Asian J. Res. Comput. Sci. 2024, 17, 157–166. [Google Scholar] [CrossRef]
Zemouri, R.; Levesque, M.; Boucher, E.; Kirouac, M.; Lafleur, F.; Bernier, S.; Merkhouf, A. Recent Research and Applications in Variational Autoencoders for Industrial Prognosis and Health Management: A Survey. In Proceedings of the 2022 Prognostics and Health Management Conference (PHM-2022 London), London, UK, 22 May 2022; IEEE: London, UK, 2022; pp. 193–203. [Google Scholar]
Khan, S.; Singh, S.; Simhadri, H.V.; Vedurada, J. BANG: Billion-Scale Approximate Nearest Neighbor Search Using a Single GPU. arXiv 2025, arXiv:2401.11324. [Google Scholar]
El Fadel, N. Facial Recognition Algorithms: A Systematic Literature Review. J. Imaging 2025, 11, 58. [Google Scholar] [CrossRef] [PubMed]
Mochurad, L.; Shchur, G. Parallelization of Cryptographic Algorithm Based on Different Parallel Computing Technologies. CEUR Workshop Proc. 2021, 2824, 20–29. [Google Scholar]
Dobson, M.; Blelloch, G. Parallel Nearest Neighbors in Low Dimensions with Batch Updates. arXiv 2021, arXiv:2111.04182. [Google Scholar] [CrossRef]
Aparício, G.; Blanquer, I.; Hernández, V. A Parallel Implementation of the K Nearest Neighbours Classifier in Three Levels: Threads, MPI Processes and the Grid. In High Performance Computing for Computational Science—VECPAR 2006; Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4395, pp. 225–235. ISBN 978-3-540-71350-0. [Google Scholar]
Li, Y.; Zhou, B.; Zhang, J.; Wei, X.; Li, Y.; Chen, Y. RadiK: Scalable and Optimized GPU-Parallel Radix Top-K Selection. In Proceedings of the 38th ACM International Conference on Supercomputing, Kyoto, Japan, 30 May 2024; ACM: New York, NY, USA; pp. 537–548. [Google Scholar]
Huang, G.; Mattar, M.; Lee, H.; Learned-Miller, E. Learned-Miller, Learning to Align from Scratch. Adv. Neural Inf. Process. Syst. 2012, 25, 1–9. [Google Scholar]
Wu, W.; Peng, H.; Yu, S. YuNet: A Tiny Millisecond-Level Face Detector. Mach. Intell. Res. 2023, 20, 656–665. [Google Scholar] [CrossRef]
Singh, A.; Kansari, J.; Kumar, V. Sinha Face Recognition Using Transfer Learning by Deep VGG16 Model. Int. J. Emerg. Technol. Innov. Res. 2022, 9, b121–b127. [Google Scholar]
Wu, X.; He, R.; Sun, Z.; Tan, T. A Light CNN for Deep Face Representation with Noisy Labels. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2884–2896. [Google Scholar] [CrossRef]

Figure 1. Dependence of execution time, speedup, and efficiency of the SNN implementation on the number of threads for the indexing and query phases at different embedding dimensionalities (OpenMP, N = 105,864).

Table 1. Comparison of the main exact FRNNS methods.

Methods	Year	Main Principle	Complexity (Index/Query)	Memory	Advantages	Limitations
Brute-force search	N/A	Compute all distances	O(1)/O(Nd)	O(Nd)	Exact, simple	Very high cost for large N or d
k-d tree	1975	Space partitioning by axis-aligned hyperplanes	O(NlogN)/ O(logN) low d, up to O(N) high d	O(N)	Fast for low d	Performance drops for high d, data-sensitive
Ball tree	1989	Space partitioning into nested hyperspheres	O(NlogN)/ O(logN) low d, up to O(N) high d	O(N)	Handles non-uniform data, metric-flexible	Curse of dimensionality; complex construction
SNN	2022–2024	Sort by first principal component; optimized distances	O(Nd2 + NlogN)/fraction of candidates	O(N)	Exact, fast, flexible, no hyperparameters (except R)	Dependent on data distribution, requires SVD

Note. Here, N denotes the number of data points, and d the dimensionality of the feature space. The notation O(1) refers to constant complexity, O(Nd) to linear dependence on both dataset size and dimensionality, O(NlogN) to sorting or indexing complexity, and O(logN) to logarithmic dependence on dataset size.

Table 2. Overview of contemporary approaches to FRNNS optimization.

Category	Methods/Techniques	Principle	FRNNS Impact	Challenges
Dimensionality reduction	PCA, LDA, t-SNE, UMAP, Autoencoders	Reduce dimensionality while preserving info	Speed, possible accuracy loss, better scalability, lower memory	Critical info loss, method selection
Computational parallelization	CPU multithreading, GPU, MPI	Distribute workload across units	Faster processing, accuracy preserved, scalable, higher resource usage	Load balancing, communication overhead, synchronization
Innovative indexing/queries	Learned indexes (LeaFi), Auto-tuning (VSAG), Memory optimization, Adaptive distance (FINGER), Async architectures (AverSearch)	ML-based indexing, adaptive to data and queries	Faster, accuracy preserved, scalable, moderate resource use	ML model complexity, portability, adapting ANNS ideas to FRNNS

Table 3. Verification results of the SNN algorithm.

Embedding Type	Dimension (d)	Test Radius R	SNN Versions Tested	Recall
FaceNet512	512	23.56	Sequential, OpenMP, CUDA	1.0
VGG-Face	4096	1.17	Sequential, OpenMP, CUDA	1.0

Table 4. Indexing time (in seconds) for d = 4096.

N	Sequential w/o BLAS	Sequential with BLAS	OpenMP	CUDA	k-d Tree	Ball Tree
13,233	65.24100	13.64686	5.34100	2.11000	29.15400	21.90300
26,466	364.23690	31.45103	12.31900	4.46800	66.82399	51.06899
52,932	-	56.11196	21.99000	8.71601	155.08673	117.40920
105,864	-	115.13030	45.06339	21.06080	339.89667	272.08081

Table 6. Query time (in seconds) for d = 4096.

N	Sequential w/o BLAS	Sequential with BLAS	OpenMP	GPU	Brute Seq	Brute	k-d Tree	Ball Tree
13,233	0.05401	0.04982	0.02499	0.00501	0.05700	0.06201	0.31700	0.07199
26,466	0.11151	0.09172	0.04600	0.00900	0.13300	0.13001	0.34600	0.13100
52,932	-	0.18045	0.09001	0.01899	0.33198	0.24499	0.91201	0.23901
105,864	-	0.35094	0.17623	0.02498	0.62202	0.38940	1.45793	0.50385

Table 5. Indexing time (in seconds) for d = 512.

N	Sequential w/o BLAS	Sequential with BLAS	OpenMP	CUDA	k-d Tree	Ball Tree
13,233	5.59042	1.49704	0.79000	0.37973	2.82899	2.04199
26,466	11.79047	2.82677	1.48400	0.45475	6.76700	4.90400
52,932	22.60518	6.55342	3.45500	0.88650	14.96499	11.50582
105,864	42.09300	12.47189	6.58500	1.70217	37.18500	29.52651

Table 7. Query time (in seconds) for d = 512.

N	Sequential w/o BLAS	Sequential with BLAS	OpenMP	GPU	Brute Seq	Brute	k-d Tree	Ball Tree
13,233	0.00739	0.00719	0.00400	0.00200	0.01200	0.01098	0.03101	0.01001
26,466	0.02383	0.01435	0.00800	0.00300	0.02101	0.01901	0.05601	0.01899
52,932	0.02925	0.02713	0.01500	0.00399	0.03399	0.03000	0.10902	0.07899
105,864	0.05201	0.05198	0.02900	0.00600	0.05300	0.05101	0.47701	0.08402

Table 8. Relative query performance of the baseline SNN algorithm (Chen & Güttel) on real datasets.

Dataset	Radius (R)	k-d Tree Time (ms)	Ball Tree Time (ms)	SNN Time (ms)	Speedup of SNN vs. k-d Tree (×)	Speedup of SNN vs. Ball Tree (×)
F-MNIST (d = 784)	800	146.3	110.3	7.765	18.83×	14.20×
F-MNIST (d = 784)	1200	163.3	110.8	11.18	14.61×	9.91×
GIST (d = 960)	0.8	3144	2160	281.5	11.17×	7.67×
GIST (d = 960)	1	3237	2183	326.8	9.91×	6.68×

Table 9. Relative query performance of parallel SNN variants on high-dimensional facial embeddings (N = 105,864).

Dimensionality (d)	Implementation SNN	Ball Tree Time (s)	k-d Tree Time (s)	SNN Time (s)	Speedup vs. Ball Tree (×)	Speedup vs. k-d Tree (×)
4096	Sequential (BLAS)	0.504	1.458	0.351	1.43×	4.15×
4096	OpenMP	0.504	1.458	0.176	2.86×	8.28×
4096	CUDA	0.504	1.458	0.025	20.16×	58.32×
512	Sequential (BLAS)	0.084	0.477	0.052	1.62×	9.17×
512	OpenMP	0.084	0.477	0.021	4.00×	22.71×
512	CUDA	0.084	0.477	0.006	14.00×	79.50×

Table 10. Comparative analysis of recent FRNNS and ANN methods.

Method	Main Idea/Technology	Dataset(s)	Reported Speedup/Accuracy
SNN (2020) [10]	Sorting-based pruning for nearest neighbors	F-MNIST, GIST	10–18× vs. k-d tree
LeaFi (2025) [21]	Learned filters for efficient similarity search	Large-scale image datasets	3–5× vs. FAISS
RadiK (2022) [36]	GPU-parallel radix top-K selection	Text and vision datasets	2–3× vs. HNSW
VSAG (2023) [23]	Graph-based ANN with optimized vector storage	ImageNet subsets	2–4× vs. IVF
DFSANNS (2025) [24]	Depth-first adaptive ANN search	Mixed	5–10× vs. HNSW
This work (SNN OpenMP/CUDA)	Parallelization on CPU (OpenMP) and GPU (CUDA)	VGG-Face, FaceNet512	20–58× vs. k-d tree, 14–80× vs. ball tree

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mochurad, L.; Kapustiak, R. Parallel Approaches for SNN-Based Nearest Neighbor Search in High-Dimensional Embedding Spaces: Application to Face Recognition. Appl. Sci. 2025, 15, 10139. https://doi.org/10.3390/app151810139

AMA Style

Mochurad L, Kapustiak R. Parallel Approaches for SNN-Based Nearest Neighbor Search in High-Dimensional Embedding Spaces: Application to Face Recognition. Applied Sciences. 2025; 15(18):10139. https://doi.org/10.3390/app151810139

Chicago/Turabian Style

Mochurad, Lesia, and Roman Kapustiak. 2025. "Parallel Approaches for SNN-Based Nearest Neighbor Search in High-Dimensional Embedding Spaces: Application to Face Recognition" Applied Sciences 15, no. 18: 10139. https://doi.org/10.3390/app151810139

APA Style

Mochurad, L., & Kapustiak, R. (2025). Parallel Approaches for SNN-Based Nearest Neighbor Search in High-Dimensional Embedding Spaces: Application to Face Recognition. Applied Sciences, 15(18), 10139. https://doi.org/10.3390/app151810139

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Parallel Approaches for SNN-Based Nearest Neighbor Search in High-Dimensional Embedding Spaces: Application to Face Recognition

Abstract

1. Introduction

2. State-of-the-Art Review

2.1. Classification and Detailed Overview of Existing FRNNS Methods

2.1.1. Exact Search Method

2.1.2. Approximate Search Methods

2.2. Modern Approaches to Optimizing FRNNS Algorithms

2.2.1. Dimensionality Reduction Methods

2.2.2. Parallelization Technologies

2.2.3. Innovative Indexing and Query Processing Strategies

3. Materials and Methods

3.1. Data Representation

3.2. Sorting with Neighborhood Pruning (SNN)

3.2.1. Parallel SNN Approach (OpenMP-Based): Indexing Phase

3.2.2. Parallel SNN Approach (OpenMP-Based): Query Phase

3.2.3. Parallel SNN Approach (CUDA-Based): Indexing Phase

3.2.4. Parallel SNN Approach (CUDA-Based): Query Phase

3.3. Computational Complexity of the Proposed Parallel Approaches

3.3.1. Analysis of the Indexing Phase

3.3.2. Analysis of the Query Phase

3.3.3. Theoretical Speedup S p and Efficiency E p

4. Results

4.1. Experimental Environment and Settings

4.1.1. LFW Dataset and Preparation

4.1.2. Embedding Generation

4.1.3. Augmentation for Scalability Evaluation

4.1.4. Hardware and Software

4.2. Verification of SNN Algorithm’s Correctness

4.2.1. Verification Method

4.2.2. Recall Results

4.3. Performance Analysis and Comparison with Baseline Algorithms

4.3.1. SNN Scalability Analysis

4.3.2. Comparison with Other Methods (Indexing)

4.3.3. Comparison with Other Methods (Query)

5. Discussion

5.1. Comparative Performance Analysis

5.2. Analysis of the Total System Response Time in the Full Recognition Cycle

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.3. Theoretical Speedup $S_{p}$ and Efficiency $E_{p}$