Parallel Approaches for SNN-Based Nearest Neighbor Search in High-Dimensional Embedding Spaces: Application to Face Recognition
Abstract
1. Introduction
- Introducing a methodological adaptation of the exact SNN algorithm for high-dimensional facial embeddings, effectively addressing challenges such as the curse of dimensionality and hubness while maintaining exact recall.
- Proposing a precomputation strategy in the query phase that improves efficiency without compromising accuracy, offering a scalable alternative to tree-based exact search methods.
- Conducting a comprehensive performance and stability analysis on large-scale datasets, demonstrating the algorithm’s robustness to increasing dimensionality and data volume.
- Providing an end-to-end evaluation of a face recognition pipeline, highlighting integration into real-time systems with high reliability and demonstrating practical scalability limits.
2. State-of-the-Art Review
2.1. Classification and Detailed Overview of Existing FRNNS Methods
2.1.1. Exact Search Method
- K-d trees are among the earliest and most well-known structures of this type. They implement recursive partitioning of the space using hyperplanes parallel to the coordinate axes, changing the splitting axis at each tree level [10]. This approach demonstrates high efficiency in low-dimensional spaces (d < 20), but as dimensionality increases, its performance drops sharply due to the so-called “curse of dimensionality” [19].
- Ball trees organize data in the form of nested hyperspheres. Each node of the tree corresponds to a hypersphere covering a certain subset of points. Compared to k-d trees, ball trees handle non-uniformly distributed data more effectively; however, in high-dimensional spaces, they also become inefficient [10].
- Other tree-based structures include Vantage Point trees (VP-trees), which partition the space relative to a chosen reference point based on a distance threshold [10]; Random Projection trees (RP-trees), which employ random projections [20]; and Cover trees, which provide theoretical guarantees for search time, although their implementation is complex [10]. Regardless of the specific type, the efficiency of tree-based methods significantly decreases as the dimensionality d approaches log N, where N is the number of points in the dataset, and performance approaches that of brute-force search [10].
- SNN is an exact algorithm for FRNNS, developed during 2022–2024 (Chen, Güttel), which demonstrates superiority over traditional methods [10]. It is based on three key principles: (1) candidate exclusion using a sorting criterion, which restricts the search space; (2) the use of precomputed scalar products to reduce arithmetic complexity; (3) reformulation of computations as matrix operations, enabling the use of high-performance BLAS libraries. The indexing stage includes data centering, computation of the principal component via SVD, sorting of points according to their projection onto this component, and storing the norms of the centered vectors. The query stage involves binary search on the sorted projections and filtering according to the distance criterion. The key advantages of SNN include the following: no need for hyperparameter tuning (except for the radius R), guaranteed accuracy, high speed (often outperforming both tree-based algorithms and optimized brute force), and flexibility, as the rapid index update allows the method to be applied to streaming data.
- K-means for k-NN (kMkNN) is a method that applies preliminary clustering using k-means to partition the data, after which the search is carried out only within a limited number of relevant clusters. The use of the triangle inequality allows for reducing the number of distance computations. Although this method is primarily designed for the k-NN task [22], its principles can be partially transferred to FRNNS.
2.1.2. Approximate Search Methods
2.2. Modern Approaches to Optimizing FRNNS Algorithms
2.2.1. Dimensionality Reduction Methods
2.2.2. Parallelization Technologies
- First, it is important to ensure the correct distribution of data among threads or computational nodes to avoid redundant computations and minimize inter-process communication overheads.
- Second, in cases of uneven data distribution or varying query complexity, load balancing mechanisms should be implemented to prevent overloading certain resources and inefficient idling of others.
- Additionally, communication costs should be optimized, which is especially relevant in cluster systems or in CPU–GPU interactions, where data transfer often becomes a bottleneck. Finally, the need for synchronization—such as coordinating access to shared structures or merging results—can introduce additional overheads and significantly complicate the implementation of an efficient parallel algorithm.
2.2.3. Innovative Indexing and Query Processing Strategies
3. Materials and Methods
3.1. Data Representation
3.2. Sorting with Neighborhood Pruning (SNN)
- Data centering is essential, as the projection is onto the principal component of the centered data, which ensures the validity of the distance bound.
- The guarantee holds specifically for the Euclidean distance.
- Sorting is not required for correctness, but ordering the data by projection value is the critical step that enables an efficient binary search to identify the candidate set in O (log n) time.
3.2.1. Parallel SNN Approach (OpenMP-Based): Indexing Phase
- Step 1: Computing the mean vector and data centering
Algorithm 1. Mean Calculation and Centering |
// Calculate mean vector in parallel FUNCTION compute_mean(data[n][d]): mean_vector[d] = {0} PARALLEL FOR dimension j FROM 0 TO d-1: sum_j = 0 FOR point i FROM 0 TO n-1: sum_j += data[i][j] mean_vector[j] = sum_j/n RETURN mean_vector // Center data matrix in parallel FUNCTION center_data(data[n][d], mean_vector[d]): PARALLEL FOR point i FROM 0 TO n-1 AND dimension j FROM 0 TO d-1 (using collapse(2)): data[i][j] -= mean_vector[j] RETURN data |
- Step 2: Computing the first principal component v1
Algorithm 2. Power Iteration for First Principal Component |
FUNCTION compute_first_pc(centered_data[n][d]): v1[d] = random_vector() FOR iter FROM 0 TO max_iterations: // BLAS calls can be internally multi-threaded temp_vec = BLAS_CALL(matrix_vector_mult, centered_data, v1) v1 = BLAS_CALL(matrix_transpose_vector_mult, centered_data, temp_vec) v1 = BLAS_CALL(vector_normalize, v1) RETURN v1 |
- Step 3: Data projection and norms calculation
Algorithm 3. Projections and Norms Calculation |
FUNCTION compute_projections_and_norms(centered_data[n][d], v1[d]): // Calculate all projections at once using BLAS projections[n] = BLAS_CALL(matrix_vector_mult, centered_data, v1) // Calculate norms in parallel results[n] = empty_array_of_tuples PARALLEL FOR point i FROM 0 TO n-1: norm_sq_i = BLAS_CALL(dot_product, centered_data[i], cen-tered_data[i]) // Store projection, original index, and norm together results[i] = (projections[i], i, norm_sq_i) RETURN results |
- Step 4: Parallel Sorting
Algorithm 4. Parallel Sorting |
FUNCTION sort_by_projection(results[n]): // parallel_sort partitions the data and sorts sub-arrays in parallel PARALLEL_SORT(results, compare_by_first_element) RETURN results |
3.2.2. Parallel SNN Approach (OpenMP-Based): Query Phase
- Step 1: Query preparation and dot product computation
Algorithm 5. Preparation and Dot Product Calculation |
FUNCTION prepare_query(query_vec[d], data[n][d], v1[d], mean[d]): // Center the query vector centered_q[d] // This small loop can be parallelized conditionally PARALLEL FOR IF (d > 100) dimension j FROM 0 TO d-1: centered_q[j] = query_vec[j] − mean[j] // Calculate projection and norm of the query vector using BLAS q_projection = BLAS_CALL(dot_product, v1, centered_q) q_norm_sq = BLAS_CALL(dot_product, centered_q, centered_q) // Pre-calculate all dot products between database points and the query vector all_dot_products[n] = BLAS_CALL(matrix_vector_mult, data, cen-tered_q) RETURN centered_q, q_projection, q_norm_sq, all_dot_products |
- Step 2: Candidate search and final filtering
Algorithm 6. Filtering |
FUNCTION filter_query(sorted_data[n], R, q_projection, q_norm_sq, all_dot_products[n]): // Find candidate range using binary search (sequential) lower_bound = BINARY_SEARCH_LOWER(sorted_data, q_projection - R) upper_bound = BINARY_SEARCH_UPPER(sorted_data, q_projection + R) // Iterate through the candidate subset in parallel results = empty_list PARALLEL FOR each candidate IN range(lower_bound, upper_bound): original_index = candidate.index dot_xy = all_dot_products[original_index] dist_sq = candidate.norm_sq + q_norm_sq - 2 * dot_xy IF dist_sq <= R*R: // Synchronization is needed for concurrent writes CRITICAL_SECTION: APPEND original_index TO results |
3.2.3. Parallel SNN Approach (CUDA-Based): Indexing Phase
- Step 1: Initialization and data transfer to GPU
Algorithm 7. GPU Memory Initialization and Data Transfer |
FUNCTION initialize_on_gpu(h_data[n][d]): // Allocate memory on Device for all data structures d_data = CUDA_MALLOC(n*d) d_mean = CUDA_MALLOC(d) d_first_pc = CUDA_MALLOC(d) d_projections = CUDA_MALLOC(n) d_norms_sq = CUDA_MALLOC(n) d_indices = CUDA_MALLOC(n) // Copy main data from Host to Device CUDA_MEMCPY(d_data, h_data, HostToDevice) |
- Step 2: Mean calculation and data centering on GPU
Algorithm 8. Mean Calculation and Data Centering on GPU |
FUNCTION center_data_on_gpu(d_data[n][d]): // Calculate mean vector on GPU using a parallel reduction kernel d_mean = LAUNCH KERNEL(parallel_reduction_mean, d_data) // Center data matrix by launching a kernel LAUNCH KERNEL(subtract_mean_kernel_global, d_data, d_mean, n, d) |
- Step 3: Computing the first principal component v1 on GPU
Algorithm 9. Power Iteration for First Principal Component on GPU |
FUNCTION compute_first_pc_on_gpu(d_data[n][d]): d_v1 = random_vector_on_device() d_temp_vec = CUDA_MALLOC(n) FOR iter FROM 0 TO max_iterations: // All calls are to cuBLAS library running on GPU CUBLAS_CALL(gemv, d_data, d_v1, d_temp_vec) // d_temp_vec = X * v1 CUBLAS_CALL(gemv, d_data_transposed, d_temp_vec, d_v1) // v1 = X^T * d_temp_vec norm = CUBLAS_CALL(nrm2, d_v1) scale = 1.0/norm CUBLAS_CALL(scal, d_v1, scale) RETURN d_v1 |
- Step 4: Projections, norms, and indices calculation on GPU
Algorithm 10. Projections, Norms, and Indices Calculation |
FUNCTION prepare_for_sort(d_data, d_v1): // Calculate all projections using one cuBLAS call d_projections = CUBLAS_CALL(gemv, d_data, d_v1) // Calculate all norms in parallel with a custom kernel LAUNCH KERNEL(compute_row_norms_sq_kernel, d_data, d_norms_sq) // Create a sequence of indices (0, 1, 2,...) on the GPU THRUST_CALL(sequence, d_indices) |
- Step 5: Parallel sorting on GPU
Algorithm 11. Parallel Sorting on GPU |
FUNCTION sort_on_gpu(d_projections, d_indices, d_norms_sq): // Sort d_indices and d_norms_sq arrays based on the keys in d_projections THRUST_CALL(sort_by_key, keys: d_projections, values: {d_indices, d_norms_sq}) |
3.2.4. Parallel SNN Approach (CUDA-Based): Query Phase
- Step 1: Query transfer and preparation
Algorithm 12. Prepare Query on GPU |
FUNCTION prepare_query_on_gpu(h_query_vec[d]): d_query_vec = CUDA_MALLOC(d) CUDA_MEMCPY(d_query_vec, h_query_vec, HostToDevice) // Center the query vector on GPU LAUNCH KERNEL(center_query_kernel, d_query_vec, d_mean) // Calculate projection and norm, results remain on device/host registers q_projection = CUBLAS_CALL(dot, d_first_pc, d_query_vec) q_norm_sq = CUBLAS_CALL(dot, d_query_vec, d_query_vec) RETURN q_projection, q_norm_sq, d_query_vec |
- Step 2: Candidate range search
Algorithm 13. Find Candidate Range on GPU |
FUNCTION find_candidates_on_gpu(d_projections, q_projection, R): // Perform parallel binary search on the GPU range_start_iterator = THRUST_CALL(lower_bound, d_projections, q_projection - R) range_end_iterator = THRUST_CALL(upper_bound, d_projections, q_projection + R) // Calculate the start index and the number of candidates first_candidate_index = range_start_iterator - d_projections num_candidates = range_end_iterator - range_start_iterator RETURN first_candidate_index, num_candidates |
- Step 3: Final filtering with CUDA kernel
- Step 4: Atomic Result Write and Copy to Host
Algorithm 14. Final Filtering and Result Collection |
FUNCTION filter_and_get_results(first_candidate_index, num_candidates, ...): // Allocate buffer for results on GPU d_results = CUDA_MALLOC(num_candidates) // Max possible size d_result_count = CUDA_MALLOC_AND_ZERO(1) // Atomic counter // Launch the kernel to perform final distance check in parallel LAUNCH KERNEL(filter_candidates_kernel, first_candidate_index, num_candidates, //... other necessary data pointers ... d_results, d_result_count ) // Copy only the valid results back to the Host num_found = CUDA_MEMCPY_FROM_DEVICE(d_result_count) h_results = CUDA_MALLOC_HOST(num_found) CUDA_MEMCPY(h_results, d_results, num_found, DeviceToHost) RETURN h_results // KERNEL DEFINITION KERNEL filter_candidates_kernel(...): thread_id = ... IF thread_id >= num_candidates THEN RETURN candidate_index = first_candidate_index + thread_id // ... read candidate data (norm, original_index) ... // ... calculate dot_product with query vector ... dist_sq = candidate.norm_sq + query.norm_sq - 2 * dot_product IF dist_sq <= R*R: // Get a unique position in the output array and write the result write_position = ATOMIC_ADD(d_result_count, 1) d_results[write_position] = candidate.original_index |
3.3. Computational Complexity of the Proposed Parallel Approaches
- n—the total number of vector representations (data points) in the database P.
- d—the dimensionality of the feature space, i.e., the length of each vector.
- P—the number of parallel computational units (for example., CPU cores) involved in the computations.
- —the number of iterations performed in the power iteration method to find the first principal component.
- the number of candidate vectors selected at the filtering stage after searching in the sorted array of projections.
3.3.1. Analysis of the Indexing Phase
- Step 1: Computing the mean vector and data centering
- Step 2: Computing the first principal component v1
- Step 3: Data projection and norm computation
- Step 4: Parallel sorting
3.3.2. Analysis of the Query Phase
- Step 1: Query preparation and precomputation of scalar products
- Step 2: Candidate range search
- Step 3: Final filtering
3.3.3. Theoretical Speedup and Efficiency
4. Results
4.1. Experimental Environment and Settings
4.1.1. LFW Dataset and Preparation
4.1.2. Embedding Generation
4.1.3. Augmentation for Scalability Evaluation
- n_original = 13,233;
- n_aug1 = 26,466;
- n_aug2 = 52,932;
- n_aug3 = 105,864.
4.1.4. Hardware and Software
4.2. Verification of SNN Algorithm’s Correctness
4.2.1. Verification Method
4.2.2. Recall Results
4.3. Performance Analysis and Comparison with Baseline Algorithms
4.3.1. SNN Scalability Analysis
4.3.2. Comparison with Other Methods (Indexing)
4.3.3. Comparison with Other Methods (Query)
5. Discussion
5.1. Comparative Performance Analysis
5.2. Analysis of the Total System Response Time in the Full Recognition Cycle
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Santoso, W.; Safitri, R.; Samidi, S. Integration of Artificial Intelligence in Facial Recognition Systems for Software Security. Sinkron 2024, 8, 1208–1214. [Google Scholar] [CrossRef]
- Gupta, A. Advancements and Challenges in Face Recognition Technology. Int. J. Comput. Trends Technol. 2024, 72, 92–104. [Google Scholar] [CrossRef]
- R, V.C.; Asha, V.; Saju, B.; Suma, N.; Reddy, T.R.M.; Sumanth, M.K. Face Recognition and Identification Using Deep Learning. In Proceedings of the 2023 Third International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), Bhilai, India, 5 January 2023; IEEE: Piscataway, NJ, USA; pp. 1–5. [Google Scholar]
- Deng, N.; Xu, Z.; Li, X.; Gao, C.; Wang, X. Deep Learning and Face Recognition: Face Recognition Approach Based on the DS-CDCN Algorithm. Appl. Sci. 2024, 14, 5739. [Google Scholar] [CrossRef]
- Li, L. Face Recognition Model Based on Deep Learning Method. Sci. Technol. Eng. Chem. Environ. Prot. 2025, 3, 1–6. [Google Scholar] [CrossRef]
- Serengil, S.; Özpınar, A. A Benchmark of Facial Recognition Pipelines and Co-Usability Performances of Modules. Bilişim Teknol. Derg. 2024, 17, 95–107. [Google Scholar] [CrossRef]
- Li, Z.; Li, Z.; Li, X. Facial Recognition Leveraging Generative Adversarial Networks. arXiv 2025, arXiv:2505.11884. [Google Scholar] [CrossRef]
- Ding, H.; Wu, J.; Zhao, W.; Matinlinna, J.P.; Burrow, M.F.; Tsoi, J.K.H. Artificial Intelligence in Dentistry—A Review. Front. Dent. Med. 2023, 4, 1085251. [Google Scholar] [CrossRef]
- Chen, Z.; Zhang, R.; Zhao, X.; Cheng, X.; Zhou, X. Exploring the Meaningfulness of Nearest Neighbor Search in High-Dimensional Space. In Lecture Notes in Computer Science; Springer Nature Singapore: Singapore, 2025; pp. 181–194. ISBN 978-981-9612-41-3. [Google Scholar]
- Chen, X.; Güttel, S. Fast and Exact Fixed-Radius Neighbor Search Based on Sorting. PeerJ Comput. Sci. 2024, 10, e1929. [Google Scholar] [CrossRef]
- Yang, S.; Xie, J.; Liu, Y.; Yu, J.X.; Gao, X.; Wang, Q.; Peng, Y.; Cui, J. Revisiting the Index Construction of Proximity Graph-Based Approximate Nearest Neighbor Search. arXiv 2024, arXiv:2410.01231. [Google Scholar] [CrossRef]
- Chen, P.; Chang, W.-C.; Jiang, J.-Y.; Yu, H.-F.; Dhillon, I.; Hsieh, C.-J. FINGER: Fast Inference for Graph-Based Approximate Nearest Neighbor Search. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April 2023; ACM: New York, NY, USA; pp. 3225–3235. [Google Scholar]
- Gupta, D.; Loane, R.; Gayen, S.; Demner-Fushman, D. Medical Image Retrieval via Nearest Neighbor Search on Pre-Trained Image Features. Knowl.-Based Syst. 2023, 278, 110907. [Google Scholar] [CrossRef]
- Aghazadeh, A.; Amirmazlaghani, M. A Distributed Approximate Nearest Neighbor Method for Real-Time Face Recognition. arXiv 2020, arXiv:2005.05824. [Google Scholar] [CrossRef]
- Li, M.; Wang, Y.-G.; Zhang, P.; Wang, H.; Fan, L.; Li, E.; Wang, W. Deep Learning for Approximate Nearest Neighbour Search: A Survey and Future Directions. IEEE Trans. Knowl. Data Eng. 2023, 35, 8997–9018. [Google Scholar] [CrossRef]
- Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA; pp. 815–823. [Google Scholar]
- Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep Face Recognition. In Proceedings of the British Machine Vision Conference 2015, Swansea, UK, 7–10 September 2015; British Machine Vision Association: Swansen, UK, 2015; pp. 41.1–41.12. [Google Scholar]
- Nielsen, B.M.G.; Hansen, L.K. Hubness Reduction Improves Sentence-BERT Semantic Spaces. arXiv 2023, arXiv:2311.18364. [Google Scholar] [CrossRef]
- Xiao, B.; Biros, G. Parallel Algorithms for Nearest Neighbor Search Problems in High Dimensions. SIAM J. Sci. Comput. 2016, 38, S667–S699. [Google Scholar] [CrossRef]
- Renga Bashyam, K.G.; Vadhiyar, S. Fast Scalable Approximate Nearest Neighbor Search for High-Dimensional Data. In Proceedings of the 2020 IEEE International Conference on Cluster Computing (CLUSTER), Kobe, Japan, 13 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 294–302. [Google Scholar]
- Wang, Q.; Ileana, I.; Palpanas, T. LeaFi: Data Series Indexes on Steroids with Learned Filters. Proc. ACM Manag. Data 2025, 3, 1–27. [Google Scholar] [CrossRef]
- Zhao, Y.; Zhang, J. Research on Knn Algorithm Based on Kmeans Clustering and Collaborative Filtering Hybrid Algorithm in AI Teaching. In Proceedings of the 2023 8th International Conference on Information Systems Engineering (ICISE), Dalian, China, 23 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 453–456. [Google Scholar]
- Zhong, X.; Li, H.; Jin, J.; Yang, M.; Chu, D.; Wang, X.; Shen, Z.; Jia, W.; Gu, G.; Xie, Y.; et al. VSAG: An Optimized Search Framework for Graph-Based Approximate Nearest Neighbor Search. arXiv 2025, arXiv:2503.17911. [Google Scholar] [CrossRef]
- Luo, J.; Zhang, M.; Chen, K.; Liao, X.; Shan, Y.; Jiang, J.; Wu, Y. Efficient Graph-Based Approximate Nearest Neighbor Search Achieving: Low Latency Without Throughput Loss. arXiv 2025, arXiv:2504.20461. [Google Scholar] [CrossRef]
- Fu, C.; Xiang, C.; Wang, C.; Cai, D. Fast Approximate Nearest Neighbor Search with the Navigating Spreading-out Graph. Proc. VLDB Endow. 2019, 12, 461–474. [Google Scholar] [CrossRef]
- Yang, M.; Li, W.; Wang, W. Fast High-Dimensional Approximate Nearest Neighbor Search with Efficient Index Time and Space. arXiv 2024, arXiv:2411.06158. [Google Scholar] [CrossRef]
- Wang, Z.; Xiong, H.; Wang, Q.; He, Z.; Wang, P.; Palpanas, T.; Wang, W. Dimensionality-Reduction Techniques for Approximate Nearest Neighbor Search: A Survey and Evaluation. arXiv 2024, arXiv:2403.13491. [Google Scholar]
- Mochurad, L.; Mirchuk, L.; Veretilnyk, A. Parallel Optimization of Dimensionality Reduction Methods for Disease Prediction: PCA and LDA with Dask-ML. Ceur Workshop Proc. 2024, 3777, 150–161. [Google Scholar]
- Mutinda, J.K.; Langat, A.K. Exploring the Role of Dimensionality Reduction in Enhancing Machine Learning Algorithm Performance. Asian J. Res. Comput. Sci. 2024, 17, 157–166. [Google Scholar] [CrossRef]
- Zemouri, R.; Levesque, M.; Boucher, E.; Kirouac, M.; Lafleur, F.; Bernier, S.; Merkhouf, A. Recent Research and Applications in Variational Autoencoders for Industrial Prognosis and Health Management: A Survey. In Proceedings of the 2022 Prognostics and Health Management Conference (PHM-2022 London), London, UK, 22 May 2022; IEEE: London, UK, 2022; pp. 193–203. [Google Scholar]
- Khan, S.; Singh, S.; Simhadri, H.V.; Vedurada, J. BANG: Billion-Scale Approximate Nearest Neighbor Search Using a Single GPU. arXiv 2025, arXiv:2401.11324. [Google Scholar]
- El Fadel, N. Facial Recognition Algorithms: A Systematic Literature Review. J. Imaging 2025, 11, 58. [Google Scholar] [CrossRef] [PubMed]
- Mochurad, L.; Shchur, G. Parallelization of Cryptographic Algorithm Based on Different Parallel Computing Technologies. CEUR Workshop Proc. 2021, 2824, 20–29. [Google Scholar]
- Dobson, M.; Blelloch, G. Parallel Nearest Neighbors in Low Dimensions with Batch Updates. arXiv 2021, arXiv:2111.04182. [Google Scholar] [CrossRef]
- Aparício, G.; Blanquer, I.; Hernández, V. A Parallel Implementation of the K Nearest Neighbours Classifier in Three Levels: Threads, MPI Processes and the Grid. In High Performance Computing for Computational Science—VECPAR 2006; Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4395, pp. 225–235. ISBN 978-3-540-71350-0. [Google Scholar]
- Li, Y.; Zhou, B.; Zhang, J.; Wei, X.; Li, Y.; Chen, Y. RadiK: Scalable and Optimized GPU-Parallel Radix Top-K Selection. In Proceedings of the 38th ACM International Conference on Supercomputing, Kyoto, Japan, 30 May 2024; ACM: New York, NY, USA; pp. 537–548. [Google Scholar]
- Huang, G.; Mattar, M.; Lee, H.; Learned-Miller, E. Learned-Miller, Learning to Align from Scratch. Adv. Neural Inf. Process. Syst. 2012, 25, 1–9. [Google Scholar]
- Wu, W.; Peng, H.; Yu, S. YuNet: A Tiny Millisecond-Level Face Detector. Mach. Intell. Res. 2023, 20, 656–665. [Google Scholar] [CrossRef]
- Singh, A.; Kansari, J.; Kumar, V. Sinha Face Recognition Using Transfer Learning by Deep VGG16 Model. Int. J. Emerg. Technol. Innov. Res. 2022, 9, b121–b127. [Google Scholar]
- Wu, X.; He, R.; Sun, Z.; Tan, T. A Light CNN for Deep Face Representation with Noisy Labels. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2884–2896. [Google Scholar] [CrossRef]
Methods | Year | Main Principle | Complexity (Index/Query) | Memory | Advantages | Limitations |
---|---|---|---|---|---|---|
Brute-force search | N/A | Compute all distances | O(1)/O(Nd) | O(Nd) | Exact, simple | Very high cost for large N or d |
k-d tree | 1975 | Space partitioning by axis-aligned hyperplanes | O(NlogN)/ O(logN) low d, up to O(N) high d | O(N) | Fast for low d | Performance drops for high d, data-sensitive |
Ball tree | 1989 | Space partitioning into nested hyperspheres | O(NlogN)/ O(logN) low d, up to O(N) high d | O(N) | Handles non-uniform data, metric-flexible | Curse of dimensionality; complex construction |
SNN | 2022–2024 | Sort by first principal component; optimized distances | O(Nd2 + NlogN)/fraction of candidates | O(N) | Exact, fast, flexible, no hyperparameters (except R) | Dependent on data distribution, requires SVD |
Category | Methods/Techniques | Principle | FRNNS Impact | Challenges |
---|---|---|---|---|
Dimensionality reduction | PCA, LDA, t-SNE, UMAP, Autoencoders | Reduce dimensionality while preserving info | Speed, possible accuracy loss, better scalability, lower memory | Critical info loss, method selection |
Computational parallelization | CPU multithreading, GPU, MPI | Distribute workload across units | Faster processing, accuracy preserved, scalable, higher resource usage | Load balancing, communication overhead, synchronization |
Innovative indexing/queries | Learned indexes (LeaFi), Auto-tuning (VSAG), Memory optimization, Adaptive distance (FINGER), Async architectures (AverSearch) | ML-based indexing, adaptive to data and queries | Faster, accuracy preserved, scalable, moderate resource use | ML model complexity, portability, adapting ANNS ideas to FRNNS |
Embedding Type | Dimension (d) | Test Radius R | SNN Versions Tested | Recall |
---|---|---|---|---|
FaceNet512 | 512 | 23.56 | Sequential, OpenMP, CUDA | 1.0 |
VGG-Face | 4096 | 1.17 | Sequential, OpenMP, CUDA | 1.0 |
N | Sequential w/o BLAS | Sequential with BLAS | OpenMP | CUDA | k-d Tree | Ball Tree |
---|---|---|---|---|---|---|
13,233 | 65.24100 | 13.64686 | 5.34100 | 2.11000 | 29.15400 | 21.90300 |
26,466 | 364.23690 | 31.45103 | 12.31900 | 4.46800 | 66.82399 | 51.06899 |
52,932 | - | 56.11196 | 21.99000 | 8.71601 | 155.08673 | 117.40920 |
105,864 | - | 115.13030 | 45.06339 | 21.06080 | 339.89667 | 272.08081 |
N | Sequential w/o BLAS | Sequential with BLAS | OpenMP | GPU | Brute Seq | Brute | k-d Tree | Ball Tree |
---|---|---|---|---|---|---|---|---|
13,233 | 0.05401 | 0.04982 | 0.02499 | 0.00501 | 0.05700 | 0.06201 | 0.31700 | 0.07199 |
26,466 | 0.11151 | 0.09172 | 0.04600 | 0.00900 | 0.13300 | 0.13001 | 0.34600 | 0.13100 |
52,932 | - | 0.18045 | 0.09001 | 0.01899 | 0.33198 | 0.24499 | 0.91201 | 0.23901 |
105,864 | - | 0.35094 | 0.17623 | 0.02498 | 0.62202 | 0.38940 | 1.45793 | 0.50385 |
N | Sequential w/o BLAS | Sequential with BLAS | OpenMP | CUDA | k-d Tree | Ball Tree |
---|---|---|---|---|---|---|
13,233 | 5.59042 | 1.49704 | 0.79000 | 0.37973 | 2.82899 | 2.04199 |
26,466 | 11.79047 | 2.82677 | 1.48400 | 0.45475 | 6.76700 | 4.90400 |
52,932 | 22.60518 | 6.55342 | 3.45500 | 0.88650 | 14.96499 | 11.50582 |
105,864 | 42.09300 | 12.47189 | 6.58500 | 1.70217 | 37.18500 | 29.52651 |
N | Sequential w/o BLAS | Sequential with BLAS | OpenMP | GPU | Brute Seq | Brute | k-d Tree | Ball Tree |
---|---|---|---|---|---|---|---|---|
13,233 | 0.00739 | 0.00719 | 0.00400 | 0.00200 | 0.01200 | 0.01098 | 0.03101 | 0.01001 |
26,466 | 0.02383 | 0.01435 | 0.00800 | 0.00300 | 0.02101 | 0.01901 | 0.05601 | 0.01899 |
52,932 | 0.02925 | 0.02713 | 0.01500 | 0.00399 | 0.03399 | 0.03000 | 0.10902 | 0.07899 |
105,864 | 0.05201 | 0.05198 | 0.02900 | 0.00600 | 0.05300 | 0.05101 | 0.47701 | 0.08402 |
Dataset | Radius (R) | k-d Tree Time (ms) | Ball Tree Time (ms) | SNN Time (ms) | Speedup of SNN vs. k-d Tree (×) | Speedup of SNN vs. Ball Tree (×) |
---|---|---|---|---|---|---|
F-MNIST (d = 784) | 800 | 146.3 | 110.3 | 7.765 | 18.83× | 14.20× |
F-MNIST (d = 784) | 1200 | 163.3 | 110.8 | 11.18 | 14.61× | 9.91× |
GIST (d = 960) | 0.8 | 3144 | 2160 | 281.5 | 11.17× | 7.67× |
GIST (d = 960) | 1 | 3237 | 2183 | 326.8 | 9.91× | 6.68× |
Dimensionality (d) | Implementation SNN | Ball Tree Time (s) | k-d Tree Time (s) | SNN Time (s) | Speedup vs. Ball Tree (×) | Speedup vs. k-d Tree (×) |
---|---|---|---|---|---|---|
4096 | Sequential (BLAS) | 0.504 | 1.458 | 0.351 | 1.43× | 4.15× |
4096 | OpenMP | 0.504 | 1.458 | 0.176 | 2.86× | 8.28× |
4096 | CUDA | 0.504 | 1.458 | 0.025 | 20.16× | 58.32× |
512 | Sequential (BLAS) | 0.084 | 0.477 | 0.052 | 1.62× | 9.17× |
512 | OpenMP | 0.084 | 0.477 | 0.021 | 4.00× | 22.71× |
512 | CUDA | 0.084 | 0.477 | 0.006 | 14.00× | 79.50× |
Method | Main Idea/Technology | Dataset(s) | Reported Speedup/Accuracy |
---|---|---|---|
SNN (2020) [10] | Sorting-based pruning for nearest neighbors | F-MNIST, GIST | 10–18× vs. k-d tree |
LeaFi (2025) [21] | Learned filters for efficient similarity search | Large-scale image datasets | 3–5× vs. FAISS |
RadiK (2022) [36] | GPU-parallel radix top-K selection | Text and vision datasets | 2–3× vs. HNSW |
VSAG (2023) [23] | Graph-based ANN with optimized vector storage | ImageNet subsets | 2–4× vs. IVF |
DFSANNS (2025) [24] | Depth-first adaptive ANN search | Mixed | 5–10× vs. HNSW |
This work (SNN OpenMP/CUDA) | Parallelization on CPU (OpenMP) and GPU (CUDA) | VGG-Face, FaceNet512 | 20–58× vs. k-d tree, 14–80× vs. ball tree |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mochurad, L.; Kapustiak, R. Parallel Approaches for SNN-Based Nearest Neighbor Search in High-Dimensional Embedding Spaces: Application to Face Recognition. Appl. Sci. 2025, 15, 10139. https://doi.org/10.3390/app151810139
Mochurad L, Kapustiak R. Parallel Approaches for SNN-Based Nearest Neighbor Search in High-Dimensional Embedding Spaces: Application to Face Recognition. Applied Sciences. 2025; 15(18):10139. https://doi.org/10.3390/app151810139
Chicago/Turabian StyleMochurad, Lesia, and Roman Kapustiak. 2025. "Parallel Approaches for SNN-Based Nearest Neighbor Search in High-Dimensional Embedding Spaces: Application to Face Recognition" Applied Sciences 15, no. 18: 10139. https://doi.org/10.3390/app151810139
APA StyleMochurad, L., & Kapustiak, R. (2025). Parallel Approaches for SNN-Based Nearest Neighbor Search in High-Dimensional Embedding Spaces: Application to Face Recognition. Applied Sciences, 15(18), 10139. https://doi.org/10.3390/app151810139