# k-NN Query Optimization for High-Dimensional Index Using Machine Learning

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- A distributed indexing scheme for high-dimensional data: This study proposes a distributed, high-dimensional indexing scheme. The proposed scheme is a Spark-based, distributed indexing scheme for processing large, high-dimensional data efficiently.
- A distributed query allocation method: In order to perform efficient distributed processing, an efficient query allocation method is required. In this paper, we propose a distributed query allocation method based on query information.
- Three k-NN query optimization techniques: In this paper, we propose three optimization techniques for efficient k-NN query processing, based on a high-dimensional distributed index. We present three optimization techniques based on density, query processing costs, and deep learning using index information. We verified the validity of the proposed optimization techniques through performance evaluations.

## 2. Related Work

## 3. Proposed Distributed, in-Memory, Index-Based k-NN Optimization Techniques

#### 3.1. Overall Structure

#### 3.2. Distributed in-Memory Index Structure

#### 3.3. k-NN Query Processing Procedure

#### 3.4. k-NN Query Processing Optimization

Algorithm 1. Optimized k-NN. | |

Input: Query Point (qp), k of k-nn(k), Optimization Type(optType: DENSITY, COST, LEARNING) | |

Output: Query Results | |

1: | results = {} |

2: | range = calculateQueryRange(qp, k, optType) |

3: | for slave in slaves do |

4: | dk = calculateDifferentK(qp, k, slave, range) |

5: | if dk > 0 then |

6: | results += processRangeQuery(qp, dk, slave, range) |

7: |
else |

8: | not perform k-NN query |

9: | end for |

10: | return results |

#### 3.4.1. Density-Based Optimization

Algorithm 2. Calculation of the Query Range | |

Input: Query Point (qp), k, optType: DENSITY, COST, LEARNING | |

Output: Optimized Query Range | |

1: | search_range = 0 |

2: | if optType is DENSITY then |

3: | dpo = dataNum/maxDistance * dimension |

4: | search_range = k/dpo |

5: | elif optType = COST then |

6: | search_range = costBasedSearchRange(k) |

7: | else |

8: | search_range = DNNBasedSearchRange(qp, k) |

9: | return search_range |

#### 3.4.2. Cost-Based Optimization

#### 3.5. DNN-Based Optimization

_{0}, dp

_{1}, … dp

_{d}, k]. The dp

_{n}represents the coordinate value corresponding to the n-th dimension. The k represents the number of target objects, k, to be found in the k-NN query. The hidden layers used in this study were structured as shown in Equations (1) and (2). The H denotes the hidden layer; each hidden layer is the result of the activation function, $\sigma $, applied to the sum of the previous hidden layer’s output value and the current hidden layer’s weight (W) and bias (b). In this study, we used ReLU for the activation function. The symbol l represents the number of hidden layers. The optimal number of hidden layers was derived using the performance evaluations in this study. The final hidden layer outputs the result value, $\widehat{y}$, which is the predicted initial search range value for performing the k-NN search. The model is trained using the search range value, y, in the log that records the existing k-NN query processing results. The MSE error function for training the model is expressed using Equation (3), where M denotes the total number of queries recorded in the log, and y

_{m}denotes the search range value of the m-th k-NN query. Therefore, the error value is the accumulated difference between the predicted and actual search ranges divided by the number of queries. The learning model minimizes the MSE.

## 4. Performance Evaluations

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Hu, Y.; Huang, J.; Schwing, A.G. VideoMatch: Matching based video object segmentation. Comput. Vis. ECCV
**2018**, 2018, 56–73. [Google Scholar] [CrossRef] - Zhao, L.; He, Z.; Cao, W.; Zhao, D. Real-time moving object segmentation and classification from HEVC compressed surveillance video. IEEE Trans. Circuits Syst. Video Technol.
**2018**, 28, 1346–1357. [Google Scholar] [CrossRef] - Joshi, K.A.; Thakore, D.G. A survey on moving object detection and tracking in video surveillance system. J. Soft Comput. Eng.
**2012**, 2, 44–48. [Google Scholar] - Cheng, J.; Yuan, Y.; Li, Y.; Wang, J.; Wang, S. Learning to Segment Video Object With Accurate Boundaries. IEEE Trans. Multimed.
**2020**, 23, 3112–3123. [Google Scholar] [CrossRef] - Matiolański, A.; Maksimova, A.; Dziech, A. CCTV object detection with fuzzy classification and image enhancement. Multimed. Tool. Appl.
**2016**, 75, 10513–10528. [Google Scholar] [CrossRef] - Kakadiya, R.; Lemos, R.; Mangalan, S.; Pillai, M.; Nikam, S. AI based automatic robbery/theft detection using smart surveillance in banks. In Proceedings of the International Conference on Electronics, Communication and Aerospace Technology, Coimbatore, India, 12–14 June 2019; pp. 201–204. [Google Scholar] [CrossRef]
- Sukhia, K.N.; Riaz, M.M.; Ghafoor, A. Content-based retinal image retrieval. IET Image Process.
**2019**, 13, 1525–1534. [Google Scholar] [CrossRef] - Yu, J.; Liu, H.; Zheng, X. Two-dimensional joint local and nonlocal discriminant analysis-based 2D image feature extraction for deep learning. Neural Comput. Applic.
**2020**, 32, 6009–6024. [Google Scholar] [CrossRef] - Sharif, U.; Mehmood, Z.; Mahmood, T.; Javid, M.A.; Rehman, A.; Saba, T. Scene analysis and search using local features and support vector machine for effective content-based image retrieval. Artif. Intell. Rev.
**2019**, 52, 901–925. [Google Scholar] [CrossRef] - Saritha, R.R.; Paul, V.; Kumar, P.G. Content based image retrieval using deep learning process. Cluster Comput.
**2019**, 22, 4187–4200. [Google Scholar] [CrossRef] - Tadi Bani, N.T.; Fekri-Ershad, S. Content-based image retrieval based on combination of texture and colour information extracted in spatial and frequency domains. Electron. Libr.
**2019**, 37, 650–666. [Google Scholar] [CrossRef] - Ma, Y.; Liu, D.; Scott, G.; Uhlmann, J.; Shyu, C.R. In-memory distributed indexing for large-scale media data retrieval. In Proceedings of the International Symposium on Multimedia, Taichung, Taiwan, 11–13 December 2017; pp. 232–239. [Google Scholar] [CrossRef]
- Skopal, T.; Lokoč, J. New dynamic construction techniques for M-tree. J. Discrete Algor.
**2009**, 7, 62–77. [Google Scholar] [CrossRef] - Cheng, H.; Yang, W.; Tang, R.; Mao, J.; Luo, Q.; Li, C.; Wang, A. Distributed Indexes Design to Accelerate Similarity based Images Retrieval in Airport Video Monitoring Systems. In Proceedings of the International Conference on Fuzzy Systems and Knowledge Discovery, Zhangjiajie, China, 15–17 August 2015; pp. 1908–1912. [Google Scholar] [CrossRef]
- Patwary, M.M.A.; Satish, N.R.; Sundaram, N.; Liu, J.; Sadowski, P.J.; Racah, E.; Byna, S.; Tull, C.; Bhimji, W.; Prabhat; et al. PANDA: Extreme Scale Parallel k-Nearest Neighbor on Distributed Architectures. In Proceedings of the International Parallel and Distributed Processing Symposium, Chicago, IL, USA, 23–27 May 2016; pp. 494–503. [Google Scholar]
- Wei, H.; Du, Y.; Liang, F.; Zhou, C.; Liu, Z.; Yi, J.; Xu, K.; Wu, D. A k-d tree-based algorithm to parallelize kriging interpolation of big spatial data. GI Sci.Remote. Sens.
**2015**, 52, 40–57. [Google Scholar] [CrossRef] - Lee, H.; Lee, H.; Wee, J.; Song, S.; Kang, T.; Choi, D.; Bok, K. Distance-based high-dimensional index structure for efficient query processing in spark environments. In Proceedings of the ICCC 2020, Busan, Korea, 12–14 November 2020; pp. 321–322. [Google Scholar]
- Yang, M.; He, D.; Fan, M.; Shi, B.; Xue, X.; Li, F.; Huang, J. Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features. In Proceedings of the IEEE/CVF International conference on Computer Vision, Montreal, BC, USA, 11–17 October 2021; pp. 11772–11781. [Google Scholar]
- Hung, B.T.; Chakrabarti, P. Parking lot occupancy detection using hybrid deep learning CNN-LSTM approach. In Proceedings of the 2nd International Conference on Artificial Intelligence: Advances and Applications: ICAIAA 2021, Jaipur, India, 27–28 March 2021; pp. 501–509. [Google Scholar]
- Hung, B.T.; Pramanik, S. Content-Based Image Retrieval using Multi Deep Neural Networks and K-Nearest Neighbor Approaches. 2023. Available online: https://www.researchgate.net/publication/368965259_Content-Based_Image_Retrieval_using_Multi_Deep_Neural_Networks_and_K-Nearest_Neighbor_Approaches (accessed on 10 March 2023).
- Yenigalla, S.C.; Srinivas Rao, K.; Ngangbam, P.S. Implementation of content-based image retrieval 3 using artificial neural networks 4. Hologr. Meets Adv. Manuf.
**2023**, 15, 10. [Google Scholar] - Jagadish, H.V.; Ooi, B.C.; Tan, K.L.; Yu, C.; Zhang, R. iDistance: An Adaptive B
^{+}-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst.**2005**, 30, 364–397. [Google Scholar] [CrossRef] - Huynh, C.V.; Huh, J.H. B
^{+}-Tree construction on massive Data with Hadoop. Cluster Comput.**2019**, 22, 1011–1021. [Google Scholar] - Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with gpus. IEEE Trans. Big Data
**2019**, 7, 535–547. [Google Scholar] [CrossRef] - Hou, Z.; Huang, C.; Wu, J.; Liu, L. Distributed Image Retrieval Base on LSH Indexing on Spark. In Proceedings of the Big Data and Security: First International Conference, ICBDS 2019, Nanjing, China, 20–22 December 2019; pp. 429–441. [Google Scholar] [CrossRef]
- Mezzoudj, S.; Behloul, A.; Seghir, R.; Saadna, Y. A parallel content-based image retrieval system using spark and tachyon frameworks. J. King Saud. Univ. Comput. Inf. Sci.
**2021**, 33, 141–149. [Google Scholar] [CrossRef] - Yan, Z.; Lin, Y.; Peng, L.; Zhang, W. Harmonia: A high throughput B
^{+}-tree for GPUs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York, NY, USA, 16–20 February 2019; pp. 133–144. [Google Scholar] [CrossRef] - Vajda, S.; Santosh, K.C. A fast k-nearest neighbor classifier using unsupervised clustering. In Proceedings of the International Conference on Recent Trends in Image Processing and Pattern Recognition, Kingsville, TX, USA, 22–23 December 2022; Springer: Singapore, 2016. [Google Scholar]
- Dai, P.; Yang, Y.; Wang, M.; Yan, R. Combination of DNN and improved KNN for indoor location fingerprinting. Wirel. Commun. Mob. Comput.
**2019**, 2019, 4283857. [Google Scholar] [CrossRef] - Atefi, K.; Hashim, H.; Kassim, M. Anomaly analysis for the classification purpose of intrusion detection system with K-nearest neighbors and deep neural network. In Proceedings of the IEEE 7th Conference on Systems, Process and Control (ICSPC), Melaka, Malaysia, 13–14 December 2019. [Google Scholar] [CrossRef]
- Hahnloser, R.H.R.; Sarpeshkar, R.; Mahowald, M.A.; Douglas, R.J.; Seung, H.S. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature
**2000**, 405, 947–951. [Google Scholar] [CrossRef] [PubMed] - Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.
**2004**, 60, 91–110. [Google Scholar] [CrossRef] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2015. [Google Scholar]

Name | Value |
---|---|

CPU | Intel(R) Core(TM) i5-6400 CPU @ 2.7 GHz × 4 |

Memory | 48 GB |

Partitions | 2 per Sever |

Platform | Spark 2.3.1, Tensorflow 2.0.0 |

The number of Nodes | 4 |

Feature | Value |
---|---|

Data type | Image feature data |

Data size | 1,000,000 (Skewed) |

Dimensions of data | 128 |

Data dimension value range | 0 to 255 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Choi, D.; Wee, J.; Song, S.; Lee, H.; Lim, J.; Bok, K.; Yoo, J.
k-NN Query Optimization for High-Dimensional Index Using Machine Learning. *Electronics* **2023**, *12*, 2375.
https://doi.org/10.3390/electronics12112375

**AMA Style**

Choi D, Wee J, Song S, Lee H, Lim J, Bok K, Yoo J.
k-NN Query Optimization for High-Dimensional Index Using Machine Learning. *Electronics*. 2023; 12(11):2375.
https://doi.org/10.3390/electronics12112375

**Chicago/Turabian Style**

Choi, Dojin, Jiwon Wee, Sangho Song, Hyeonbyeong Lee, Jongtae Lim, Kyoungsoo Bok, and Jaesoo Yoo.
2023. "k-NN Query Optimization for High-Dimensional Index Using Machine Learning" *Electronics* 12, no. 11: 2375.
https://doi.org/10.3390/electronics12112375