# k-NN Query Optimization for High-Dimensional Index Using Machine Learning

## Abstract

## 1. Introduction

- A distributed indexing scheme for high-dimensional data: This study proposes a distributed, high-dimensional indexing scheme. The proposed scheme is a Spark-based, distributed indexing scheme for processing large, high-dimensional data efficiently.
- A distributed query allocation method: In order to perform efficient distributed processing, an efficient query allocation method is required. In this paper, we propose a distributed query allocation method based on query information.
- Three k-NN query optimization techniques: In this paper, we propose three optimization techniques for efficient k-NN query processing, based on a high-dimensional distributed index. We present three optimization techniques based on density, query processing costs, and deep learning using index information. We verified the validity of the proposed optimization techniques through performance evaluations.

## 2. Related Work

## 3. Proposed Distributed, in-Memory, Index-Based k-NN Optimization Techniques

#### 3.1. Overall Structure

#### 3.2. Distributed in-Memory Index Structure

#### 3.3. k-NN Query Processing Procedure

#### 3.4. k-NN Query Processing Optimization

Algorithm 1. Optimized k-NN. | |

Input: Query Point (qp), k of k-nn(k), Optimization Type(optType: DENSITY, COST, LEARNING) | |

Output: Query Results | |

1: | results = {} |

2: | range = calculateQueryRange(qp, k, optType) |

3: | for slave in slaves do |

4: | dk = calculateDifferentK(qp, k, slave, range) |

5: | if dk > 0 then |

6: | results += processRangeQuery(qp, dk, slave, range) |

7: |
else |

8: | not perform k-NN query |

9: | end for |

10: | return results |

#### 3.4.1. Density-Based Optimization

Algorithm 2. Calculation of the Query Range | |

Input: Query Point (qp), k, optType: DENSITY, COST, LEARNING | |

Output: Optimized Query Range | |

1: | search_range = 0 |

2: | if optType is DENSITY then |

3: | dpo = dataNum/maxDistance * dimension |

4: | search_range = k/dpo |

5: | elif optType = COST then |

6: | search_range = costBasedSearchRange(k) |

7: | else |

8: | search_range = DNNBasedSearchRange(qp, k) |

9: | return search_range |

#### 3.4.2. Cost-Based Optimization

#### 3.5. DNN-Based Optimization

_{0}, dp

_{1}, … dp

_{d}, k]. The dp

_{n}represents the coordinate value corresponding to the n-th dimension. The k represents the number of target objects, k, to be found in the k-NN query. The hidden layers used in this study were structured as shown in Equations (1) and (2). The H denotes the hidden layer; each hidden layer is the result of the activation function, $\sigma $, applied to the sum of the previous hidden layer’s output value and the current hidden layer’s weight (W) and bias (b). In this study, we used ReLU for the activation function. The symbol l represents the number of hidden layers. The optimal number of hidden layers was derived using the performance evaluations in this study. The final hidden layer outputs the result value, $\widehat{y}$, which is the predicted initial search range value for performing the k-NN search. The model is trained using the search range value, y, in the log that records the existing k-NN query processing results. The MSE error function for training the model is expressed using Equation (3), where M denotes the total number of queries recorded in the log, and y

_{m}denotes the search range value of the m-th k-NN query. Therefore, the error value is the accumulated difference between the predicted and actual search ranges divided by the number of queries. The learning model minimizes the MSE.

## 4. Performance Evaluations

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

Name | Value |
---|---|

CPU | Intel(R) Core(TM) i5-6400 CPU @ 2.7 GHz × 4 |

Memory | 48 GB |

Partitions | 2 per Sever |

Platform | Spark 2.3.1, Tensorflow 2.0.0 |

The number of Nodes | 4 |

Feature | Value |
---|---|

Data type | Image feature data |

Data size | 1,000,000 (Skewed) |

Dimensions of data | 128 |

Data dimension value range | 0 to 255 |

