k -Nearest Neighbor Learning with Graph Neural Networks

: k -nearest neighbor ( k NN) is a widely used learning algorithm for supervised learning tasks. In practice, the main challenge when using k NN is its high sensitivity to its hyperparameter setting, including the number of nearest neighbors k , the distance function, and the weighting function. To improve the robustness to hyperparameters, this study presents a novel k NN learning method based on a graph neural network, named k NNGNN. Given training data, the method learns a task-speciﬁc k NN rule in an end-to-end fashion by means of a graph neural network that takes the k NN graph of an instance to predict the label of the instance. The distance and weighting functions are implicitly embedded within the graph neural network. For a query instance, the prediction is obtained by performing a k NN search from the training data to create a k NN graph and passing it through the graph neural network. The effectiveness of the proposed method is demonstrated using various benchmark datasets for classiﬁcation and regression tasks.


Introduction
The k-nearest neighbor (kNN) algorithm is one of the most widely used learning algorithms in machine learning research [1,2]. The main concept of kNN is to predict the label of a query instance based on the labels of k closest instances in the stored data, assuming that the label of an instance is similar to that of its kNN instances. kNN is simple and easy to implement, but is very effective in terms of prediction performance. kNN makes no specific assumptions about the distribution of the data. Because it is an instancebased learning algorithm that requires no training before making predictions, incremental learning can be easily adopted. For these reasons, kNN has been actively applied to a variety of supervised learning tasks including both classification and regression tasks.
The procedure for kNN learning is as follows. Suppose a training dataset D = {(x t , y t )} N t=1 is given for a supervised learning task, where x t and y t are the input vector and the corresponding label vector of the t-th instance. y t is assumed to be a one-hot vector in the case of a classification task and a scalar value in the case of a regression task. In the training phase, the dataset D is just stored without any explicit learning from the dataset. In the inference phase, for each query instance x, kNN search is performed to retrieve kNN in- that are closest to x based on a distance function d. Then, the predicted labelŷ is obtained as a weighted combination of the labels y (1) , . . . , y (k) based on a weighting function w along with the distance function d as follows: The difficulty in using kNN is determining the hyperparameters. The three main hyperparameters are the number of neighbors k, the distance function d, and the weighting function w [3]. Firstly, in terms of k, a small k makes it capture a specific local structure in the data, and thus, the outcome can be sensitive to noise, whereas a large k makes it more concentrate on the global structure of the data and suppresses the effect of noise. Secondly, the distance function d determines how to calculate the distance between the input vectors of a pair of instances with nearby instances having high relevance. Popular examples of this function for kNN are the Manhattan, Euclidean, and Mahalanobis distances. Thirdly, the weighting function w determines how much each kNN instance contributes to the prediction. The standard kNN assigns the same weight to each kNN instance (i.e., w(d) = 1/k). It is known to be better to assign larger/smaller weights to closer/farther kNN instances based on their distances to the query instance x using a non-uniform weighting function (e.g., w(d) = 1/d). Thus, a kNN instance with a larger weight will contribute more to the prediction for the instance.
The performance of kNN is known to be highly sensitive to hyperparameters, the best setting of which depends on the characteristics of the data [3,4]. Thus, the hyperparameters must be chosen appropriately to improve the prediction performance. Since this is a challenging issue, considerable research efforts have been devoted to hyperparameter optimization for kNN, which are introduced briefly in Section 2. Compared to related work, the main aim of this study is end-to-end kNN learning toward improved robustness to the hyperparameter setting and to make predictions for new data without additional optimization procedures.
This study presents a novel end-to-end kNN learning method, named kNN graph neural network (kNNGNN), which learns a task-specific kNN rule from the training dataset in an end-to-end fashion based on a graph neural network. For each instance in the training dataset and its kNN instances, a kNN graph is constructed with nodes representing the label information of the instances and edges representing the distance information between the instances. Then, a graph neural network is built to consider the kNN graph of an instance to predict the label for the instance. The graph neural network can be regarded as a data-driven implementation of implicit weight and distance functions. By doing so, the prediction performance of kNN can be improved without careful consideration of its hyperparameter setting. The proposed method is applicable to any type of supervised learning task, including classification and regression. Furthermore, the proposed method does not require any additional optimization procedure when making predictions for new data, which is advantageous in terms of computational efficiency. To investigate the effectiveness of the proposed method, experiments are conducted using various benchmark datasets for classification and regression tasks.

Related Work
This section discusses related work on hyperparameter optimization for the kNN algorithm, which has been actively studied by many researchers. As previously mentioned, kNN learning involves three main hyperparameters: the number of neighbors k, the distance function d, and the weighting function w. A different dataset requires a different hyperparameter setting, and no specific setting can universally be the best for every application, as indicted by the no-free-lunch theorem [5]. Thus, the proper choice of these hyperparameters is critical for obtaining a high prediction performance. In practice, the best hyperparameter setting for a given dataset is usually determined by performing a cross-validation procedure that searches over possible hyperparameter candidates. Various search strategies are applicable, such as grid search, random search [6], and Bayesian optimization [7]. They are time consuming and costly, especially for large-scale datasets. Previous research efforts have focused on choosing the hyperparameters of kNN in more intelligent ways based on heuristics or extra optimization procedures for each query instance.
There are two main research approaches regarding the number of neighbors k. The first approach is to assign different k values to different query instances based on their local neighborhood information instead of a fixed k value [8][9][10][11][12]. The second approach is to employ non-uniform weighting functions to reduce the effect of k on the prediction performance.
For the distance function d, one research approach is to learn task-specific distance functions directly from data to improve the prediction performance, which is referred to as distance metric learning [13,14]. Many methods for this approach were developed for use in the classification settings [15][16][17][18][19], while some were developed for use in the regression settings [20][21][22]. Another approach is to adjust the distance function in an adaptive manner for each query instance [23][24][25][26][27]. This requires an extra optimization procedure, as well as a kNN search when making a prediction for each query instance.
For the weighting function w, existing methods have focused on designing nonuniform weighting functions that decay smoothly as the distance increases [4]. One main research approach is to assign adaptive weights to the kNN instances of each query instance by performing an extra optimization procedure [23,[25][26][27][28], which also helps to reduce the effect of k. Another approach is to develop fuzzy versions of the kNN algorithm [29][30][31].
The three hyperparameters affect each other, which means that the optimal choice of one hyperparameter is dependent on the other hyperparameters. Therefore, they must be considered simultaneously rather than independently. Moreover, the methods involving costly extra optimization procedures when making predictions for query instances are computationally expensive, which is undesirable in practice. In addition, the majority of existing methods focus on specific settings, primarily classification tasks. Developing a universal method that is efficient and applicable to various tasks is beneficial. To address these concerns, this study proposes to jointly learn a distance function and a weighting function using a graph neural network in an end-to-end manner, which aims to make it robust to the choice of k in the prediction performance and is applicable to both classification and regression tasks.

Graph Representation of Data
Suppose that a training set D = {(x t , y t )} N t=1 is given, where x t ∈ R p is the t-th input vector for the input variables and y t is the corresponding label vector for the output variable. For a classification task with regard to c classes, y t is a c-dimensional one-hot vector where the element corresponding to the target class is set to 1 and all the remaining elements are set to 0. For a regression task with a single output, y t is a scalar representing the target value.
The proposed method uses a transformation function g that transforms each input vector x t into a graph G t such that G t = g(x t ; D). Two hyperparameters need to be determined: the number of nearest neighbors k and the distance function d. They are used only to operate the transformation function g for kNN search from D; however, they are not used explicitly in the learning procedure in Section 3.2. For each x t , its kNN is constructed as a fully connected undirected graph with k + 1 nodes and k(k + 1)/2 edges as follows: where each node feature vector v i t ∈ R c+1 and edge feature vector e i,j t ∈ R p are represented as: where the t-th input vector x t is denoted by x (0) t for the simplicity of description. The number c is set to the number of classes in the case of classification and is 1 in the case of regression.
In the graph G t , the 0-th node corresponds to x t , and the other nodes correspond to the kNN instances of x t . Each node feature vector v i t represents the label information with the last element set to zero, except that v 0 t does not contain the label information and has the last element set to one. Each edge feature vector e i,j t consists of the absolute difference between each of the input variables x (i) t and x (j) t . Thus, G t represents the labels of the kNN instances and pairwise distances between the instances. It should be noted that G t does not contain y t because it needs to be unknown when making a prediction in a supervised learning setting.

k-Nearest Neighbor Graph Neural Network
Here, the proposed method named kNNGNN is introduced, which implements kNN learning in an end-to-end manner. It adapts the message-passing neural network architecture [32], which can handle general node and edge features with isomorphic invariance, to build a graph neural network for kNN learning. To learn a kNN rule from the training dataset D, it builds a graph neural network that operates on the graph representation G = g(x; D) for an input vector x given the training dataset D to predict the corresponding label vector y asŷ = f (G) = f (g(x; D)).
The model architecture used in this study is as follows. It first embeds each v i into a p-dimensional initial node representation vector using an embedding function φ as A message-passing step for the graph G is then performed using two main functions: message function M and update function U. The node representation vectors h (l),i are updated as below: After L time steps of message passing, a set of node representation vectors {h (l),i } L l=0 per node is obtained. The set for the 0-th node {h (l),0 } L l=0 is then processed with the readout function r to obtain the final prediction of the label y as: The component functions φ, M, U, and r are parameterized as neural networks, mostly based on the idea presented in Gilmer et al. [32]. The function φ is a two-layer fully connected neural network with p tanh units in each layer. The function M is a two-layer fully connected neural network where the first layer consists of 2m tanh units and the second layer outputs a m × m matrix. The function U is modeled as a recurrent neural network with gated recurrent units (GRUs) [33], which pass the previous hidden state h (l−1),i and the current input m (l),i to derive the current hidden state h (l),i at each time step l. The function r is a two-layer fully connected neural network where the first layer consists of p tanh units and the second layer outputsŷ by softmax and linear units in the case of classification and regression tasks, respectively. Different types of supervised learning tasks can be addressed using different types of units in the last layer of r.
The model defined above is denoted as the function f . The model makes a prediction from the input vector x and its kNN instances in D, i.e.,ŷ = f (g(x; D)). The model differs from conventional neural networks in that it does not directly learn the relationship between input and output variables. In terms of kNN learning, the weight and distance functions are embedded implicitly into the function f . Therefore, the function f can be regarded as an implicit representation of a kNN rule, in which the functions M and U work as implicit distance and weighting functions, respectively.

Learning from Training Data
Given the training dataset D = {(x t , y t )} N t=1 , the proposed method learns a taskspecific kNN rule from D in the form ofŷ = f (g(x; D)). The prediction model f is trained based on the graph representation g using the following objective function J : where L is the loss function, the choice of which depends on the target task. The typical choices of the loss function are cross-entropy and squared error for the classification and regression tasks, respectively.

Prediction for New Data
Once the prediction model f is trained, it can be used to predict unknown labels for new data. The prediction procedure is illustrated in Figure 1. Given a query instance x * whose label y * is unknown, its kNN instances N ( are searched from the training dataset D based on the distance function d. Then, the corresponding graph G * = g(x * ; D) is generated. The prediction of y * , which is denoted byŷ * , is computed using the model f as:ŷ * = f (G * ) = f (g(x * ; D)).
,0  The proposed method does not require additional optimization procedures when making predictions. The prediction for a query instance is simply conducted by performing a kNN search to identify the kNN instances and then processing these instances with the model. This is advantageous in terms of computational efficiency.
As the proposed method learns the kNN rule, incremental learning can be implemented efficiently. This is the main advantage of the kNN algorithm compared to other learning algorithms, especially when additional training data are collected over time after the model is trained. When new labeled data are added to the training dataset D, the prediction performance will be improved without updating the model.

Datasets
The effectiveness of the proposed method was investigated through experiments on various benchmark datasets. They contained 20 classification datasets, and twenty regression datasets were collected from the UCI machine learning repository (http://archive.ics.uci. edu/ml/ (accessed on 10 January 2021) and the StatLib datasets archive (http://lib.stat. cmu.edu/datasets/(accessed on 10 January 2021)). The datasets used for classification tasks were annealing, balance, breastcancer, carevaluation, ecoli, glass, heart, ionosphere, iris, landcover, movement, parkinsons, seed, segment, sonar, vehicle, vowel, wine, yeast, and zoo. The datasets used for regression tasks were abalone, airfoil, appliances, autompg, bikesharing, bodyfat, cadata, concretecs, cpusmall, efficiency, housing, mg, motorcycle, newspopularity, skillcraft, spacega, superconductivity, telemonitoring, wine-red, and wine-white. Each dataset had a different number of instances with a different dimensionality. For each dataset, onethousand instances were randomly sampled if the size of the dataset was greater than 1000. All numeric variables were normalized into the range of [−1, 1]. The details of the datasets used are listed in Tables 1 and 2.

Compared Methods
Three kNN methods that use different weighting schemes w were compared in the experiments: uniform kNN, weighted kNN, and the proposed kNNGNN. The uniform kNN and weighted kNN respectively used the following weighting functions: For kNNGNN, the weighting function is embedded implicitly. For each method, the hyperparameter settings were varied to examine their effects. The candidates for the distance function d were as follows: where S is the covariance matrix of the input variables calculated from the training dataset.
Accordingly, there were a total of nine combinations of distance and weighting functions compared in the experiments, as summarized in Table 3. None of the methods used any additional optimization procedures when making predictions. For kNNGNN, the distance function was only explicitly used for the kNN search to generate graph representations of the data. For each combination, the effect of k was investigated on the prediction performance by varying its value from 1, 3, 5, 7, 10, 15, 20, and 30.

Experimental Settings
In the experiments, the performance of each method was evaluated using a two-fold cross-validation procedure. In this procedure, the original dataset was divided into five disjoint subsets. Then, two iterations were conducted, each of which used one subset and the other subset as the training and test sets, respectively. As performance measures, the misclassification error rate and root mean squared error (RMSE) were used for the classification and regression tasks, respectively. Given a test set denoted by D = {(x t , y t )} N t=1 , the performance measures are calculated as: For the proposed method, each prediction model was built based on the following configurations. In the objective function J , the loss function L used for the classification and regression tasks was set to cross-entropy and squared error, respectively. For the model, the hyperparameter L was set to 3, as Gilmer et al. [32] demonstrated any L ≥ 3 would work. The hyperparameter p was explored on {10, 20, 50} by holdout validation. In the training phase, dropout was applied to the function r with a dropout rate of 0.1 for regularization [34]. During the training, eighty percent and 20% of the training set were used to train and validate the model, respectively. The model parameters were updated using the Adam optimizer with a batch size of 20. The learning rate was set to 10 −3 at the first training epoch and was reduced by a factor of 0.1 if no improvement in the validation loss was observed for 10 consecutive epochs. The training was terminated when the learning rate was decreased to 10 −7 or the number of epochs reached 500. In the inference phase, for each query instance, thirty different outputs were obtained by performing stochastic forward passes through the trained model with the dropout turned on [35]. The average of these outputs was then used to obtain the predicted label for the instance.
All baseline methods were implemented using the scikit-learn package in Python. The proposed method was implemented based on GPU-accelerated TensorFlow in Python. All experiments were performed 10 times independently with different random seeds. For the results, the average performance over the repetitions was compared. Then, for each of the three weighting functions w, the summary statistics of the performance over different settings of distance functions d and the number of neighbors k are reported. Figure 2 shows the error rate comparison results of the baseline and proposed methods with varying the hyperparameter settings on 20 classification datasets. Compared to the baseline methods, kNNGNN overall yielded lower error rates at various values of k for most datasets. For the results with different hyperparameters, the average, standard deviation, and best error rate for each dataset are summarized in Table 1. kNNGNN yielded the lowest average and standard deviation of the error rate over different hyperparameters on most datasets, which indicated that the performance of kNNGNN was less sensitive to its hyperparameter settings. In particular, kNNGNN was superior to the baseline method when the hyperparameter k was larger. Figure 3 compares the baseline and proposed methods in terms of the RMSE with varying hyperparameter settings on 20 regression datasets. As shown in this figure, the performance curves of kNNGNN flattened as k increased on most datasets, whereas the RMSE of the baseline methods tended to increase at large k for some datasets. Table 2 shows the average, standard deviation, and best RMSE for different hyperparameter settings for each dataset. The behavior of kNNGNN was similar to that of the classification tasks. kNNGNN showed stable performance against changes in the hyperparameter settings. kNNGNN yielded the lowest average and standard deviation of the RMSE for the majority of datasets.

Results and Discussion
In summary, the experimental results successfully demonstrated the effectiveness of kNNGNN in improving the prediction performance for both classification and regression tasks. Although kNNGNN failed to yield the lowest error for some datasets, kNNGNN yielded high robustness to its hyperparameters. This indicated that kNNGNN would provide comparable performance without carefully tuning its hyperparameters; thus, it can be preferred in practice considering the difficulty of choosing the optimal hyperparameter setting. Because the performance curve of kNNGNN flattened at large k values on most datasets, setting a moderate k value around 15∼20 would be reasonable considering the trade-off between the performance and computational cost.

Conclusions
This study presented kNNGNN, which learns a task-specific kNN rule from data in an end-to-end fashion. The proposed method constructed the kNN rule in the form of a graph neural network, in which the distance and weighting functions were embedded implicitly. The graph neural network considered the kNN graph of an instance as the input to predict the label of the instance. Owing to the flexibility of neural networks, the method can be applied to any form of supervised learning tasks including classification and regression. It does not require any extra optimization procedure when making predictions for new data, which is beneficial in terms of computational efficiency. Moreover, as the method learns the kNN rule instead of the explicit relationship between the input and output variables, incremental learning can be implemented efficiently.
The effectiveness of the proposed method was demonstrated through experiments on benchmark classification and regression datasets. The results showed that the proposed method can yield comparable prediction performance with less sensitivity to the choice of its hyperparameters. The proposed method allows more robust kNN learning without carefully tuning the hyperparameters. The use of a graph neural network for kNN learning may still have room for improvement and thus merits further investigation. One practical concern is the high complexity of a graph neural network in terms of time and space, which increases with k. A graph neural network cannot be trained in a reasonable amount of time without using a GPU. Alleviation of complexity to improve learning efficiency will be an avenue for future work.