Locally-Scaled Kernels and Confidence Voting

: Classification, the task of discerning the class of an unlabeled data point using information from a set of labeled data points, is a well-studied area of machine learning with a variety of approaches. Many of these approaches are closely linked to the selection of metrics or the generalizing of similarities defined by kernels. These metrics or similarity measures often require their parameters to be tuned in order to achieve the highest accuracy for each dataset. For example, an extensive search is required to determine the value of K or the choice of distance metric in K -NN classification. This paper explores a method of kernel construction that when used in classification performs consistently over a variety of datasets and does not require the parameters to be tuned. Inspired by dimensionality reduction techniques (DRT), we construct a kernel-based similarity measure that captures the topological structure of the data. This work compares the accuracy of K -NN classifiers, computed with specific operating parameters that obtain the highest accuracy per dataset, to a single trial of the here-proposed kernel classifier with no specialized parameters on standard benchmark sets. The here-proposed kernel used with simple classifiers has comparable accuracy to the ‘best-case’ K -NN classifiers without requiring the tuning of operating parameters.


Introduction
Discerning the label y * for an unlabeled data point x * using information from data points with known labels (x j , y j ) is a question fundamentally based on 'similarity'.It is expected that points of the same class are similar, so it is reasonable to map similar data points to the same class; thus, the label of x * could be determined using the class information of similar labeled data points.One way to interpret similarity is through a distance metric; if two points are 'close' by some distance metric then they are similar, and if they are far, then they are dissimilar.This leads to the concept of the K-nearest neighbors (K-NN) and the K-NN classifier.
The choice of distance metric has a significant impact on the accuracy of traditional K-NN classifiers.This impact has been explored in literature with [1][2][3][4][5][6] reporting the success of a variety of distance metrics used with a K-NN classifier on diverse sets of high-dimensional datasets.No singular distance metric performs superiorly across all datasets, but rather the best-performing distance metric is specific to each dataset.
The problem statement is thus, when presented with a labeled dataset (x j , y j ) ∈ X (|X| = J) where x j ∈ R N with associated class label y j ∈ {1, . . ., C}, construct a function class(x j ) that maps x j to its label y j .Using this, the class of a 'new' unlabeled point x * can be discerned.This function should perform universally well across all datasets without requiring the tuning of its operating parameters.
A distance metric that, when used in classification, performs universally well across all datasets should adjust to the data on a local scale.Ideally, the distance between data points can be measured according to the topological structure of the data.From our previous experience in data mining, we understand the importance of feature extraction in classification [7]-even the best classifier will be unable to make meaningful classifications if the features of the data are indistinguishable.No distance metric can measure meaningful similarity if the feature space of the data does not separate data in a meaningful way.In machine learning, kernels are a tool used to pull the data to a new feature space called H; this space is designed to better capture the similarities between points of the same class and vice versa [8].Conveniently, the inner product (which induces a distance metric) in this space is a measure of similarity [9], making the kernel a natural extension of distance-based methods of classification.
Many approaches to classification only depend on the pairwise distance/similarity/correlation of the input data.As such, these algorithms can be kernelized, meaning the algorithm operates on the values of the kernel.This is further motivation for the use of a kernel to map to a new feature space that better captures similarity.Furthermore, kernel methods provide an advantage when working with high dimensional datasets as the Gram kernel matrix size is proportional to the number of data points J-once it is computed we no longer have to perform any calculations in the space of R N .
The success of traditional and weighted K-NN classifiers also depends on how confident x i is of its label y i .It can be challenging to discern the label of x * using information from the surrounding neighbors when the neighbors are outliers or mislabeled data points, not uncommon in real-world datasets [10].Thus, a secondary goal is to design a method of classification that diverges from the traditional K-NN classifier and introduces some measure of confidence to each x j .This allows a trade-off between supervised classification (i.e., classification based on the known labels) and semi-supervised classification (i.e., classification based on some inferred class-structure of the data).
To evaluate the proposed approach, we perform tests using standard benchmark data sets from the University of California Irvine (UCI) Machine Learning Repository [11].Using a leave-one-out testing strategy, we compare classifiers, based on the here-proposed similarity kernel with other weighted and non-weighted K-NN approaches.In contrast to approaches that study how to obtain the most accurate tuning for a classifier on a particular data set individually, we are interested in designing a similarity kernel that when used for classification requires one tuning that works well for many/all data sets.

Background
We begin by giving two motivations for the use of kernels in classification, one based on arguments generalizing K-NN, and the other based on mathematical arguments.
Classification has been a well-studied problem in the area of machine learning with myriad tools existing for a diverse array of applications.One popular classifier is called Knearest neighbors [12], which uses the K nearest labeled points (by some distance metric d) of x * to draw conclusions about the label of x * .For ease of notation, n K (x j , x k ) is a binary function that returns 1 if x k is one of the K-nearest neighbors of x j and 0 if it is not.A J × C matrix G called the 'class matrix' stores the class information of the labeled samples such that G j,c = 1 if x j belongs to class c and zero; otherwise, classification can be performed as follows: A common choice for the distance metric is the Euclidean metric d EUC .However, there are many possible choices of distance metrics for K-NN classification.An in-depth exploration of metrics can be found in the work of Abu Alfeilat et al. [1], which measures the classification accuracy, precision, and recall for a 1-NN classifier using 27 distance metrics from 13 distance metric families across 23 benchmark datasets from the UCI Machine Learning Repository [11].From this study, it was apparent that the distance metric with the highest accuracy depends highly on the dataset, i.e., there is no single distance metric that proves to be the 'best' across all benchmark datasets.Some of the top-performing distance metrics from this work can be found below in Table 1.

Equation
Name The paper concludes that the Hassanat distance [2] is the best-performing distance metric in general.
Notably, in the study by Abu Alfeilat et al., the authors only perform classification with a 1-NN classifier.This experimental design choice could have been influenced by the well-reported deterioration of K-NN classifiers as the number of neighbors increases [13].For an unknown dataset, testing must be performed to find the most suitable distance metric and value of K.The authors were inspired to find a classifying technique that is metric-insensitive and stable with regard to operating parameters such as K.
Similar work in [9] found that computing the weights of a K-NN classifier on a localscale can temper the classifier's sensitivity to K.
Aside from exploring the effect of choice of distance metrics on K-NN classifier accuracy, another method of improving classification accuracy concerns changing the influence of the neighbors' votes, following the idea that some neighbors might be more relevant to the classification of x * and therefore their vote should be weighted more heavily.Thus, weights w are introduced.The choice of weights can be guided using either combinatorial considerations or using metric/topological arguments.
An optimal weighting scheme based on a combinatorial argument is presented by Samworth in [13].This technique compares the risk of a classification using weights w n,m to the optimal Bayesian classifier, and uses an asymptotic expansion to derive an optimality criterion.Samworth first determines the optimal value for K * -nearest neighbors based on the dimension of the data N and the number of samples J: K * = ⌊B * J 4/N+4 ⌋.The weight for the kth nearest neighbor is then Note that this weighting, while proven combinatorially optimal, is not local to each neighborhood, as all neighborhoods use the same weight for their kth nearest neighbor.
Kernels can be used as a generalization of distance metrics.We are operating in an inner product space (Hilbert space), with some scalar product ⟨•, •⟩ H .The scalar product is seen as a measure of similarity between two data points; it is 0 if the data points are dissimilar, usually called orthogonal, and it increases as the data 'aligns'.The scalar product induces a norm in this space as ||x|| = ⟨x, x⟩ 1 2 , and the norm induces a distance metric d(x j , x k ) = ||x k − x j ||.Now, given a possible non-linear feature map Φ that maps data into a new feature space, we define k(x j , x k ) = ⟨Φ(x j ), Φ(x k )⟩ meaning the kernel gives the value of the scalar product.The mathematical machinery with the representation theorem and Mercer's theorem at the center is extensively studied in reference [8].Conveniently, we only need to define a good kernel, and the similarity measure is implied.The task of choosing the correct kernel is still regarded as an open problem [14].
Weights chosen following a metric argument are computed from a function of the distance d jk between x j and its kth nearest neighbor x k , following the idea that points that measure closer to x j should have a more influential vote.Some examples of weights w j|k that are associated with each j-neighbor-k pair are outlined in Table 2.The use of the | symbol in w j|k is to emphasize that these weights are not symmetric, as they are all locally scaled in some way with respect to the defining point of the neighborhood x j .The weights are applied in the classification function as follows: Following the ideas of weights and kernels, we can construct a measure of similarity.Points that have a high measure of similarity are more likely to belong to the same class, and the inverse is true as well.To construct a measure of similarity, we look to dimensionality reduction algorithms such as UMAP [15], Laplacian Eigenmaps [16] and t-SNE [17].These techniques are designed to capture a measure of similarity in high dimensional space and preserve that measure of similarity in an embedding of the data in a lower dimensional, primarily for the purpose of visualization.However, the measure of similarity can also be used for classification following the idea that similar data points should belong to the same class.

Equation
Name The 'go-to' choice for similarity measures in the literature is Gaussian measures [17,20].Interestingly, the dimensionality reduction community (in techniques such as UMAP [15] and Laplacian Eigenmaps [16]) use Laplacian measures of the form w j|k = e ad(x j ,x k ) .These techniques are based on the manifold hypothesis [21], which assumes that the data are distributed on a manifold (M, g) embedded in the input space.Thus, the similarity measure-which measures similarity based on some distance-should use the geodesic distance g as it best represents the underlying structure of the data.
However, as the manifold is not explicitly defined, [15,17,22] explore different ways of approximating (or in the case of [15], capturing) that geodesic distance.Specifically, UMAP works on the assumption that the dataset is distributed uniformly upon a Riemannian manifold, allowing for the geodesic distance to be measured by the Euclidean distance in sufficiently small local neighborhoods.This allows UMAP to 'capture' the geodesic distance instead of approximating it.
Finally comes the question of evaluating the success of the classifier.When evaluating a method of classification with a benchmark dataset, one must partition the data into a training set and a testing set.The training set is used for training the classifier.The classifier will make predictions on the testing set, and those predictions are compared back to the ground truth labels.From this, metrics such as accuracy can be derived.In the literature, when documenting the efficacy of methods of classification, data are usually divided using a train-test partition, or the data are divided into n-folds where each fold takes a 'turn' being the test set and all other folds are combined to create the training set.In both these methods, the success metrics (e.g., accuracy) are averaged across all the folds, or in the case of the test-train partition, averaged across a few trials of random partitions.Some common train-test split ratios are 66:33 [1], 70:30 [2,3], and 80:20 [23], and some common folds are 5-fold [24] and 10-fold [5,6,25,26].Regardless of the ratio of the partition, the randomness of how the data points are divided affects the performance of the classifier, which impacts the repeatability of the experiment [27].Abu Alfeilat et al. report the common performance measures of accuracy, precision, and recall.The method of measuring multi-class accuracy varies across different sources [28-30], which is reviewed in Section 5.2 with Equations ( 8) and ( 9).We have found that studies of classifiers on benchmark datasets do not always publish their method of measuring accuracy, and the ambiguity of the success metrics and train-test split creates a challenge when comparing the results from different researchers.

The Similarity Kernel
A kernel is a positive semi-definite mapping, specifically a similarity kernel k(x j , x k ), which maps two elements of the dataset to a value between zero and one (inclusive); the closer the value is to one, the more similar the data points are considered.However unlike the weights from Table 2, by Mercer's theorem [8] kernels are required to be symmetric, meaning k(x j , x k ) = k(x k , x j ).Another way to view the kernel approach is to consider Hilbert spaces and metrics.A kernel is the generalized idea of a distance metric.Given a mapping Φ, mapping the input space X to a new space (often called the feature space X ′ ) one automatically obtains a similarity measure by defining k(x j , x k ) = ⟨Φ(x j ), Φ(x k )⟩ by Mercer's theorem.
By the representation theorem [8], any function can be expressed as a linear combination of the kernel function, and classification functions are no exception.Thus, the classifier class(x j ) constructed using the similarity measurement is The question then becomes: what kernel best captures the similarities (and dissimilarities) of the data's classes?

Theoretical Framework-Preserving the Topology
Inspired by the work of dimensionality reduction techniques (DRT) such as UMAP [15], Laplacian Eigenmaps [16], and t-SNE [17], a similarity measure is designed to capture the local and global structure of the dataset, thereby preserving the topology of the data.This stands in contrast to much of the existing research, for example [13], which designs optimal weights for K-NN classifiers based on combinatorics, or [1,2], which select/design best-performing weights based on experimental success instead of using a mathematical framework for justifying their design choices.The similarity measures in UMAP [15] and Laplacian Eigenmaps [16] use a Laplacian distribution, in contrast to t-SNE [17], which uses a Gaussian distribution as a Laplacian distribution has a longer tail, which is necessary for capturing the distances between points in high-dimensional space [31].The following describes a topological justification for the choice of similarity kernel k, which preserves the topology of the dataset.
1.It is assumed the data points x j are distributed uniformly on a Riemannian manifold (M, g) embedded in the vector space R N .This manifold represents the underlying natural structure of the data.
Then, naturally, the geodesic distance g(x j , x m ) along the manifold should be used to measure the closeness of neighbors, but g is not explicitly defined for real-world datasets.However, a Riemannian manifold is locally Euclidean.A lemma presented in [15] outlines a strategy for measuring geodesic distance on a 'local neighborhood' scale.The lemma states: 2. If g is locally constant in an open neighborhood U, then within a ball B ⊆ U with radius r of volume π N 2 r N Γ(N/2+1) centered at point x j the geodesic distance from x j to any point x k ∈ B i is 1 r d(x j , x k ) where d is the Euclidean distance in R N .This is because the geodesic distance in the neighborhood is bound by the Euclidean distance on the tangent plane at x j .
Around each point x j on the Riemannian manifold, there exists a tangent space that is spanned by the tangent vectors.The neighbors of x j can be projected onto the tangent space.The geodesic distances between x j and its neighbors are bound by the Euclidean metrics implied by the scalar product of the tangent vectors on the tangent space.If the data are contained on a Riemannian surface, the local neighborhood as defined by the geodesic distance is also the local neighborhood determined using Euclidean distance.Thus, for sufficiently small neighborhoods, the local metric can be captured by the Euclidean distance metric.
The question that then remains is how can this be used to compute similarity?The solution is to capture the topology of the data with a graph.The nodes of the graph represent the data points and the edges, weighted with neighborhood-local measurements measured using Lemma 1 from [15] and connecting neighbors together.
3. A simplex is a generalized triangle.From algebraic topology, it is known that a simplicial approximation of a manifold can be used to capture the topological aspects of that manifold [32].
We use the topology captured by the graph to make inferences about similarity based on the geodesic distance of the manifold.This follows the principle that the geodesic distance is the most intuitive metric to use when measuring similarity, allowing for the measure of similarity to be based on geodesic distance without explicitly knowing g(x j , x m ).

Similarity is a symmetric property.
The result is a non-directed graph that can be used as a tool to represent the topology of the dataset.A kernel that captures similarity based on the topology of the data can then be constructed from the graph, the technical details of which are outlined in Section 3.2.

The Locally Scaled Symmetric Laplacian Diffusion Kernel
Based on points 1, 2, and 3 above, a weight is first constructed.The weight is based on a Laplacian distribution that has been locally scaled between x j and K of its nearest neighbors where the local scaling factor exp( σ j ) is locally scaled to the neighborhood of x j , using d j1 as the distance from x j to its first nearest neighbor, which shifts the weights such that they all decay relative to the first-nearest-neighbor of the neighborhood, and σ j is chosen such that ∑ K k=1 w j|k = α.Natural choices for the parameter α are 2, log 2 (K), which is used by UMAP following combinatorial arguments, or √ K. Table 7 in Section 6 quantifies the effects of α on classification accuracy.These weights are non-symmetric because the selection of K nearest neighbors is non-symmetric.The selection and order of the neighbors do not change by this local metric when compared to the Euclidean distance metric.
Furthermore, following point 4 above, the weights are symmetrized using kjk = w j|k + w k|j − w j|k w k|j .
Symmetrizing w j|k is necessary to obtain a proper similarity kernel kjk .Note that this method of symmetrization follows fuzzy set intersection, as described in UMAP [15] and not the arithmetic mean like t-SNE [17].Practically, this means that if w i|j = 1 and w j|i = 0, symmetrizing results in kij = 1.Without this, points in a tightly knit cluster might not be able to capture their location in the larger structure of the data, thereby losing important global structure information.
Finally, the weights are normalized with A kernel matrix K is constructed where K j,k = k jk and K j,j = 1.This is the locally scaled symmetric Laplacian diffusion (LSSLD) kernel k(x j , x k ).

Kernel Computational Complexity
Let us point out that time complexity is not equal to the time required to execute code, but rather the number of steps needed to solve the problem.Regardless of whether one is computing the time complexity of a covariance matrix, distance matrix, or kernel, all combinatorial pairs need to be generated and all dimensions visited.This results in a complexity of O(J 2 N), where J is the number of data points and N is the dimension of the input space.Computing the LSSLD kernel requires a normalization for each data point, and for this, we use Newton's method, which is a constant time operation, to solve the non-linear equation.

Applications of Kernels in Classification
Many, if not most, of the commonly used machine learning algorithms (SVM, regression, ridge regression, random forest, etc.) can be kernelized; the algorithm takes as an input the values of the kernel evaluated on all pairs or mathematically, a dual space.We use the LSSLD kernel we defined in the previous section, Section 3.2, in the following methods of classification.Specifically, this research considers two methods of classification, the first is a blended-model of weighed K-NN and confidence voting (of which we consider three ratios of 'blending'), and the second is kernel ridge regression.
The simplest approach to classification is a weighted K-NN classifier that utilizes the LSSLD kernel as weights

Confidence Voting and the Blended-Model
Going one step further, the kernel could be used to measure the confidence of a point x j regarding its own class.Perhaps x j is an outlier, this can be assessed by measuring the similarity of x j to points of the same class.Confidence voting is closely linked with outlier detection [33]; if x j is similar to points that mostly belong to a different class, x j could be an outlier.To prevent misclassification of x * , outliers should not be weighted heavily.Thus, confidences c j,c are assigned to each sample x j where c j,c = ∑ J k=1 G k,c k(x j , x k ), such that the confidence reflects how certain x j is of belonging to class c.From this confidence matrix C, again a J × C matrix, can be constructed.Confidence voting can be used to curtail the negative effects from outliers with Confidence voting introduces a validation for each of the training points to determine if the labels implied from the similarity kernel are equivalent to the labels given in the training set, and the decision-making is based on these implied labels.As can be seen in Table 8, datasets known to contain outliers [34] (Heart, Wine, Balance, Australian, Ionosphere, Rice, Haberman) perform better with confidence voting.
A parameter β represents the 'trade-off' between the proportion of the classification that is based on G j,c (the known labels of the class, or ground truth) and the proportion of the classification that is based on C j,c (the underlying structure of the data), leading to blended-model voting Note that when β = 0, the blended-model is equivalent to K-NN using the LSSLD weights, and when β = 1, the blended-model is equivalent to confidence voting.

Regularization
This leads to the concept of regularization, a method that trades off, which effectively prevents over-fitting.The similarity kernel matrix K that was defined in Section 3.2 can nicely be used in a regularization framework.A proposed regularization approach used in [35,36] based on kernel ridge regression (KRR) uses a locally scaled Gaussian kernel.The regularization framework in [35] learns a global classification as where I is the identity matrix, and 0 < γ < 1 is the regularization parameter, assigned a value of 0.99.

Classification Computational Complexity
For the classification complexity, we assume the weights or kernel have been precomputed, and their complexity is outlined in Section 3.3.A weighted K-NN has a complexity of O(J), independent of the dimension of the data N. Confidence voting requires the computation of the predicted class for each point resulting in a complexity of O(J 2 ).Ridge regression is the most expensive method of classification as it requires the inversion of a matrix and has complexity O(J 3 ).

Evaluation Methods
The following outlines our method of evaluating the LSSLD kernel-based classifiers.We could not directly compare the accuracy of the LSSLD kernel-based classifier to the accuracies of the classifiers in [1,18,19] as we found it challenging to reproduce their results.Some of the details surrounding their evaluation methods were ambiguous.Thus, we computed our own performance baseline of existing methods from [1,18,19] to compare the LSSLD kernel-based classifier against.Additionally, we describe our evaluations in detail below such that our results can be reproduced.

Datasets
This research uses 23 benchmark datasets from the UCI machine learning repository [11], as outlined in Table 3.These datasets are used as benchmarks in many areas of classification research [1,[23][24][25][26]37,38] and vary in the number of points J, number of input dimensions N, and number of classes C.There are some notable remarks about the datasets that must be stated for the purpose of repeatability.To begin, for the Ionosphere dataset (J = 351, N = 33), either the 1st or 34th attribute can be interpreted as the label.This research uses the 34th attribute as the label, and the 1st attribute is included in the feature vector for x.The smallest dataset, Vehicle, has J = 94, which prevents exploring values of K > 94 in Tables 4 and 6 in Section 6 as K cannot exceed J in these classification methods.Additionally, some datasets such as Glass suffer from extremely unbalanced classes.In the case of Glass, classes 0 and 1 have approximately 70 elements, while classes 2 and 4 have less than 20.For class 5 with 9 elements, after partitioning the train-test sets, the training set has an insufficient 5 elements.In these cases, specifically, the train-test partition must be carefully selected to ensure all classes are represented in both sets.The question is, what is considered a correct classification?In the two-class case, the answer is obvious and well-agreed upon.The 'number of correct in the two-class case is the sum of the two diagonal entries in the confusion matrix (true positives and true negatives).This is independent of the class one considers as 'positive', and thus accuracy is also independent of class:

Accuracy
However, if there are more than two classes, what is considered a 'correct classification' varies depending on the source.Following the work in [29,30], this work considers both true positives and true negatives as 'correct' in the multi-class case, thus the accuracy A c depends on the class: In this research, to produce a single accuracy measurement the accuracies are averaged across all classes.
As stated, not all sources agree on this method of calculating multi-class accuracy, for example, ref. [28] does not include true negatives as 'correct classifications' and thus when measuring accuracy only considers entries on the diagonal of the confusion matrix as 'correct'.This results in an entirely different measure of accuracy.In examples using UCI datasets where the method of calculating accuracy is ambiguous [1,24,25,37], it is not possible to compare the accuracy of a new method of classification to their published baseline.A consistent and clear definition of accuracy is essential for establishing a repeatable baseline.Therefore, the method of calculating accuracy in this research is made explicitly clear in Equation (9).

Train-Test Split
The question of how to partition the data into a training set and testing set is also an important consideration.In the literature, a common approach is to split the benchmark datasets with a 66:33 [1], 70:30 [2,3] or 80:20 [23] train-test partition.The success of the classifier is partially dependent on which data are in the training set and which data are in the testing set, especially in small datasets or datasets with under-represented classes.When the partitions are not published, it can be difficult to reproduce the results.
The histogram in Figure 2a illustrates classification accuracy on 100 random 60:40 partitions of the 'smaller' dataset Heart.From this histogram, it is clear that the choice of partition can significantly impact the accuracy.Here, the accuracy varies by 20%.To obtain more objective, repeatable, and accurate metrics of success, n-fold testing is often used, which divides the data into n sets and iterates through the train-test process n times, each time reserving one of the n sets for testing and using the others for training, before averaging the success metrics of all n iterations.Averaging across n folds improves the consistency of the success metrics but still relies on a random partitioning of the n sets that could lead to inconsistencies when the experiment is reproduced.Figure 2b illustrates the distribution of accuracy for 100 randomly selected fold partitions for folds-of-10 trials.The accuracy varies by 6% even with the 10-fold method.In many cases of comparing two or more classifiers, the reported gain in accuracy is only 1 or 2 percent, well within this range of variance.
Additionally, for both the train-test partitioning and the n-fold testing, the effect of the distribution of the classes between the train-test partition can have a very significant impact on accuracy, as picking 10 samples at random in a multi-class problem does not ensure each class is represented appropriately in the test and training sets.We illustrate that the problems we encounter with the train-test partition or n-fold method are not related to the ratios of the partitions themselves but to the randomness in determining which data points belong to which set.It was observed during experimentation that it is very easy to overestimate the success of a classifier by selecting partitions that favor higher accuracy results.
To create an objective and repeatable baseline, this research uses leave-one-out testing, which means each sample is to be tested against all other samples, resulting in a 1:J traintest ratio.It is commonly agreed in the validation literature that leave-one-out testing, while computationally more expensive, has a consistent performance [39].Leaving every sample in the dataset out exactly once results in a consistent metric of success.Leave-oneout testing is efficient to compute for blended-model classification as the kernel is only computed once for all data points, and the matrix G, the known labels, is updated in place for each test.

Pre-Processing
Finally, the method of preparing the data must be addressed in order to reproduce these results.All datasets were scaled dimension-wise between [0, 1].No feature extraction or reduction (e.g., PCA) is performed on the data in this experiment.

Experimental Results
We implemented a classification system using C++ with the Eigen library [40], and an additional system was implemented in Python to verify consistency across languages.The system performs classification on datasets using methods including K-NN with distances from Table 1, weighted K-NN with weights from Table 2, and the LSSLD kernel techniques outlined in Section 4.
In the traditional K-NN classifiers, distances can be computed on-demand during or prior to testing.However, using a LSSLD kernel requires the kernel to be computed prior to testing as it has a local scaling factor that must be precomputed.The local scaling factor computation requires solving a non-linear equation with Newton's method, which is efficient and converges quickly.Specifically, the solver computed for a maximum of 10 iterations or until a tolerance of ±0.001 was reached.In addition to computing the nearest neighbors of each data point, confidence voting requires the normalization of each row and symmetrizing the kernel weights, both constant in the number of data points.
In place of using a boldface font to indicate the highest accuracy of each dataset, we have chosen to present the results with a fuchsia gradient.The highest accuracy for each dataset is highlighted with a 100% saturation, and correspondingly decreasing to a zero percent saturation for accuracies with a difference of 0.02 or less from the best.

Baselines
To establish a baseline, we ran K-NN classifier using four of the distance metrics from Table 1: the Euclidean distance, Hassanat distance (as it was presented to be the best option in general, as reported in [1]), cosine distance, and Manhattan distance.Their accuracy from leave-one-out testing is presented in Table 4.Note that the optimal K is presented in brackets for each metric, as searched for in the range of [1,99].This is computationally expensive and creates a tough baseline for the LSSLD kernel classifier to compare with.In a few cases (e.g., EEG ), the classification accuracy is consistent across all metrics; however, for most datasets, there is at least one metric that outperforms the others.In some cases, choosing an ill-suited distance metric for a particular dataset can result in a greater than 20% loss of accuracy.Furthermore, only using 1-NN for any one distance metric (such as in [1]) could result in a 10% loss of accuracy or greater.It is evident from the last column that the accuracy of the classifier varies greatly with the choice of metric, and either significant computation effort should be made to determine the best operating parameters for each dataset or a method that achieves consistent accuracy regardless of operational parameters should be used.Figure 3 shows, as is well known, that the accuracy of a K-NN classifier is very sensitive to the value of K and in most cases degenerates as K increases, varying substantially in the process.In contrast, the proposed LSSLD KRR and LSSLD kernel confidence vote classifiers have a nearly flat curve, the dependence on K is minimal, and the accuracy is consistent with increasing K. Table 5 presents the baseline accuracy results of four popular weighted K-NN techniques.It is evident that most datasets tend to achieve good accuracy with the WKNN/WDKNN weights or the EXPKNN/NORKNN weights but usually not both.Interestingly, WDKNN [19], which was presented as an improvement on WKNN, reports identical accuracy to WKNN when comparing accuracy using leave-one-out testing.The authors interpret this as support for the use of leave-one-out testing to achieve consistent evaluation metrics that consistently capture the accuracy of the classifier.

One Tuning for All
One key objective was to design a kernel-based classifier for which one value for each operating parameter (in this case K, α, and β) would work well for all datasets.Without this insensitivity, the operating parameters would have to be selected using trial and error, which is a computationally expensive task and not practical in application.
Table 6 presents the accuracy of three LSSLD classifiers: K-NN weighted with LSSLD weights (i.e., blended-model with β = 0), confidence voting (i.e., blended-model with β = 1), and KRR.As in Table 5, the accuracies of the kernel-based classifiers are quite stable with respect to K but do vary across the three methods of classification.Notably, the most computationally expensive method, KRR, does not significantly outperform the other less computationally expensive methods; for most datasets, it does not outperform the other methods at all.Finally, Table 8 reports the accuracy of the blended-model classifier.The trend seems to indicate that the kernel-based classifier has higher accuracy on some datasets with β = 0, indicating the given labels of the data are preferred to make successful classifications, whereas the kernel-based classifier has higher accuracy on other datasets when β = 1, indicating the labels inferred by the kernel are preferred to make successful classifications.One interpretation of this phenomenon is that datasets that report a higher accuracy with β = 1 might have more outliers or mislabeled points.From Table 8 β of 0.5 produces the most consistently high accuracy.

Conclusions
Determining the best choice of operating parameters, the first step in many methods of classification, is critical for the the accuracy of the classifier used.The dimensionality reduction community developed methods aimed at capturing/learning the topological structure of the training data.We present a kernel approach, using a Laplacian kernel with local scaling.
Experiments using benchmark data sets demonstrate that the LSSLD kernel achieves comparable leave-one-out prediction accuracy scores using one single tuning/set of parameters for all data sets to weighted and non-weighted K-NN classifiers tuned over a variety of metrics or weights and neighborhood sizes tuned for each data set individually.Using confidence voting, representing a trade-off between the given labels and the labels implied by the kernel further confirms that the presented kernel is a promising step towards a 'one tuning for all' classifier.

Future Work
Inspired by the success of the LSSLD kernel, we want to further research the methods of kernel construction that capture similarity.The accuracy of the LSSLD kernel was satisfactory, but we believe accuracy could be improved further, especially regarding datasets that did not satisfy the manifold hypothesis (Vehicle, Australian, Balance, Haberman, Monks), as mentioned in Section 6.3.One way the classification accuracy might be improved is by utilizing the label information of the dataset in the construction of a kernel.As kernels can be linearly combined to construct 'new' kernels, the LSSLD kernel could be enhanced with 'supervised' kernels constructed with the information provided by the labels to create 'semi-supervised' kernels.We also could continue to explore the relationship between functions of classification and kernels, and how the classifier decisions could be encapsulated in the kernel.

Figure 1
Figure 1 is an example of a multi-class confusion matrix.Accuracy is a common metric of classifier success.Generally accuracy A can be described as A = Number of correct classifications Total number of test samples .

Figure 1 .
Figure 1.Example confusion matrix color-coded for measuring the success metrics pertaining to class c.True positives (TP c ) are in the green cell, false negatives (FN c ) are the sum of the red cells, false positives (FP c ) are the sum of the orange cells, and true negatives (TN c ) are the sum of the blue cells.

10 Figure 2 .
Figure 2. Two histograms illustrating the distributions of accuracy calculated from random traintest partitions of the dataset Heart, classified using Euclidean distance by a 1-nearest neighbor classifier.(b) is 100 trials from random folds-of-10, whereas (a) is 100 trials of 60:40 random train-test partitioning.Classes were represented proportionally in both the train and test sets.

Figure 3 .
Figure 3.The accuracy of a traditional K-NN classifier compared with the accuracy of a confidence voting and ridge regression (KRR) classifier using the LSSLD kernel on the Heart dataset.The accuracy of the traditional K-NN classifier varies significantly depending on the value of K, degrading as K increases.In contrast, confidence voting and ridge regression using the LSSLD kernel report consistent accuracy across all values of K.

Table 1 .
A selection of distance metrics.

Table 2 .
A selection of K-NN weights.

Table 3 .
A summary of the datasets used in this research.

Table 4 .
Accuracy of 4 popular distance metrics for a K-NN classifier from Equation (1).The metrics for both K = 1 and the best K (i.e., the K that results in the best accuracy) are presented, with the best K in brackets.The final column is the range that the accuracies from different distance metrics span, considering only the four best K columns presented here.

Table 5 .
Accuracy of a weighted K-NN classifier using popular choices of weights from Table2and best K from K = 5, 10, 20, 50.