Kernelized Manifold-Optimized Linear KNN for Nonlinear Data Classification

Zhang, Jin; Bian, Zekang; Zhang, Liang; Wang, Feng

doi:10.3390/electronics15122572

Open AccessArticle

Kernelized Manifold-Optimized Linear KNN for Nonlinear Data Classification

¹

School of Computer Engineering, Suzhou Polytechnic University, Suzhou 215104, China

²

School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, Shanghai 200240, China

³

School of AI & Computer Science, Jiangnan University, Wuxi 214122, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(12), 2572; https://doi.org/10.3390/electronics15122572

Submission received: 30 March 2026 / Revised: 18 May 2026 / Accepted: 2 June 2026 / Published: 10 June 2026

(This article belongs to the Special Issue Multimodal Learning for Multimedia Content Analysis and Understanding)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In sparse representation learning-based linear k-nearest neighbors methods, the linear representation assumption frequently fails when applied to nonlinear distributed data, leading to degraded generalization and a loss of physical interpretability. To address this, we propose the Kernelized Manifold-Optimized Linear Nearest Neighbor (KMOLNN) method. Methodologically, KMOLNN projects the data into a high-dimensional kernel space to capture the nonlinear relationships, while introducing an adaptive manifold-preserving regularization term—via an adaptive Laplacian matrix—to dynamically preserve the local geometric structures. Theoretically, this study provides a mathematical proof of the nearest neighbor group effect for the kernel framework and reveals that its weight optimization behavior implicitly implements the Bayesian decision rule. Furthermore, we derive a rigorous generalization error bound using Rademacher complexity to validate its theoretical robustness. Empirically, we evaluate KMOLNN on 15 small-to-medium-scale benchmark datasets against eight comparative methods, including recent variants. The results demonstrate significant numeric superiority, with KMOLNN achieving an average accuracy of 90.76% and a Macro F1-score of 88.62% across the evaluated datasets. Finally, we present a comprehensive runtime analysis, explicitly acknowledging that these gains in generalization capability and theoretical interpretability present a practical trade-off, requiring increased computational runtime due to the iterative alternating optimization process.

Keywords:

linear KNN; kernelized weight optimization strategy; group effect; Bayesian decision rule; manifold-capturing regularization

1. Introduction

Due to its simplicity and interpretability, k-nearest neighbors (KNN) [1,2,3], as a fundamental machine learning method, predicts the testing sample by majority voting on the labels of its nearest neighbors, and has been successfully applied in many fields, such as pattern recognition [4] and recommendation systems [5]. However, the computational complexity of a traditional KNN increases linearly with the scale of the dataset, resulting in significant challenges in both time and space consumption when handling large-scale datasets. Additionally, because of its simple mathematical foundation, a traditional KNN cannot be directly improved by altering or improving their mathematical objective function, e.g., introducing regularization terms to incorporate prior information.

To address these issues, and inspired by sparse representation learning, a linear KNN method [6,7,8,9,10,11,12] is developed to learn the linear sparse representation for each testing sample from its nearest neighbors, which can reduce the burden of naive distance-based searches in some settings and enable objective function-based improvements. Moreover, the linear KNN method provides the possibility of enhancing the traditional KNN by improving its objective function. However, when handling datasets containing nonlinear distributed samples (e.g., financial market prediction datasets [13] and bioinformatics datasets [14]), the testing samples cannot be accurately represented by the training sample, thereby weakening the generalization capability. That is, assume a testing sample

y

is reconstructed through a linear combination of training samples

X = [x_{1}, \dots, x_{n}]

, i.e.,

y \approx X w

, where

w

is the learned representation vector. If these samples do not satisfy the linear assumption, the reconstruction error

{‖y - X w‖}_{2}^{2}

will significantly increase, causing the representation

w

to fail to accurately reflect the relationships between these samples. Additionally, the learned linear representation cannot provide accurate physical interpretability (e.g., sample/feature importance-based interpretability [15]).

To intuitively illustrate the motivation behind the proposed KMOLNN method, Figure 1 visualizes the differences in reconstruction capabilities between linear and nonlinear models under varying data distribution structures. In the figure, the blue, red, and green dots represent the training samples

X

, the true test sample

y_{t e s t}

, and the reconstructed sample

\hat{y}

calculated by the models, respectively.

As shown in Figure 1a, for linearly distributed data, the test sample lies within the subspace spanned by the training samples. A precise approximation is achieved via the linear combination

\hat{y} \approx \sum w_{i} x_{i}

, resulting in a negligible reconstruction error. However, for nonlinear manifold structures—such as the quadratic function (

f (x) = x^{2}

) in Figure 1b or the sine function (

f (x) = \sin (x)

) in Figure 1c—the linear assumption fails. Geometrically, a linear combination of points on a curve forms a secant that deviates from the manifold itself. This causes a significant geometric mismatch between the reconstruction

\hat{y}

and the true sample

y_{t e s t}

. To address this, Figure 1d demonstrates the efficacy of kernel mapping. By mapping data into a high-dimensional feature space

H

, the manifold is “straightened” to satisfy linear representation conditions, allowing

\hat{y}

to accurately converge to

y_{t e s t}

. This validates the necessity of the proposed KMOLNN method for resolving nonlinear generalization challenges.

In summary, when handling a nonlinear distributed dataset, the linear KNN method still has the following challenges that should be seriously addressed:

Due to its reliance on the linear representation assumption, when a dataset contains nonlinear relationships, the linear representation learned by the KNN method cannot represent the testing sample accurately, which loses its physical meaning, thereby weakening the generalization capability.
The simplification of the linear KNN method may neglect the complex structures in the data, weakening its generalization capability.

These challenges are particularly evident in real-world scenarios. For example, in financial market prediction, complex nonlinear relationships exist between stock price fluctuations and various factors, such as market sentiment and political events. Recent studies [16] have further demonstrated that deep learning-based frameworks can effectively capture these intricate patterns in high-stakes environments, such as federated credit card fraud detection. In bioinformatics, gene expression data exhibit intricate patterns, and their associations with diseases are often nonlinear. Similarly, in image recognition tasks, the relationships among an object’s shape, texture, and position are frequently nonlinear as well. In such domains, the nonlinearity present in the data can undermine the physical interpretability of the representation weight vector obtained through linear KNN optimization. When data deviate from linear assumptions—that is, when nonlinear or other complex relationships exist—the physical meaning of the optimized representation weight vector may be compromised in the following ways:

In linear models, each element of the weight vector represents the degree of influence of the corresponding feature on the target variable. However, if the data do not conform to the linear assumption, because nonlinear relationships may cause the physical meaning of weights to become ambiguous or unclear, this approach to interpreting the weights becomes ineffective.
In nonlinear relationships, the interactions between different features lead to bias in the weights. Linear models cannot capture these complex interactions, so the optimized weights may become distorted and fail to accurately reflect the relationships between the features.
If nonlinear relationships exist in the data, using linear models for prediction may result in performance degradation. Linear models cannot accurately fit nonlinear relationships, potentially leading to significant errors when predicting new data.

To address the above challenges, this study proposes a Kernelized Manifold-Optimized Linear k-Nearest Neighbors method, aiming to overcome the limitations of linear KNN when handling nonlinear data while maintaining a practical runtime on the evaluated (small-to-medium scale) datasets. By introducing kernelization techniques, KMOLNN maps data onto a high-dimensional feature space to capture the nonlinear relationships, thereby avoiding the curse of dimensionality that would arise from directly increasing the dimensions of the original features. Furthermore, we optimize the weight allocation mechanism by mathematically proving the existence of the group effect of the nearest neighbors and its prediction behavior, connecting it to the Bayesian decision rule, which enhances its generalization capability and robustness on nonlinear datasets. In summary, the primary contribution of this work is algorithmic and empirical, supported by a robust theoretical motivation. Specifically, the main contributions of this paper are summarized as follows:

By incorporating a manifold-preserving regularization term with an adaptive Laplacian matrix, KMOLNN ensures that the data retain their original local structure after high-dimensional mapping, significantly enhancing its adaptability to nonlinear datasets and outperforming traditional weighted KNN methods.
This study provides, for the first time, a mathematical proof of the group effect of the nearest neighbors for the kernel LLK method, revealing the mechanism by which the model assigns higher weights to the training samples close to the testing sample, thereby offering greater theoretical depth compared to existing kernel KNN approaches.
Through mathematical derivation, KMOLNN implicitly implements the Bayesian decision rule in weight optimization, approximating the maximum a posteriori probability estimation, which enhances the robustness and interpretability of KMOLNN on nonlinear datasets.
The experimental results for all adopted datasets confirm that KMOLNN demonstrates enhanced generalization capability and acceptable runtime under our experimental protocol compared to the existing KNN variants.

The paper is organized as follows: Section 2 reviews the related work and preliminaries; Section 3 details the objective function of the proposed KMOLNN method and introduces the optimization process of the proposed KMOLNN method; Section 4 conducts the experimental analysis on the adopted datasets; and Section 5 presents the conclusion. To facilitate a better understanding of the subsequent mathematical derivations, the key notations used throughout this paper are summarized in Table 1.

2. Related Work and Preliminaries

2.1. Related Work

As a variant of the classical k-nearest neighbors method [1], the linear KNN method has received more and more attention in the field of machine learning. Recently, with the development of machine learning, many linear KNN methods [17] have been developed to achieve an enhanced generalization capability by introducing new learning strategies into the traditional linear KNN method. According to [4,17], the existing linear KNN methods can be divided into the following categories:

Some linear KNN methods [6,7] focus on introducing regularization terms to learn the enhanced representation. For instance, Tibshirani et al. [6] proposed the Least Absolute Shrinkage and Selection Operator (LASSO) to learn a sparse representation for each testing sample by introducing an $L_{1}$ regularization constraint on the representation vector. McDonald et al. [18] proposed a Ridge regression to prevent overfitting and enhance the generalization capability by introducing an $L_{2}$ regularization constraint on the representation vector. Based on the LASSO, Zhong et al. [7] proposed the Elastic Net to learn the sparse and robust representation of each testing sample by introducing an $L_{1} + L_{2}$ regularization constraint on the representation vector. Wang et al. [7] proposed the Locality-constrained Linear Coding (LLC) method, which establishes a robust connection between the linear representation weights and distance metrics, facilitating the acquisition of accurate weights and enabling precise classification. Then, Liu et al. [9,19] further developed the LLC method into the Local Linear KNN method (LLKNN), establishing a more stable relationship between the representation weights and distance metrics. Additionally, LLKNN provides a solid theoretical foundation and demonstrates that the Bayesian decision rule can interpret prediction behaviors.
Some linear KNN methods [4,10] focus on weighting the neighbors to achieve enhanced classification performance. For instance, Xu et al. [10] proposed the Weighted Local Linear KNN method (WLLKNN), which applies weighting to the $L_{1}$ -norm regularization term to obtain prior weights of the nearest neighbors by weighting the linear representation vector. Gou et al. [4] proposed the Weighted Local Mean Representation-based k-nearest neighbor method (WLMRKNN). This approach utilizes the weights of k-local mean vectors from each class to constrain the representation coefficients. These weights provide more information for reconstructing the testing samples, thereby enabling a more accurate representation of testing samples. This method demonstrates reduced sensitivity to the choice of k-value, contributing to the enhanced robustness of the KNN algorithm. Zhao et al. [20] proposed the Local Centroid Distance-Constrained Representation-based KNN (LCDR-KNN), which enhances the KNN classifier accuracy by selecting the k-nearest training samples for each class, using their centroid distance as a constraint, and combining it with a collaborative representation to calculate the weights for each neighbor in the classification.
Some linear KNN methods [21,22] focus on adaptively selecting the k-value for each testing sample to achieve enhanced classification performance. For instance, Chen et al. [21] demonstrated that employing different k-values for different classes yields a better generalization performance than using a fixed k-value for all classes. Mullick et al. [22] utilized neural networks to learn the density information around testing samples from the training data, thereby determining the appropriate k-values. Zhang et al. [17] proposed using a decision tree method to predict the optimal k-value for testing samples. First, during the training phase, the optimal k-value for all samples is learned through sparse modeling. Then, using the training samples and the optimal k-value, a k-tree is constructed to rapidly predict the optimal k-value for the testing samples during prediction. Wang et al. [23] proposed adjusting the local k-value of testing samples based on confidence intervals. Manocha et al. [24] used Bayesian optimization to adaptively select and infer the optimal k-value for testing samples from training samples. Cheng et al. [25] proposed a sparse learning-based KNN (S-KNN), which introduces a correlation matrix between the training and testing samples. Based on the S-KNN, Zhang et al. [11] proposed the Graph Sparse KNN (GS-KNN) method, which reduces the impact of noise on the prediction labels of testing samples by introducing sparsity into the linear expression matrix. Additionally, Zhang et al. [12] proposed a One-Step KNN, which transforms the k-nearest neighbor search in a linear KNN into matrix operations, thereby computing the adaptive k-value. This approach aims to enhance the prediction accuracy and efficiency by simplifying the computational processes. Recent research has further extended these ideas. For instance, Amer et al. [12] proposed an efficient k-nearest neighbor model that introduces three KNN variants (PRKNN, EPRKNN, and WPRKNN) by combining preprocessing techniques and weighting schemes, significantly improving performance for big data classification. Zhang et al. [26] introduced a shared-style linear k-nearest neighbors classification method, emphasizing the integration of style information into multi-view data to optimize the classification accuracy. Furthermore, Fan et al. [27] proposed a multi-view adaptive k-nearest neighbors classification, which further enhances the model’s adaptability to heterogeneous data by dynamically adjusting the k-value for multi-view data. This trend is consistent with recent work on shared-style learning, which integrates the style information across multi-view data to optimize the classification accuracy [26]. These recent advances highlight the potential of kernelization and adaptive mechanisms for improving KNN performance, providing a solid foundation for the proposed KMOLNN method’s kernel mapping and manifold optimization.

While the aforementioned methods have advanced the linear KNN framework, a clear research gap remains. The existing approaches typically focus on either linear sparsity (e.g., LASSO, LLC) or fixed graph structures (e.g., GS-KNN), often failing to capture the complex nonlinear manifolds in real-world data. Methods like MVAKNN incorporate kernelization but do not explicitly link the weight optimization to the grouping effect or Bayesian decision theory. KMOLNN addresses these limitations simultaneously by integrating nonlinear kernel mapping with an adaptive manifold-preserving term, thereby providing both superior generalization and theoretical interpretability. A comprehensive comparison of the key characteristics and theoretical bases between KMOLNN and these related linear-KNN variants is summarized in Table 2.

2.2. Preliminaries

2.2.1. Linear KNN Method

To address the high time and space consumption of a traditional KNN on large-scale datasets, the linear KNN (L-KNN) was developed to improve efficiency by simplifying the computational process, specifically by introducing linear functions or low-dimensional mappings to approximate the nonlinear decision boundaries of traditional KNN, thereby reducing the time and space consumption. The linearization process can be described by the following formula:

y \approx X w + b

(1)

where

w

and

b

denote the representation weight and bias, respectively;

X

and

y

denote the training samples and a testing sample, respectively.

To further optimize the weight allocation and enhance the generalization capability, some regularization terms were introduced into L-KNN, including the Least Absolute Shrinkage and Selection Operator (LASSO) [6], Ridge regression, and Elastic Net [28]. Their objective functions are defined as follows:

LASSO:

L (w) = {‖y - X w‖}_{2}^{2} + λ {‖w‖}_{1} .

(2)

Ridge regression:

L (w) = {‖y - X w‖}_{2}^{2} + λ {‖w‖}_{2}^{2} .

(3)

Elastic Net:

L (w) = {‖y - X w‖}_{2}^{2} + λ_{1} {‖w‖}_{1} + λ_{2} {‖w‖}_{2}^{2} .

(4)

where

λ

,

λ_{1}

and

λ_{2}

denote the regularization coefficient used to balance model complexity and fitting capability.

2.2.2. Graph Theory and Laplacian Matrix

Graph Theory

Graph theory is a significant branch of mathematics that primarily studies the complex network structures formed by nodes (or vertices) and the edges connecting these nodes. In such a structure, the nodes represent individuals in the network, while the edges describe the relationships between them. Among the core tools of graph theory are the degree matrix

D

and adjacency matrix

A

, both of which serve as fundamental elements for understanding and analyzing graph structures. The degree matrix is a diagonal matrix where each element on the diagonal represents the degree of the corresponding node, indicating the total number of edges directly connected to that node. The adjacency matrix records the direct connections between the nodes, where an element is 1 when an edge exists between nodes

i

and

j

, and is 0 otherwise. These two matrices collectively provide a systematic approach to analyzing and understanding the structure and properties of graphs.

Laplacian Matrix

The Laplacian matrix

L

is a core concept in spectral graph theory, serving as a discrete operator that captures the intrinsic structure of a graph. It is formally defined as

L = D - W

, where

W

denotes the adjacency matrix encoding the similarity relationships or edge weights between nodes (data samples), and

D

is the diagonal degree matrix, where each diagonal element

D_{i i} = \sum_{j} W_{i j}

represents the total connectivity of node

i

. Mathematically, the Laplacian matrix acts similarly to the Laplace operator in continuous space, measuring the local smoothness of a function defined on a graph. Its spectral properties, particularly its eigenvalues and eigenvectors, provide critical insights into the connectivity and cluster structures within the data, forming the theoretical foundation for spectral clustering and dimensionality reduction techniques. In the context of the proposed KMOLNN method, the Laplacian matrix is instrumental for constructing the manifold-preserving regularization term. By incorporating

L

into the objective function, the model enforces a smoothness constraint that penalizes differences in representation weights between adjacent nodes. This ensures that the training samples sharing high similarity in the local manifold structure are assigned consistent weights, thereby preserving the original geometrical relationships of the nonlinear data in the learned representation.

3. The Objective Function and Prediction Function of the Proposed KMOLNN Method

As mentioned above, KMOLNN aims to address the limitations of the traditional linear KNN method on complex nonlinear datasets by introducing kernel mapping and manifold-preserving regularization terms, thereby enhancing its generalization capability when handling nonlinear data. Specifically, the kernel mapping regularization term maps the nonlinear data onto a high-dimensional space and integrates it into the linear KNN framework, effectively overcoming the traditional method’s difficulty in capturing complex data structures while maintaining computational efficiency. The manifold-preserving regularization term ensures the preservation of local neighborhood relationships during the mapping process, further enhancing the generalization capability and perception of the true manifold structure of the data. This section presents the general form of the kernelized objective function and classification function, and elucidates the transformation process from the original linear form to the kernelized form.

Here, we first present the framework diagram of the proposed KMOLNN method in Figure 2. As illustrated in Figure 2, the framework of the proposed KMOLNN method consists of three main phases: kernel mapping, manifold-regularized optimization, and classification. Initially, the original nonlinear training and testing samples are projected onto a high-dimensional feature space via a kernel mapping function to uncover the linear relationships hidden in the original space. Subsequently, the core optimization module learns the linear representation coefficient vector by minimizing an objective function that integrates the reconstruction error with manifold-preserving regularization. This phase employs an alternating optimization strategy to dynamically update the probability weights and the adaptive Laplacian matrix, ensuring the preservation of the data’s local manifold structure. Finally, the prediction module aggregates the learned similarity probabilities for each class and determines the label of the testing sample based on the maximum accumulated similarity, implicitly adhering to a Bayesian decision rule.

3.1. Objective Function of the KMOLNN Method

The traditional linear KNN method is built on the assumption that the testing sample can be linearly represented by its several nearest training samples, where the elements of the learned representation vector reflect the importance of the corresponding samples, contributing to reconstruction of the testing sample. Therefore, when employing the linear KNN method, two conditions should be satisfied: (1) the training samples and the testing samples should be linearly correlated, and (2) the elements of the representation vector should possess a well-defined physical meaning. However, the above assumption is often unsatisfied when handling datasets containing nonlinear samples; thus, the traditional linear KNN method fails to accurately learn the representation and capture the manifold structure of the data, which is beneficial for enhancing the classification performance.

To address the above limitation, we first apply the kernel method to map the original nonlinear data onto a high-dimensional space to maintain the above assumption, where the mapped data are linearly correlated. Therefore, based on the training samples

X

, the testing sample

y

can be reconstructed by

ϕ (y) \approx ϕ (X) w + b

(5)

where

ϕ (\cdot)

denotes the mapping function that is defined as the kernel function,

w

denotes the learned representation vector, and

b

denotes the bias. Obviously, the kernel function can map the data onto the high-dimensional feature space and restore the linear relationship, yet it leaves the other challenge: the elements of the learned representation vector in the mapped high-dimensional feature space lose the intuitive physical meaning they had in the original feature space, i.e., the learned representation vector loses the original interpretability.

To guarantee that the learned linear representation has intuitive and explicit physical meaning, and inspired by graph theory [29], a Laplacian matrix is always constructed and adopted to capture the manifold structure of data. Specifically, the regularization term of the Laplacian matrix is introduced into Equation (5) to penalize differences in its elements corresponding to adjacent nodes within the graph structure, thereby encouraging similar elements among similar or proximate samples in the learned representation vector. Additionally, the regularization term helps to discover the robust relationships underlying the structures of the data, and it also preserves the manifold structure of the data after mapping the original data, so that the learned representation preserves the intuitive physical meaning of the original feature space. So, the objective function is defined as follows:

\min_{w, A} {‖ϕ (y) - \sum_{i = 1}^{m} p_{i} ϕ (x_{i})‖}^{2} + λ_{1} \sum_{i = 1}^{m} p_{i} \log p_{i} + λ_{2} \sum_{i = 1}^{m} {‖p_{i} - β d_{i}‖}^{2} + λ_{3} \sum_{i, j = 1}^{m} p^{T} L (A) p

(6)

where

p_{i} = \frac{e^{w_{i}}}{\sum_{j = 1}^{m} e^{w_{j}}}

represents the probability weights of

i

-th training sample

x_{i}

,

ϕ (\cdot)

denotes the mapping function that is defined as the kernel function,

x_{i}

denotes the

i

-th (

i = 1, 2, \dots, m

) training sample,

m

represents the number of all training samples, and

y

denotes the testing sample. Parameters

λ_{1}

,

λ_{2}

and

λ_{3}

denote contributions of the corresponding terms. The vector

d = [d_{1}, d_{2}, \dots, d_{m}]

represents the similarity measure between the testing sample and the training samples, where the

i

-th element

d_{i}

(

i = 1, 2, \dots, m

) can be calculated by

d_{i} = \exp (- γ {‖ϕ (y) - ϕ (x_{i})‖}^{2})

(7)

where

γ

denotes a positive hyperparameter that controls the width of the kernel function.

For the last term in Equation (6), where

L = D - A

,

A

denotes the adjacency matrix of the graph. For a weighted graph,

A_{i j}

denotes the weight of the edge between node

i

and node

j

.

D

denotes the degree matrix, and its diagonal elements

D_{i j}

equal the degree of node

i

(also known as the number of connections, which is either the number of edges directly connected to node

i

or the total weight). In many practical applications, due to the highly complex intrinsic structure of data, fixed Laplacian matrices are typically based on predefined similarity measures (such as Euclidean distance), which fail to adequately capture the dynamics and complexity of the data. In real-world data analysis tasks, data originate from multiple sources, exhibiting high heterogeneity and a dynamic nature. Therefore, we further improve the objective function in Equation (6) by proposing the use of an adaptive Laplacian matrix [30] to dynamically adjust to these changes, thereby providing enhanced adaptability to diverse data sources and dynamic environments. An adaptive Laplacian matrix, by learning the edge weights from the data itself, can more accurately capture the actual relationships between data points, particularly in high-dimensional data with nonlinear relationships. Unlike traditional fixed Laplacian matrices that rely on static similarity measures (e.g., Euclidean distance), the adaptive Laplacian matrix in KMOLNN learns the edge weights directly from the data distribution during optimization. This ensures the model remains sensitive to the intrinsic geometry of nonlinear manifolds, which is a significant advancement over the predefined graph constraints used in prior kernelized variants. Finally, the objective function in Equation (6) can be transformed into

\min_{p, A} {‖ϕ (y) - \sum_{i = 1}^{m} p_{i} ϕ (x_{i})‖}^{2} + λ_{1} \sum_{i = 1}^{m} p_{i} \log p_{i} + λ_{2} \sum_{i = 1}^{m} {‖p_{i} - β d_{i}‖}^{2} + λ_{3} [\sum_{i, j = 1}^{m} p^{T} L (A) p + φ \sum_{i, j} a_{i j} {‖x_{i} - x_{j}‖}^{2}] + λ_{4} \sum_{i, j} a_{i j}^{2}

(8)

The first term

{‖ϕ (y) - \sum_{i = 1}^{m} p_{i} ϕ (x_{i})‖}^{2}

is the kernelized reconstruction error term, where

ϕ (x_{i})

denotes the function mapping input data

x

to a high-dimensional (or infinite-dimensional) feature space, and

ϕ (y)

denotes the testing sample

y

mapped to the same feature space.

p_{i}

is the

i

-th element of the weight vector

p

, representing the testing sample

y

as a training sample

x_{i}

, illustrating the importance of linear combinations in the mapped feature space. The objective of this term is to minimize the reconstruction error of the testing sample in the mapped feature space, to ensure that the selected training samples can accurately reconstruct the testing sample in the high-dimensional feature space as much as possible.

The second term

\sum_{i = 1}^{m} p_{i} \log p_{i}

is the probability regularization term, which is introduced to increase the entropy of the probability distribution, thereby enhancing its uniformity and uncertainty. By encouraging a more uniform distribution, entropy regularization prevents the model from over-relying on or over-adapting to individual training samples, effectively reducing the risk of overfitting. Moreover, this term enhances fairness across different training samples during linear representation, promotes balanced weight allocation, and implicitly incorporates a Bayesian prior, thereby strengthening the model’s generalization capability.

The third term

\sum_{i = 1}^{m} {‖p_{i} - β d_{i}‖}^{2}

serves as the locality constraint term, where

d_{i}

denotes the similarity measure between the testing sample and the

i - th

training sample (e.g., Gaussian kernel-based distance). By minimizing this term, the constraint encourages the probability weight

p_{i}

to correlate with the similarity of training samples in the original space, thus prioritizing training samples that are closer or more relevant to the testing sample for representation. This reinforces the local neighborhood principle, prevents the model from relying on distant samples, and improves the reconstruction accuracy and local adaptability of the model.

The fourth term

[\sum_{i, j = 1}^{m} p^{T} L (A) p + φ \sum_{i, j} a_{i j} {‖x_{i} - x_{j}‖}^{2}]

is the manifold-preserving regularization term. This term introduces the Laplacian matrix

L

(based on the adaptive adjacency matrix

A

) to ensure that the low-dimensional manifold structure of the data is preserved when the training samples and testing samples are mapped onto the high-dimensional space, thereby guaranteeing the physical interpretation of the representation weights (e.g., weight continuity for similar samples). Furthermore, given the complexity of the intrinsic structure in some data, a fixed manifold is clearly insufficient for fully representing the dynamic relationships within the data. Therefore, we employ adaptive edge weights to dynamically capture these structures, enhancing the model’s perception of nonlinear manifolds.

The fifth term

\sum_{i, j} a_{i j}^{2}

is used to prevent edge weight overfitting. This term controls the magnitude of edge weights in the adaptive adjacency matrix

A

through the

L_{2}

regularization norm, preventing the edge weights from becoming too divergent or large, which could lead to overfitting. It encourages sparsity and smoothness of the graph structure; ensures stability during the optimization process; and maintains the model’s generalization capability on complex data, especially nonlinear-distributed datasets.

3.2. Prediction Function of the Proposed KMOLNN Method

After the proposed KMOLNN method is trained by optimizing the above objective function, its prediction function can be defined as follows:

c^{*} = \arg \max_{c} \sum_{b_{i} \in B_{c}} p_{i} k (x, y_{i}),

(9)

where

b_{i}

represents the learned representation weight (or coefficient) of the

i

-th training sample,

B_{c}

represents the bias term corresponding to the

c

-th class, and

k (x, y_{i})

represents the kernel function (defined as the Gaussian kernel) that measures the similarity between the testing sample

x

and the training sample

y_{i}

in the mapped high-dimensional feature space.

According to the prediction function in Equation (9), the proposed KMOLNN method predicts the testing sample by evaluating its similarity with training samples of each class in the mapped high-dimensional feature space. The physical meaning of the proposed KMOLNN method is primarily manifested in two aspects: high-dimensional mapping achieved through kernel functions and classification decisions based on similarity probability.

The kernel function allows us to implicitly map the data onto a high-dimensional or even infinite-dimensional feature space by computing the kernel functions in the original feature space rather than directly calculating the mapped data. The purpose of this mapping is to reveal the intrinsic structure of data in the new feature space, making the linearly inseparable data in the original feature space linearly separable in the mapped feature space. For example, through the radial basis function (RBF) kernel, data can be mapped onto a higher-dimensional feature space where similar samples are closer together and dissimilar samples are more dispersed.
The proposed KMOLNN method utilizes similarity probabilities computed by the kernel function for prediction. For a given testing sample, it calculates the sum of its similarity to the training samples in each class. The similarity is computed directly in the high-dimensional feature space via the kernel function, reflecting the proximity probability of the testing sample to the training samples of each class in the mapped high-dimensional feature space. Finally, the proposed KMOLNN method assigns the testing sample to the class with the highest sum of similarity, indicating that the testing sample is most similar to the training samples of that class in the high-dimensional feature space.

3.3. Theoretical Analysis of Generalization Capability

The proposed KMOLNN method utilizes kernel mapping and manifold-preserving regularization to address the limitations of traditional linear KNN on nonlinear distributed data. Although kernel mapping can capture complex data structures in high-dimensional feature space, it introduces potential overfitting risks due to increased model complexity. To rigorously evaluate the generalization capability of the proposed KMOLNN method, we employ Rademacher complexity [31,32] to quantify its capability to generalize to unseen data. This section derives the generalization error bound, analyzes the impact of regularization and kernel parameters, and validates the theoretical findings through experimental results.

3.3.1. Definition of Rademacher Complexity

To rigorously analyze the generalization performance of the proposed KMOLNN algorithm, we utilize the framework of Rademacher complexity, which measures the richness or capacity of a hypothesis class by its ability to fit random noise.

Let

S = {x_{1}, \dots, x_{n}}

be a dataset of

n

samples drawn independently and identically distributed (i.i.d.) from an unknown distribution

D

. The empirical Rademacher complexity of a hypothesis class

H

with respect to the sample set

S

is defined as

{\hat{R}}_{S} (H) = E_{σ} [\sup_{f \in H} \frac{1}{n} \sum_{i = 1}^{n} σ_{i} f (x_{i})]

(10)

where

σ = {σ_{1}, \dots, σ_{n}}

are independent Rademacher random variables uniformly taking values in

{- 1, + 1}

. The expected Rademacher complexity over all samples of size

n

is given by

R_{n} (H) = E_{S} [{\hat{R}}_{S} (H)]

.

In standard kernel-based learning (e.g., standard kernel

k

-NN or SVMs), the hypothesis class is typically bounded solely by the Reproducing Kernel Hilbert Space (RKHS) norm, denoted as

H_{K} = {f \in H : ‖ f ‖_{K}^{2} \leq C}

. However, in the proposed KMOLNN framework, the integration of adaptive manifold regularization intrinsically restricts the hypothesis class to a much tighter, data-dependent subspace. Specifically, the KMOLNN hypothesis class

H_{K M}

is defined as

H_{K M} = \{f \in H_{K} : ‖ f ‖_{K}^{2} + γ f^{T} L f \leq C\}

(11)

where

γ > 0

is the manifold regularization parameter,

f = {[f (x_{1}), \dots, f (x_{n})]}^{T}

represents the evaluations of the function on the sample set

S

, and

L

is the adaptive Laplacian matrix dynamically learned during the alternating optimization process.

The functional penalty term

‖ f ‖_{I}^{2} = f^{T} L f

enforces smoothness along the intrinsic geodesic directions of the data distribution. By heavily penalizing high-frequency components over the graph’s Laplacian

L

, the supremum in the Rademacher complexity

{\hat{R}}_{S} (H_{K M})

is no longer dominated by the ambient dimension of the full RKHS. Instead, it is constrained by an “effective dimension” governed by the spectral decay of the combined operator. As we will demonstrate, the optimized adaptive Laplacian

L

accelerates the decay of the eigenvalues

λ_{i} (L)

, thereby significantly reducing the Rademacher radius of

H_{K M}

and yielding a tighter generalization bound.

3.3.2. Generalization Error Bound

To derive the generalization error bound, we apply standard results from Statistical Learning Theory. For any

δ \in (0, 1)

, under the random sampling of the training set

S

, with probability of at least

1 - δ

, the expected risk

E [L (f)] = E_{(x, y) ~ D} [l (f (x), y)]

satisfies

E [L (f)] \leq \hat{L} (f) + 2 R_{n} (F) + \sqrt{\frac{\log (1 / δ)}{2 n}}

(12)

where

\hat{L} (f) = \frac{1}{n} \sum_{i = 1}^{n} l (f (x_{i}), y_{i})

denotes the empirical risk, and

l

denotes the loss function (e.g., the 0–1 loss for classification). This bound decomposes the generalization error into the empirical risk, model complexity (i.e.,

R_{n} (F)

), and a confidence term that decreases as the sample size

n

increases.

3.3.3. Rademacher Complexity Bound for the Proposed KMOLNN Method

To bound

R_{n} (F)

, we consider the prediction form of the proposed KMOLNN method,

f (y) = \sum_{i = 1}^{n} p_{i} K (y, x_{i})

, for classification. Since

p \in Δ_{n}

, and assuming the kernel is bounded (e.g.,

| K (y, x_{i}) | \leq M

, typically for the Gaussian kernel

M = 1

), we analyze the empirical Rademacher complexity by

{\hat{R}}_{n} (F) = E_{σ} [\sup_{p \in Δ_{n}} \frac{1}{n} \sum_{i = 1}^{n} σ_{i} \sum_{j = 1}^{n} p_{j} K (x_{i}, x_{j})]

(13)

Using the fact that

p

lies in the probability simplex, we apply Hölder’s inequality:

|\sum_{i = 1}^{n} σ_{i} \sum_{j = 1}^{n} p_{j} K (x_{i}, x_{j})| \leq \sum_{j = 1}^{n} p_{j} |\sum_{i = 1}^{n} σ_{i} K (x_{i}, x_{j})| \leq \max_{j} |\sum_{i = 1}^{n} σ_{i} K (x_{i}, x_{j})|

(14)

Because

\sum p_{j} = 1

, taking the expectation over

σ

gives

E_{σ} [\max_{j} |\sum_{i = 1}^{n} σ_{i} K (x_{i}, x_{j})|]

(15)

For each

j

,

\sum_{i = 1}^{n} σ_{i} K (x_{i}, x_{j})

is the sum of bounded random variables (

| K (x_{i}, x_{j}) | \leq M

). By Hoeffding’s inequality, for a fixed

j

we have

P (|\sum_{i = 1}^{n} σ_{i} K (x_{i}, x_{j})| \geq t) \leq 2 \exp (- \frac{2 t^{2}}{n M^{2}})

(16)

Taking the maximum over

n

terms and applying the union bound, we obtain

E_{σ} [\max_{j} |\sum_{i = 1}^{n} σ_{i} K (x_{i}, x_{j})|] \leq M \sqrt{\frac{2 \log n}{n}}

(17)

Therefore, the Rademacher complexity is bounded as

R_{n} (F) \leq M \frac{1}{\sqrt{n}} \sqrt{2 \log n}

(18)

This bound indicates that the complexity decreases at a rate of

O (\sqrt{\log n / n})

, suggesting that the proposed KMOLNN method maintains controlled complexity as the sample size

n

increases.

3.4. Proof of the Nearest Neighbor Group Effect

In this subsection, we introduce the second characteristic of the regularization term

R (w)

, namely the learned linear probability weight vector

p

, which exhibits the k-nearest neighbors grouping effect. The k-nearest neighbors grouping effect requires that when two training samples within the same class label are close to a testing sample, the elements in the linear expression weight vector

p

corresponding to these two training samples should be similar. The k-nearest neighbors grouping effect in Equation (9) can be described by Theorem 1. Before presenting the formal theoretical analysis, it is important to establish that our formulations operate under the standard manifold assumption: the high-dimensional data distribution is presumed to be concentrated on or near a lower-dimensional intrinsic manifold, which allows the local linear reconstruction to hold true in the localized region.

Theorem 1.

For the objective function, assuming the adjacency matrix

A

is fixed, the optimization variable is the weight vector

p

. For any two training samples

x_{i}

and

x_{j}

within the same class

k

, if their similarity to the testing sample

y

is close, i.e.,

d_{i} \approx d_{j}

(where

d_{i}

denotes the similarity to

y

), and samples

x_{i}

and

x_{j}

are highly correlated in the graph structure (i.e., in the adjacency matrix

A

,

a_{i j}

is large), then the optimized weight

p_{i}^{*}

and

p_{j}^{*}

satisfies

| p_{i}^{*} - p_{j}^{*} | \leq ε

(19)

where

ε

denotes a small positive number dependent on the regularization parameter; and

λ_{1}

, is large), then

λ_{2}

and

λ_{3}

are the correlations between samples.

Proof.

The relevant proof of Theorem 1 is provided in Appendix A. □

The empirical results from the ablation study further validate this theoretical proof; the removal of the manifold term (

γ = 0

) leads to a decline in the classification stability, confirming that the nearest neighbor group effect is a critical factor for achieving superior generalization.

3.5. Optimization of the Objective Function for the Proposed Method

The optimization of the objective function in the proposed KMOLNN method typically involves complex dependencies among multiple variables, such as the probability weight vector

p

and the adaptive adjacency matrix

A

, which makes obtaining a direct solution challenging. Therefore, we decompose the optimization problem of the objective function into multiple sub-objective problems and progressively approach the global optimal solution through alternating optimization [33,34], gradually approximating the global optimal solution. This section elaborates in detail on the strategies for parameter initialization and the specific implementation steps for the alternating optimization.

3.5.1. Parameter Initialization

The parameter initialization significantly affects the convergence rate of the optimal solution and the numerical stability of computations. In this subsection, we present two initialization methods:

For the initialization of the probability weight vector $p$ , each element of $p$ is assigned an identical value. This approach is based on a reasonable assumption: in the absence of any prior information, the likelihood of any testing sample belonging to each training sample is equal. At the same time, it ensures that ${‖p‖}_{1} = 1$ , which satisfies the fundamental mathematical norms of probability.
For the adaptive adjacency matrix $A$ , we employ the k-means clustering method [35] to cluster the data points $x_{i}$ to give an appropriate number of classes $k$ , which typically depends on the characteristics of the problem or is determined through certain criteria (such as the elbow method). Differentiate the connection weights within the same class from those between classes:

a_{i j} = \{\begin{matrix} e x p (- \frac{∥ x_{i} - x_{j} ∥^{2}}{2 σ^{2}}) \\ ε \end{matrix} \begin{matrix} l a b e l (x_{i}) = l a b e l (x_{j}) \\ l a b e l (x_{i}) \neq l a b e l (x_{j}) \end{matrix}

(20)

where

ε

denotes a small positive number representing the minimal level of connection between classes.

3.5.2. Alternating Optimization

When implementing an alternating optimization [15], we decompose the original optimization problem into several subproblems, each of which is responsible for optimizing one variable in the original problem. When optimizing the probability weight vector

p

, we hold the adaptive adjacency matrix

A

fixed; conversely, when optimizing the adjacency matrix

A

, we need to hold the probability weight vector

p

fixed. The repeated application of this strategy ensures that each optimization step approaches the optimal solution. Next, we provide a detailed explanation of the specific implementation methods and corresponding technical details for each optimization step.

When the adjacency matrix $A$ is fixed, optimize the probability weight vector $p$ .

When

A = A (k)

is held fixed, the objective function in Equation (8) reduces to the following sub-objective function:

J_{1} (p) = \min_{p} {‖ϕ (y) - \sum_{i = 1}^{m} p_{i} ϕ (x_{i})‖}^{2} + λ_{1} \sum_{i = 1}^{m} p_{i} \log p_{i} + λ_{2} \sum_{i = 1}^{m} {‖p_{i} - β d_{i}‖}^{2} + λ_{3} \sum_{i, j = 1}^{m} p^{T} L (A) p

(21)

Obviously, Equation (21) is usually a convex optimization problem, which is optimized using the gradient descent algorithm [15]. It is particularly important to note that the first term contains an unknown mapping function. We can use the kernel trick to further simplify the computations. The kernel trick allows us to operate in the feature space without explicitly computing the mapping,

ϕ (•)

, thereby significantly reducing the computational complexity.

First, construct the kernel matrix

K_{i, j}

and the vector

k_{y}

to prepare for the next step of gradient calculation and parameter updates. Here,

K_{i, j} \in m \times m

denotes the

i

-th and

j

-th elements in

K (x_{i}, x_{j})

, and

k_{y}

denotes the

j

-th element in

K (y, x_{j})

. We transform the linear reconstruction term

{‖ϕ (y) - \sum_{i = 1}^{m} p_{i} ϕ (x_{i})‖}^{2}

into

{‖ϕ (y) - \sum_{i = 1}^{m} p_{i} ϕ (x_{i})‖}^{2} = K (y, y) - 2 \sum_{i = 1}^{m} p_{i} K (y, x_{i}) + \sum_{i = 1}^{m} \sum_{j = 1}^{m} p_{i} p_{j} K (x_{i}, x_{j})

(22)

And then, we further express the sub-objective function as follows:

\min_{p} K (y, y) - 2 \sum_{i = 1}^{m} p_{i} K (y, x_{i}) + \sum_{i = 1}^{m} \sum_{j = 1}^{m} p_{i} p_{j} K (x_{i}, x_{j}) + λ_{1} \sum_{i = 1}^{m} p_{i} \log p_{i} + λ_{2} \sum_{i = 1}^{m} {‖p_{i} - β d_{i}‖}^{2} + λ_{3} \sum_{i, j = 1}^{m} p^{T} L (A) p

(23)

Based on the gradient descent method, the gradient of the objective function with respect to

p_{i}

is calculated by

\frac{\partial J (p)}{\partial p_{i}} = - 2 K (y, x_{i}) + 2 \sum_{j = 1}^{m} p_{j} K (x_{i}, x_{j}) + λ_{1} (\log p_{i} + 1) + 2 λ_{2} (p_{i} - β d_{i}) + 2 λ_{3} {(L (A^{(k)}) p)}_{i}

(24)

We update the gradient descent steps

p

:

p^{(t + 1)} = p^{(t)} - l r \frac{\partial^{(t)}}{\partial p}

(25)

where

l r

denotes the learning rate. We determine convergence of the optimization process by monitoring the changes in the objective function value to decide when to terminate optimization. Specifically, we define the change amount as

Δ J = |J_{k + 1} - J_{k}|

(26)

where

J_{k}

and

J_{k + 1}

denote the objective function values of two consecutive iterations, respectively. If this change amount

Δ J

is smaller than a preset threshold, we can consider the optimization process converged.

Next, we analyze the computational complexity of Algorithm 1 step by step. In Step 1, constructing the kernel matrix

K \in ℝ^{n \times n}

requires evaluating the Gaussian kernel for all

n^{2}

pairs of training samples; since each kernel evaluation involves computing a distance in

d

dimensions, this step costs

O (n^{2} d)

. In the same step, computing the similarity vector

k \in ℝ^{n}

between the testing sample and all training samples requires

n

kernel evaluations and thus costs

O (n d)

. In Steps 2–10, the algorithm enters an iterative procedure; suppose it runs for

T

iterations. In each iteration, Steps 3–5 loop over

n

coefficients, and computing the required quantity for each coefficient typically involves an

O (n)

inner product with the precomputed kernel matrix, resulting in

O (n^{2})

time per iteration. In Step 6, the update is dominated by dense matrix–vector operations, which also cost

O (n^{2})

. In Steps 8–10, projecting

p

onto the probability simplex can be implemented in, at most,

O (n \log n)

time (e.g., via sorting-based projection), which is lower than

O (n^{2})

for a moderate-to-large

n

. Therefore, the per-iteration cost in Steps 2–10 is dominated by

O (n^{2})

, and after

T

iterations, the total time complexity of Algorithm 1 is

O (n^{2} d) + T (O (n^{2}) + O (n \log n)) = O (n^{2} d + T n^{2})

, where the

O (n^{2} d)

term comes from the kernel construction and the

O (T n^{2})

term dominates the iterative phase in practice.

Algorithm 1: Optimization process in Equation (21).

Input: Given training sample matrix

X

; testing sample

y

; adjacency matrix

A

; distance vector

d

; and regularization parameters

λ_{1}

,

λ_{2}

and

λ_{3}

.

Output: Probability weight vector

p

of the testing sample.

Procedure

Step 1: Calculate

K (x_{i}, x_{j})

and

K (y, x_{j})

.

Step 2: While

|J_{k + 1} - J_{k}| > ε

.

Step 3: For

i = 1

to

n

do the following.

Step 4: Calculate

\frac{\partial J (p)}{\partial p_{i}} = - 2 K (y, x_{i}) + 2 \sum_{j = 1}^{m} p_{j} K (x_{i}, x_{j}) + λ_{1} (\log p_{i} + 1) + 2 λ_{2} (p_{i} - β d_{i}) + 2 λ_{3} {(L (A^{(k)}) p)}_{i}

.

Step 5: Add

\frac{\partial J (p)}{\partial p_{i}}

to

g r a d_{p}

.

Step 6: Update

p

.

Step 7:

p \leftarrow p - l r * g r a d_{p}

.

Step 8: Project

p

to make it a probability distribution.

Step 9:

p \leftarrow m a x (p, 0)

.

Step 10:

p \leftarrow p / s u m (p)

.

Step 11: Return

p

.

2.: When the probability weight vector $p$ is fixed, optimize the adjacency matrix $A$ .

When fixing the probability weight vector

p

, the objective function in Equation (8) degenerates to a sub-objective function related only to the adjacency matrix

A

, as follows:

J_{A} (A) = \min_{A} λ_{3} [p^{T} L (A) p] + λ_{3} φ [\sum_{i, j = 1}^{m} a_{i j} {‖x_{i} - x_{j}‖}^{2}] + λ_{4} \sum_{i, j} a_{i j}^{2}

(27)

For the first term

λ_{3} [p^{T} L (A) p]

in Equation (27), its Laplacian matrix

L (A) = D - A

.

D

denotes the degree matrix with diagonal elements

D_{i i} = \sum_{j} a_{i j}

. For

A

, taking the derivative transforms it into

\frac{\partial}{\partial a_{i j}} [p^{T} (D - A) p]

. When

i = j

, the partial derivative of

a_{i j}

in

D

is equal to 1, otherwise it is 0. Therefore, since

\frac{\partial D_{i i}}{\partial a_{i j}} = δ_{i j}

, where

δ_{i j}

denotes the Kronecker delta function, we obtain

\frac{\partial}{\partial a_{i j}} λ_{3} [p^{T} L (A) p] = λ_{3} p_{i}^{2} δ_{i j} - 2 λ_{3} p_{i} p_{j}

(28)

The second term in the sub-objective function

λ_{3} [\sum_{i, j = 1}^{m} a_{i j} {‖x_{i} - x_{j}‖}^{2}]

carries out the derivation of

a_{i j}

. We provide its derivative directly as follows:

\frac{\partial}{\partial a_{i j}} [λ_{3} φ (\sum_{i, j} a_{i j} {‖x_{i} - x_{j}‖}^{2})] = λ_{3} φ (\sum_{i, j} a_{i j} {‖x_{i} - x_{j}‖}^{2}) {‖x_{i} - x_{j}‖}^{2}

(29)

For the third term in the sub-objective function

λ_{4} \sum_{i, j} a_{i j}^{2}

, the derivative of

a_{i j}

, we directly provide its derivative as follows:

\frac{\partial}{\partial a_{i j}} [λ_{4} \sum_{i, j} a_{i j}^{2}] = 2 λ_{4} a_{i j}

(30)

Therefore, we obtain the gradient of Equation (27) as follows:

\frac{\partial J}{\partial a_{i j}} = λ_{3} (p_{i}^{2} - 2 p_{i} p_{j}) + λ_{3} φ^{'} (\sum_{i, j} a_{i j} {‖x_{i} - x_{j}‖}^{2}) {‖x_{i} - x_{j}‖}^{2} + 2 λ_{4} a_{i j}

(31)

We update the gradient descent steps

a_{i j}

:

a_{i j} \leftarrow a_{i j} - l r \frac{\partial J}{\partial a_{i j}}

(32)

We ensure non-negativity by projecting onto non-negative tangent values

a_{i j}

:

a_{i j} \leftarrow m a x (a_{i j}, 0)

(33)

We present the details for optimizing Equation (27) in Algorithm 2.

Algorithm 2: Optimizing Equation (27).

Input: Given training samples

X

; testing sample probability vector

p

; and regularization parameters

λ_{3}

,

λ_{4}

and

φ

.

Output: Optimal adjacency matrix

A

.

Procedure

Step 1: Randomly initialize adjacency matrix

A

.

Step 2: While

|J_{k + 1} - J_{k}| > ε

.

Step 3:

g r a d_{A} = λ_{3} (p_{i}^{2} - 2 p_{i} p_{j}) + λ_{3} φ^{'} (\sum_{i, j} a_{i j} {‖x_{i} - x_{j}‖}^{2}) {‖x_{i} - x_{j}‖}^{2} + 2 λ_{4} a_{i j}

.

Step 4: Update

A \leftarrow A - l r * g r a d_{A}

.

Step 5: Projection

A \leftarrow m a x (A, 0)

to ensure non-negativity.

Step 6: Return

A

.

Next, we analyze the computational complexity of Algorithm 2 step by step. Let

n

denote the number of nodes (training samples), let

A \in ℝ^{n \times n}

be the adaptive adjacency matrix to be learned, and let

d

be the dimension of each node feature vector. In Step 1,

A

is randomly initialized by assigning values to all

n^{2}

entries, which costs

O (n^{2})

. In Step 2, the while loop repeatedly updates

A

until convergence. Suppose it runs for

T_{2}

iterations, in each iteration, the dominant computation is the double loop over all index pairs

(i, j)

, i.e., over the

n^{2}

entries of

A

. In Step 3, for each pair

(i, j)

, computing the squared Euclidean distance between node features (e.g.,

{‖x_{i} - x_{j}‖}_{2}^{2}

) requires a subtraction and a sum of squares in

d

dimensions, costing

O (d)

. In Step 4, updating the corresponding entry of

A

involves only a constant number of arithmetic operations, costing

O (1)

. In Step 5, enforcing the non-negativity constraint (e.g., via projection onto the non-negative domain) is also a constant-time operation per entry, costing

(1)

. Therefore, one full pass of Steps 3–5 over all

(i, j)

pairs costs

O (n^{2} d)

. After completing one iteration, the convergence test compares the updated value with its previous value across all

n^{2}

elements, which costs

O (n^{2})

and is dominated by

O (n^{2} d)

when

d \geq 1

. Consequently, the total time complexity of Algorithm 2 after

T_{2}

iterations is

O (n^{2}) + T_{2} (O (n^{2} d) + O (n^{2})) = O (T_{2} n^{2} d)

, where the

O (n^{2} d)

term typically dominates in practice.

We summarize the optimization process of the objective function in the proposed KMOLNN method in Algorithm 3.

Algorithm 3: Alternating optimization process of the objective function in the proposed KMOLNN method.

Input: Training samples

X = {x_{1}, \dots x_{n}}

; testing sample

y

; kernel function

K (\cdot, \cdot)

; parameters

λ

,

μ

,

γ

and

β

; kernel width

σ

; maximum iterations

m a x_i t e r

; and convergence threshold tol.

Output: Optimized weight vector

p

and adjacency matrix

A

.

Procedure

Step 1: Initialization:

p \leftarrow \frac{1}{n}

,

p = p / s u m (p)

.

Step 2: Initialize

A

using k-means clustering; if

x_{i}

and

x_{j}

are in same cluster, set

A_{i j} = 1

, otherwise

A_{i j} = 0

.

Step 3: Compute similarity vector

d

, where

d_{i} = e x p (- \frac{{‖y - x_{i}‖}^{2}}{2 σ^{2}})

for

i = 1, \dots, n

.

Step 4: Compute kernel matrices

K_{X X} = K (X, X)

,

K_{y X} = K (y, X)

and

K_{y y} = K (y, y)

.

Step 5: Iterative optimization.

Step 6: For

i t e r = 1

to

m a x_i t e r

do the following.

Step 7: Fix

A

; optimize weight vector

p \leftarrow A l g o r i t h m 1 (K_{X X}, K_{y X}, K_{y y}, d, L, λ, μ, γ)

. Compute Laplacian

L = D - A

, where

D_{i i} = \sum_{j} A_{i j}

.

Step 8: Fix

p

; optimize adjacency matrix

A \leftarrow A l g o r i t h m 2 (p, γ, β)

.

Step 9: Update

D

, where

D_{i i} = \sum_{j} a_{i j}

.

Step 10: If

{‖p_{n e w} - p_{o l d}‖}_{2} + {‖A_{n e w} - A_{o l d}‖}_{F} < t o l

.

Step 11: Break.

Step 12:

p_{o l d} \leftarrow p

,

A_{o l d} \leftarrow A

.

Step 13: Return

p

,

A

.

Next, we analyze the computational complexity of Algorithm 3 step by step, using the same notations as in the algorithm. Let

n

be the number of training samples,

d

be the feature dimension,

\max_i t e r

be the maximum number of outer iterations, and let

T_{1}

and

T_{2}

denote the numbers of inner iterations required by Algorithm 1 and Algorithm 2, respectively. In Step 1, initializing the probability weight vector

p

assigns

n

values and therefore costs

O (n)

. In Step 2, initializing the adjacency matrix

A

via

k

-means clustering costs

O (n d k t)

(with

k

and

t

treated as constants in practice), and explicitly constructing/writing

A \in ℝ^{n \times n}

costs at most

O (n^{2})

; hence, this step is dominated by

O (n^{2})

. In Step 3, computing the similarity vector

s

between the testing sample and all training samples requires

n

distance/kernel evaluations in

d

-dimensional space, costing

O (n d)

. In Step 4, computing the kernel objects (notably the kernel matrix

K \in ℝ^{n \times n}

and the kernel vector

k

) requires evaluating the kernel for all

n^{2}

pairs of training samples (each typically

O (d)

), leading to

O (n^{2} d)

time. In Steps 6–11, the algorithm enters the outer loop, which runs for at most

\max_i t e r

iterations. In each outer iteration, Step 7 fixes

A

and computes the graph’s Laplacian

L = D - A

(with

D

being the degree matrix), which takes

O (n^{2})

, and then invokes Algorithm 1 to optimize

p

, costing

O (T_{1} n^{2})

, since each inner iteration is dominated by dense matrix–vector operations involving

K

. Next, Step 8 fixes

p

and invokes Algorithm 2 to optimize

A

; this requires updating all

n^{2}

entries of

A

, and each update typically performs a

d

-dimensional distance computation plus a constant-time arithmetic/projection, yielding

O (n^{2} d)

per inner iteration and

O (T_{2} n^{2} d)

in total. Therefore, the cost per outer iteration is

O (T_{1} n^{2} + T_{2} n^{2} d)

, and the overall time complexity of Algorithm 3 is

O (n^{2} d) + \max_i t e r (O (T_{1} n^{2}) + O (T_{2} n^{2} d)) = O (n^{2} d + \max_i t e r, (T_{1} n^{2} + T_{2} n^{2} d))

. In typical settings where

d

is moderate, the dominant terms are

O (n^{2} d)

for kernel construction and

O (\max_i t e r \cdot T_{2} n^{2} d)

for updating

A

, while the early-stopping condition controlled by may reduce the actual number of outer iterations in practice.

4. Experimental Studies

To evaluate the generalization capability and running speed of the proposed KMOLNN method, we compared its classification performance and execution times with alternative methods on the adopted datasets. The experiments were implemented in Python 3.9 on a Windows 10 platform (Intel Core i7-8700 CPU, 32 GB RAM). Crucially, since KMOLNN is a lazy learning method requiring no explicit global training, the reported runtimes denote the end-to-end inference cost per test sample, specifically covering kernel matrix construction and alternating optimization.

4.1. Comparative Methods and Parameter Settings

In the experiments, eight KNN-based methods were selected as the comparative methods. They can be briefly summarized as follows:

k-Nearest Neighbors (KNN): KNN employs the Euclidean distance to measure the similarity between all training samples and the testing sample. Among all the calculated Euclidean distances, the k-training samples closest to the testing sample are selected. The label of the testing sample is determined based on the labels of these k-training samples. The main parameter of the KNN method is the k-value. According to the recommendation in the literature [1], the k-value is typically selected through cross-validation by traversing values from 1 to n to determine the optimal k-value.

Kernel

k

-Nearest Neighbors (KKNN) [2]: As a direct nonlinear extension of traditional KNN, KKNN maps the input data into a high-dimensional feature space via a kernel function before performing nearest neighbor voting. This baseline is included to validate the advantage of our manifold-optimized reconstruction over simple kernelized voting.

Fuzzy k-Nearest Neighbors (FKNN) [36]: FKNN performs classification prediction by considering the similarity between testing samples and different classes. It employs fuzzy membership functions to quantify the relationship between testing samples and each class in the training samples, assigning weights based on proximity. The primary parameter of FKNN is the k-value. According to the recommendation in the literature [36], the k-value is typically selected through cross-validation by traversing values from 1 to n to determine the optimal k-value.

Proximal Ratio-based k-Nearest Neighbors (PRKNN) [37]: PRKNN is a KNN variant based on the Proximal Ratio (PR), which optimizes classification decisions by identifying the noise points and overlapping samples to address class imbalance and nonlinear data distribution. It achieves locally adaptive neighborhood selection by minimizing the contribution of overlapping points and incorporating weighting schemes (EPRKNN and WPRKNN). According to the literature recommendations, the main parameters include the number of neighbors

k

and the PR threshold. The k-value is selected via cross-validation within the

{1, 3, 5, \dots, 9}

range, with the PR threshold search range being

{0.1, 0.5}

. The objective function is to minimize the sum of PR loss and weighted distance.

Elastic Net [7]: Elastic Net is a regularization technique that combines a Lasso regression and Ridge regression, and is designed to handle data with highly correlated features. The parameter setting primarily involves two key parameters,

λ_{1}

and

λ_{2}

, to determine the weight ratio between the Lasso (

L 1

norm) and Ridge (

L 2

norm). Typically, these two parameters are selected through cross-validation, such as using a grid search to find the optimal combination within a given

{0.001, 0.01, 0.1, 1, 10, 100}

parameter range, achieving the best trade-off between bias and variance to enhance the model’s predictive power and generalization capability.

Weighted Multi-Local Mean Representation-based KNN (WLMRKNN) [4]: There are two parameters involved in WLMRKNN, i.e., the neighborhood size

k

and the regularization parameter

γ

. According to [4],

k

is varied from 1 to 15 with a step size of 1 (i.e.,

k \in 1, 2, \dots, 15

). The authors also recommend evaluating most datasets by sweeping

k

from 1 to 15 with a step size of 1. For face databases,

k

is set from 1 to

n_{t}

(the number of training samples per class) with a step size of 1. In addition,

γ

is the regularization coefficient in the objective function, which controls the strength of the locality-constrained term

{‖W_{j} s^{j}‖}_{2}^{2}

.

Local Centroid Distance-Constrained Representation-based KNN method (LCDR-KNN) [20]: LCDR-KNN improves the accuracy of KNN classifiers by selecting the k-nearest training samples for each class using the center-of-mass distances of these samples as constraints in conjunction with collaborative representation to compute the weight of each neighbor in the classification. According to [20], the parameter

λ_{1}

is searched in the set

{0.001, 0.01, 0.1, 1, 10, 100}

, and the parameter

λ_{2}

is searched in the set

{0.001, 0.01, 0.1, 1, 10, 100}

.

Shared Style k-Nearest Neighbors (SSL-KNN) [26]: There are 5 parameters to be determined in the proposed SSL-KNN method. According to [26], the parameter

λ_{1}

controls the strength of the linear expression weights’ sparsification, which is searched iteratively in the set

\{0.001, 0.01, 0.1, 1, 10, 100\}

. The parameter

λ_{2}

controls the strength of the

L_{2}

-norm regularization term, which is traversed in the set

\{0.001, 0.01, 0.1, 1, 10, 100\}

. The parameter

λ_{3}

controls the strength of the testing sample style membership optimization and is iterated through the set

\{0.005, 0.01, 0.05, 0.1, 0.5, 1, 10\}

. The parameter

λ_{4}

is used to control the degrees of freedom of the style matrix. The larger the value of parameter, the more the style matrix converges to the unit matrix, and the search is traversed in the set

\{0.005, 0.01, 0.05, 0.1, 0.5, 1, 10\}

. According to [26], the last parameter

σ

is the Gaussian kernel bandwidth, which is traversed in the set

\{0.1, 1, 5\}

.

Multi-View Adaptive k-Nearest Neighbor (MVAKNN) [27]: There are 2 main trade-off parameters to be determined in the proposed MVAKNN method. According to [27], the parameter

ρ_{1}

controls the strength of the

L_{1}

-norm regularization term

{‖W‖}_{1}

(i.e., enforcing sparsity of the correlation matrix W), and its search space is

ρ_{1} \in {10^{- 5}, \dots, 10^{1}}

. The parameter

ρ_{2}

controls the strength of the Laplacian regularization term

Tr (W^{T} X^{T} L X W)

, and its search space is

ρ_{2} \in {10^{- 5}, \dots, 10^{- 1}}

. In addition,

σ

appears as the Gaussian kernel bandwidth in the weight matrix

S_{i j}

(e.g.,

S_{i j} = \exp (- {‖x_{i} - x_{j}‖}^{2} / (2 σ^{2}))

) when

x_{j}

is a neighbor of

x_{i}

, but [27] does not explicitly provide a corresponding grid search range for

σ

in the experimental parameter settings.

The proposed KMOLNN method: We conducted a systematic grid search for hyperparameter selection. The regularization parameters

λ_{1}, λ_{2}, λ_{3},

and

λ_{4}

were tuned within the logarithmic range of

[10^{- 3}, 10^{3}]

. Simultaneously, the kernel width

h

was optimized within the range

[0.1 σ_{0}, 2.0 σ_{0}]

, where

σ_{0}

denotes the median of the pairwise distances between training samples. This expansive search space ensures that the model’s sensitivity to structural regularization and kernel mapping is thoroughly evaluated across diverse data distributions.

During the testing phase, to ensure fairness in comparison, we adopted the prediction rules provided in the relevant literature for predicting testing samples, as shown in Table 3. Additionally, a comprehensive summary of these comparative baseline methods, including their key characteristics and theoretical bases, is provided in Table 4.

4.2. The Benchmark Datasets

In this subsection, we use 15 standard benchmark datasets from the KEEL and UCI repositories. The dataset characteristics are summarized in Table 5. Following the experimental protocol in [26], we adopt a repeated random hold-out evaluation: for each dataset, we randomly split the data into 80% for training and 20% for testing. This procedure is repeated 10 times using different random seeds, and we report the mean of the performance over the 10 runs.

To strictly account for the variance inherent in iterative graph updates and to avoid reporting biased point estimates, all performance metrics (e.g., accuracy and Macro F1) are reported as the mean results over 10 independent random hold-out runs. Furthermore, to mitigate sensitivity to initialization choices, the adaptive adjacency matrix is consistently initialized using a k-means-based strategy, ensuring stable and deterministic convergence across all repeated runs.

4.3. Evaluation Metrics

In this study, two commonly used evaluation metrics, i.e., the mean accuracy (ACC) and mean Macro F1-score, are adopted to evaluate the classification performance of all adopted methods on all adopted datasets; they can be calculated, respectively, as follows:

A C C = \frac{T P + T N}{T P + T N + F P + F N}

(34)

where TP denotes True Positives, representing the number of samples correctly identified as positive cases by the classifier; TN denotes True Negatives, representing the number of samples correctly identified as negative cases by the classifier; FP denotes False Positives, representing the number of samples incorrectly identified as positive cases by the classifier; and FN denotes False Negatives, representing the number of samples incorrectly identified as negative cases by the classifier.

F 1 - s c o r e = \frac{2 \cdot p r e c i s i o n \cdot r e c a l l}{p r e c i s i o n + r e c a l l}

(35)

where

p r e c i s i o n = \frac{T P}{T P + F P}

and

r e c a l l = \frac{T P}{T P + F N}

. For multi-class classification, the Macro F1-score is calculated by taking the arithmetic mean of the F1-scores for each individual class:

M a c r o F 1 = \frac{1}{| C |} \sum_{i \in C} F 1_{i}

(36)

where

C

denotes the set of classes, and

F 1_{i}

is the F1-score for the

i

-th class. In addition, to statistically analyze the differences between the proposed KMOLNN method and the comparative methods, the Friedman test [3,38] and Bonferroni–Dun test [39] are adopted in the experiments.

4.4. Generalization Capability Analysis

In this subsection, we evaluate the generalization ability of the proposed KMOLNN method and the comparative methods on the adopted dataset by comparing all the adopted methods, run 10 times, and taking their average accuracy and mean Macro F1-socre.

To ensure a rigorous and modern benchmark, our comparison set includes state-of-the-art (SOTA) variants published in 2024 (SSL-KNN and MVAKNN) and 2025 (PRKNN). As summarized in Table 4, while recent methods like MVAKNN integrate kernelization and manifold learning, KMOLNN uniquely employs an adaptive Laplacian optimization framework combined with a theoretical link to Bayesian decision rules. This allows KMOLNN to consistently outperform these strong alternatives, especially on datasets with intricate nonlinear manifolds. All the experimental results are presented in Table 6 and Table 7.

To provide a more intuitive and diagnostic comparison across the entire benchmark suite, we have included a per-dataset Win/Tie/Loss (W/T/L) summary at the bottom of both Table 6 and Table 7. As shown in the W/T/L statistics, KMOLNN exhibits a consistent and overwhelming advantage, strictly losing to or tying with the competing baselines in only a marginal fraction of the datasets. This diagnostic analysis directly confirms the robust superiority of our algorithm across varying data distributions.

From Table 6 and Table 7, we can draw the following conclusions:

KMOLNN consistently outperformed the other methods in terms of both accuracy and Macro F1-score across the majority of datasets. For instance, on datasets with complex structures, such as Sonar and Ionosphere, the proposed KMOLNN method achieved remarkable performance. Specifically, on the Sonar dataset, KMOLNN achieved an accuracy of 90.71%, surpassing the runner-up MVAKNN (90.47%) and significantly outperforming the traditional KNN (82.61%). On the Pendigits dataset (with 16 dimensions and 10 classes), KMOLNN achieved an accuracy of 99.49%, outperforming MK-AKNN’s 99.09% and Elastic Net’s 97.79%. This highlights the efficacy of kernelized manifold optimization at capturing complex data manifolds and enhancing generalization capability.
While advanced methods like MK-AKNN and SSL-KNN demonstrated strong competitiveness, KMOLNN maintained the leading position. Looking at the data, MK-AKNN emerged as the strongest competitor, achieving the best results on datasets such as Wine and Vowel. However, KMOLNN provided a more robust mechanism overall, particularly on larger or more diverse datasets. For example, on the Letter Recognition dataset (sample size 20,000), KMOLNN attained an accuracy of 98.30%, distinctly higher than that of MK-AKNN (97.80%) and Elastic Net (95.10%). In terms of overall mean accuracy across all 15 datasets, KMOLNN achieved approximately 90.76%, whereas the traditional KNN scored 84.23%, representing a substantial improvement.
The Macro F1-score results further confirm the balanced precision–recall capabilities of KMOLNN. On the datasets with potential multi-class challenges, such as Ecoli (8 classes) and Yeast (10 classes), KMOLNN maintained superior Macro F1-scores. For instance, on the Yeast dataset, KMOLNN achieved a Macro F1-score of 63.6%, which was the highest among all methods, surpassing MK-AKNN (62.5%) and significantly outperforming KNN (50.5%). The Macro F1-score of KMOLNN across all datasets reached 88.62%, demonstrating that the implicit Bayesian decision rule in our method effectively handles class separability even in difficult scenarios.

Finally, the grouped analysis validates its adaptability. For high-dimensional datasets like Sonar (60 dimensions), KMOLNN’s accuracy of 90.71% validates that the manifold-preserving regularization effectively mitigates the curse of dimensionality. For large-scale datasets, such as Pendigits and Letter Recognition, KMOLNN consistently ranked first, proving its scalability. Overall, these results confirm that KMOLNN is not only theoretically sound but also practically superior for processing nonlinear data. Overall, these results not only validate the practical advantages of KMOLNN for nonlinear classification tasks but also confirm it through repeated trials and statistical analysis across diverse benchmarks. This establishes a solid foundation for subsequent statistical tests, such as the Friedman test.

To go beyond narrative explanations to empirically understand why KMOLNN achieves these results, it is critical to analyze the dataset-wise patterns. On datasets with highly complex, nonlinear decision boundaries (such as Sonar and Ionosphere), standard linear methods like LLKNN often fail. LLKNN assumes local linearity in the original ambient space, which leads to assigning high weights to “false neighbors” that cross the folded nonlinear manifold.

In KMOLNN, first, the kernelization unfolds the nonlinear data into a higher-dimensional RKHS where the local linear reconstruction assumption becomes valid. Second, rather than relying on a static Euclidean graph, our adaptive Laplacian matrix dynamically recalculates similarities during optimization. This allows KMOLNN to actively prune those false neighbors and preserve only the true geodesic neighborhood, translating our theoretical manifold preservation directly into the practical accuracy gains observed in Table 6.

4.5. Runtime Analysis

In this section, we conduct a comprehensive evaluation of the computational efficiency of the proposed KMOLNN method against eight comparative methods. Table 8 summarizes the average running times of all adopted methods on all adopted datasets.

Table 8 summarizes the average runtime of KMOLNN and eight baseline methods on 15 datasets. A clear pattern is that the simple neighbor-based classifiers (e.g., KNN/FKNN/PRKNN) are consistently the fastest, because their computation is dominated by distance evaluation and local voting, with no expensive training or iterative parameter updates. The methods that incorporate regularization or additional learning modules (e.g., Elastic Net, LCDR-KNN, WLMRKNN) typically require more time due to extra optimization steps, but they still remain moderate on most datasets.

In contrast, the slowest methods in Table 8 are SSL-KNN and KMOLNN. The table indicates that SSL-KNN has the largest runtime overall, while KMOLNN is usually the second most time-consuming, and the gap becomes especially visible on larger datasets (e.g., Pendigits, Satimage, Letter Recognition). This trend suggests that KMOLNN’s cost scales more sharply with the number of samples than the lighter baselines.

The runtime overhead of KMOLNN mainly comes from: (1) kernel-related computations, where building and using kernel similarities can be expensive (often near quadratic in sample size); (2) constructing and updating the adaptive neighborhood graph and its Laplacian regularization, which is typically performed iteratively; and (3) alternating optimization procedures (such as repeated updates of weights/structures), which accumulate time across iterations. As datasets grow, these steps increase both the computation and memory demand, explaining the stronger slowdown on large-scale benchmarks.

Although KMOLNN is slower, its extra cost supports its nonlinear representation ability (via kernel mapping) and manifold/structure preservation (via graph regularization), which are designed to improve the classification quality. For better scalability, future work could adopt kernel approximations (e.g., Nyström or random Fourier features), reduce the number of optimization iterations, or use parallel/GPU acceleration for the graph and matrix operations.

The experimental results present a clear trade-off: KMOLNN consistently ranks among the most accurate methods but has a higher computational cost compared to the linear baselines. We argue that this extra complexity is essential for capturing the ‘straightened’ manifold structure in high-dimensional feature spaces. Specifically, for datasets like Sonar (60 dimensions), the manifold-preserving regularization effectively mitigates the curse of dimensionality, a gain that justifies the

O (n^{2})

complexity. However, the scalability remains a critical constraint; as the sample size increases to 20,000 (e.g., Letter Recognition), the alternating optimization cycles accumulate significant overhead. Therefore, KMOLNN is particularly recommended for complex nonlinear classification tasks where accuracy and interpretability outweigh strict real-time constraints.

4.6. Statistical Analysis

To statistically analyze the proposed KMOLNN method, the Friedman test is first adopted to evaluate the difference among all adopted methods, and then the Bonferroni–Dunn test is further adopted to evaluate the differences between the proposed KMOLNN method and the comparative methods. The average ranking values of all adopted methods across the datasets, which serve as the basis for these statistical tests, are presented in Table 9.

According to [38], the Friedman test is employed to determine whether there exists a significant difference in generalization capability between the proposed KMOLNN method and all comparative methods across the 15 benchmark datasets. The null hypothesis of this test posits that all methods exhibit no statistically significant difference in their generalization capability. The calculation formula for the Friedman statistic is as follows:

χ_{F}^{2} = \frac{12 N}{k (k + 1)} (\sum_{j = 1}^{k} R_{j}^{2} - \frac{k {(k + 1)}^{2}}{4})

(37)

where

R_{j}

denotes the average accuracy ranking of the

j

-th method across all benchmark datasets,

k

represents the number of methods (here,

k = 9

), and

N

denotes the number of benchmark datasets (here,

N = 15

). Based on the calculations,

χ_{F}^{2} = 98.71

. A further transformation yields the F-distribution statistic:

F_{F} = \frac{(N - 1) χ_{F}^{2}}{N (k - 1) - χ_{F}^{2}} = 64.91

(38)

This statistic follows an F-distribution with degrees of freedom

(k - 1, (k - 1) (N - 1))

, that is,

(8, 112)

. At the significance level

α = 0.05

, the critical value for the F-distribution is approximately 1.95. Since

F_{F} = 64 . 69 > 1.95

, we reject the null hypothesis, indicating that there exist statistically significant differences in the generalization capability among these methods.

According to the literature [39], the Bonferroni–Dunn test is employed for post hoc comparison, treating the proposed KMOLNN method as the control method and performing pairwise comparisons with the other methods. The most crucial statistic in this test is the critical difference (CD), calculated as follows:

C D = q_{α} \sqrt{\frac{k (k + 1)}{6 N}}

(39)

where

q_{α} = 2.724

denotes the critical value of the statistic for

k = 9

and

α = 0.05

. The calculation yields

C D = 2.724

.When the absolute difference in average ranks between two methods exceeds the CD, it signifies a statistically significant difference in generalization capability; otherwise, the difference is not significant. We present the absolute differences in average ranks between the proposed KMOLNN method and the remaining 8 comparison methods in Table 10 and compare them with the critical value

C D = 2.724

.

Based on the rankings in Table 9, we can derive a robust statistical conclusion regarding the proposed KMOLNN’s performance.

First, the Friedman test confirms that the performance differences among the nine methods are statistically significant and not due to random chance. The calculated statistic (

τ_{F} = 60.15

) far exceeds the critical threshold of 1.95. This result firmly establishes that the methods perform differently across the benchmark datasets, justifying the need for pairwise comparisons. The Bonferroni–Dunn post hoc test further clarifies KMOLNN’s position relative to its peers, revealing a distinct two-tier separation among the comparative methods:

Contrary to the assumption that simple linear models might suffice, KMOLNN demonstrates a statistically significant advantage over KNN, FKNN, PRKNN, Elastic Net, and LCDR-KNN. Notably, the performance gap is widest between KMOLNN (average rank 1.83) and the standard KNN (average rank 8.4), with a rank difference of 6.57—well above the critical difference (CD) of 2.7243. This statistically validates that our kernelization and manifold-preserving strategies successfully overcome the limitations of traditional linear and neighbor-based methods.
When compared to the most advanced methods—MVAKNN, SSL-KNN, and WLMRKNN—the statistical differences fall below the critical threshold (hypothesis accepted). It is particularly noteworthy that KMOLNN ties with MVAKNN for the top rank (both at 1.83), at 5. This indicates that while KMOLNN is not statistically superior to these top-tier methods, it successfully reaches the state-of-the-art level, offering a highly competitive alternative that matches the performance of complex multi-kernel and style-based approaches.

In summary, the statistical evidence proves that KMOLNN is a significant upgrade over standard baselines and performs on par with the best-in-class algorithms available.

4.7. Limitations and Future Work

While KMOLNN demonstrates significant benefits for capturing nonlinear manifolds and providing theoretical interpretability, several limitations must be acknowledged. First, the computational complexity of

O (n^{2})

arises from the iterative kernel and Laplacian updates, which poses challenges for real-time processing on large-scale datasets compared to simpler baselines. Second, the model’s performance exhibits sensitivity to the choice of kernel parameters and the initial quality of the adaptive graph. Furthermore, although the alternating optimization converges effectively in our experiments, its numerical stability can depend on the initialization strategy.

Future research will focus on improving the inference efficiency by incorporating kernel approximation methods or algorithm unrolling techniques. Additionally, we aim to extend the evaluation of KMOLNN to more complex real-world scenarios, such as multi-modal medical imaging, to further validate its generalization capability beyond standard benchmark suites.

4.8. Qualitative Analysis of Interpretability

To further provide a concrete understanding of the learned representation weights and the manifold-preserving mechanism, we conduct a qualitative analysis on a representative case from the Iris dataset.

To explicitly demonstrate the interpretability of KMOLNN, Figure 3 visualizes the internal mechanisms using a representative case from the Iris dataset. Figure 3a displays the heatmap of the learned adaptive adjacency matrix

A

, with the training samples sorted by class. The distinct block-diagonal structure confirms that our manifold-preserving optimization successfully captures the intrinsic geometric structure of the kernel space, maintaining strong connections (dark blue) strictly among the intra-class samples. Furthermore, Figure 3b illustrates the learned probability weights

w

for a specific test sample. The model distinctly assigns similar and elevated weights (red bars) to a group of neighbors belonging to the correct class, while suppressing others (gray bars). This visual evidence directly validates our mathematical proof of the nearest neighbor group effect, proving that the optimized weights in KMOLNN possess concrete, semantic interpretability linked to the Bayesian decision rule.

4.9. Ablation Study

To move beyond theoretical intuition and provide a direct empirical decomposition of the proposed framework, we design an ablation study. This analysis explicitly justifies why the specific joint integration of kernel mapping and adaptive manifold-preserving regularization is practically preferable to simpler alternatives, such as relying on kernel mapping or graph regularization alone.

Furthermore, to directly address the necessity of our proposed framework against baseline kernel methods, this ablation study isolates the standard kernel k-nearest neighbors [2] (KKNN) approach as a specific comparative variant. We evaluate the performance of the full KMOLNN method against three degraded variants on three representative nonlinear datasets (Sonar, Ionosphere, and Wine):

KMOLNN (linear): The kernel mapping function

ϕ (\cdot)

is removed. The reconstruction is performed in the original input space, reducing the model to a locally linear KNN. Its objective function degrades to

\min_{w_{i}, A} {‖y_{t e s t} - \sum_{i = 1}^{n} w_{i} x_{i}‖}_{2}^{2} + η \sum_{i = 1}^{n} w_{i} \ln w_{i} + λ \sum_{i = 1}^{n} d_{i}^{2} w_{i} + γ w^{T} L w + θ {‖A‖}_{F}^{2}

(40)

KMOLNN (w/o manifold)/KKNN: The manifold-preserving regularization term is disabled by setting the parameter

γ = 0

. By retaining the high-dimensional kernel mapping but omitting the manifold optimization, this variant is functionally equivalent to the standard kernel k-nearest neighbors baseline. This allows us to directly evaluate the specific performance gain achieved by transitioning from a basic kernelized approach to our manifold-optimized framework. Its objective function degrades to

\min_{w_{i}} {‖ϕ (y_{t e s t}) - \sum_{i = 1}^{n} w_{i} ϕ (x_{i})‖}_{2}^{2} + η \sum_{i = 1}^{n} w_{i} \ln w_{i} + λ \sum_{i = 1}^{n} d_{i}^{2} w_{i}

(41)

KMOLNN (fixed graph): The adaptive graph learning mechanism is removed. This variant uses a static, predefined KNN graph based on standard Euclidean distances instead of dynamically updating the adjacency matrix

A

. Consequently, the graph regularization over

A

is omitted:

\min_{w_{i}} {‖ϕ (y_{t e s t}) - \sum_{i = 1}^{n} w_{i} ϕ (x_{i})‖}_{2}^{2} + η \sum_{i = 1}^{n} w_{i} \ln w_{i} + λ \sum_{i = 1}^{n} d_{i}^{2} w_{i} + γ w^{T} L_{f i x e d} w

(42)

The performance of these variants, measured by both the accuracy and Macro F1-score, is summarized in Table 11.

As observed in Table 11, the progressive integration of each component steadily improves both the accuracy and the Macro F1-score.

First, mapping the data onto a high-dimensional feature space is essential. The KMOLNN (linear) variant suffers the most significant performance drop, particularly on the highly nonlinear Sonar dataset (accuracy drops to 81.25%). This validates that kernelization is crucial for handling complex nonlinear distributions.

Second, the comparison between the KMOLNN (w/o manifold)/KKNN baseline and our full method provides clear evidence of the necessity of the manifold-preserving regularization. While the basic KKNN approach improves upon the linear variant, its performance remains suboptimal (e.g., 86.54% on Sonar) because it lacks the geometric constraints needed to maintain local neighborhood structures in the kernel space. By enabling manifold regularization, the full KMOLNN method re-establishes the nearest neighbor grouping effect, boosting the accuracy to 90.71% and ensuring much more stable weight distributions (as reflected by the higher Macro F1-scores).

Finally, comparing the full method with the KMOLNN (fixed graph) variant demonstrates the limitations of predefined similarity measures. By dynamically updating the adjacency matrix, the adaptive Laplacian matrix filters out the irrelevant connections and precisely captures the dynamic manifold, yielding the highest classification performance.

Crucially, these empirical observations provide direct grounding for our theoretical claims. The significant performance drop in the KKNN variant empirically validates the necessity of the nearest neighbor grouping effect; without this adaptive geometric constraint, the model fails to maintain local smoothness. Furthermore, the fact that KMOLNN consistently outputs sparse, valid probability weights that translate into state-of-the-art accuracy across these diverse datasets serves as strong empirical confirmation that our optimized objective effectively approximates the optimal Bayesian decision rule in practice.

4.10. Parameter Sensitivity and Robustness

To evaluate the stability and robustness of the KMOLNN algorithm, a systematic parameter sensitivity analysis is conducted on the representative Sonar dataset. Based on the overall objective function, we focus on two critical sets of hyperparameter interactions: the relationship between the entropy regularization parameter

λ_{1}

and the locality constraint parameter

λ_{2}

, and the interaction between the adaptive manifold-preserving coefficient

λ_{3}

and the Gaussian kernel width

h

. During this bivariate analysis, the adjacency matrix regularization parameter

λ_{4}

and the distance penalty factor

φ

are kept fixed at their pre-determined optimal values.

4.10.1. Interaction of $λ_{1}$ and $λ_{2}$

Figure 4a presents the 3D surface plot of classification accuracy as the entropy regularization

λ_{1}

and the locality constraint

λ_{2}

vary simultaneously across a wide logarithmic range from

10^{- 3}

to

10^{3}

. The visualization reveals a broad, stable “plateau” of high performance, where the classification accuracy remains consistently above 88% across most parameter combinations. The prominent peak observed around

λ_{1} = λ_{2} = 10^{0}

indicates that the model achieves its optimal balance between weight sparsity (driven by entropy) and localized reconstruction stability (driven by the distance penalty) in this central region. The absence of sharp, localized spikes empirically demonstrates that KMOLNN is highly robust and not overly sensitive to precise hyperparameter tuning within this subspace, significantly reducing the computational burden typically required for a fine-grained grid search.

4.10.2. Interaction of $λ_{3}$ and $h$

Figure 4b provides a 2D heatmap illustrating the sensitivity of accuracy with respect to the adaptive manifold coefficient

λ_{3}

and the Gaussian kernel width

h

. A striking structural pattern is the vertical consistency of the accuracy scores. For a fixed kernel width (e.g.,

h = 1.0 σ_{0}

), the classification accuracy remains virtually identical—varying narrowly between 90.5% and 91.1%—even as

λ_{3}

scales across six orders of magnitude from

10^{- 3}

to

10^{3}

. This stability firmly demonstrates the intrinsic robustness of the adaptive manifold optimization mechanism; the model reliably learns the dynamic graph structure

A

regardless of the global penalty weight assigned to the structural terms. Conversely, the performance exhibits higher sensitivity to the kernel width

h

. As shown by the prominent vertical dark-blue band, the optimal performance is strictly concentrated around

h = 1.0 σ_{0}

. Extreme deviations, such as an excessively small (

0.1 σ_{0}

) or large (

2.0 σ_{0}

) kernel width, result in a severe performance decline. This behavior perfectly aligns with the theoretical expectations, as

h

directly dictates the resolution of the nonlinear mapping

ϕ (x)

; a value well matched to the default data distribution scale

σ_{0}

is essential for maintaining a discriminative neighborhood structure in the kernel space. In summary, the sensitivity analysis confirms that KMOLNN maintains superior and stable classification performance within an expansive hyperparameter search space, proving its reliability for complex real-world classification tasks where prior knowledge of optimal parameters may be limited.

5. Conclusions

This study proposes the Kernelized Manifold-Optimized Linear k-Nearest Neighbors method to address the limitations of traditional linear KNN on nonlinear distributed data. By integrating kernel mapping with adaptive manifold-preserving regularization, the model effectively captures complex data structures while restoring the physical interpretability of representation weights. The theoretical analysis confirms the existence of a nearest neighbor group effect and demonstrates the model’s intrinsic connection to the Bayesian decision rule. The experimental results on all adopted datasets confirm that the proposed KMOLNN method achieves an enhanced generalization capability compared to the comparative methods.

Further work related to this study may focus on the following aspects: (1) The proposed KMOLNN method has a high time consumption because of its adoption of the alternating optimization method; new optimization methods will be developed to achieve a faster optimization speed. (2) The current KMOLNN method relies on an iterative alternating optimization process, which provides interpretability but limits inference speed. Furthermore, while the proposed model provides improved transparency through its learned weight distribution, it should be noted that its interpretability is primarily geometric (reflecting manifold neighborhoods) rather than providing semantic feature-level explanations. Future work could explore Deep Unfolding Networks [30] to map the iterations of the proposed optimization algorithm into layers of a deep neural network.

Author Contributions

J.Z.: Conceptualization, Validation, Methodology and Software. Z.B.: Methodology and Writing—review and editing. L.Z.: Conceptualization, Methodology, Writing—original draft, and Software. F.W.: Methodology and Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62306126, and the Jiangsu Province Youth Science and Technology Talent Support Project, grant number JSTJ2024283.

Data Availability Statement

Publicly available datasets were analyzed in this study. These datasets can be found in the UCI Machine Learning Repository (https://archive.ics.uci.edu, accessed on 1 June 2026) and the KEEL Dataset Repository (https://sci2s.ugr.es/keel/datasets.php, accessed on 1 June 2026).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could appear to have influenced the work reported on in this paper.

Appendix A

Proof of Theorem 1.

We aim to prove that the objective function possesses the grouping effect; that is, for two samples belonging to the same class

x_{i}

and

x_{j}

, when their similarity to the testing sample

y

is close (i.e.,

d_{i} \approx d_{j}

), the optimized weights satisfy

p_{i}^{*} \approx p_{j}^{*}

.

With a fixed

A

, the objective function is

J (p) = {‖ϕ (y) - \sum_{i = 1}^{m} p_{i} ϕ (x_{i})‖}^{2} + λ_{1} \sum_{i = 1}^{m} p_{i} \log p_{i} + λ_{2} \sum_{i = 1}^{m} {(p_{i} - β d_{i})}^{2} + λ_{3} p^{T} L (A) p

(A1)

Taking the partial derivative of

J

with respect to

p_{i}

, we obtain

\frac{\partial J}{\partial p_{i}} = 2 (\sum_{j = 1}^{m} K (x_{i}, x_{j}) p_{j} - K (y, x_{i})) + λ_{1} (\log p_{i} + 1) + 2 λ_{2} (p_{i} - β d_{i}) + 2 λ_{3} {[L (A) p]}_{i}

(A2)

At the optimal solution,

\frac{\partial J}{\partial p_{i}^{*}} = 0

. For

x_{i}

and

x_{j}

of the same class, when

K (y, x_{i}) \approx K (y, x_{j})

and

d_{i} \approx d_{j}

, we calculate the difference between their gradients at the optimal weights

p^{*}

:

\frac{\partial J}{\partial p_{i}^{*}} - \frac{\partial J}{\partial p_{j}^{*}} = 0

(A3)

We get

2 {(k_{i} - k_{j})}^{T} p^{*} + 2 (k_{y j} - k_{y i}) + λ_{1} \log \frac{p_{i}^{*}}{p_{j}^{*}} + 2 λ_{2} (p_{i}^{*} - p_{j}^{*} - β (d_{i} - d_{j})) + 2 λ_{3} ({[L (A) p^{*}]}_{i} - {[L (A) p^{*}]}_{j}) = 0

(A4)

Assuming

k_{i} \approx k_{j}

,

k_{y i} \approx k_{y j}

, and

d_{i} \approx d_{j}

, and simplifying the Laplacian term (since strongly connected intra-class samples have negligible penalty differences), we obtain

λ_{1} \log \frac{p_{i}^{*}}{p_{j}^{*}} + 2 λ_{2} (p_{i}^{*} - p_{j}^{*}) \approx 0

(A5)

Let

u = p_{i}^{*} / p_{j}^{*}

, then

(p_{i}^{*} - p_{j}^{*}) = p_{j}^{*} (u - 1)

. Substitute this into the equation

λ_{1} \log u + 2 λ_{2} p_{j}^{*} (u - 1) = 0

(A6)

When

u

is close to 1 (

u \approx 1

), we can apply the Taylor approximation

\log u \approx u - 1

. Therefore,

(u - 1) (λ_{1} + 2 λ_{2} p_{j}^{*}) \approx 0

(A7)

Since the parameter

(λ_{1} + 2 λ_{2} p_{j}^{*}) > 0

, we must have

u \approx 1

, namely

p_{i}^{*} \approx p_{j}^{*}

.

This property of the regularization term

λ_{2} \sum_{i = 1}^{m} {(p_{i} - β d_{i})}^{2}

ensures that when

d_{i} \approx d_{j}

, the assigned weights

p_{i}^{*} \approx p_{j}^{*}

. Furthermore, the Laplacian term

λ_{3} p^{T} L (A) p

enhances this effect by explicitly encouraging connected samples to have similar weights. Therefore, the objective function exhibits a strong grouping effect. □

References

Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Altman, N.S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. Am. Stat. 1992, 46, 175–185. [Google Scholar] [CrossRef]
Jiang, J.; Wu, J.; Luo, J.; Meng, X.; Qian, L.; Li, K. KATSA: KNN Ameliorated Tree Seed Algorithm for complex optimization problems. Expert Syst. Appl. 2025, 280, 127465. [Google Scholar] [CrossRef]
Gou, J.; Qiu, W.; Yi, Z.; Shen, X.; Zhan, Y.; Ou, W. Locality constrained representation-based K-nearest neighbor classification. Knowl. Based Syst. 2019, 167, 38–52. [Google Scholar] [CrossRef]
He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; ACM: New York, NY, USA, 2020. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Zhong, L.W.; Kwok, J.T. Efficient sparse modeling with automatic feature grouping. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1436–1447. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Yang, J.; Yu, K.; Lv, F.; Huang, T.; Gong, Y. Locality-constrained linear coding for image classification. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2010; pp. 3360–3367. [Google Scholar]
Liu, Q.; Liu, C. A novel locally linear KNN method with applications to visual recognition. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2010–2021. [Google Scholar] [CrossRef]
Xu, Y.-L.; Chen, S.; Luo, B. A weighted locally linear KNN model for image recognition. In Proceedings of the CCF Chinese Conference on Computer Vision; Springer: Singapore, 2017; pp. 567–578. [Google Scholar]
Zhang, S.; Zong, M.; Sun, K.; Liu, Y.; Cheng, D. Efficient kNN algorithm based on graph sparse reconstruction. In Proceedings of the International Conference on Advanced Data Mining and Applications; Springer: Cham, Switzerland, 2014; pp. 356–369. [Google Scholar]
Zhang, S.; Li, J. KNN classification with one-step computation. IEEE Trans. Knowl. Data Eng. 2021, 35, 2711–2723. [Google Scholar] [CrossRef]
Cao, J.; Li, Z.; Li, J. Financial time series forecasting model based on CEEMDAN and LSTM. Phys. A Stat. Mech. Its Appl. 2019, 519, 127–139. [Google Scholar] [CrossRef]
Min, S.; Lee, B.; Yoon, S. Deep learning in bioinformatics. Brief. Bioinform. 2017, 18, 851–869. [Google Scholar]
Bian, Z.; Zhang, J.; Chung, F.L.; Wang, S. Residual Sketch Learning for a Feature-Importance-Based and Linguistically Interpretable Ensemble Classifier. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 10461–10474. [Google Scholar] [CrossRef] [PubMed]
Venkata Krishna Reddy, V.; Vijaya Kumar Reddy, R.; Siva Krishna Munaga, M.; Karnam, B.; Maddila, S.K.; Sekhar Kolli, C. Deep learning-based credit card fraud detection in federated learning. Expert Syst. Appl. 2024, 255, 124493. [Google Scholar] [CrossRef]
Zhang, S.; Li, X.; Zong, M.; Zhu, X.; Wang, R. Efficient kNN Classification With Different Numbers of Nearest Neighbors. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 1774–1785. [Google Scholar] [CrossRef] [PubMed]
Mcdonald, G.C. Ridge regression. Wiley Interdiscip. Rev. Comput. Stat. 2010, 1, 93–100. [Google Scholar] [CrossRef]
Liu, Q.; Liu, C. A novel locally linear KNN model for visual recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2015; pp. 1329–1337. [Google Scholar] [CrossRef]
Zhao, Y.; Liu, Y.; Liu, X.; Zou, E. Local Centroid Distance Constrained Representation-Based K-Nearest Neighbor Classifier. In Proceedings of the China Conference on Wireless Sensor Networks; Springer: Singapore, 2020. [Google Scholar]
Li, B.; Chen, Y.W.; Chen, Y.Q. The Nearest Neighbor Algorithm of Local Probability Centers. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2008, 38, 141–154. [Google Scholar] [CrossRef]
Mullick, S.S.; Datta, S.; Das, S. Adaptive Learning-Based k-Nearest Neighbor Classifiers With Resilience to Class Imbalance. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5713–5725. [Google Scholar] [CrossRef]
Wang, J.; Neskovic, P.; Cooper, L.N. Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence. Pattern Recognit. 2006, 39, 417–423. [Google Scholar] [CrossRef]
Manocha, S.; Girolami, M.A. An empirical analysis of the probabilistic K-nearest neighbour classifier. Pattern Recognit. Lett. 2007, 28, 1818–1824. [Google Scholar] [CrossRef]
Cheng, D.; Zhang, S.; Deng, Z.; Zhu, Y.; Zong, M. kNN algorithm with data-driven k value. In Proceedings of the International Conference on Advanced Data Mining and Applications; Springer: Cham, Switzerland, 2014; pp. 499–512. [Google Scholar]
Zhang, J.; Bian, Z.; Wang, S. Shared style linear k nearest neighbor classification method. Expert Syst. Appl. 2024, 241, 122702. [Google Scholar] [CrossRef]
Fan, Z.; Huang, Y.; Xi, C.; Liu, Q. Multiview Adaptive K-Nearest Neighbor Classification. IEEE Trans. Artif. Intell. 2024, 5, 1221–1234. [Google Scholar] [CrossRef]
Zhang, Z.; Lai, Z.; Xu, Y.; Shao, L.; Wu, J.; Xie, G.S. Discriminative Elastic-Net Regularized Linear Regression. IEEE Trans. Image Process. 2017, 26, 1466–1481. [Google Scholar] [CrossRef] [PubMed]
Ortega, A.; Frossard, P.; Kovačević, J.; Moura, J.M.F.; Vandergheynst, P. Graph Signal Processing: Overview, Challenges, and Applications. Proc. IEEE 2018, 106, 808–828. [Google Scholar] [CrossRef]
Monga, V.; Li, Y.; Eldar, Y.C. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing. IEEE Signal Process. Mag. 2021, 38, 18–44. [Google Scholar] [CrossRef]
Bousquet, O.; Elisseeff, A. Stability and generalization. J. Mach. Learn. Res. 2002, 2, 499–526. [Google Scholar]
Bartlett, P.L.; Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 2002, 3, 463–482. [Google Scholar]
Sultana, T.; Dumitrescu, S. Globally optimal max-min rate joint channel and power allocation for hybrid NOMA-OMA downlink systems. IEEE Trans. Signal Process. 2025, 73, 1674–1690. [Google Scholar] [CrossRef]
Bian, Z.; Chung, F.-L.; Wang, S. Enhanced fuzzy random forest by using doubly randomness and copying from dynamic dictionary attributes. IEEE Trans. Fuzzy Syst. 2022, 30, 4369–4383. [Google Scholar] [CrossRef]
Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2016; pp. 478–487. [Google Scholar]
Keller, J.M.; Gray, M.R.; Givens, J.A. A fuzzy K-nearest neighbor algorithm. IEEE Trans. Syst. Man Cybern. 1985, SMC-15, 580–585. [Google Scholar] [CrossRef]
Amer, A.A.; Ravana, S.D.; Habeeb, R.A.A. Effective k-nearest neighbor models for data classification enhancement. J. Big Data 2025, 12, 86. [Google Scholar] [CrossRef]
Li, G.; Jung, J.J. Deep learning for anomaly detection in multivariate time series: Approaches, applications, and challenges. Inf. Fusion 2023, 91, 93–102. [Google Scholar] [CrossRef]
Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]

Figure 1. Visualization of the linear and nonlinear models on a nonlinear distributed dataset. (a) Reconstruction of linearly distributed data using a linear model. (b) Failure of linear reconstruction on a quadratic manifold. (c) Failure of linear reconstruction on a sine manifold. (d) Accurate reconstruction of nonlinear data via kernel mapping.

Figure 2. Framework diagram of the proposed KMOLNN method.

Figure 3. Visualization of KMOLNN’s interpretability on the Iris dataset. (a) Heatmap of the learned adaptive adjacency matrix

A

, showing the preserved block-diagonal manifold structure. (b) Learned probability weights

w

for a test sample, demonstrating the grouping effect where same-class neighbors receive similar and higher weights.

Figure 3. Visualization of KMOLNN’s interpretability on the Iris dataset. (a) Heatmap of the learned adaptive adjacency matrix

A

, showing the preserved block-diagonal manifold structure. (b) Learned probability weights

w

for a test sample, demonstrating the grouping effect where same-class neighbors receive similar and higher weights.

Figure 4. Parameter sensitivity analysis of the KMOLNN algorithm on the Sonar dataset. (a) Accuracy surface with respect to reconstruction weight and entropy regularization. (b) Heatmap of accuracy varying with manifold weight and kernel width.

Table 1. Summary of key notations.

Symbol	Description
$X = {x_{1}, \dots, x_{n}}$	The set of training samples
$y$	The testing sample
$w = {[w_{1}, \dots, w_{n}]}^{T}$	The representation weight vector for the testing sample
$ϕ (\cdot)$	The nonlinear mapping function (kernel mapping)
$K (x_{i}, x_{j})$	The kernel function (e.g., Gaussian kernel)
$L \in ℝ^{n \times n}$	The Laplacian matrix
$A \in ℝ^{n \times n}$	The adaptive adjacency matrix
$D \in ℝ^{n \times n}$	The degree matrix of the graph
$λ, η, γ$	Regularization coefficients for different terms
$h$	The hyperparameter controlling the kernel width

Table 2. Comparative analysis of KMOLNN and related linear KNN variants.

Method	Objective Function	Kernel Usage	Graph/Manifold Reg.	Theoretical Basis
LLKNN	Distance constrained	No	No	Bayesian rule
WLMRKNN	Local mean repr.	No	No	Weighted neighbors
GS-KNN	Sparse repr.	No	Sparse graph	Correlation matrix
MVAKNN	Multi-view adaptive	Yes	Laplacian reg.	Multi-view learning
KMOLNN (Ours)	Kernel-based error	Yes	Adaptive Laplacian	Group effect & Bayesian

Table 3. Objective functions and parameter settings of all 9 adopted methods.

Methods	Objective Functions	Prediction Rules
KNN	-	-
FKNN [36]	-	-
PRKNN [37]	-	-
Elastic Net [7]	$\min_{w} {‖X w - y‖}^{2} + λ_{1} {‖w‖}_{1} + λ_{2} {‖w‖}^{2}$	$l^{'} = \underset{c_{j} \in C}{\arg \max} \sum_{i = 1}^{k} w_{i}^{(c_{j})}$
WLMRKNN [4]	$\min_{w^{c_{j}}} {‖{\bar{X}}_{k N}^{c_{j}} w^{c_{j}} - y‖}_{2}^{2} + λ {‖T^{c_{j}} w^{c_{j}}‖}_{2}^{2}$	$\begin{array}{l} {\bar{r}}_{k}^{c_{j}} (y) = {‖y - {\bar{X}}_{k N}^{c_{j}} w^{j}‖}_{2}^{2} \\ l^{'} = \underset{c_{j} \in C}{\arg \max} ({\bar{r}}_{k}^{c_{j}} (y)) \end{array}$
LCDR-KNN [20]	$\min_{S} {‖\tilde{X} S - y‖}_{2}^{2} + λ_{1} {‖S‖}_{2}^{2} + λ_{2} \sum_{j = 1}^{N} s_{j}^{2} {‖y - u_{j}‖}_{2}$	$l^{'} = \underset{c_{i} \in C}{\arg \max} (\sum_{h = 1}^{k} s_{h}^{(c_{i})})$
SSL-KNN [26]	$\begin{array}{l} \min_{W, μ, S_{k}} \sum_{p = 1}^{r} {‖\sum_{k = 1}^{c} μ_{k} X_{k} S_{k} w_{k}^{p} - y_{p}‖}^{2} + λ_{1} \sum_{p = 1}^{r} \sum_{k = 1}^{c} {‖w_{k}^{p}‖}_{1} + λ_{2} \prod_{p = 1}^{r} (\sum_{k = 1}^{c} {‖w_{k}^{p} - β d_{k}^{p}‖}^{2}) \\ + λ_{3} μ^{T} (1 - μ) + λ_{4} \sum_{k = 1}^{c} {‖S_{k} - I‖}_{F}^{2} \\ s . t . \sum_{k = 1}^{c} μ_{k} = 1, μ_{k} > 0 \end{array}$	$l^{'} = \arg \underset{k = 1}{\max^{c}} \{μ_{k} \sum_{p = 1}^{r} 1^{T} S_{(k)} w_{k}^{p}\}$
MVAKNN [27]	$\min_{W} {‖X W - X‖}_{F}^{2} + ρ_{1} {‖W‖}_{1} + ρ_{2} Tr (W^{T} X^{T} L X W)$	$f = \arg \max_{i} p_{t i}^{v_{1} + \dots + v_{λ}}$
KMOLNN	$\begin{array}{l} \min_{p, A} {‖ϕ (y) - \sum_{i = 1}^{m} p_{i} ϕ (x_{i})‖}^{2} + λ_{1} \sum_{i = 1}^{m} p_{i} \log p_{i} + λ_{2} \sum_{i = 1}^{m} {‖p_{i} - β d_{i}‖}^{2} \\ + λ_{3} [\sum_{i, j = 1}^{m} p^{T} L (A) p + φ \sum_{i, j} a_{i j} {‖x_{i} - x_{j}‖}^{2}] + λ_{4} \sum_{i, j} a_{i j}^{2} \end{array}$	$l^{'} = \arg \max_{c} \sum_{b_{i} \in B_{c}} p_{i} k (x, y_{i})$

Table 4. Summary of the comparative baseline methods.

Method	Year	Key Idea	Kernelization	Manifold Learning	Adaptive Weighting
KNN [1]	1967	Majority voting based on Euclidean distance	No	No	No
FKNN [36]	1985	Fuzzy membership-based weighted voting	No	No	No
Elastic Net [7]	2012	Sparse modeling with $L_{1}$ and $L_{2}$ regularization	No	No	No
WLMRKNN [4]	2019	Local mean representation with weighted neighbors	No	No	Yes
LCDR-KNN [20]	2020	Local centroid distance-constrained representation	No	No	Yes
SSL-KNN [26]	2024	Shared style multi-view linear classification	Yes	No	Yes
MVAKNN [27]	2024	Multi-view adaptive $k$ -nearest neighbor classification	Yes	Yes	Yes
PRKNN [37]	2025	Proximal ratio-based noise and overlap handling	No	No	No
KMOLNN	2026	Kernelized manifold-optimized linear KNN	Yes	Yes	Yes

Table 5. Summary of the 15 benchmark datasets used in the experiments.

	Datasets	Number of Sizes	Number of Dimensions	Number of Classes
1	Sonar	208	60	2
2	Wine	178	13	3
3	Iris	150	4	3
4	Breast Cancer Wisconsin	569	30	2
5	Pima Indians Diabetes	768	8	2
6	Glass Identification	214	9	6
7	Ionosphere	351	34	2
8	Heart Disease (Cleveland)	303	13	2
9	Vowel	990	10	11
10	Ecoli	336	7	8
11	Yeast	1484	8	10
12	Pendigits	10,992	16	10
13	Satimage	6435	36	6
14	Vehicle Silhouettes	846	18	4
15	Letter Recognition	20,000	16	26

Table 6. Average testing accuracies of all adopted methods on all adopted datasets.

Dataset	KNN	FKNN	PRKNN	Elastic Net	LCDR-KNN	WLMRKNN	SSL-KNN	MVAKNN	KMOLNN
Sonar	0.8262	0.8548	0.7643	0.8429	0.8857	0.8786	0.8929	0.9048	0.9071
Wine	0.9667	0.9861	0.9944	0.9833	0.9889	0.9917	0.9889	0.9944	0.9917
Iris	0.9600	0.9733	0.9533	0.9667	0.9733	0.9733	0.9800	0.9867	0.9733
Breast Cancer Wisconsin	0.9561	0.9754	0.9684	0.9719	0.9781	0.9807	0.9798	0.9851	0.9754
Pima Indians Diabetes	0.7247	0.7597	0.7649	0.7779	0.7682	0.7721	0.7753	0.7818	0.7948
Glass Identification	0.7047	0.7442	0.6209	0.7372	0.7512	0.7605	0.7651	0.7860	0.7977
Ionosphere	0.8629	0.9057	0.8857	0.8914	0.9186	0.9257	0.9357	0.9414	0.9529
Heart Disease	0.8148	0.8475	0.8377	0.8443	0.8623	0.8754	0.8705	0.8902	0.9115
Vowel	0.9051	0.9449	0.8942	0.9212	0.9621	0.9581	0.9601	0.9848	0.9747
Ecoli	0.8119	0.8507	0.7851	0.8448	0.8657	0.8716	0.8687	0.8910	0.9045
Yeast	0.5650	0.6051	0.5822	0.5980	0.6209	0.6350	0.6300	0.6519	0.6721
Pendigits	0.9720	0.9810	0.8950	0.9780	0.9920	0.9890	0.9900	0.9910	0.9950
Satimage	0.9060	0.9200	0.8420	0.9150	0.9350	0.9380	0.9420	0.9390	0.9470
Vehicle Silhouettes	0.7249	0.7852	0.6479	0.7621	0.8012	0.8148	0.8201	0.8450	0.8337
Letter Recognition	0.9420	0.9580	0.7650	0.9510	0.9650	0.9700	0.9720	0.9780	0.9830
W/T/L (Ours vs. competing)	15/0/0	15/0/0	14/0/1	15/0/0	15/0/0	14/0/1	13/0/2	10/0/5	-

Note: The best results for each dataset are highlighted in bold.

Table 7. Average testing Macro F1-scores of all adopted methods on all adopted datasets.

Dataset	KNN	FKNN	PRKNN	Elastic Net	LCDR-KNN	WLMRKNN	SSL-KNN	MVAKNN	KMOLNN
Sonar	0.8143	0.8476	0.7571	0.8357	0.8786	0.8714	0.8857	0.9000	0.9000
Wine	0.9639	0.9833	0.9944	0.9833	0.9889	0.9889	0.9861	0.9861	0.9944
Iris	0.9567	0.9733	0.9533	0.9667	0.9733	0.9733	0.9800	0.9867	0.9867
Breast Cancer Wisconsin	0.9482	0.9693	0.9623	0.9649	0.9719	0.9763	0.9754	0.9807	0.9860
Pima Indians Diabetes	0.6851	0.7299	0.7240	0.7448	0.7383	0.7448	0.7500	0.7578	0.7669
Glass Identification	0.6256	0.6953	0.5070	0.6814	0.7093	0.7256	0.7302	0.7558	0.7323
Ionosphere	0.8457	0.8914	0.8714	0.8786	0.9057	0.9143	0.9286	0.9357	0.9286
Heart Disease	0.7951	0.8328	0.8197	0.8279	0.8443	0.8607	0.8574	0.8754	0.8393
Vowel	0.8848	0.9318	0.9220	0.9081	0.9551	0.9480	0.9520	0.9798	0.9712
Ecoli	0.7522	0.8104	0.6851	0.7955	0.8254	0.8343	0.8299	0.8627	0.8761
Yeast	0.5051	0.5650	0.4822	0.5519	0.5801	0.6051	0.5949	0.6249	0.6360
Pendigits	0.9680	0.9790	0.8820	0.9750	0.9900	0.9860	0.9880	0.9890	0.9730
Satimage	0.8850	0.9050	0.8150	0.8980	0.9200	0.9250	0.9300	0.9280	0.9470
Vehicle Silhouettes	0.7018	0.7680	0.6148	0.7450	0.7852	0.8018	0.8083	0.8320	0.8278
Letter Recognition	0.9380	0.9540	0.7420	0.9480	0.9610	0.9680	0.9700	0.9760	0.9670

Note: The best results for each dataset are highlighted in bold.

Table 8. Average running times (seconds) of all adopted methods on all adopted datasets.

Dataset	KNN	FKNN	PRKNN	Elastic Net	LCDR-KNN	WLMRKNN	SSL-KNN	MVAKNN	KMOLNN
Sonar	0.0012	0.0015	0.0030	0.0255	0.0312	0.0280	4.9155	0.0595	2.1886
Wine	0.0010	0.0013	0.0027	0.0210	0.0291	0.0255	3.1367	0.0528	2.5487
Iris	0.0079	0.0011	0.0023	0.0207	0.0239	0.0211	3.8646	0.0492	1.9123
Breast Cancer	0.0015	0.0018	0.0037	0.0297	0.0368	0.0340	5.0191	0.0712	2.9516
Pima Indians	0.0018	0.0021	0.0046	0.0363	0.0440	0.0407	5.7688	0.0807	3.5123
Glass	0.0011	0.0014	0.0028	0.0221	0.0307	0.0282	4.9450	0.0550	2.0859
Ionosphere	0.0013	0.0016	0.0031	0.0274	0.0333	0.0304	4.0524	0.0621	2.3719
Heart Disease	0.0014	0.0018	0.0033	0.0267	0.0347	0.0333	4.0825	0.0702	2.3566
Vowel	0.0052	0.0038	0.0079	0.0411	0.0491	0.0469	6.7292	0.0936	4.1845
Ecoli	0.0016	0.0020	0.0038	0.0333	0.0402	0.0380	7.7151	0.0794	3.6233
Yeast	0.0054	0.0058	0.0118	0.0510	0.0641	0.0575	9.4203	0.1163	5.6277
Pendigits	0.0148	0.0209	0.0405	0.1009	0.1235	0.1155	19.9106	0.2368	10.3476
Satimage	0.0129	0.0149	0.0304	0.0904	0.1168	0.1033	16.9335	0.2273	8.7759
Vehicle	0.0011	0.0027	0.0057	0.0443	0.0581	0.0545	8.1484	0.1087	4.3475
Letter Rec.	0.0174	0.0204	0.0420	0.1474	0.1911	0.1722	40.7081	0.3631	17.2134

Table 9. Ranking values of all adopted methods on all adopted datasets.

Dataset	KNN	FKNN	PRKNN	Elastic Net	LCDR-KNN	WLMRKNN	SSL-KNN	MVAKNN	KMOLNN
Sonar	8	6	9	7	4	5	3	2	1
Wine	9	7	1.5	8	5.5	3.5	5.5	1.5	3.5
Iris	8	4.5	9	7	4.5	4.5	2	1	4.5
Breast Cancer Wisconsin	9	5.5	8	7	4	2	3	1	5.5
Pima Indians Diabetes	9	8	7	3	6	5	4	2	1
Glass Identification	8	6	9	7	5	4	3	2	1
Ionosphere	9	6	8	7	5	4	3	2	1
Heart Disease	9	6	8	7	5	3	4	2	1
Vowel	8	6	9	7	3	5	4	1	2
Ecoli	8	6	9	7	5	3	4	2	1
Yeast	9	6	8	7	5	3	4	2	1
Pendigits	8	6	9	7	2	5	4	3	1
Satimage	8	6	9	7	5	4	2	3	1
Vehicle Silhouettes	8	6	9	7	5	4	3	1	2
Letter Recognition	8	6	9	7	5	4	3	2	1
Average ranking	8.4	6.07	8.1	6.8	4.6	3.93	3.43	1.83	1.83

Table 10. The experimental results for the Bonferroni–Dunn test between the proposed KMOLNN method and all comparative methods.

Method	Average Rank Difference	CD	Null Hypothesis
KNN/KMOLNN	6.57	2.724	Reject
FKNN/KMOLNN	4.24	2.724	Reject
PRKNN/KMOLNN	6.27	2.724	Reject
Elastic Net/KMOLNN	4.97	2.724	Reject
LCDR-KNN/KMOLNN	2.77	2.724	Reject
WLMRKNN/KMOLNN	2.1	2.724	Accept
SSL-KNN/KMOLNN	1.6	2.724	Accept
MVAKNN/KMOLNN	0	2.724	Accept

Table 11. Ablation study results isolating the impact of kernelization, manifold regularization (KKNN baseline), and the adaptive Laplacian matrix.

Dataset	Metric	KMOLNN (Linear)	KMOLNN (w/o Manifold)/KKNN	KMOLNN (Fixed Graph)	KMOLNN (Full)
Sonar	Accuracy	0.8125	0.8654	0.881	0.9071
Sonar	Macro F1	0.795	0.851	0.875	0.900
Ionosphere	Accuracy	0.8714	0.9125	0.931	0.9529
Ionosphere	Macro F1	0.858	0.895	0.91	0.9286
Wine	Accuracy	0.9611	0.9722	0.9833	0.9917
Wine	Macro F1	0.958	0.975	0.985	0.9944

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Bian, Z.; Zhang, L.; Wang, F. Kernelized Manifold-Optimized Linear KNN for Nonlinear Data Classification. Electronics 2026, 15, 2572. https://doi.org/10.3390/electronics15122572

AMA Style

Zhang J, Bian Z, Zhang L, Wang F. Kernelized Manifold-Optimized Linear KNN for Nonlinear Data Classification. Electronics. 2026; 15(12):2572. https://doi.org/10.3390/electronics15122572

Chicago/Turabian Style

Zhang, Jin, Zekang Bian, Liang Zhang, and Feng Wang. 2026. "Kernelized Manifold-Optimized Linear KNN for Nonlinear Data Classification" Electronics 15, no. 12: 2572. https://doi.org/10.3390/electronics15122572

APA Style

Zhang, J., Bian, Z., Zhang, L., & Wang, F. (2026). Kernelized Manifold-Optimized Linear KNN for Nonlinear Data Classification. Electronics, 15(12), 2572. https://doi.org/10.3390/electronics15122572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Kernelized Manifold-Optimized Linear KNN for Nonlinear Data Classification

Abstract

1. Introduction

2. Related Work and Preliminaries

2.1. Related Work

2.2. Preliminaries

2.2.1. Linear KNN Method

2.2.2. Graph Theory and Laplacian Matrix

Graph Theory

Laplacian Matrix

3. The Objective Function and Prediction Function of the Proposed KMOLNN Method

3.1. Objective Function of the KMOLNN Method

3.2. Prediction Function of the Proposed KMOLNN Method

3.3. Theoretical Analysis of Generalization Capability

3.3.1. Definition of Rademacher Complexity

3.3.2. Generalization Error Bound

3.3.3. Rademacher Complexity Bound for the Proposed KMOLNN Method

3.4. Proof of the Nearest Neighbor Group Effect

3.5. Optimization of the Objective Function for the Proposed Method

3.5.1. Parameter Initialization

3.5.2. Alternating Optimization

4. Experimental Studies

4.1. Comparative Methods and Parameter Settings

4.2. The Benchmark Datasets

4.3. Evaluation Metrics

4.4. Generalization Capability Analysis

4.5. Runtime Analysis

4.6. Statistical Analysis

4.7. Limitations and Future Work

4.8. Qualitative Analysis of Interpretability

4.9. Ablation Study

4.10. Parameter Sensitivity and Robustness

4.10.1. Interaction of λ 1 and λ 2

4.10.2. Interaction of λ 3 and h

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.10.1. Interaction of $λ_{1}$ and $λ_{2}$

4.10.2. Interaction of $λ_{3}$ and $h$