Next Article in Journal
A Hyperledger Fabric-Based SBOM Management System for Secure Software Supply Chain Integrity
Previous Article in Journal
Research on Tunnel Traffic Flow Prediction Model Based on Graph Neural Networks
Previous Article in Special Issue
Domain-Transportable Latent Summaries for Robust Multimodal Autism Phenotyping Under Missing Modality Blocks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Kernelized Manifold-Optimized Linear KNN for Nonlinear Data Classification

1
School of Computer Engineering, Suzhou Polytechnic University, Suzhou 215104, China
2
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, Shanghai 200240, China
3
School of AI & Computer Science, Jiangnan University, Wuxi 214122, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(12), 2572; https://doi.org/10.3390/electronics15122572
Submission received: 30 March 2026 / Revised: 18 May 2026 / Accepted: 2 June 2026 / Published: 10 June 2026
(This article belongs to the Special Issue Multimodal Learning for Multimedia Content Analysis and Understanding)

Abstract

In sparse representation learning-based linear k-nearest neighbors methods, the linear representation assumption frequently fails when applied to nonlinear distributed data, leading to degraded generalization and a loss of physical interpretability. To address this, we propose the Kernelized Manifold-Optimized Linear Nearest Neighbor (KMOLNN) method. Methodologically, KMOLNN projects the data into a high-dimensional kernel space to capture the nonlinear relationships, while introducing an adaptive manifold-preserving regularization term—via an adaptive Laplacian matrix—to dynamically preserve the local geometric structures. Theoretically, this study provides a mathematical proof of the nearest neighbor group effect for the kernel framework and reveals that its weight optimization behavior implicitly implements the Bayesian decision rule. Furthermore, we derive a rigorous generalization error bound using Rademacher complexity to validate its theoretical robustness. Empirically, we evaluate KMOLNN on 15 small-to-medium-scale benchmark datasets against eight comparative methods, including recent variants. The results demonstrate significant numeric superiority, with KMOLNN achieving an average accuracy of 90.76% and a Macro F1-score of 88.62% across the evaluated datasets. Finally, we present a comprehensive runtime analysis, explicitly acknowledging that these gains in generalization capability and theoretical interpretability present a practical trade-off, requiring increased computational runtime due to the iterative alternating optimization process.

1. Introduction

Due to its simplicity and interpretability, k-nearest neighbors (KNN) [1,2,3], as a fundamental machine learning method, predicts the testing sample by majority voting on the labels of its nearest neighbors, and has been successfully applied in many fields, such as pattern recognition [4] and recommendation systems [5]. However, the computational complexity of a traditional KNN increases linearly with the scale of the dataset, resulting in significant challenges in both time and space consumption when handling large-scale datasets. Additionally, because of its simple mathematical foundation, a traditional KNN cannot be directly improved by altering or improving their mathematical objective function, e.g., introducing regularization terms to incorporate prior information.
To address these issues, and inspired by sparse representation learning, a linear KNN method [6,7,8,9,10,11,12] is developed to learn the linear sparse representation for each testing sample from its nearest neighbors, which can reduce the burden of naive distance-based searches in some settings and enable objective function-based improvements. Moreover, the linear KNN method provides the possibility of enhancing the traditional KNN by improving its objective function. However, when handling datasets containing nonlinear distributed samples (e.g., financial market prediction datasets [13] and bioinformatics datasets [14]), the testing samples cannot be accurately represented by the training sample, thereby weakening the generalization capability. That is, assume a testing sample y is reconstructed through a linear combination of training samples X = [ x 1 , , x n ] , i.e., y X w , where   w is the learned representation vector. If these samples do not satisfy the linear assumption, the reconstruction error y X w 2 2 will significantly increase, causing the representation   w to fail to accurately reflect the relationships between these samples. Additionally, the learned linear representation cannot provide accurate physical interpretability (e.g., sample/feature importance-based interpretability [15]).
To intuitively illustrate the motivation behind the proposed KMOLNN method, Figure 1 visualizes the differences in reconstruction capabilities between linear and nonlinear models under varying data distribution structures. In the figure, the blue, red, and green dots represent the training samples X , the true test sample y t e s t , and the reconstructed sample y ^ calculated by the models, respectively.
As shown in Figure 1a, for linearly distributed data, the test sample lies within the subspace spanned by the training samples. A precise approximation is achieved via the linear combination y ^ w i x i , resulting in a negligible reconstruction error. However, for nonlinear manifold structures—such as the quadratic function ( f ( x ) = x 2 ) in Figure 1b or the sine function ( f ( x ) = sin ( x ) ) in Figure 1c—the linear assumption fails. Geometrically, a linear combination of points on a curve forms a secant that deviates from the manifold itself. This causes a significant geometric mismatch between the reconstruction y ^ and the true sample y t e s t . To address this, Figure 1d demonstrates the efficacy of kernel mapping. By mapping data into a high-dimensional feature space H , the manifold is “straightened” to satisfy linear representation conditions, allowing y ^ to accurately converge to y t e s t . This validates the necessity of the proposed KMOLNN method for resolving nonlinear generalization challenges.
In summary, when handling a nonlinear distributed dataset, the linear KNN method still has the following challenges that should be seriously addressed:
  • Due to its reliance on the linear representation assumption, when a dataset contains nonlinear relationships, the linear representation learned by the KNN method cannot represent the testing sample accurately, which loses its physical meaning, thereby weakening the generalization capability.
  • The simplification of the linear KNN method may neglect the complex structures in the data, weakening its generalization capability.
These challenges are particularly evident in real-world scenarios. For example, in financial market prediction, complex nonlinear relationships exist between stock price fluctuations and various factors, such as market sentiment and political events. Recent studies [16] have further demonstrated that deep learning-based frameworks can effectively capture these intricate patterns in high-stakes environments, such as federated credit card fraud detection. In bioinformatics, gene expression data exhibit intricate patterns, and their associations with diseases are often nonlinear. Similarly, in image recognition tasks, the relationships among an object’s shape, texture, and position are frequently nonlinear as well. In such domains, the nonlinearity present in the data can undermine the physical interpretability of the representation weight vector obtained through linear KNN optimization. When data deviate from linear assumptions—that is, when nonlinear or other complex relationships exist—the physical meaning of the optimized representation weight vector may be compromised in the following ways:
  • In linear models, each element of the weight vector represents the degree of influence of the corresponding feature on the target variable. However, if the data do not conform to the linear assumption, because nonlinear relationships may cause the physical meaning of weights to become ambiguous or unclear, this approach to interpreting the weights becomes ineffective.
  • In nonlinear relationships, the interactions between different features lead to bias in the weights. Linear models cannot capture these complex interactions, so the optimized weights may become distorted and fail to accurately reflect the relationships between the features.
  • If nonlinear relationships exist in the data, using linear models for prediction may result in performance degradation. Linear models cannot accurately fit nonlinear relationships, potentially leading to significant errors when predicting new data.
To address the above challenges, this study proposes a Kernelized Manifold-Optimized Linear k-Nearest Neighbors method, aiming to overcome the limitations of linear KNN when handling nonlinear data while maintaining a practical runtime on the evaluated (small-to-medium scale) datasets. By introducing kernelization techniques, KMOLNN maps data onto a high-dimensional feature space to capture the nonlinear relationships, thereby avoiding the curse of dimensionality that would arise from directly increasing the dimensions of the original features. Furthermore, we optimize the weight allocation mechanism by mathematically proving the existence of the group effect of the nearest neighbors and its prediction behavior, connecting it to the Bayesian decision rule, which enhances its generalization capability and robustness on nonlinear datasets. In summary, the primary contribution of this work is algorithmic and empirical, supported by a robust theoretical motivation. Specifically, the main contributions of this paper are summarized as follows:
  • By incorporating a manifold-preserving regularization term with an adaptive Laplacian matrix, KMOLNN ensures that the data retain their original local structure after high-dimensional mapping, significantly enhancing its adaptability to nonlinear datasets and outperforming traditional weighted KNN methods.
  • This study provides, for the first time, a mathematical proof of the group effect of the nearest neighbors for the kernel LLK method, revealing the mechanism by which the model assigns higher weights to the training samples close to the testing sample, thereby offering greater theoretical depth compared to existing kernel KNN approaches.
  • Through mathematical derivation, KMOLNN implicitly implements the Bayesian decision rule in weight optimization, approximating the maximum a posteriori probability estimation, which enhances the robustness and interpretability of KMOLNN on nonlinear datasets.
  • The experimental results for all adopted datasets confirm that KMOLNN demonstrates enhanced generalization capability and acceptable runtime under our experimental protocol compared to the existing KNN variants.
The paper is organized as follows: Section 2 reviews the related work and preliminaries; Section 3 details the objective function of the proposed KMOLNN method and introduces the optimization process of the proposed KMOLNN method; Section 4 conducts the experimental analysis on the adopted datasets; and Section 5 presents the conclusion. To facilitate a better understanding of the subsequent mathematical derivations, the key notations used throughout this paper are summarized in Table 1.

2. Related Work and Preliminaries

2.1. Related Work

As a variant of the classical k-nearest neighbors method [1], the linear KNN method has received more and more attention in the field of machine learning. Recently, with the development of machine learning, many linear KNN methods [17] have been developed to achieve an enhanced generalization capability by introducing new learning strategies into the traditional linear KNN method. According to [4,17], the existing linear KNN methods can be divided into the following categories:
  • Some linear KNN methods [6,7] focus on introducing regularization terms to learn the enhanced representation. For instance, Tibshirani et al. [6] proposed the Least Absolute Shrinkage and Selection Operator (LASSO) to learn a sparse representation for each testing sample by introducing an L 1 regularization constraint on the representation vector. McDonald et al. [18] proposed a Ridge regression to prevent overfitting and enhance the generalization capability by introducing an L 2 regularization constraint on the representation vector. Based on the LASSO, Zhong et al. [7] proposed the Elastic Net to learn the sparse and robust representation of each testing sample by introducing an L 1 + L 2 regularization constraint on the representation vector. Wang et al. [7] proposed the Locality-constrained Linear Coding (LLC) method, which establishes a robust connection between the linear representation weights and distance metrics, facilitating the acquisition of accurate weights and enabling precise classification. Then, Liu et al. [9,19] further developed the LLC method into the Local Linear KNN method (LLKNN), establishing a more stable relationship between the representation weights and distance metrics. Additionally, LLKNN provides a solid theoretical foundation and demonstrates that the Bayesian decision rule can interpret prediction behaviors.
  • Some linear KNN methods [4,10] focus on weighting the neighbors to achieve enhanced classification performance. For instance, Xu et al. [10] proposed the Weighted Local Linear KNN method (WLLKNN), which applies weighting to the L 1 -norm regularization term to obtain prior weights of the nearest neighbors by weighting the linear representation vector. Gou et al. [4] proposed the Weighted Local Mean Representation-based k-nearest neighbor method (WLMRKNN). This approach utilizes the weights of k-local mean vectors from each class to constrain the representation coefficients. These weights provide more information for reconstructing the testing samples, thereby enabling a more accurate representation of testing samples. This method demonstrates reduced sensitivity to the choice of k-value, contributing to the enhanced robustness of the KNN algorithm. Zhao et al. [20] proposed the Local Centroid Distance-Constrained Representation-based KNN (LCDR-KNN), which enhances the KNN classifier accuracy by selecting the k-nearest training samples for each class, using their centroid distance as a constraint, and combining it with a collaborative representation to calculate the weights for each neighbor in the classification.
  • Some linear KNN methods [21,22] focus on adaptively selecting the k-value for each testing sample to achieve enhanced classification performance. For instance, Chen et al. [21] demonstrated that employing different k-values for different classes yields a better generalization performance than using a fixed k-value for all classes. Mullick et al. [22] utilized neural networks to learn the density information around testing samples from the training data, thereby determining the appropriate k-values. Zhang et al. [17] proposed using a decision tree method to predict the optimal k-value for testing samples. First, during the training phase, the optimal k-value for all samples is learned through sparse modeling. Then, using the training samples and the optimal k-value, a k-tree is constructed to rapidly predict the optimal k-value for the testing samples during prediction. Wang et al. [23] proposed adjusting the local k-value of testing samples based on confidence intervals. Manocha et al. [24] used Bayesian optimization to adaptively select and infer the optimal k-value for testing samples from training samples. Cheng et al. [25] proposed a sparse learning-based KNN (S-KNN), which introduces a correlation matrix between the training and testing samples. Based on the S-KNN, Zhang et al. [11] proposed the Graph Sparse KNN (GS-KNN) method, which reduces the impact of noise on the prediction labels of testing samples by introducing sparsity into the linear expression matrix. Additionally, Zhang et al. [12] proposed a One-Step KNN, which transforms the k-nearest neighbor search in a linear KNN into matrix operations, thereby computing the adaptive k-value. This approach aims to enhance the prediction accuracy and efficiency by simplifying the computational processes. Recent research has further extended these ideas. For instance, Amer et al. [12] proposed an efficient k-nearest neighbor model that introduces three KNN variants (PRKNN, EPRKNN, and WPRKNN) by combining preprocessing techniques and weighting schemes, significantly improving performance for big data classification. Zhang et al. [26] introduced a shared-style linear k-nearest neighbors classification method, emphasizing the integration of style information into multi-view data to optimize the classification accuracy. Furthermore, Fan et al. [27] proposed a multi-view adaptive k-nearest neighbors classification, which further enhances the model’s adaptability to heterogeneous data by dynamically adjusting the k-value for multi-view data. This trend is consistent with recent work on shared-style learning, which integrates the style information across multi-view data to optimize the classification accuracy [26]. These recent advances highlight the potential of kernelization and adaptive mechanisms for improving KNN performance, providing a solid foundation for the proposed KMOLNN method’s kernel mapping and manifold optimization.
While the aforementioned methods have advanced the linear KNN framework, a clear research gap remains. The existing approaches typically focus on either linear sparsity (e.g., LASSO, LLC) or fixed graph structures (e.g., GS-KNN), often failing to capture the complex nonlinear manifolds in real-world data. Methods like MVAKNN incorporate kernelization but do not explicitly link the weight optimization to the grouping effect or Bayesian decision theory. KMOLNN addresses these limitations simultaneously by integrating nonlinear kernel mapping with an adaptive manifold-preserving term, thereby providing both superior generalization and theoretical interpretability. A comprehensive comparison of the key characteristics and theoretical bases between KMOLNN and these related linear-KNN variants is summarized in Table 2.

2.2. Preliminaries

2.2.1. Linear KNN Method

To address the high time and space consumption of a traditional KNN on large-scale datasets, the linear KNN (L-KNN) was developed to improve efficiency by simplifying the computational process, specifically by introducing linear functions or low-dimensional mappings to approximate the nonlinear decision boundaries of traditional KNN, thereby reducing the time and space consumption. The linearization process can be described by the following formula:
y X w + b
where w and b denote the representation weight and bias, respectively; X and y denote the training samples and a testing sample, respectively.
To further optimize the weight allocation and enhance the generalization capability, some regularization terms were introduced into L-KNN, including the Least Absolute Shrinkage and Selection Operator (LASSO) [6], Ridge regression, and Elastic Net [28]. Their objective functions are defined as follows:
LASSO:
L ( w ) = y X w 2 2 + λ w 1 .
Ridge regression:
L ( w ) = y X w 2 2 + λ w 2 2 .
Elastic Net:
L ( w ) = y X w 2 2 + λ 1 w 1 + λ 2 w 2 2 .
where λ , λ 1 and λ 2 denote the regularization coefficient used to balance model complexity and fitting capability.

2.2.2. Graph Theory and Laplacian Matrix

Graph Theory
Graph theory is a significant branch of mathematics that primarily studies the complex network structures formed by nodes (or vertices) and the edges connecting these nodes. In such a structure, the nodes represent individuals in the network, while the edges describe the relationships between them. Among the core tools of graph theory are the degree matrix D and adjacency matrix A , both of which serve as fundamental elements for understanding and analyzing graph structures. The degree matrix is a diagonal matrix where each element on the diagonal represents the degree of the corresponding node, indicating the total number of edges directly connected to that node. The adjacency matrix records the direct connections between the nodes, where an element is 1 when an edge exists between nodes i and j , and is 0 otherwise. These two matrices collectively provide a systematic approach to analyzing and understanding the structure and properties of graphs.
Laplacian Matrix
The Laplacian matrix L is a core concept in spectral graph theory, serving as a discrete operator that captures the intrinsic structure of a graph. It is formally defined as L = D W , where W denotes the adjacency matrix encoding the similarity relationships or edge weights between nodes (data samples), and D is the diagonal degree matrix, where each diagonal element D i i = j W i j represents the total connectivity of node i . Mathematically, the Laplacian matrix acts similarly to the Laplace operator in continuous space, measuring the local smoothness of a function defined on a graph. Its spectral properties, particularly its eigenvalues and eigenvectors, provide critical insights into the connectivity and cluster structures within the data, forming the theoretical foundation for spectral clustering and dimensionality reduction techniques. In the context of the proposed KMOLNN method, the Laplacian matrix is instrumental for constructing the manifold-preserving regularization term. By incorporating L into the objective function, the model enforces a smoothness constraint that penalizes differences in representation weights between adjacent nodes. This ensures that the training samples sharing high similarity in the local manifold structure are assigned consistent weights, thereby preserving the original geometrical relationships of the nonlinear data in the learned representation.

3. The Objective Function and Prediction Function of the Proposed KMOLNN Method

As mentioned above, KMOLNN aims to address the limitations of the traditional linear KNN method on complex nonlinear datasets by introducing kernel mapping and manifold-preserving regularization terms, thereby enhancing its generalization capability when handling nonlinear data. Specifically, the kernel mapping regularization term maps the nonlinear data onto a high-dimensional space and integrates it into the linear KNN framework, effectively overcoming the traditional method’s difficulty in capturing complex data structures while maintaining computational efficiency. The manifold-preserving regularization term ensures the preservation of local neighborhood relationships during the mapping process, further enhancing the generalization capability and perception of the true manifold structure of the data. This section presents the general form of the kernelized objective function and classification function, and elucidates the transformation process from the original linear form to the kernelized form.
Here, we first present the framework diagram of the proposed KMOLNN method in Figure 2. As illustrated in Figure 2, the framework of the proposed KMOLNN method consists of three main phases: kernel mapping, manifold-regularized optimization, and classification. Initially, the original nonlinear training and testing samples are projected onto a high-dimensional feature space via a kernel mapping function to uncover the linear relationships hidden in the original space. Subsequently, the core optimization module learns the linear representation coefficient vector by minimizing an objective function that integrates the reconstruction error with manifold-preserving regularization. This phase employs an alternating optimization strategy to dynamically update the probability weights and the adaptive Laplacian matrix, ensuring the preservation of the data’s local manifold structure. Finally, the prediction module aggregates the learned similarity probabilities for each class and determines the label of the testing sample based on the maximum accumulated similarity, implicitly adhering to a Bayesian decision rule.

3.1. Objective Function of the KMOLNN Method

The traditional linear KNN method is built on the assumption that the testing sample can be linearly represented by its several nearest training samples, where the elements of the learned representation vector reflect the importance of the corresponding samples, contributing to reconstruction of the testing sample. Therefore, when employing the linear KNN method, two conditions should be satisfied: (1) the training samples and the testing samples should be linearly correlated, and (2) the elements of the representation vector should possess a well-defined physical meaning. However, the above assumption is often unsatisfied when handling datasets containing nonlinear samples; thus, the traditional linear KNN method fails to accurately learn the representation and capture the manifold structure of the data, which is beneficial for enhancing the classification performance.
To address the above limitation, we first apply the kernel method to map the original nonlinear data onto a high-dimensional space to maintain the above assumption, where the mapped data are linearly correlated. Therefore, based on the training samples X , the testing sample y can be reconstructed by
ϕ ( y ) ϕ ( X ) w + b
where ϕ ( ) denotes the mapping function that is defined as the kernel function, w denotes the learned representation vector, and b denotes the bias. Obviously, the kernel function can map the data onto the high-dimensional feature space and restore the linear relationship, yet it leaves the other challenge: the elements of the learned representation vector in the mapped high-dimensional feature space lose the intuitive physical meaning they had in the original feature space, i.e., the learned representation vector loses the original interpretability.
To guarantee that the learned linear representation has intuitive and explicit physical meaning, and inspired by graph theory [29], a Laplacian matrix is always constructed and adopted to capture the manifold structure of data. Specifically, the regularization term of the Laplacian matrix is introduced into Equation (5) to penalize differences in its elements corresponding to adjacent nodes within the graph structure, thereby encouraging similar elements among similar or proximate samples in the learned representation vector. Additionally, the regularization term helps to discover the robust relationships underlying the structures of the data, and it also preserves the manifold structure of the data after mapping the original data, so that the learned representation preserves the intuitive physical meaning of the original feature space. So, the objective function is defined as follows:
min w , A ϕ ( y ) i = 1 m p i ϕ ( x i ) 2 + λ 1 i = 1 m p i log p i + λ 2 i = 1 m p i β d i 2 + λ 3 i , j = 1 m p T L ( A ) p
where p i = e w i j = 1 m e w j represents the probability weights of i -th training sample x i , ϕ ( ) denotes the mapping function that is defined as the kernel function, x i denotes the i -th ( i = 1 , 2 , , m ) training sample, m represents the number of all training samples, and y denotes the testing sample. Parameters λ 1 , λ 2 and λ 3 denote contributions of the corresponding terms. The vector d = [ d 1 , d 2 , , d m ] represents the similarity measure between the testing sample and the training samples, where the i -th element d i ( i = 1 , 2 , , m ) can be calculated by
d i = exp γ ϕ ( y ) ϕ ( x i ) 2
where γ denotes a positive hyperparameter that controls the width of the kernel function.
For the last term in Equation (6), where L = D A , A denotes the adjacency matrix of the graph. For a weighted graph, A i j denotes the weight of the edge between node i and node j . D denotes the degree matrix, and its diagonal elements D i j equal the degree of node i (also known as the number of connections, which is either the number of edges directly connected to node i or the total weight). In many practical applications, due to the highly complex intrinsic structure of data, fixed Laplacian matrices are typically based on predefined similarity measures (such as Euclidean distance), which fail to adequately capture the dynamics and complexity of the data. In real-world data analysis tasks, data originate from multiple sources, exhibiting high heterogeneity and a dynamic nature. Therefore, we further improve the objective function in Equation (6) by proposing the use of an adaptive Laplacian matrix [30] to dynamically adjust to these changes, thereby providing enhanced adaptability to diverse data sources and dynamic environments. An adaptive Laplacian matrix, by learning the edge weights from the data itself, can more accurately capture the actual relationships between data points, particularly in high-dimensional data with nonlinear relationships. Unlike traditional fixed Laplacian matrices that rely on static similarity measures (e.g., Euclidean distance), the adaptive Laplacian matrix in KMOLNN learns the edge weights directly from the data distribution during optimization. This ensures the model remains sensitive to the intrinsic geometry of nonlinear manifolds, which is a significant advancement over the predefined graph constraints used in prior kernelized variants. Finally, the objective function in Equation (6) can be transformed into
min p , A ϕ ( y ) i = 1 m p i ϕ ( x i ) 2 + λ 1 i = 1 m p i log p i + λ 2 i = 1 m p i β d i 2 + λ 3 i , j = 1 m p T L ( A ) p + φ i , j a i j x i x j 2 + λ 4 i , j a i j 2
The first term ϕ ( y ) i = 1 m p i ϕ ( x i ) 2 is the kernelized reconstruction error term, where ϕ ( x i ) denotes the function mapping input data x to a high-dimensional (or infinite-dimensional) feature space, and ϕ ( y ) denotes the testing sample y mapped to the same feature space. p i is the i -th element of the weight vector p , representing the testing sample y as a training sample x i , illustrating the importance of linear combinations in the mapped feature space. The objective of this term is to minimize the reconstruction error of the testing sample in the mapped feature space, to ensure that the selected training samples can accurately reconstruct the testing sample in the high-dimensional feature space as much as possible.
The second term i = 1 m p i log p i is the probability regularization term, which is introduced to increase the entropy of the probability distribution, thereby enhancing its uniformity and uncertainty. By encouraging a more uniform distribution, entropy regularization prevents the model from over-relying on or over-adapting to individual training samples, effectively reducing the risk of overfitting. Moreover, this term enhances fairness across different training samples during linear representation, promotes balanced weight allocation, and implicitly incorporates a Bayesian prior, thereby strengthening the model’s generalization capability.
The third term i = 1 m p i β d i 2 serves as the locality constraint term, where d i denotes the similarity measure between the testing sample and the i - th training sample (e.g., Gaussian kernel-based distance). By minimizing this term, the constraint encourages the probability weight p i to correlate with the similarity of training samples in the original space, thus prioritizing training samples that are closer or more relevant to the testing sample for representation. This reinforces the local neighborhood principle, prevents the model from relying on distant samples, and improves the reconstruction accuracy and local adaptability of the model.
The fourth term i , j = 1 m p T L ( A ) p + φ i , j a i j x i x j 2 is the manifold-preserving regularization term. This term introduces the Laplacian matrix L (based on the adaptive adjacency matrix   A ) to ensure that the low-dimensional manifold structure of the data is preserved when the training samples and testing samples are mapped onto the high-dimensional space, thereby guaranteeing the physical interpretation of the representation weights (e.g., weight continuity for similar samples). Furthermore, given the complexity of the intrinsic structure in some data, a fixed manifold is clearly insufficient for fully representing the dynamic relationships within the data. Therefore, we employ adaptive edge weights to dynamically capture these structures, enhancing the model’s perception of nonlinear manifolds.
The fifth term i , j a i j 2 is used to prevent edge weight overfitting. This term controls the magnitude of edge weights in the adaptive adjacency matrix   A through the L 2 regularization norm, preventing the edge weights from becoming too divergent or large, which could lead to overfitting. It encourages sparsity and smoothness of the graph structure; ensures stability during the optimization process; and maintains the model’s generalization capability on complex data, especially nonlinear-distributed datasets.

3.2. Prediction Function of the Proposed KMOLNN Method

After the proposed KMOLNN method is trained by optimizing the above objective function, its prediction function can be defined as follows:
c * = arg max c b i B c p i k ( x , y i ) ,
where b i represents the learned representation weight (or coefficient) of the i -th training sample, B c represents the bias term corresponding to the c -th class, and k ( x , y i ) represents the kernel function (defined as the Gaussian kernel) that measures the similarity between the testing sample x and the training sample y i in the mapped high-dimensional feature space.
According to the prediction function in Equation (9), the proposed KMOLNN method predicts the testing sample by evaluating its similarity with training samples of each class in the mapped high-dimensional feature space. The physical meaning of the proposed KMOLNN method is primarily manifested in two aspects: high-dimensional mapping achieved through kernel functions and classification decisions based on similarity probability.
  • The kernel function allows us to implicitly map the data onto a high-dimensional or even infinite-dimensional feature space by computing the kernel functions in the original feature space rather than directly calculating the mapped data. The purpose of this mapping is to reveal the intrinsic structure of data in the new feature space, making the linearly inseparable data in the original feature space linearly separable in the mapped feature space. For example, through the radial basis function (RBF) kernel, data can be mapped onto a higher-dimensional feature space where similar samples are closer together and dissimilar samples are more dispersed.
  • The proposed KMOLNN method utilizes similarity probabilities computed by the kernel function for prediction. For a given testing sample, it calculates the sum of its similarity to the training samples in each class. The similarity is computed directly in the high-dimensional feature space via the kernel function, reflecting the proximity probability of the testing sample to the training samples of each class in the mapped high-dimensional feature space. Finally, the proposed KMOLNN method assigns the testing sample to the class with the highest sum of similarity, indicating that the testing sample is most similar to the training samples of that class in the high-dimensional feature space.

3.3. Theoretical Analysis of Generalization Capability

The proposed KMOLNN method utilizes kernel mapping and manifold-preserving regularization to address the limitations of traditional linear KNN on nonlinear distributed data. Although kernel mapping can capture complex data structures in high-dimensional feature space, it introduces potential overfitting risks due to increased model complexity. To rigorously evaluate the generalization capability of the proposed KMOLNN method, we employ Rademacher complexity [31,32] to quantify its capability to generalize to unseen data. This section derives the generalization error bound, analyzes the impact of regularization and kernel parameters, and validates the theoretical findings through experimental results.

3.3.1. Definition of Rademacher Complexity

To rigorously analyze the generalization performance of the proposed KMOLNN algorithm, we utilize the framework of Rademacher complexity, which measures the richness or capacity of a hypothesis class by its ability to fit random noise.
Let S = { x 1 , , x n } be a dataset of n samples drawn independently and identically distributed (i.i.d.) from an unknown distribution D . The empirical Rademacher complexity of a hypothesis class H with respect to the sample set S is defined as
R ^ S ( H ) = E σ sup f H 1 n i = 1 n σ i f ( x i )
where σ = { σ 1 , , σ n } are independent Rademacher random variables uniformly taking values in { 1 , + 1 } . The expected Rademacher complexity over all samples of size n is given by R n ( H ) = E S [ R ^ S ( H ) ] .
In standard kernel-based learning (e.g., standard kernel k -NN or SVMs), the hypothesis class is typically bounded solely by the Reproducing Kernel Hilbert Space (RKHS) norm, denoted as H K = { f H : f K 2 C } . However, in the proposed KMOLNN framework, the integration of adaptive manifold regularization intrinsically restricts the hypothesis class to a much tighter, data-dependent subspace. Specifically, the KMOLNN hypothesis class H K M is defined as
H K M = f H K : f K 2 + γ f T L f C
where γ > 0 is the manifold regularization parameter, f = [ f ( x 1 ) , , f ( x n ) ] T represents the evaluations of the function on the sample set S , and L is the adaptive Laplacian matrix dynamically learned during the alternating optimization process.
The functional penalty term f I 2 = f T L f enforces smoothness along the intrinsic geodesic directions of the data distribution. By heavily penalizing high-frequency components over the graph’s Laplacian L , the supremum in the Rademacher complexity R ^ S ( H K M ) is no longer dominated by the ambient dimension of the full RKHS. Instead, it is constrained by an “effective dimension” governed by the spectral decay of the combined operator. As we will demonstrate, the optimized adaptive Laplacian L accelerates the decay of the eigenvalues λ i ( L ) , thereby significantly reducing the Rademacher radius of H K M and yielding a tighter generalization bound.

3.3.2. Generalization Error Bound

To derive the generalization error bound, we apply standard results from Statistical Learning Theory. For any δ ( 0 , 1 ) , under the random sampling of the training set S , with probability of at least 1 δ , the expected risk E [ L ( f ) ] = E ( x , y ) ~ D [ l ( f ( x ) , y ) ] satisfies
E [ L ( f ) ] L ^ ( f ) + 2 R n ( F ) + log ( 1 / δ ) 2 n
where L ^ ( f ) = 1 n i = 1 n l ( f ( x i ) , y i ) denotes the empirical risk, and l denotes the loss function (e.g., the 0–1 loss for classification). This bound decomposes the generalization error into the empirical risk, model complexity (i.e., R n ( F ) ), and a confidence term that decreases as the sample size n increases.

3.3.3. Rademacher Complexity Bound for the Proposed KMOLNN Method

To bound R n ( F ) , we consider the prediction form of the proposed KMOLNN method, f ( y ) = i = 1 n p i K ( y , x i ) , for classification. Since p Δ n , and assuming the kernel is bounded (e.g., | K ( y , x i ) | M , typically for the Gaussian kernel M = 1 ), we analyze the empirical Rademacher complexity by
R ^ n ( F ) = E σ sup p Δ n 1 n i = 1 n σ i j = 1 n p j K ( x i , x j )
Using the fact that p lies in the probability simplex, we apply Hölder’s inequality:
i = 1 n σ i j = 1 n p j K ( x i , x j ) j = 1 n p j i = 1 n σ i K ( x i , x j ) max j i = 1 n σ i K ( x i , x j )
Because p j = 1 , taking the expectation over σ gives
E σ max j i = 1 n σ i K ( x i , x j )
For each j , i = 1 n σ i K ( x i , x j ) is the sum of bounded random variables ( | K ( x i , x j ) | M ). By Hoeffding’s inequality, for a fixed j we have
P i = 1 n σ i K ( x i , x j ) t 2 exp 2 t 2 n M 2
Taking the maximum over n terms and applying the union bound, we obtain
E σ max j i = 1 n σ i K ( x i , x j ) M 2 log n n
Therefore, the Rademacher complexity is bounded as
R n ( F ) M 1 n 2 log n
This bound indicates that the complexity decreases at a rate of O ( log n / n ) , suggesting that the proposed KMOLNN method maintains controlled complexity as the sample size n increases.

3.4. Proof of the Nearest Neighbor Group Effect

In this subsection, we introduce the second characteristic of the regularization term R ( w ) , namely the learned linear probability weight vector p , which exhibits the k-nearest neighbors grouping effect. The k-nearest neighbors grouping effect requires that when two training samples within the same class label are close to a testing sample, the elements in the linear expression weight vector p corresponding to these two training samples should be similar. The k-nearest neighbors grouping effect in Equation (9) can be described by Theorem 1. Before presenting the formal theoretical analysis, it is important to establish that our formulations operate under the standard manifold assumption: the high-dimensional data distribution is presumed to be concentrated on or near a lower-dimensional intrinsic manifold, which allows the local linear reconstruction to hold true in the localized region.
Theorem 1.
For the objective function, assuming the adjacency matrix  A  is fixed, the optimization variable is the weight vector  p . For any two training samples  x i  and  x j  within the same class  k , if their similarity to the testing sample  y  is close, i.e.,  d i d j  (where  d i  denotes the similarity to  y ), and samples  x i  and  x j  are highly correlated in the graph structure (i.e., in the adjacency matrix  A ,  a i j  is large), then the optimized weight  p i *  and  p j *  satisfies
| p i * p j * | ε
where  ε  denotes a small positive number dependent on the regularization parameter; and  λ 1 , is large), then  λ 2  and  λ 3  are the correlations between samples.
Proof. 
The relevant proof of Theorem 1 is provided in Appendix A. □
The empirical results from the ablation study further validate this theoretical proof; the removal of the manifold term ( γ = 0 ) leads to a decline in the classification stability, confirming that the nearest neighbor group effect is a critical factor for achieving superior generalization.

3.5. Optimization of the Objective Function for the Proposed Method

The optimization of the objective function in the proposed KMOLNN method typically involves complex dependencies among multiple variables, such as the probability weight vector p and the adaptive adjacency matrix A , which makes obtaining a direct solution challenging. Therefore, we decompose the optimization problem of the objective function into multiple sub-objective problems and progressively approach the global optimal solution through alternating optimization [33,34], gradually approximating the global optimal solution. This section elaborates in detail on the strategies for parameter initialization and the specific implementation steps for the alternating optimization.

3.5.1. Parameter Initialization

The parameter initialization significantly affects the convergence rate of the optimal solution and the numerical stability of computations. In this subsection, we present two initialization methods:
  • For the initialization of the probability weight vector p , each element of p is assigned an identical value. This approach is based on a reasonable assumption: in the absence of any prior information, the likelihood of any testing sample belonging to each training sample is equal. At the same time, it ensures that p 1 = 1 , which satisfies the fundamental mathematical norms of probability.
  • For the adaptive adjacency matrix A , we employ the k-means clustering method [35] to cluster the data points x i to give an appropriate number of classes k , which typically depends on the characteristics of the problem or is determined through certain criteria (such as the elbow method). Differentiate the connection weights within the same class from those between classes:
a i j = e x p ( x i x j 2 2 σ 2 ) ε   l a b e l ( x i ) = l a b e l ( x j )   l a b e l ( x i ) l a b e l ( x j )
where ε denotes a small positive number representing the minimal level of connection between classes.

3.5.2. Alternating Optimization

When implementing an alternating optimization [15], we decompose the original optimization problem into several subproblems, each of which is responsible for optimizing one variable in the original problem. When optimizing the probability weight vector p , we hold the adaptive adjacency matrix A fixed; conversely, when optimizing the adjacency matrix A , we need to hold the probability weight vector p fixed. The repeated application of this strategy ensures that each optimization step approaches the optimal solution. Next, we provide a detailed explanation of the specific implementation methods and corresponding technical details for each optimization step.
  • When the adjacency matrix A is fixed, optimize the probability weight vector p .
When A = A ( k ) is held fixed, the objective function in Equation (8) reduces to the following sub-objective function:
J 1 ( p ) = min p ϕ ( y ) i = 1 m p i ϕ ( x i ) 2 + λ 1 i = 1 m p i log p i + λ 2 i = 1 m p i β d i 2 + λ 3 i , j = 1 m p T L ( A ) p
Obviously, Equation (21) is usually a convex optimization problem, which is optimized using the gradient descent algorithm [15]. It is particularly important to note that the first term contains an unknown mapping function. We can use the kernel trick to further simplify the computations. The kernel trick allows us to operate in the feature space without explicitly computing the mapping, ϕ ( ) , thereby significantly reducing the computational complexity.
First, construct the kernel matrix K i , j and the vector k y to prepare for the next step of gradient calculation and parameter updates. Here, K i , j m × m denotes the i -th and j -th elements in K ( x i , x j ) , and k y denotes the j -th element in K ( y , x j ) . We transform the linear reconstruction term ϕ ( y ) i = 1 m p i ϕ ( x i ) 2 into
ϕ ( y ) i = 1 m p i ϕ ( x i ) 2 = K ( y , y ) 2 i = 1 m p i K ( y , x i ) + i = 1 m j = 1 m p i p j K ( x i , x j )
And then, we further express the sub-objective function as follows:
min p K ( y , y ) 2 i = 1 m p i K ( y , x i ) + i = 1 m j = 1 m p i p j K ( x i , x j ) + λ 1 i = 1 m p i log p i + λ 2 i = 1 m p i β d i 2 + λ 3 i , j = 1 m p T L ( A ) p
Based on the gradient descent method, the gradient of the objective function with respect to p i is calculated by
J ( p ) p i = 2 K ( y , x i ) + 2 j = 1 m p j K ( x i , x j ) + λ 1 ( log p i + 1 ) + 2 λ 2 ( p i β d i ) + 2 λ 3 ( L ( A ( k ) ) p ) i
We update the gradient descent steps p :
p ( t + 1 ) = p ( t ) l r ( t ) p
where l r denotes the learning rate. We determine convergence of the optimization process by monitoring the changes in the objective function value to decide when to terminate optimization. Specifically, we define the change amount as
Δ J = J k + 1 J k
where J k and J k + 1 denote the objective function values of two consecutive iterations, respectively. If this change amount Δ J is smaller than a preset threshold, we can consider the optimization process converged.
Next, we analyze the computational complexity of Algorithm 1 step by step. In Step 1, constructing the kernel matrix K n × n requires evaluating the Gaussian kernel for all n 2 pairs of training samples; since each kernel evaluation involves computing a distance in d dimensions, this step costs O ( n 2 d ) . In the same step, computing the similarity vector k n between the testing sample and all training samples requires n kernel evaluations and thus costs O ( n d ) . In Steps 2–10, the algorithm enters an iterative procedure; suppose it runs for T iterations. In each iteration, Steps 3–5 loop over n coefficients, and computing the required quantity for each coefficient typically involves an O ( n ) inner product with the precomputed kernel matrix, resulting in O ( n 2 ) time per iteration. In Step 6, the update is dominated by dense matrix–vector operations, which also cost O ( n 2 ) . In Steps 8–10, projecting p onto the probability simplex can be implemented in, at most, O ( n log n ) time (e.g., via sorting-based projection), which is lower than O ( n 2 ) for a moderate-to-large n . Therefore, the per-iteration cost in Steps 2–10 is dominated by O ( n 2 ) , and after T iterations, the total time complexity of Algorithm 1 is O ( n 2 d ) + T ( O ( n 2 ) + O ( n log n ) ) = O ( n 2 d + T n 2 ) , where the O ( n 2 d ) term comes from the kernel construction and the O ( T n 2 ) term dominates the iterative phase in practice.
Algorithm 1: Optimization process in Equation (21).
Input: Given training sample matrix X ; testing sample y ; adjacency matrix A ; distance vector d ; and regularization parameters λ 1 , λ 2 and λ 3 .
Output: Probability weight vector p of the testing sample.
Procedure
Step 1: Calculate K ( x i , x j ) and K ( y , x j ) .
Step 2: While  J k + 1 J k > ε .
Step 3:   For i = 1 to n do the following.
Step 4:    Calculate J ( p ) p i = 2 K ( y , x i ) + 2 j = 1 m p j K ( x i , x j ) + λ 1 ( log p i + 1 ) + 2 λ 2 ( p i β d i ) + 2 λ 3 ( L ( A ( k ) ) p ) i .
Step 5:    Add J ( p ) p i to g r a d p .
Step 6:   Update p .
Step 7:     p p l r * g r a d p .
Step 8:   Project p to make it a probability distribution.
Step 9:     p m a x p ,   0 .
Step 10:   p p / s u m ( p ) .
Step 11: Return  p .
2.
When the probability weight vector p is fixed, optimize the adjacency matrix A .
When fixing the probability weight vector p , the objective function in Equation (8) degenerates to a sub-objective function related only to the adjacency matrix A , as follows:
J A ( A ) = min A λ 3 p T L ( A ) p + λ 3 φ i , j = 1 m a i j x i x j 2 + λ 4 i , j a i j 2
For the first term λ 3 p T L ( A ) p in Equation (27), its Laplacian matrix L ( A ) = D A . D denotes the degree matrix with diagonal elements D i i = j a i j . For A , taking the derivative transforms it into a i j p T ( D A ) p . When i = j , the partial derivative of a i j in D is equal to 1, otherwise it is 0. Therefore, since D i i a i j = δ i j , where δ i j denotes the Kronecker delta function, we obtain
a i j λ 3 p T L ( A ) p = λ 3 p i 2 δ i j 2 λ 3 p i p j
The second term in the sub-objective function λ 3 i , j = 1 m a i j x i x j 2 carries out the derivation of a i j . We provide its derivative directly as follows:
a i j λ 3 φ ( i , j a i j x i x j 2 ) = λ 3 φ ( i , j a i j x i x j 2 ) x i x j 2
For the third term in the sub-objective function λ 4 i , j a i j 2 , the derivative of a i j , we directly provide its derivative as follows:
a i j λ 4 i , j a i j 2 = 2 λ 4 a i j
Therefore, we obtain the gradient of Equation (27) as follows:
J a i j = λ 3 ( p i 2 2 p i p j ) + λ 3 φ ( i , j a i j x i x j 2 ) x i x j 2 + 2 λ 4 a i j
We update the gradient descent steps a i j :
a i j a i j l r J a i j
We ensure non-negativity by projecting onto non-negative tangent values a i j :
a i j m a x ( a i j , 0 )
We present the details for optimizing Equation (27) in Algorithm 2.
Algorithm 2: Optimizing Equation (27).
Input: Given training samples X ; testing sample probability vector p ; and regularization parameters λ 3 , λ 4 and φ .
Output: Optimal adjacency matrix A .
Procedure
Step 1:     Randomly initialize adjacency matrix A .
Step 2:     While  J k + 1 J k > ε .
Step 3:    g r a d A = λ 3 ( p i 2 2 p i p j ) + λ 3 φ ( i , j a i j x i x j 2 ) x i x j 2 + 2 λ 4 a i j .
Step 4:   Update A A l r * g r a d A .
Step 5:   Projection A m a x ( A , 0 ) to ensure non-negativity.
Step 6:     Return  A .
Next, we analyze the computational complexity of Algorithm 2 step by step. Let n denote the number of nodes (training samples), let A n × n be the adaptive adjacency matrix to be learned, and let d be the dimension of each node feature vector. In Step 1, A is randomly initialized by assigning values to all n 2 entries, which costs O ( n 2 ) . In Step 2, the while loop repeatedly updates A until convergence. Suppose it runs for T 2 iterations, in each iteration, the dominant computation is the double loop over all index pairs ( i , j ) , i.e., over the n 2 entries of A . In Step 3, for each pair ( i , j ) , computing the squared Euclidean distance between node features (e.g., x i x j 2 2 ) requires a subtraction and a sum of squares in d dimensions, costing O ( d ) . In Step 4, updating the corresponding entry of A involves only a constant number of arithmetic operations, costing O ( 1 ) . In Step 5, enforcing the non-negativity constraint (e.g., via projection onto the non-negative domain) is also a constant-time operation per entry, costing ( 1 ) . Therefore, one full pass of Steps 3–5 over all ( i , j ) pairs costs O ( n 2 d ) . After completing one iteration, the convergence test compares the updated value with its previous value across all n 2 elements, which costs O ( n 2 ) and is dominated by O ( n 2 d ) when d 1 . Consequently, the total time complexity of Algorithm 2 after T 2 iterations is O ( n 2 ) + T 2 ( O ( n 2 d ) + O ( n 2 ) ) = O ( T 2 n 2 d ) , where the O ( n 2 d ) term typically dominates in practice.
We summarize the optimization process of the objective function in the proposed KMOLNN method in Algorithm 3.
Algorithm 3: Alternating optimization process of the objective function in the proposed KMOLNN method.
Input: Training samples X = { x 1 , x n } ; testing sample y ; kernel function K , ; parameters λ , μ , γ and β ; kernel width σ ; maximum iterations m a x _ i t e r ; and convergence threshold tol.
Output: Optimized weight vector p and adjacency matrix A .
Procedure
Step 1: Initialization: p 1 n , p = p / s u m p .
Step 2: Initialize A using k-means clustering; if x i and   x j are in same cluster, set A i j = 1 , otherwise A i j = 0 .
Step 3: Compute similarity vector d , where d i = e x p y x i 2 2 σ 2 for   i = 1 ,     ,   n .
Step 4: Compute kernel matrices K X X =   K ( X ,   X ) , K y X =   K ( y ,   X ) and K y y =   K ( y ,   y ) .
Step 5: Iterative optimization.
Step 6: For i t e r = 1 to m a x _ i t e r  do the following.
Step 7:    Fix A ; optimize weight vector p A l g o r i t h m   1 ( K X X ,   K y X ,   K y y ,   d ,   L ,   λ ,   μ ,   γ ) . Compute Laplacian L = D A , where D i i = j A i j .
Step 8:    Fix p ; optimize adjacency matrix A A l g o r i t h m   2 p ,   γ ,   β .
Step 9:    Update D , where D i i = j a i j .
Step 10:      If  p n e w p o l d 2 + A n e w A o l d F < t o l .
Step 11:         Break.
Step 12:       p o l d p , A o l d A .
Step 13: Return  p , A .
Next, we analyze the computational complexity of Algorithm 3 step by step, using the same notations as in the algorithm. Let n be the number of training samples, d be the feature dimension, max _ i t e r be the maximum number of outer iterations, and let T 1 and T 2 denote the numbers of inner iterations required by Algorithm 1 and Algorithm 2, respectively. In Step 1, initializing the probability weight vector p assigns n values and therefore costs O ( n ) . In Step 2, initializing the adjacency matrix A via k -means clustering costs O ( n d k t ) (with k and t treated as constants in practice), and explicitly constructing/writing A n × n costs at most O ( n 2 ) ; hence, this step is dominated by O ( n 2 ) . In Step 3, computing the similarity vector s between the testing sample and all training samples requires n distance/kernel evaluations in d -dimensional space, costing O ( n d ) . In Step 4, computing the kernel objects (notably the kernel matrix K n × n and the kernel vector k ) requires evaluating the kernel for all n 2 pairs of training samples (each typically O ( d ) ), leading to O ( n 2 d ) time. In Steps 6–11, the algorithm enters the outer loop, which runs for at most max _ i t e r iterations. In each outer iteration, Step 7 fixes A and computes the graph’s Laplacian L = D A (with D being the degree matrix), which takes O ( n 2 ) , and then invokes Algorithm 1 to optimize p , costing O ( T 1 n 2 ) , since each inner iteration is dominated by dense matrix–vector operations involving K . Next, Step 8 fixes p and invokes Algorithm 2 to optimize A ; this requires updating all n 2 entries of A , and each update typically performs a d -dimensional distance computation plus a constant-time arithmetic/projection, yielding O ( n 2 d ) per inner iteration and O ( T 2 n 2 d ) in total. Therefore, the cost per outer iteration is O ( T 1 n 2 + T 2 n 2 d ) , and the overall time complexity of Algorithm 3 is O ( n 2 d ) + max _ i t e r ( O ( T 1 n 2 ) + O ( T 2 n 2 d ) ) = O ( n 2 d + max _ i t e r , ( T 1 n 2 + T 2 n 2 d ) ) . In typical settings where d is moderate, the dominant terms are O ( n 2 d ) for kernel construction and O ( max _ i t e r T 2 n 2 d ) for updating A , while the early-stopping condition controlled by may reduce the actual number of outer iterations in practice.

4. Experimental Studies

To evaluate the generalization capability and running speed of the proposed KMOLNN method, we compared its classification performance and execution times with alternative methods on the adopted datasets. The experiments were implemented in Python 3.9 on a Windows 10 platform (Intel Core i7-8700 CPU, 32 GB RAM). Crucially, since KMOLNN is a lazy learning method requiring no explicit global training, the reported runtimes denote the end-to-end inference cost per test sample, specifically covering kernel matrix construction and alternating optimization.

4.1. Comparative Methods and Parameter Settings

In the experiments, eight KNN-based methods were selected as the comparative methods. They can be briefly summarized as follows:
k-Nearest Neighbors (KNN): KNN employs the Euclidean distance to measure the similarity between all training samples and the testing sample. Among all the calculated Euclidean distances, the k-training samples closest to the testing sample are selected. The label of the testing sample is determined based on the labels of these k-training samples. The main parameter of the KNN method is the k-value. According to the recommendation in the literature [1], the k-value is typically selected through cross-validation by traversing values from 1 to n to determine the optimal k-value.
Kernel k -Nearest Neighbors (KKNN) [2]: As a direct nonlinear extension of traditional KNN, KKNN maps the input data into a high-dimensional feature space via a kernel function before performing nearest neighbor voting. This baseline is included to validate the advantage of our manifold-optimized reconstruction over simple kernelized voting.
Fuzzy k-Nearest Neighbors (FKNN) [36]: FKNN performs classification prediction by considering the similarity between testing samples and different classes. It employs fuzzy membership functions to quantify the relationship between testing samples and each class in the training samples, assigning weights based on proximity. The primary parameter of FKNN is the k-value. According to the recommendation in the literature [36], the k-value is typically selected through cross-validation by traversing values from 1 to n to determine the optimal k-value.
Proximal Ratio-based k-Nearest Neighbors (PRKNN) [37]: PRKNN is a KNN variant based on the Proximal Ratio (PR), which optimizes classification decisions by identifying the noise points and overlapping samples to address class imbalance and nonlinear data distribution. It achieves locally adaptive neighborhood selection by minimizing the contribution of overlapping points and incorporating weighting schemes (EPRKNN and WPRKNN). According to the literature recommendations, the main parameters include the number of neighbors k and the PR threshold. The k-value is selected via cross-validation within the { 1 , 3 , 5 , , 9 } range, with the PR threshold search range being { 0.1 , 0.5 } . The objective function is to minimize the sum of PR loss and weighted distance.
Elastic Net [7]: Elastic Net is a regularization technique that combines a Lasso regression and Ridge regression, and is designed to handle data with highly correlated features. The parameter setting primarily involves two key parameters, λ 1 and λ 2 , to determine the weight ratio between the Lasso ( L 1 norm) and Ridge ( L 2 norm). Typically, these two parameters are selected through cross-validation, such as using a grid search to find the optimal combination within a given { 0.001 , 0.01 , 0.1 , 1 , 10 , 100 } parameter range, achieving the best trade-off between bias and variance to enhance the model’s predictive power and generalization capability.
Weighted Multi-Local Mean Representation-based KNN (WLMRKNN) [4]: There are two parameters involved in WLMRKNN, i.e., the neighborhood size k and the regularization parameter γ . According to [4], k is varied from 1 to 15 with a step size of 1 (i.e., k 1 , 2 , , 15 ). The authors also recommend evaluating most datasets by sweeping k from 1 to 15 with a step size of 1. For face databases, k is set from 1 to n t (the number of training samples per class) with a step size of 1. In addition, γ is the regularization coefficient in the objective function, which controls the strength of the locality-constrained term W j s j 2 2 .
Local Centroid Distance-Constrained Representation-based KNN method (LCDR-KNN) [20]: LCDR-KNN improves the accuracy of KNN classifiers by selecting the k-nearest training samples for each class using the center-of-mass distances of these samples as constraints in conjunction with collaborative representation to compute the weight of each neighbor in the classification. According to [20], the parameter λ 1 is searched in the set { 0.001 , 0.01 , 0.1 , 1 , 10 , 100 } , and the parameter λ 2 is searched in the set { 0.001 , 0.01 , 0.1 , 1 , 10 , 100 } .
Shared Style k-Nearest Neighbors (SSL-KNN) [26]: There are 5 parameters to be determined in the proposed SSL-KNN method. According to [26], the parameter λ 1 controls the strength of the linear expression weights’ sparsification, which is searched iteratively in the set 0.001 , 0.01 , 0.1 , 1 , 10 , 100 . The parameter λ 2 controls the strength of the   L 2 -norm regularization term, which is traversed in the set 0.001 , 0.01 , 0.1 , 1 , 10 , 100 . The parameter λ 3 controls the strength of the testing sample style membership optimization and is iterated through the set 0.005 , 0.01 , 0.05 , 0.1 , 0.5 , 1 , 10 . The parameter λ 4 is used to control the degrees of freedom of the style matrix. The larger the value of parameter, the more the style matrix converges to the unit matrix, and the search is traversed in the set 0.005 , 0.01 , 0.05 , 0.1 , 0.5 , 1 , 10 . According to [26], the last parameter σ is the Gaussian kernel bandwidth, which is traversed in the set 0.1 , 1 , 5 .
Multi-View Adaptive k-Nearest Neighbor (MVAKNN) [27]: There are 2 main trade-off parameters to be determined in the proposed MVAKNN method. According to [27], the parameter ρ 1 controls the strength of the L 1 -norm regularization term W 1 (i.e., enforcing sparsity of the correlation matrix W), and its search space is ρ 1 { 10 5 , , 10 1 } . The parameter ρ 2 controls the strength of the Laplacian regularization term Tr ( W T X T L X W ) , and its search space is ρ 2 { 10 5 , , 10 1 } . In addition, σ appears as the Gaussian kernel bandwidth in the weight matrix S i j (e.g., S i j = exp ( x i x j 2 / ( 2 σ 2 ) ) ) when x j is a neighbor of x i , but [27] does not explicitly provide a corresponding grid search range for σ in the experimental parameter settings.
The proposed KMOLNN method: We conducted a systematic grid search for hyperparameter selection. The regularization parameters λ 1 , λ 2 , λ 3 , and λ 4 were tuned within the logarithmic range of [ 10 3 , 10 3 ] . Simultaneously, the kernel width h was optimized within the range [ 0.1 σ 0 , 2.0 σ 0 ] , where σ 0 denotes the median of the pairwise distances between training samples. This expansive search space ensures that the model’s sensitivity to structural regularization and kernel mapping is thoroughly evaluated across diverse data distributions.
During the testing phase, to ensure fairness in comparison, we adopted the prediction rules provided in the relevant literature for predicting testing samples, as shown in Table 3. Additionally, a comprehensive summary of these comparative baseline methods, including their key characteristics and theoretical bases, is provided in Table 4.

4.2. The Benchmark Datasets

In this subsection, we use 15 standard benchmark datasets from the KEEL and UCI repositories. The dataset characteristics are summarized in Table 5. Following the experimental protocol in [26], we adopt a repeated random hold-out evaluation: for each dataset, we randomly split the data into 80% for training and 20% for testing. This procedure is repeated 10 times using different random seeds, and we report the mean of the performance over the 10 runs.
To strictly account for the variance inherent in iterative graph updates and to avoid reporting biased point estimates, all performance metrics (e.g., accuracy and Macro F1) are reported as the mean results over 10 independent random hold-out runs. Furthermore, to mitigate sensitivity to initialization choices, the adaptive adjacency matrix is consistently initialized using a k-means-based strategy, ensuring stable and deterministic convergence across all repeated runs.

4.3. Evaluation Metrics

In this study, two commonly used evaluation metrics, i.e., the mean accuracy (ACC) and mean Macro F1-score, are adopted to evaluate the classification performance of all adopted methods on all adopted datasets; they can be calculated, respectively, as follows:
A C C = T P + T N T P + T N + F P + F N
where TP denotes True Positives, representing the number of samples correctly identified as positive cases by the classifier; TN denotes True Negatives, representing the number of samples correctly identified as negative cases by the classifier; FP denotes False Positives, representing the number of samples incorrectly identified as positive cases by the classifier; and FN denotes False Negatives, representing the number of samples incorrectly identified as negative cases by the classifier.
F 1 - s c o r e = 2 p r e c i s i o n r e c a l l p r e c i s i o n + r e c a l l
where p r e c i s i o n = T P T P + F P and r e c a l l = T P T P + F N . For multi-class classification, the Macro F1-score is calculated by taking the arithmetic mean of the F1-scores for each individual class:
M a c r o   F 1 = 1 | C | i C F 1 i
where C denotes the set of classes, and F 1 i is the F1-score for the i -th class. In addition, to statistically analyze the differences between the proposed KMOLNN method and the comparative methods, the Friedman test [3,38] and Bonferroni–Dun test [39] are adopted in the experiments.

4.4. Generalization Capability Analysis

In this subsection, we evaluate the generalization ability of the proposed KMOLNN method and the comparative methods on the adopted dataset by comparing all the adopted methods, run 10 times, and taking their average accuracy and mean Macro F1-socre.
To ensure a rigorous and modern benchmark, our comparison set includes state-of-the-art (SOTA) variants published in 2024 (SSL-KNN and MVAKNN) and 2025 (PRKNN). As summarized in Table 4, while recent methods like MVAKNN integrate kernelization and manifold learning, KMOLNN uniquely employs an adaptive Laplacian optimization framework combined with a theoretical link to Bayesian decision rules. This allows KMOLNN to consistently outperform these strong alternatives, especially on datasets with intricate nonlinear manifolds. All the experimental results are presented in Table 6 and Table 7.
To provide a more intuitive and diagnostic comparison across the entire benchmark suite, we have included a per-dataset Win/Tie/Loss (W/T/L) summary at the bottom of both Table 6 and Table 7. As shown in the W/T/L statistics, KMOLNN exhibits a consistent and overwhelming advantage, strictly losing to or tying with the competing baselines in only a marginal fraction of the datasets. This diagnostic analysis directly confirms the robust superiority of our algorithm across varying data distributions.
From Table 6 and Table 7, we can draw the following conclusions:
  • KMOLNN consistently outperformed the other methods in terms of both accuracy and Macro F1-score across the majority of datasets. For instance, on datasets with complex structures, such as Sonar and Ionosphere, the proposed KMOLNN method achieved remarkable performance. Specifically, on the Sonar dataset, KMOLNN achieved an accuracy of 90.71%, surpassing the runner-up MVAKNN (90.47%) and significantly outperforming the traditional KNN (82.61%). On the Pendigits dataset (with 16 dimensions and 10 classes), KMOLNN achieved an accuracy of 99.49%, outperforming MK-AKNN’s 99.09% and Elastic Net’s 97.79%. This highlights the efficacy of kernelized manifold optimization at capturing complex data manifolds and enhancing generalization capability.
  • While advanced methods like MK-AKNN and SSL-KNN demonstrated strong competitiveness, KMOLNN maintained the leading position. Looking at the data, MK-AKNN emerged as the strongest competitor, achieving the best results on datasets such as Wine and Vowel. However, KMOLNN provided a more robust mechanism overall, particularly on larger or more diverse datasets. For example, on the Letter Recognition dataset (sample size 20,000), KMOLNN attained an accuracy of 98.30%, distinctly higher than that of MK-AKNN (97.80%) and Elastic Net (95.10%). In terms of overall mean accuracy across all 15 datasets, KMOLNN achieved approximately 90.76%, whereas the traditional KNN scored 84.23%, representing a substantial improvement.
  • The Macro F1-score results further confirm the balanced precision–recall capabilities of KMOLNN. On the datasets with potential multi-class challenges, such as Ecoli (8 classes) and Yeast (10 classes), KMOLNN maintained superior Macro F1-scores. For instance, on the Yeast dataset, KMOLNN achieved a Macro F1-score of 63.6%, which was the highest among all methods, surpassing MK-AKNN (62.5%) and significantly outperforming KNN (50.5%). The Macro F1-score of KMOLNN across all datasets reached 88.62%, demonstrating that the implicit Bayesian decision rule in our method effectively handles class separability even in difficult scenarios.
Finally, the grouped analysis validates its adaptability. For high-dimensional datasets like Sonar (60 dimensions), KMOLNN’s accuracy of 90.71% validates that the manifold-preserving regularization effectively mitigates the curse of dimensionality. For large-scale datasets, such as Pendigits and Letter Recognition, KMOLNN consistently ranked first, proving its scalability. Overall, these results confirm that KMOLNN is not only theoretically sound but also practically superior for processing nonlinear data. Overall, these results not only validate the practical advantages of KMOLNN for nonlinear classification tasks but also confirm it through repeated trials and statistical analysis across diverse benchmarks. This establishes a solid foundation for subsequent statistical tests, such as the Friedman test.
To go beyond narrative explanations to empirically understand why KMOLNN achieves these results, it is critical to analyze the dataset-wise patterns. On datasets with highly complex, nonlinear decision boundaries (such as Sonar and Ionosphere), standard linear methods like LLKNN often fail. LLKNN assumes local linearity in the original ambient space, which leads to assigning high weights to “false neighbors” that cross the folded nonlinear manifold.
In KMOLNN, first, the kernelization unfolds the nonlinear data into a higher-dimensional RKHS where the local linear reconstruction assumption becomes valid. Second, rather than relying on a static Euclidean graph, our adaptive Laplacian matrix dynamically recalculates similarities during optimization. This allows KMOLNN to actively prune those false neighbors and preserve only the true geodesic neighborhood, translating our theoretical manifold preservation directly into the practical accuracy gains observed in Table 6.

4.5. Runtime Analysis

In this section, we conduct a comprehensive evaluation of the computational efficiency of the proposed KMOLNN method against eight comparative methods. Table 8 summarizes the average running times of all adopted methods on all adopted datasets.
Table 8 summarizes the average runtime of KMOLNN and eight baseline methods on 15 datasets. A clear pattern is that the simple neighbor-based classifiers (e.g., KNN/FKNN/PRKNN) are consistently the fastest, because their computation is dominated by distance evaluation and local voting, with no expensive training or iterative parameter updates. The methods that incorporate regularization or additional learning modules (e.g., Elastic Net, LCDR-KNN, WLMRKNN) typically require more time due to extra optimization steps, but they still remain moderate on most datasets.
In contrast, the slowest methods in Table 8 are SSL-KNN and KMOLNN. The table indicates that SSL-KNN has the largest runtime overall, while KMOLNN is usually the second most time-consuming, and the gap becomes especially visible on larger datasets (e.g., Pendigits, Satimage, Letter Recognition). This trend suggests that KMOLNN’s cost scales more sharply with the number of samples than the lighter baselines.
The runtime overhead of KMOLNN mainly comes from: (1) kernel-related computations, where building and using kernel similarities can be expensive (often near quadratic in sample size); (2) constructing and updating the adaptive neighborhood graph and its Laplacian regularization, which is typically performed iteratively; and (3) alternating optimization procedures (such as repeated updates of weights/structures), which accumulate time across iterations. As datasets grow, these steps increase both the computation and memory demand, explaining the stronger slowdown on large-scale benchmarks.
Although KMOLNN is slower, its extra cost supports its nonlinear representation ability (via kernel mapping) and manifold/structure preservation (via graph regularization), which are designed to improve the classification quality. For better scalability, future work could adopt kernel approximations (e.g., Nyström or random Fourier features), reduce the number of optimization iterations, or use parallel/GPU acceleration for the graph and matrix operations.
The experimental results present a clear trade-off: KMOLNN consistently ranks among the most accurate methods but has a higher computational cost compared to the linear baselines. We argue that this extra complexity is essential for capturing the ‘straightened’ manifold structure in high-dimensional feature spaces. Specifically, for datasets like Sonar (60 dimensions), the manifold-preserving regularization effectively mitigates the curse of dimensionality, a gain that justifies the O ( n 2 ) complexity. However, the scalability remains a critical constraint; as the sample size increases to 20,000 (e.g., Letter Recognition), the alternating optimization cycles accumulate significant overhead. Therefore, KMOLNN is particularly recommended for complex nonlinear classification tasks where accuracy and interpretability outweigh strict real-time constraints.

4.6. Statistical Analysis

To statistically analyze the proposed KMOLNN method, the Friedman test is first adopted to evaluate the difference among all adopted methods, and then the Bonferroni–Dunn test is further adopted to evaluate the differences between the proposed KMOLNN method and the comparative methods. The average ranking values of all adopted methods across the datasets, which serve as the basis for these statistical tests, are presented in Table 9.
According to [38], the Friedman test is employed to determine whether there exists a significant difference in generalization capability between the proposed KMOLNN method and all comparative methods across the 15 benchmark datasets. The null hypothesis of this test posits that all methods exhibit no statistically significant difference in their generalization capability. The calculation formula for the Friedman statistic is as follows:
χ F 2 = 12 N k ( k + 1 ) j = 1 k R j 2 k ( k + 1 ) 2 4
where R j denotes the average accuracy ranking of the j -th method across all benchmark datasets, k represents the number of methods (here, k = 9 ), and N denotes the number of benchmark datasets (here, N = 15 ). Based on the calculations, χ F 2 = 98.71 . A further transformation yields the F-distribution statistic:
F F = ( N 1 ) χ F 2 N ( k 1 ) χ F 2 = 64.91
This statistic follows an F-distribution with degrees of freedom ( k 1 , ( k 1 ) ( N 1 ) ) , that is, ( 8 , 112 ) . At the significance level α = 0.05 , the critical value for the F-distribution is approximately 1.95. Since F F = 64 . 69 > 1.95 , we reject the null hypothesis, indicating that there exist statistically significant differences in the generalization capability among these methods.
According to the literature [39], the Bonferroni–Dunn test is employed for post hoc comparison, treating the proposed KMOLNN method as the control method and performing pairwise comparisons with the other methods. The most crucial statistic in this test is the critical difference (CD), calculated as follows:
C D = q α k ( k + 1 ) 6 N
where q α = 2.724 denotes the critical value of the statistic for k = 9 and α = 0.05 . The calculation yields C D = 2.724 .When the absolute difference in average ranks between two methods exceeds the CD, it signifies a statistically significant difference in generalization capability; otherwise, the difference is not significant. We present the absolute differences in average ranks between the proposed KMOLNN method and the remaining 8 comparison methods in Table 10 and compare them with the critical value C D = 2.724 .
Based on the rankings in Table 9, we can derive a robust statistical conclusion regarding the proposed KMOLNN’s performance.
First, the Friedman test confirms that the performance differences among the nine methods are statistically significant and not due to random chance. The calculated statistic ( τ F = 60.15 ) far exceeds the critical threshold of 1.95. This result firmly establishes that the methods perform differently across the benchmark datasets, justifying the need for pairwise comparisons. The Bonferroni–Dunn post hoc test further clarifies KMOLNN’s position relative to its peers, revealing a distinct two-tier separation among the comparative methods:
  • Contrary to the assumption that simple linear models might suffice, KMOLNN demonstrates a statistically significant advantage over KNN, FKNN, PRKNN, Elastic Net, and LCDR-KNN. Notably, the performance gap is widest between KMOLNN (average rank 1.83) and the standard KNN (average rank 8.4), with a rank difference of 6.57—well above the critical difference (CD) of 2.7243. This statistically validates that our kernelization and manifold-preserving strategies successfully overcome the limitations of traditional linear and neighbor-based methods.
  • When compared to the most advanced methods—MVAKNN, SSL-KNN, and WLMRKNN—the statistical differences fall below the critical threshold (hypothesis accepted). It is particularly noteworthy that KMOLNN ties with MVAKNN for the top rank (both at 1.83), at 5. This indicates that while KMOLNN is not statistically superior to these top-tier methods, it successfully reaches the state-of-the-art level, offering a highly competitive alternative that matches the performance of complex multi-kernel and style-based approaches.
In summary, the statistical evidence proves that KMOLNN is a significant upgrade over standard baselines and performs on par with the best-in-class algorithms available.

4.7. Limitations and Future Work

While KMOLNN demonstrates significant benefits for capturing nonlinear manifolds and providing theoretical interpretability, several limitations must be acknowledged. First, the computational complexity of O ( n 2 ) arises from the iterative kernel and Laplacian updates, which poses challenges for real-time processing on large-scale datasets compared to simpler baselines. Second, the model’s performance exhibits sensitivity to the choice of kernel parameters and the initial quality of the adaptive graph. Furthermore, although the alternating optimization converges effectively in our experiments, its numerical stability can depend on the initialization strategy.
Future research will focus on improving the inference efficiency by incorporating kernel approximation methods or algorithm unrolling techniques. Additionally, we aim to extend the evaluation of KMOLNN to more complex real-world scenarios, such as multi-modal medical imaging, to further validate its generalization capability beyond standard benchmark suites.

4.8. Qualitative Analysis of Interpretability

To further provide a concrete understanding of the learned representation weights and the manifold-preserving mechanism, we conduct a qualitative analysis on a representative case from the Iris dataset.
To explicitly demonstrate the interpretability of KMOLNN, Figure 3 visualizes the internal mechanisms using a representative case from the Iris dataset. Figure 3a displays the heatmap of the learned adaptive adjacency matrix A , with the training samples sorted by class. The distinct block-diagonal structure confirms that our manifold-preserving optimization successfully captures the intrinsic geometric structure of the kernel space, maintaining strong connections (dark blue) strictly among the intra-class samples. Furthermore, Figure 3b illustrates the learned probability weights w for a specific test sample. The model distinctly assigns similar and elevated weights (red bars) to a group of neighbors belonging to the correct class, while suppressing others (gray bars). This visual evidence directly validates our mathematical proof of the nearest neighbor group effect, proving that the optimized weights in KMOLNN possess concrete, semantic interpretability linked to the Bayesian decision rule.

4.9. Ablation Study

To move beyond theoretical intuition and provide a direct empirical decomposition of the proposed framework, we design an ablation study. This analysis explicitly justifies why the specific joint integration of kernel mapping and adaptive manifold-preserving regularization is practically preferable to simpler alternatives, such as relying on kernel mapping or graph regularization alone.
Furthermore, to directly address the necessity of our proposed framework against baseline kernel methods, this ablation study isolates the standard kernel k-nearest neighbors [2] (KKNN) approach as a specific comparative variant. We evaluate the performance of the full KMOLNN method against three degraded variants on three representative nonlinear datasets (Sonar, Ionosphere, and Wine):
KMOLNN (linear): The kernel mapping function ϕ ( ) is removed. The reconstruction is performed in the original input space, reducing the model to a locally linear KNN. Its objective function degrades to
min w i , A y t e s t i = 1 n w i x i 2 2 + η i = 1 n w i ln w i + λ i = 1 n d i 2 w i + γ w T L w + θ A F 2
KMOLNN (w/o manifold)/KKNN: The manifold-preserving regularization term is disabled by setting the parameter γ = 0 . By retaining the high-dimensional kernel mapping but omitting the manifold optimization, this variant is functionally equivalent to the standard kernel k-nearest neighbors baseline. This allows us to directly evaluate the specific performance gain achieved by transitioning from a basic kernelized approach to our manifold-optimized framework. Its objective function degrades to
min w i ϕ ( y t e s t ) i = 1 n w i ϕ ( x i ) 2 2 + η i = 1 n w i ln w i + λ i = 1 n d i 2 w i
KMOLNN (fixed graph): The adaptive graph learning mechanism is removed. This variant uses a static, predefined KNN graph based on standard Euclidean distances instead of dynamically updating the adjacency matrix A . Consequently, the graph regularization over A is omitted:
min w i ϕ ( y t e s t ) i = 1 n w i ϕ ( x i ) 2 2 + η i = 1 n w i ln w i + λ i = 1 n d i 2 w i + γ w T L f i x e d w
The performance of these variants, measured by both the accuracy and Macro F1-score, is summarized in Table 11.
As observed in Table 11, the progressive integration of each component steadily improves both the accuracy and the Macro F1-score.
First, mapping the data onto a high-dimensional feature space is essential. The KMOLNN (linear) variant suffers the most significant performance drop, particularly on the highly nonlinear Sonar dataset (accuracy drops to 81.25%). This validates that kernelization is crucial for handling complex nonlinear distributions.
Second, the comparison between the KMOLNN (w/o manifold)/KKNN baseline and our full method provides clear evidence of the necessity of the manifold-preserving regularization. While the basic KKNN approach improves upon the linear variant, its performance remains suboptimal (e.g., 86.54% on Sonar) because it lacks the geometric constraints needed to maintain local neighborhood structures in the kernel space. By enabling manifold regularization, the full KMOLNN method re-establishes the nearest neighbor grouping effect, boosting the accuracy to 90.71% and ensuring much more stable weight distributions (as reflected by the higher Macro F1-scores).
Finally, comparing the full method with the KMOLNN (fixed graph) variant demonstrates the limitations of predefined similarity measures. By dynamically updating the adjacency matrix, the adaptive Laplacian matrix filters out the irrelevant connections and precisely captures the dynamic manifold, yielding the highest classification performance.
Crucially, these empirical observations provide direct grounding for our theoretical claims. The significant performance drop in the KKNN variant empirically validates the necessity of the nearest neighbor grouping effect; without this adaptive geometric constraint, the model fails to maintain local smoothness. Furthermore, the fact that KMOLNN consistently outputs sparse, valid probability weights that translate into state-of-the-art accuracy across these diverse datasets serves as strong empirical confirmation that our optimized objective effectively approximates the optimal Bayesian decision rule in practice.

4.10. Parameter Sensitivity and Robustness

To evaluate the stability and robustness of the KMOLNN algorithm, a systematic parameter sensitivity analysis is conducted on the representative Sonar dataset. Based on the overall objective function, we focus on two critical sets of hyperparameter interactions: the relationship between the entropy regularization parameter λ 1 and the locality constraint parameter λ 2 , and the interaction between the adaptive manifold-preserving coefficient λ 3 and the Gaussian kernel width h . During this bivariate analysis, the adjacency matrix regularization parameter λ 4 and the distance penalty factor φ are kept fixed at their pre-determined optimal values.

4.10.1. Interaction of λ 1 and λ 2

Figure 4a presents the 3D surface plot of classification accuracy as the entropy regularization λ 1 and the locality constraint λ 2 vary simultaneously across a wide logarithmic range from 10 3 to 10 3 . The visualization reveals a broad, stable “plateau” of high performance, where the classification accuracy remains consistently above 88% across most parameter combinations. The prominent peak observed around λ 1 = λ 2 = 10 0 indicates that the model achieves its optimal balance between weight sparsity (driven by entropy) and localized reconstruction stability (driven by the distance penalty) in this central region. The absence of sharp, localized spikes empirically demonstrates that KMOLNN is highly robust and not overly sensitive to precise hyperparameter tuning within this subspace, significantly reducing the computational burden typically required for a fine-grained grid search.

4.10.2. Interaction of λ 3 and h

Figure 4b provides a 2D heatmap illustrating the sensitivity of accuracy with respect to the adaptive manifold coefficient λ 3 and the Gaussian kernel width h . A striking structural pattern is the vertical consistency of the accuracy scores. For a fixed kernel width (e.g., h = 1.0 σ 0 ), the classification accuracy remains virtually identical—varying narrowly between 90.5% and 91.1%—even as λ 3 scales across six orders of magnitude from 10 3 to 10 3 . This stability firmly demonstrates the intrinsic robustness of the adaptive manifold optimization mechanism; the model reliably learns the dynamic graph structure A regardless of the global penalty weight assigned to the structural terms. Conversely, the performance exhibits higher sensitivity to the kernel width h . As shown by the prominent vertical dark-blue band, the optimal performance is strictly concentrated around h = 1.0 σ 0 . Extreme deviations, such as an excessively small ( 0.1 σ 0 ) or large ( 2.0 σ 0 ) kernel width, result in a severe performance decline. This behavior perfectly aligns with the theoretical expectations, as h directly dictates the resolution of the nonlinear mapping ϕ ( x ) ; a value well matched to the default data distribution scale σ 0 is essential for maintaining a discriminative neighborhood structure in the kernel space. In summary, the sensitivity analysis confirms that KMOLNN maintains superior and stable classification performance within an expansive hyperparameter search space, proving its reliability for complex real-world classification tasks where prior knowledge of optimal parameters may be limited.

5. Conclusions

This study proposes the Kernelized Manifold-Optimized Linear k-Nearest Neighbors method to address the limitations of traditional linear KNN on nonlinear distributed data. By integrating kernel mapping with adaptive manifold-preserving regularization, the model effectively captures complex data structures while restoring the physical interpretability of representation weights. The theoretical analysis confirms the existence of a nearest neighbor group effect and demonstrates the model’s intrinsic connection to the Bayesian decision rule. The experimental results on all adopted datasets confirm that the proposed KMOLNN method achieves an enhanced generalization capability compared to the comparative methods.
Further work related to this study may focus on the following aspects: (1) The proposed KMOLNN method has a high time consumption because of its adoption of the alternating optimization method; new optimization methods will be developed to achieve a faster optimization speed. (2) The current KMOLNN method relies on an iterative alternating optimization process, which provides interpretability but limits inference speed. Furthermore, while the proposed model provides improved transparency through its learned weight distribution, it should be noted that its interpretability is primarily geometric (reflecting manifold neighborhoods) rather than providing semantic feature-level explanations. Future work could explore Deep Unfolding Networks [30] to map the iterations of the proposed optimization algorithm into layers of a deep neural network.

Author Contributions

J.Z.: Conceptualization, Validation, Methodology and Software. Z.B.: Methodology and Writing—review and editing. L.Z.: Conceptualization, Methodology, Writing—original draft, and Software. F.W.: Methodology and Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62306126, and the Jiangsu Province Youth Science and Technology Talent Support Project, grant number JSTJ2024283.

Data Availability Statement

Publicly available datasets were analyzed in this study. These datasets can be found in the UCI Machine Learning Repository (https://archive.ics.uci.edu, accessed on 1 June 2026) and the KEEL Dataset Repository (https://sci2s.ugr.es/keel/datasets.php, accessed on 1 June 2026).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could appear to have influenced the work reported on in this paper.

Appendix A

Proof of Theorem 1.
We aim to prove that the objective function possesses the grouping effect; that is, for two samples belonging to the same class x i and x j , when their similarity to the testing sample y is close (i.e., d i d j ), the optimized weights satisfy p i * p j * .
With a fixed A , the objective function is
J ( p ) = ϕ ( y ) i = 1 m p i ϕ x i 2 + λ 1 i = 1 m p i log p i + λ 2 i = 1 m p i β d i 2 + λ 3 p T L ( A ) p
Taking the partial derivative of J with respect to p i , we obtain
J p i = 2 j = 1 m K x i , x j p j K y , x i + λ 1 log p i + 1 + 2 λ 2 p i β d i + 2 λ 3 L ( A ) p i
At the optimal solution, J p i * = 0 . For x i and x j of the same class, when K y , x i K y , x j and d i d j , we calculate the difference between their gradients at the optimal weights p * :
J p i * J p j * = 0
We get
2 k i k j T p * + 2 k y j k y i + λ 1 log p i * p j * + 2 λ 2 p i * p j * β d i d j + 2 λ 3 L ( A ) p * i L ( A ) p * j = 0
Assuming k i k j , k y i k y j , and d i d j , and simplifying the Laplacian term (since strongly connected intra-class samples have negligible penalty differences), we obtain
λ 1 log p i * p j * + 2 λ 2 p i * p j * 0
Let u = p i * / p j * , then ( p i * p j * ) = p j * ( u 1 ) . Substitute this into the equation
λ 1 log u + 2 λ 2 p j * ( u 1 ) = 0
When u is close to 1 ( u 1 ), we can apply the Taylor approximation log u u 1 . Therefore,
( u 1 ) λ 1 + 2 λ 2 p j * 0
Since the parameter ( λ 1 + 2 λ 2 p j * ) > 0 , we must have u 1 , namely p i * p j * .
This property of the regularization term λ 2 i = 1 m ( p i β d i ) 2 ensures that when d i d j , the assigned weights p i * p j * . Furthermore, the Laplacian term λ 3 p T L ( A ) p enhances this effect by explicitly encouraging connected samples to have similar weights. Therefore, the objective function exhibits a strong grouping effect. □

References

  1. Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
  2. Altman, N.S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. Am. Stat. 1992, 46, 175–185. [Google Scholar] [CrossRef]
  3. Jiang, J.; Wu, J.; Luo, J.; Meng, X.; Qian, L.; Li, K. KATSA: KNN Ameliorated Tree Seed Algorithm for complex optimization problems. Expert Syst. Appl. 2025, 280, 127465. [Google Scholar] [CrossRef]
  4. Gou, J.; Qiu, W.; Yi, Z.; Shen, X.; Zhan, Y.; Ou, W. Locality constrained representation-based K-nearest neighbor classification. Knowl. Based Syst. 2019, 167, 38–52. [Google Scholar] [CrossRef]
  5. He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; ACM: New York, NY, USA, 2020. [Google Scholar]
  6. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  7. Zhong, L.W.; Kwok, J.T. Efficient sparse modeling with automatic feature grouping. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1436–1447. [Google Scholar] [CrossRef] [PubMed]
  8. Wang, J.; Yang, J.; Yu, K.; Lv, F.; Huang, T.; Gong, Y. Locality-constrained linear coding for image classification. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2010; pp. 3360–3367. [Google Scholar]
  9. Liu, Q.; Liu, C. A novel locally linear KNN method with applications to visual recognition. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2010–2021. [Google Scholar] [CrossRef]
  10. Xu, Y.-L.; Chen, S.; Luo, B. A weighted locally linear KNN model for image recognition. In Proceedings of the CCF Chinese Conference on Computer Vision; Springer: Singapore, 2017; pp. 567–578. [Google Scholar]
  11. Zhang, S.; Zong, M.; Sun, K.; Liu, Y.; Cheng, D. Efficient kNN algorithm based on graph sparse reconstruction. In Proceedings of the International Conference on Advanced Data Mining and Applications; Springer: Cham, Switzerland, 2014; pp. 356–369. [Google Scholar]
  12. Zhang, S.; Li, J. KNN classification with one-step computation. IEEE Trans. Knowl. Data Eng. 2021, 35, 2711–2723. [Google Scholar] [CrossRef]
  13. Cao, J.; Li, Z.; Li, J. Financial time series forecasting model based on CEEMDAN and LSTM. Phys. A Stat. Mech. Its Appl. 2019, 519, 127–139. [Google Scholar] [CrossRef]
  14. Min, S.; Lee, B.; Yoon, S. Deep learning in bioinformatics. Brief. Bioinform. 2017, 18, 851–869. [Google Scholar]
  15. Bian, Z.; Zhang, J.; Chung, F.L.; Wang, S. Residual Sketch Learning for a Feature-Importance-Based and Linguistically Interpretable Ensemble Classifier. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 10461–10474. [Google Scholar] [CrossRef] [PubMed]
  16. Venkata Krishna Reddy, V.; Vijaya Kumar Reddy, R.; Siva Krishna Munaga, M.; Karnam, B.; Maddila, S.K.; Sekhar Kolli, C. Deep learning-based credit card fraud detection in federated learning. Expert Syst. Appl. 2024, 255, 124493. [Google Scholar] [CrossRef]
  17. Zhang, S.; Li, X.; Zong, M.; Zhu, X.; Wang, R. Efficient kNN Classification With Different Numbers of Nearest Neighbors. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 1774–1785. [Google Scholar] [CrossRef] [PubMed]
  18. Mcdonald, G.C. Ridge regression. Wiley Interdiscip. Rev. Comput. Stat. 2010, 1, 93–100. [Google Scholar] [CrossRef]
  19. Liu, Q.; Liu, C. A novel locally linear KNN model for visual recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2015; pp. 1329–1337. [Google Scholar] [CrossRef]
  20. Zhao, Y.; Liu, Y.; Liu, X.; Zou, E. Local Centroid Distance Constrained Representation-Based K-Nearest Neighbor Classifier. In Proceedings of the China Conference on Wireless Sensor Networks; Springer: Singapore, 2020. [Google Scholar]
  21. Li, B.; Chen, Y.W.; Chen, Y.Q. The Nearest Neighbor Algorithm of Local Probability Centers. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2008, 38, 141–154. [Google Scholar] [CrossRef]
  22. Mullick, S.S.; Datta, S.; Das, S. Adaptive Learning-Based k-Nearest Neighbor Classifiers With Resilience to Class Imbalance. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5713–5725. [Google Scholar] [CrossRef]
  23. Wang, J.; Neskovic, P.; Cooper, L.N. Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence. Pattern Recognit. 2006, 39, 417–423. [Google Scholar] [CrossRef]
  24. Manocha, S.; Girolami, M.A. An empirical analysis of the probabilistic K-nearest neighbour classifier. Pattern Recognit. Lett. 2007, 28, 1818–1824. [Google Scholar] [CrossRef]
  25. Cheng, D.; Zhang, S.; Deng, Z.; Zhu, Y.; Zong, M. kNN algorithm with data-driven k value. In Proceedings of the International Conference on Advanced Data Mining and Applications; Springer: Cham, Switzerland, 2014; pp. 499–512. [Google Scholar]
  26. Zhang, J.; Bian, Z.; Wang, S. Shared style linear k nearest neighbor classification method. Expert Syst. Appl. 2024, 241, 122702. [Google Scholar] [CrossRef]
  27. Fan, Z.; Huang, Y.; Xi, C.; Liu, Q. Multiview Adaptive K-Nearest Neighbor Classification. IEEE Trans. Artif. Intell. 2024, 5, 1221–1234. [Google Scholar] [CrossRef]
  28. Zhang, Z.; Lai, Z.; Xu, Y.; Shao, L.; Wu, J.; Xie, G.S. Discriminative Elastic-Net Regularized Linear Regression. IEEE Trans. Image Process. 2017, 26, 1466–1481. [Google Scholar] [CrossRef] [PubMed]
  29. Ortega, A.; Frossard, P.; Kovačević, J.; Moura, J.M.F.; Vandergheynst, P. Graph Signal Processing: Overview, Challenges, and Applications. Proc. IEEE 2018, 106, 808–828. [Google Scholar] [CrossRef]
  30. Monga, V.; Li, Y.; Eldar, Y.C. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing. IEEE Signal Process. Mag. 2021, 38, 18–44. [Google Scholar] [CrossRef]
  31. Bousquet, O.; Elisseeff, A. Stability and generalization. J. Mach. Learn. Res. 2002, 2, 499–526. [Google Scholar]
  32. Bartlett, P.L.; Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 2002, 3, 463–482. [Google Scholar]
  33. Sultana, T.; Dumitrescu, S. Globally optimal max-min rate joint channel and power allocation for hybrid NOMA-OMA downlink systems. IEEE Trans. Signal Process. 2025, 73, 1674–1690. [Google Scholar] [CrossRef]
  34. Bian, Z.; Chung, F.-L.; Wang, S. Enhanced fuzzy random forest by using doubly randomness and copying from dynamic dictionary attributes. IEEE Trans. Fuzzy Syst. 2022, 30, 4369–4383. [Google Scholar] [CrossRef]
  35. Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2016; pp. 478–487. [Google Scholar]
  36. Keller, J.M.; Gray, M.R.; Givens, J.A. A fuzzy K-nearest neighbor algorithm. IEEE Trans. Syst. Man Cybern. 1985, SMC-15, 580–585. [Google Scholar] [CrossRef]
  37. Amer, A.A.; Ravana, S.D.; Habeeb, R.A.A. Effective k-nearest neighbor models for data classification enhancement. J. Big Data 2025, 12, 86. [Google Scholar] [CrossRef]
  38. Li, G.; Jung, J.J. Deep learning for anomaly detection in multivariate time series: Approaches, applications, and challenges. Inf. Fusion 2023, 91, 93–102. [Google Scholar] [CrossRef]
  39. Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Figure 1. Visualization of the linear and nonlinear models on a nonlinear distributed dataset. (a) Reconstruction of linearly distributed data using a linear model. (b) Failure of linear reconstruction on a quadratic manifold. (c) Failure of linear reconstruction on a sine manifold. (d) Accurate reconstruction of nonlinear data via kernel mapping.
Figure 1. Visualization of the linear and nonlinear models on a nonlinear distributed dataset. (a) Reconstruction of linearly distributed data using a linear model. (b) Failure of linear reconstruction on a quadratic manifold. (c) Failure of linear reconstruction on a sine manifold. (d) Accurate reconstruction of nonlinear data via kernel mapping.
Electronics 15 02572 g001
Figure 2. Framework diagram of the proposed KMOLNN method.
Figure 2. Framework diagram of the proposed KMOLNN method.
Electronics 15 02572 g002
Figure 3. Visualization of KMOLNN’s interpretability on the Iris dataset. (a) Heatmap of the learned adaptive adjacency matrix A , showing the preserved block-diagonal manifold structure. (b) Learned probability weights w for a test sample, demonstrating the grouping effect where same-class neighbors receive similar and higher weights.
Figure 3. Visualization of KMOLNN’s interpretability on the Iris dataset. (a) Heatmap of the learned adaptive adjacency matrix A , showing the preserved block-diagonal manifold structure. (b) Learned probability weights w for a test sample, demonstrating the grouping effect where same-class neighbors receive similar and higher weights.
Electronics 15 02572 g003
Figure 4. Parameter sensitivity analysis of the KMOLNN algorithm on the Sonar dataset. (a) Accuracy surface with respect to reconstruction weight and entropy regularization. (b) Heatmap of accuracy varying with manifold weight and kernel width.
Figure 4. Parameter sensitivity analysis of the KMOLNN algorithm on the Sonar dataset. (a) Accuracy surface with respect to reconstruction weight and entropy regularization. (b) Heatmap of accuracy varying with manifold weight and kernel width.
Electronics 15 02572 g004
Table 1. Summary of key notations.
Table 1. Summary of key notations.
SymbolDescription
X = { x 1 , , x n } The set of training samples
y The testing sample
w = [ w 1 , , w n ] T The representation weight vector for the testing sample
ϕ ( ) The nonlinear mapping function (kernel mapping)
K ( x i , x j ) The kernel function (e.g., Gaussian kernel)
L n × n The Laplacian matrix
A n × n The adaptive adjacency matrix
D n × n The degree matrix of the graph
λ , η , γ Regularization coefficients for different terms
h The hyperparameter controlling the kernel width
Table 2. Comparative analysis of KMOLNN and related linear KNN variants.
Table 2. Comparative analysis of KMOLNN and related linear KNN variants.
MethodObjective FunctionKernel UsageGraph/Manifold Reg.Theoretical Basis
LLKNNDistance constrainedNoNoBayesian rule
WLMRKNNLocal mean repr.NoNoWeighted neighbors
GS-KNNSparse repr.NoSparse graphCorrelation matrix
MVAKNNMulti-view adaptiveYesLaplacian reg.Multi-view learning
KMOLNN (Ours)Kernel-based errorYesAdaptive LaplacianGroup effect & Bayesian
Table 3. Objective functions and parameter settings of all 9 adopted methods.
Table 3. Objective functions and parameter settings of all 9 adopted methods.
MethodsObjective FunctionsPrediction Rules
KNN--
FKNN [36]--
PRKNN [37]--
Elastic Net [7] min w X w y 2 + λ 1 w 1 + λ 2 w 2 l = arg max c j C i = 1 k w i ( c j )
WLMRKNN [4] min w c j X ¯ k N c j w c j y 2 2 + λ T c j w c j 2 2 r ¯ k c j ( y ) = y X ¯ k N c j w j 2 2 l = arg max c j C ( r ¯ k c j ( y ) )
LCDR-KNN [20] min S X ˜ S y 2 2 + λ 1 S 2 2 + λ 2 j = 1 N s j 2 y u j 2 l = arg max c i C ( h = 1 k s h ( c i ) )
SSL-KNN [26] min W , μ , S k p = 1 r k = 1 c μ k X k S k w k p y p 2 + λ 1 p = 1 r k = 1 c w k p 1 + λ 2 p = 1 r ( k = 1 c w k p β d k p 2 )    + λ 3 μ T ( 1 μ ) + λ 4 k = 1 c S k I F 2 s . t . k = 1 c μ k = 1 , μ k > 0 l = arg max c k = 1 μ k p = 1 r 1 T S ( k ) w k p
MVAKNN [27] min W X W X F 2 + ρ 1 W 1 + ρ 2 Tr ( W T X T L X W ) f = arg max i p t i v 1 + + v λ
KMOLNN min p , A ϕ ( y ) i = 1 m p i ϕ ( x i ) 2 + λ 1 i = 1 m p i log p i + λ 2 i = 1 m p i β d i 2 + λ 3 i , j = 1 m p T L ( A ) p + φ i , j a i j x i x j 2 + λ 4 i , j a i j 2 l = arg max c b i B c p i k ( x , y i )
Table 4. Summary of the comparative baseline methods.
Table 4. Summary of the comparative baseline methods.
MethodYearKey IdeaKernelizationManifold LearningAdaptive Weighting
KNN [1]1967Majority voting based on Euclidean distanceNoNoNo
FKNN [36]1985Fuzzy membership-based weighted votingNoNoNo
Elastic Net [7]2012Sparse modeling with L 1 and L 2 regularizationNoNoNo
WLMRKNN [4]2019Local mean representation with weighted neighborsNoNoYes
LCDR-KNN [20]2020Local centroid distance-constrained representationNoNoYes
SSL-KNN [26]2024Shared style multi-view linear classificationYesNoYes
MVAKNN [27]2024Multi-view adaptive k -nearest neighbor classificationYesYesYes
PRKNN [37]2025Proximal ratio-based noise and overlap handlingNoNoNo
KMOLNN2026Kernelized manifold-optimized linear KNNYesYesYes
Table 5. Summary of the 15 benchmark datasets used in the experiments.
Table 5. Summary of the 15 benchmark datasets used in the experiments.
DatasetsNumber of SizesNumber of DimensionsNumber of Classes
1Sonar208602
2Wine178133
3Iris15043
4Breast Cancer Wisconsin569302
5Pima Indians Diabetes76882
6Glass Identification21496
7Ionosphere351342
8Heart Disease (Cleveland)303132
9Vowel9901011
10Ecoli33678
11Yeast1484810
12Pendigits10,9921610
13Satimage6435366
14Vehicle Silhouettes846184
15Letter Recognition20,0001626
Table 6. Average testing accuracies of all adopted methods on all adopted datasets.
Table 6. Average testing accuracies of all adopted methods on all adopted datasets.
DatasetKNNFKNNPRKNNElastic NetLCDR-KNNWLMRKNNSSL-KNNMVAKNNKMOLNN
Sonar0.82620.85480.76430.84290.88570.87860.89290.90480.9071
Wine0.96670.98610.99440.98330.98890.99170.98890.99440.9917
Iris0.96000.97330.95330.96670.97330.97330.98000.98670.9733
Breast Cancer Wisconsin0.95610.97540.96840.97190.97810.98070.97980.98510.9754
Pima Indians Diabetes0.72470.75970.76490.77790.76820.77210.77530.78180.7948
Glass Identification0.70470.74420.62090.73720.75120.76050.76510.78600.7977
Ionosphere0.86290.90570.88570.89140.91860.92570.93570.94140.9529
Heart Disease0.81480.84750.83770.84430.86230.87540.87050.89020.9115
Vowel0.90510.94490.89420.92120.96210.95810.96010.98480.9747
Ecoli0.81190.85070.78510.84480.86570.87160.86870.89100.9045
Yeast0.56500.60510.58220.59800.62090.63500.63000.65190.6721
Pendigits0.97200.98100.89500.97800.99200.98900.99000.99100.9950
Satimage0.90600.92000.84200.91500.93500.93800.94200.93900.9470
Vehicle Silhouettes0.72490.78520.64790.76210.80120.81480.82010.84500.8337
Letter Recognition0.94200.95800.76500.95100.96500.97000.97200.97800.9830
W/T/L (Ours vs. competing)15/0/015/0/014/0/115/0/015/0/014/0/113/0/210/0/5-
Note: The best results for each dataset are highlighted in bold.
Table 7. Average testing Macro F1-scores of all adopted methods on all adopted datasets.
Table 7. Average testing Macro F1-scores of all adopted methods on all adopted datasets.
DatasetKNNFKNNPRKNNElastic NetLCDR-KNNWLMRKNNSSL-KNNMVAKNNKMOLNN
Sonar0.81430.84760.75710.83570.87860.87140.88570.90000.9000
Wine0.96390.98330.99440.98330.98890.98890.98610.98610.9944
Iris0.95670.97330.95330.96670.97330.97330.98000.98670.9867
Breast Cancer Wisconsin0.94820.96930.96230.96490.97190.97630.97540.98070.9860
Pima Indians Diabetes0.68510.72990.72400.74480.73830.74480.75000.75780.7669
Glass Identification0.62560.69530.50700.68140.70930.72560.73020.75580.7323
Ionosphere0.84570.89140.87140.87860.90570.91430.92860.93570.9286
Heart Disease0.79510.83280.81970.82790.84430.86070.85740.87540.8393
Vowel0.88480.93180.92200.90810.95510.94800.95200.97980.9712
Ecoli0.75220.81040.68510.79550.82540.83430.82990.86270.8761
Yeast0.50510.56500.48220.55190.58010.60510.59490.62490.6360
Pendigits0.96800.97900.88200.97500.99000.98600.98800.98900.9730
Satimage0.88500.90500.81500.89800.92000.92500.93000.92800.9470
Vehicle Silhouettes0.70180.76800.61480.74500.78520.80180.80830.83200.8278
Letter Recognition0.93800.95400.74200.94800.96100.96800.97000.97600.9670
Note: The best results for each dataset are highlighted in bold.
Table 8. Average running times (seconds) of all adopted methods on all adopted datasets.
Table 8. Average running times (seconds) of all adopted methods on all adopted datasets.
DatasetKNNFKNNPRKNNElastic NetLCDR-KNNWLMRKNNSSL-KNNMVAKNNKMOLNN
Sonar0.00120.00150.00300.02550.03120.02804.91550.05952.1886
Wine0.00100.00130.00270.02100.02910.02553.13670.05282.5487
Iris0.00790.00110.00230.02070.02390.02113.86460.04921.9123
Breast Cancer0.00150.00180.00370.02970.03680.03405.01910.07122.9516
Pima Indians0.00180.00210.00460.03630.04400.04075.76880.08073.5123
Glass0.00110.00140.00280.02210.03070.02824.94500.05502.0859
Ionosphere0.00130.00160.00310.02740.03330.03044.05240.06212.3719
Heart Disease0.00140.00180.00330.02670.03470.03334.08250.07022.3566
Vowel0.00520.00380.00790.04110.04910.04696.72920.09364.1845
Ecoli0.00160.00200.00380.03330.04020.03807.71510.07943.6233
Yeast0.00540.00580.01180.05100.06410.05759.42030.11635.6277
Pendigits0.01480.02090.04050.10090.12350.115519.91060.236810.3476
Satimage0.01290.01490.03040.09040.11680.103316.93350.22738.7759
Vehicle0.00110.00270.00570.04430.05810.05458.14840.10874.3475
Letter Rec.0.01740.02040.04200.14740.19110.172240.70810.363117.2134
Table 9. Ranking values of all adopted methods on all adopted datasets.
Table 9. Ranking values of all adopted methods on all adopted datasets.
DatasetKNNFKNNPRKNNElastic NetLCDR-KNNWLMRKNNSSL-KNNMVAKNNKMOLNN
Sonar869745321
Wine971.585.53.55.51.53.5
Iris84.5974.54.5214.5
Breast Cancer Wisconsin95.58742315.5
Pima Indians Diabetes987365421
Glass Identification869754321
Ionosphere968754321
Heart Disease968753421
Vowel869735412
Ecoli869753421
Yeast968753421
Pendigits869725431
Satimage869754231
Vehicle Silhouettes869754312
Letter Recognition869754321
Average ranking8.46.078.16.84.63.933.431.831.83
Table 10. The experimental results for the Bonferroni–Dunn test between the proposed KMOLNN method and all comparative methods.
Table 10. The experimental results for the Bonferroni–Dunn test between the proposed KMOLNN method and all comparative methods.
MethodAverage Rank DifferenceCDNull Hypothesis
KNN/KMOLNN6.572.724Reject
FKNN/KMOLNN4.242.724Reject
PRKNN/KMOLNN6.272.724Reject
Elastic Net/KMOLNN4.972.724Reject
LCDR-KNN/KMOLNN2.772.724Reject
WLMRKNN/KMOLNN2.12.724Accept
SSL-KNN/KMOLNN1.62.724Accept
MVAKNN/KMOLNN02.724Accept
Table 11. Ablation study results isolating the impact of kernelization, manifold regularization (KKNN baseline), and the adaptive Laplacian matrix.
Table 11. Ablation study results isolating the impact of kernelization, manifold regularization (KKNN baseline), and the adaptive Laplacian matrix.
DatasetMetricKMOLNN (Linear)KMOLNN (w/o Manifold)/KKNNKMOLNN (Fixed Graph)KMOLNN (Full)
SonarAccuracy0.81250.86540.8810.9071
Macro F10.7950.8510.8750.900
IonosphereAccuracy0.87140.91250.9310.9529
Macro F10.8580.8950.910.9286
WineAccuracy0.96110.97220.98330.9917
Macro F10.9580.9750.9850.9944
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Bian, Z.; Zhang, L.; Wang, F. Kernelized Manifold-Optimized Linear KNN for Nonlinear Data Classification. Electronics 2026, 15, 2572. https://doi.org/10.3390/electronics15122572

AMA Style

Zhang J, Bian Z, Zhang L, Wang F. Kernelized Manifold-Optimized Linear KNN for Nonlinear Data Classification. Electronics. 2026; 15(12):2572. https://doi.org/10.3390/electronics15122572

Chicago/Turabian Style

Zhang, Jin, Zekang Bian, Liang Zhang, and Feng Wang. 2026. "Kernelized Manifold-Optimized Linear KNN for Nonlinear Data Classification" Electronics 15, no. 12: 2572. https://doi.org/10.3390/electronics15122572

APA Style

Zhang, J., Bian, Z., Zhang, L., & Wang, F. (2026). Kernelized Manifold-Optimized Linear KNN for Nonlinear Data Classification. Electronics, 15(12), 2572. https://doi.org/10.3390/electronics15122572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop