Next Article in Journal
Shared Autoencoder-Based Unified Intrusion Detection Across Heterogeneous Datasets for Binary and Multi-Class Classification Using a Hybrid CNN–DNN Model
Previous Article in Journal
Innovations in Robots for Weed and Pest Control: A Systematic Review of Cutting-Edge Research
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Kernel-Based Optimal Subspaces (KOS): A Method for Data Classification

Departement of Mathematics and Computer Science, College of Science and General Studies, Alfaisal University, P.O. Box 50927, Riyadh 11533, Saudi Arabia
Mach. Learn. Knowl. Extr. 2026, 8(2), 52; https://doi.org/10.3390/make8020052
Submission received: 24 September 2025 / Revised: 2 February 2026 / Accepted: 5 February 2026 / Published: 22 February 2026
(This article belongs to the Section Data)

Abstract

Support Vector Machine (SVM) is a popular kernel-based method for data classification that has demonstrated high efficiency across a wide range of practical applications. However, SVM suffers from several limitations, including the potential failure of the optimization process, especially in high-dimensional spaces; the inherently high computational cost; the lack of a systematic approach to multi-class classification; difficulties in handling imbalanced classes; and the prohibitive cost of real-time or dynamic classification. This paper proposes an alternative method, referred to as Kernel-based Optimal Subspaces (KOS), which belongs to the family of kernel subspace methods. Mathematically similar to Kernel PCA (KPCA), KOS achieves performance comparable to SVM while addressing the aforementioned weaknesses. The method is based on computing the minimum distance to optimal feature subspaces of the mapped data. Because no optimization process is required, KOS is robust, fast, and easy to implement. The optimal subspaces are constructed independently, enabling high parallelizability and making the approach well-suited for dynamic classification and real-time applications. Furthermore, the issue of imbalanced classes is naturally handled by subdividing large classes into smaller sub-classes, thereby creating appropriately sized sub-subspaces within the feature space.

1. Introduction

Machine Learning (ML), a branch of Artificial Intelligence (AI), is the science of designing algorithms that can automatically improve by learning from data (see, for instance, [1]). This multidisciplinary field has a wide range of applications, including handwriting recognition, automated disease detection, robotics, and others.
Among the core problems in AI is data classification, which relies heavily on ML techniques. Kernel methods are one of the most popular subsets of data classifiers, with Support Vector Machine (SVM) being among the most widely used. The SVM algorithm is grounded in statistical learning theory, originating from Vapnik’s work on the Structural Risk Minimization principle [2,3,4].
The fundamental idea of kernel methods is to map the original data, referred to as attributes, into a higher-dimensional space, known as the feature space, where the data can be more easily separated. While SVM was originally designed for linear separation, this mapping enables it to handle data that are non-linearly separable in the original space by achieving linear separation in the feature space. Note that the mapping function is not explicitly defined, which is a tedious task; instead, a suitable kernel function is used, as long as all operations can be expressed using dot products. The existence of such kernels is guaranteed under certain conditions by Mercer’s theorem [4,5].
SVM is widely recognized for its high performance (see [6,7,8,9,10,11,12,13] for recent improvements). However, despite significant advancements, it still has several limitations:
  • As an optimization problem, SVM may fail, particularly in high dimensions, even though the cost function is convex (quadratic).
  • Extending SVM to multi-class classification is not straightforward and becomes computationally expensive in high dimensions or when the number of classes is large. For instance, a common approach combining one-against-one and one-against-all strategies requires computing a total of n + n ( n 1 ) 2 separating hyperplanes (i.e., solving n optimization problems), where n is the number of classes.
  • Imbalanced classes pose another challenge. A popular approach to addressing this involves adding synthetic attributes to underrepresented classes. While effective, the process is not straightforward and can alter the structure of the original class data.
  • In dynamic classification, where classes may be added or removed, all separating hyperplanes must be recalculated, which can be prohibitively expensive for real-time applications.
To address these issues, a preliminary version of an alternative method, along with basic initial results, was introduced in a recent conference paper [14]. In this extended journal version, we provide the complete formulation of the method for centered data in the feature space, which offers more efficiency, along with a thorough theoretical analysis and corresponding proofs. Furthermore, we propose an effective strategy to tackle the inherent imbalanced class problem. Extensive experiments are conducted on large, multi-class datasets for a thorough evaluation of the method’s performance. In addition, a detailed analysis of time efficiency is included. Finally, when using the radial basis function (RBF) kernel, we describe a robust learning procedure for selecting the parameter σ .
In this work, the proposed method is referred to as Kernel-based Optimal Subspaces (KOS). This is a kernel subspace method [15,16,17] and is similar to Kernel PCA (KPCA) [18,19]. The approach relies, as in subspace methods, on computing a minimum distance between an unseen feature vector and a set of optimal subspaces, in a sense that will be defined, in the feature space. The unseen vector is then assigned to the appropriate attribute class. Proper orthogonal decomposition (POD) is used to build the feature subspaces, and their optimality is shown.
Fundamentally, the method requires the computation of the eigenpairs of the Mercer kernel matrix for each training class independently. No optimization is involved, making the method straightforward to implement, significantly faster than SVM (with linear complexity with respect to the number of classes), and robust.
KOS is highly parallelizable with respect to the number of classes and well-suited for dynamic and real-time classification, as only the eigenpairs for new classes need to be computed. If a class is removed, no further computation is required.
The issue of imbalanced classes is handled naturally by splitting large attribute classes into balanced sub-classes and then labeling all corresponding feature subspaces as belonging to a single class. An unseen feature vector is classified into a given class if it is close to any of the subspaces generated by that class’s sub-classes.
The performance of the proposed method is demonstrated through both 2D and high-dimensional, multi-class test cases. The results are compared against those of Support Vector Machine (SVM), using the popular LIBSVM software package [20]. The findings show that KOS performs comparably to SVM while offering several significant advantages, as discussed earlier, making it a strong alternative to SVM. Although the primary objective is to propose an alternative to SVM, the method is also compared with Kernel Neural Networks (KNNs) and Kernel PCA (KPCA).
The experiments use the radial basis function (RBF) kernel, which, as is well known in the context of SVM, involves tuning two key parameters: the standard deviation σ (in the Gaussian form) or γ (in the RBF form) and the penalty parameter C. The most widely used method for selecting these parameters is cross-validation-based grid search, as described in [21,22,23,24,25], including both automated and manual search approaches.
In contrast, the KOS method does not require the penalty parameter C. Only one parameter, σ or γ , needs to be learned. Through a class-wise re-scaling process, this parameter is learned in a robust and effective manner, contributing to the overall performance of the method.
In Section 2, a short overview of proper orthogonal decomposition (POD) is provided first, then details of the proposed KOS method are developed. Tests to validate the proposed method by comparison to SVM results are performed and reported in Section 3. Conclusions are drawn in Section 4.

2. KOS: Kernel-Based Optimal Subspaces Method

Before detailing the KOS method, we first present a brief overview of the proper orthogonal decomposition technique, which is a crucial component of the approach.

2.1. Proper Orthogonal Decomposition (POD): Overview

Proper orthogonal decomposition (POD) [26], also known as Karhunen–Loeve expansion (KLE) [26], popular in the data analysis field, is a technique that aims to represent a big number of data by a reduced number of basis elements built from the data. More precisely, if the data are stored in an m × N matrix A, with N being a large number, it is possible to represent the N columns of A by p orthonormal vectors, with p being very small compared with N. The mathematical formulation can be summarized as follows: Let A = Y 1 , Y 2 , , Y N ; the problem is to find B = V 1 , V 2 , , V p , an orthonormal set of p vectors that better represent the N columns of A in the following sense.
V 1 satisfies the optimization problem
V 1 = A r g [ M a x V I R m i = 1 N V , Y i 2 subject to V 2 = 1 ] .
V 2 = A r g [ M a x V I R m i = 1 N | V , Y i | 2 subject to V 2 = 1 and V , V 1 = 0 ] .
V k = A r g [ M a x v I R m i = 1 N | V , Y i | 2 subject to V 2 = 1 and V , V 1 = 0 , V , V 2 = 0 , , V , V k 1 = 0 ] .
The vectors V k , k = 1 , p , are the eigenvectors of A A T . A simplified proof is provided in Appendix A for self-content. For more details, see [26].

POD Basis System

The POD basis system is built as follows:
1.
If n m , let M be the correlation matrix M i , j = ( A T A ) i , j where ( , ) is any inner product. Let V be the matrix of eigenvectors of M; then the basis functions Ψ i , called modes, are given by:
Ψ i = j = 1 N v j i Y j
where v j i are the components of the ith eigenvector of M.
2.
If n > m , the POD basis is defined by the eigenvectors of the matrix M ¯ i , j = ( A A T ) i , j .
This paper will use the first definition since the original data will be mapped into an infinite-dimension vector space.
Now calculate the inner product ( Ψ i , Ψ j ) :
( Ψ i , Ψ j ) = ( l = 1 N v l i Y l , k = 1 N v k j , Y k ) = l , k v l i v k j ( Y l , Y k ) = l , k v l i v k j M l , k = M V i . V j = λ i V i . V j
This shows that the functions Ψ i , 1 i N , are orthogonal if and only the eigenvectors V i , 1 i N , are, which is generally the case. For i = j we get the norm Ψ i = λ i , which shows that the norm of each mode represents part of the energy of the system. In conclusion, modes with low energy (small eigenvalues) can be neglected, reducing the POD basis to only modes with significant energy. Practically, the ratio of the modeled to the total energy contained in the system given by
ε ( l ) = i = 1 l λ i i = 1 N λ i
is calculated to decide if l modes are enough to describe the system.

2.2. KOS: Global Description and Some Theoretical Aspects

Similar to SVM and other kernel methods, KOS first maps the original data into a high-dimensional feature space using a kernel function. For each class, an optimal subspace that best fits its feature vectors is constructed. An unseen sample is classified, as is common in subspace-based methods, by projecting its feature vector onto each class subspace and assigning the label corresponding to the minimum distance. Unlike SVM, KOS replaces separating hyperplanes, often a source of practical limitations, with optimal class subspaces that directly model the feature data, as illustrated in Figure 1. A summary of the method is provided in Algorithm 1.
Note that in some cases, different classes may be mapped to the same feature subspace, leading to a misclassification. Therefore, it is desirable for the subspaces to be as distinct as possible to reduce this risk. In theory, as proved later in this paper, it is always possible to construct subspaces that are even completely disjoint. However, in practice, this is cannot be guaranteed or controlled. For optimal performance, the kernel should be chosen such that the resulting feature space is of infinite dimension, which increases the likelihood of separability between class subspaces. It is well established, for example, that the RBF kernel meets this requirement.
Algorithm 1: Main KOS Steps
Let C k , k = 1 , P denote P attribute sets representing P different classes used during the learning stage. Let B k = 1 , P denote the corresponding mapped sets in the feature space. The KOS classification procedure is as follows:
1.
Check whether the classes C k are imbalanced. If so, split the larger classes into smaller sub-classes according to procedure described in Section 2.6.
2.
Select a Mercer kernel K, then compute the kernel matrices K k for each feature class C k (of feature sub-classes if applicable).
3.
Compute the POD correlation matrices M k for each feature class (or sub-class) C k using either non-centered formulation (9) or centered formulation (14).
4.
Compute the eigenvalues and eigenvectors of each matrix M k , k = 1 , P .
5.
For each unseen vector X ^ , compute its coordinates α i k , i = 1 , s i z e ( C k ) in each POD feature subspace using Formula (20) for the non-centered case or Formula (26) for the centered case.
6.
Compute the distance of X ^ to each POD feature subspace using Formula (21) for the non-centered case or Formula (27) for the centered case.
7.
Decision: Assign X ^ to the class corresponding to the minimum computed distance.
Theorem 1.
There exist Mercer kernels such that the attribute classes can be mapped into disjoint feature subspaces.
Proof. 
For simplicity, the proof is given for the two-class case; the result can then be extended recursively to the multi-class case. Let χ denote the attribute (original) space and let H be the feature space obtained through a mapping Φ associated with a Mercer kernel K ( x , y ) . Suppose that K ( x , y ) is such that H is infinite-dimensional.
Φ : χ H
Let C 1 and C 2 be two classes, and let E and F be two finite-dimensional subspaces of the feature space H such that Φ ( C 1 ) E and Φ ( C 2 ) F . Let E T denote the orthogonal complement of E in H . Since H is infinite-dimensional and E is finite-dimensional, E is also infinite-dimensional.
Let G E T be any subspace with the same dimension as F . Then there exist a one-to-one mapping ϕ : F G .
Now, define a new mapping ψ : χ H as follows:
ψ ( x ) = ϕ ( Φ ( x ) ) i f x C 2 Φ ( x ) i f n o t
We have Φ ( C 1 ) E and Φ ( C 2 ) G with E and G orthogonal then disjoint. The kernel defined by K ¯ ( x , y ) = ψ ( x ) ψ ( y ) is a Mercer kernel by construction and satisfies the theorem, which ends the proof. □

2.3. POD Feature Subspaces

The POD technique is proposed to build the optimal subspaces in the feature vector space. Note that similar techniques, such as PCA (Principal Component Analysis) and KPCA (Kernel Principal Component Analysis), are commonly used in kernel theory for data classification, typically as a preprocessing step for feature extraction and enhancement [27]. The optimality of the POD feature subspaces will be discussed later.
Assume that a suitable kernel K ( x , y ) is given, and let ϕ be the associated mapping function. It is important to note that choosing a kernel such that the feature space H is of infinite dimension is highly recommended, as it helps map data into subspaces with minimal or no overlap.
Let C l , l = 1 , P denote the P attribute classes. Denote by X i l the ith element of class C l and by Y i l = ϕ ( X i l ) its corresponding mapped feature vector. To construct the POD feature subspaces, we only need to build the POD basis, also known as the modes, as described in Section 2.1.
In general, it is well established that centering the data is recommended when applying the POD method. Therefore, we provide formulations of the POD basis for both non-centered and centered data.
To lighten the notations, the superscript l, which refers to a specific class, is omitted and will be reintroduced as needed. In the following, we use K i , j to denote the Mercer’ kernel K matrix, namely, K i , j = Y i Y j = ϕ ( X i ) ϕ ( X j ) , and K ( X , Y ) when X or Y does not belong to the training dataset.
1.
Non-centered data formulation: In this case, the data in the feature space are used without centering. Let Y i = ϕ ( X i ) , i = 1 , N be the mapped data points of a selected class of size N. First, we construct the correlation matrix:
M ( i , j ) = Y i Y j = ϕ ( X i ) ϕ ( X j ) = K ( i , j ) ,
where K ( i , j ) denotes the kernel function evaluated at Y i and Y j . As seen, the POD correlation matrix is equivalent to the kernel matrix derived from the mapping.
The normalized POD modes are then given by
ψ i = 1 σ i k = 1 N V k i Y k ,
where V i is the ith eigenvector of the kernel matrix M and σ i the associated eigenvalue.
2.
Centered data formulation: In this case, the feature vectors Y i are centered. Let Y ¯ = 1 N k = 1 N Y k be the mean vector and Z i = Y i Y ¯ be the centered data. The correlation matrix is then given by:
M ( i , j ) = Z i Z j = ( Y i Y ¯ ) ( Y j Y ¯ ) = Y i Y j Y i Y ¯ Y j Y ¯ + Y ¯ Y ¯ .
Each of the inner products can be expressed in terms of the kernel function:
Y i Y ¯ = Y i 1 N k = 1 N Y k = 1 N k = 1 N Y i Y k = 1 N k = 1 N K ( i , k ) ,
Y ¯ Y ¯ = 1 N k = 1 N Y k 1 N l = 1 N Y l = 1 N 2 k = 1 N l = 1 N K ( l , k ) .
By substituting back into the expression for M ( i , j ) , the centered correlation matrix becomes:
M ( i , j ) = K ( i , j ) 1 N k = 1 N ( K ( i , k ) + K ( j , k ) ) + 1 N 2 k = 1 N l = 1 N K ( l , k ) .
As in the non-centered case, the correlation matrix can be entirely expressed using the kernel matrix.
The normalized POD modes for centered data are then given by
ψ i = 1 σ i k = 1 N V k i ( Y k Y ¯ )
where V i is the ith eigenvector of the correlation matrix M defined by (14) and σ i the associated eigenvalue.
Note that for calculating the minimum distance in classification (as will be shown later), it is not necessary to explicitly compute the feature vectors Y k or the POD modes.

2.4. Optimality of the POD Feature Subspaces

The optimality of the feature subspaces is considered from both geometric and algebraic perspectives.
From a geometric viewpoint, the subspace basis should approximate the original feature vectors as closely as possible, thereby preserving the intrinsic structure of the data in the feature space.
From an algebraic viewpoint, the feature subspaces should be the smallest, in terms of dimension, subspaces that contain the feature vectors, thereby enhancing discrimination.
In essence, the POD method provides a set of orthonormal basis functions that span the most “informative” subspace, allowing for an efficient and discriminative representation of each class in the feature space.
  • Geometrical Optimality of POD Feature Spaces:
    As shown in [26], if Ψ is the POD basis matrix constructed to represent the columns of a matrix A = Y 1 , Y 2 , , Y N and Ψ ¯ is any other basis of the same size constructed for the same purpose, then the following inequality holds:
    A Ψ F A Ψ ¯ F ,
    where . F is the Frobenius norm, defined by
    A F = i = 1 m j = 1 N | A i , j | = Trace ( A T A )
    This result demonstrates the geometric optimality of the POD basis: among all possible orthonormal bases of the same dimension, the POD basis provides the best approximation (in the least-squares sense) of the original data matrix A. Therefore, the POD basis vectors are the closest possible representatives of the original feature vectors, ensuring maximum fidelity in capturing the data structure.
  • Algebraic Optimality of POD Feature Spaces:
Lemma 1.
The POD subspace is algebraically optimal; it is the smallest subspace containing the mapped data.
Proof. 
The proof is straightforward. To show that the POD subspace is the smallest subspace containing the mapped (feature) data, it is sufficient to show that any subspace containing the feature data must also contain the POD subspace. □
Let F be a subspace that contains the given feature data. To prove that F also contains the POD subspace, it suffices to show that F contains the POD modes (i.e., the basis of the POD subspace). Since the POD modes are linear combinations of the feature data and F is a subspace (i.e., closed under linear combinations), the modes necessarily lie in F. Hence, the POD subspace is contained in any subspace that contains the mapped data, proving its algebraic optimality.

2.5. Decision Criterion

The decision criterion for an unseen sample X ^ is, as in subspace-based methods, based on the minimum distance between its feature vector Y ^ = ϕ ( X ^ ) and the POD feature subspaces. This requires computing the POD coordinates of Y ^ within each subspace. The procedures for both non-centered and centered data are detailed below:
1.
Non-centered data formulation: The coordinates α i , i = 1 , N , are given by the projection of the element on the POD modes
α i = < ψ i , Y ^ >
with
ψ i = 1 σ i k = 1 N V k i Y k ,
Then
α i = 1 σ i < k = 1 N V k i Y k , Y ^ > = 1 σ i k = 1 N V k i ( Y k Y ^ ) = 1 σ i k = 1 N V k i K ( X k , X ^ )
Using the Pythagoras rule, the distance of Y ¯ to the POD subspace F is given by
d i s ( Y ^ , F ) = Y ^ 2 i = 1 N α i 2 .
In the above,
Y ^ 2 = Y ^ Y ^ = ϕ ( X ^ ) ϕ ( X ^ ) = K ( X ^ , X ^ ) .
2.
Centered data formulation: For the centered case, Y ^ and Y i are replaced by Y ^ Y ¯ and Y i Y ¯ respectively. That is,
α i = < ψ i , ( Y ^ Y ¯ ) > ,
with
ψ i = 1 σ i k = 1 N V k i ( Y k Y ¯ ) ,
Then
α i = 1 σ i < k = 1 N V k i ( Y k Y ¯ ) , ( Y ^ Y ¯ ) > = 1 σ i [ k = 1 N V k i ( Y k Y ^ ) k = 1 N V k i ( Y k Y ¯ ) k = 1 N V k i Y ¯ Y ^ + k = 1 N V k i Y ¯ Y ¯ ] .
By substituting Y ¯ = 1 N j = 1 N Y j we obtain
α i = 1 σ i [ k = 1 N V k i K ( X k , X ^ ) 1 N k = 1 N j = 1 N V k i K k , j ( 1 N j = 1 N K ( X j , X ^ ) ) k = 1 N V k i + ( 1 N 2 k = 1 N j = 1 N K k , j ) k = 1 N V k i ] .
Then
d i s ( Y ^ , F ) = Y ^ Y ¯ 2 i = 1 N α i 2 .
In the above,
Y ^ Y ¯ 2 = Y ^ 2 2 Y ^ Y ¯ + Y ¯ 2 = K ( X ^ , X ^ ) 2 N j = 1 N K ( X ^ , X j ) + 1 N 2 k = 1 N j = 1 N K k , j .
The unseen attribute vector is assigned to the class that corresponds to the minimum distance calculated above.

2.6. Imbalanced Classes

Class imbalance is a common issue in many classifiers, including SVM. Popular approaches include adjusting class weights (as in LIBSVM) and the SMOTE algorithm [28], which generates synthetic samples for smaller classes. While both methods show acceptable effectiveness, they lack a systematic way to determine weight distributions or the placement of synthetic vectors, potentially altering the data structure.
The proposed KOS method addresses this naturally by subdividing large attribute classes into smaller sub-classes of similar sizes. Each sub-class retains the original class label, effectively representing large subspaces as a collection of smaller, more balanced subspaces.
Specifically, a reference size, often the size of the smallest class, is defined. Any class at least twice this size is split into subsets roughly equal to the reference size. For classification, if the minimum distance is obtained from a POD subspace of a subdivided class, the unseen sample is assigned to that class. This strategy is illustrated in Figure 2 and validated in the test section.

2.7. RBF Parameter Learning

As previously mentioned, when using the RBF kernel, only the parameter σ (equivalently γ ) needs to be learned, as in the case of hard-margin SVM. There is no need for a penalty parameter.
To enable a fast and efficient search algorithm, we propose scaling σ according to the class attribute diameter. This approach is motivated by multi-scale theory in image processing (see, for instance, [29,30]), where the Gaussian kernel is used to extract features at different scales corresponding to different values of σ . The latter determines the width of the Gaussian, allowing each object to be observed at an appropriate scale. Small values of σ correspond to fine image details, whereas larger values capture more global structures.
Applying this idea to each class and selecting a value of σ proportional to the class diameter allow the entire class structure to be captured. Only minor adjustments are then required to avoid missing important details or overfitting. Moreover, vectors that are far apart may produce many entries of the RBF co-variance matrix close to 0, which can lead to an ill-conditioned matrix. Choosing σ proportional to the class diameter helps avoid this issue.
We decompose σ as follows:
σ = ω D ,
where D is the diameter of the set, computed as the maximum distance between any two elements in the set. The scaling factor ω is varied from 0.1 to 1 in increments of 0.1 . This strategy significantly reduces the complexity of the parameter search.
For comparison, in standard SVM, approximately 250 × 250 grid searches are typically required to determine the optimal parameter pair ( σ , C ) , where C is the penalty parameter.

2.8. Some Remarks

PCA and POD share the same mathematical foundation but differ in data interpretation. Since the original data are mapped into an infinite-dimensional space, where the dimensionality exceeds the number of samples, POD modes are more appropriately computed using (10) or (15) and the eigenvectors of Y T Y [26], rather than the eigenvectors of Y Y T as in KPCA. This approach simplifies the projection process.
Additionally, POD treats the data as snapshots, interpreting the entire class as a dynamical system in which each element represents a temporal deformation of a standard reference. From this perspective, POD aims to optimize the system’s energy content. This distinction is meaningful, as will be shown in the test section along with the proposed strategy for handling imbalanced classes. Notably, the POD-snapshots approach is widely used in fluid dynamics for reduced-order modeling.

2.9. Summary of the Algorithm

The KOS method is summarized below, and its main computational steps are described to facilitate practical implementation.

3. Results: Validation and Discussion

The KOS method is verified and validated using the RBF kernel. Initial tests are conducted on 2D cases to visually assess its ability to handle non-linear classification. Subsequently, KOS is tested on selected real-world datasets from the LIBSVM [20] and OpenML [31] repositories, and the results are compared against those obtained using SVM implemented in the widely used LIBSVM software package. While the main objective is to propose an alternative to SVM, the performance and computational cost of the method are also compared with Kernel Neural Networks (KNNs) and Kernel PCA (KPCA). Accuracy and macro-precision are used as performance metrics.
It is important to note that no specific preprocessing is applied to the data for KOS, apart from re-scaling the input features to the interval [ 1 , 1 ] .

3.1. 2D Test Cases

To evaluate KOS’s capability to separate non-linearly separable data, three increasingly challenging scenarios are tested: (i) connected sets, (ii) non-connected sets, and (iii) a spiral configuration. Figure 3, Figure 4 and Figure 5 demonstrate that KOS successfully separates the data in all cases. A fixed kernel parameter value of σ = 1.2 is used for all three tests.

3.2. Higher-Dimension Test Cases

Two-class and multi-class test cases are selected from the LIBSVM [20] and OpenML [31] databases. For all experiments, accuracy and macro-precision are computed to evaluate performance.

3.2.1. KOS vs. SVM

KOS and SVM (using LIBSVM software) are tested, and their performances compared. Note that for KOS, there is no penalty parameter C; only the standard deviation σ is learned. This is done by splitting the training set into two parts: 2/3 for building the KOS model (eigenpairs of the correlation matrices) and 1/3 for validation. The value of σ is varied from σ = 0.1 , to σ = 1 in increments of 0.1 . Once the optimal σ is found, the entire training set is used to build the KOS model. The tests are briefly described in Table 1 and Table 2. Table 3 and Table 4 compare the performance of KOS and SVM. Overall, the tables show that KOS performs comparably to SVM while offering significant advantages, outlined in the next subsection.
The effectiveness of the proposed imbalanced strategy is demonstrated, for instance, in the Leukemia and Shuffle tests, which contain the most imbalanced classes. Specifically, the performance rates without splitting the classes were 67.64 % and 80.6 % , respectively. After applying the splitting strategy, these scores improved to 82.35 % and 99.9 % , respectively.

3.2.2. Comparison with KNN and KPCA

The Kernel Neural Network (KNN) used in this study is a Multilayer Perceptron (MLP) with two hidden layers of 64 and 32 neurons, respectively. Note that in this model, the kernel is used only for feature extraction and enhancement. Both the KNN and KPCA are implemented using the TensorFlow library [39]. Table 5 reports the obtained results, which indicate that KOS (and similarly SVM) generally outperforms the KNN, particularly on imbalanced datasets such as Leukemia. A similar conclusion holds for KPCA, even though KPCA is conceptually closer to KOS. This difference, as discussed in Section 2.8, can be attributed to the snapshot POD formulation. On one hand, this formulation makes the computation of distances to subspaces in the feature space straightforward; on the other hand, it treats each class as a dynamical system, where each vector is interpreted as a temporal deformation of a representative element, thereby reinforcing intra-class cohesion. Moreover, KPCA computes distances to subspaces through reconstruction via the inverse transform, which introduces an optimization step that negatively impacts both accuracy and computation time. The proposed imbalance classes-handling strategy further contributes to improving the performance of KOS on classes that typically exhibit such behavior.

3.2.3. Feature Space Orthogonality

We proved in Section 2.2 that from a theoretical perspective, it is always possible for feature subspaces to be distinct and even orthogonal. However, in practice, this is not guaranteed. Considering kernels whose feature subspaces are infinite-dimensional (such as the RBF kernel) increases the likelihood that orthogonality occurs, thereby improving the efficiency of the method. In this section, to quantify the separation between subspaces in the conducted experiments, the principal angles between subspaces and the chordal metric are computed.
Let E and F be two subspaces of dimensions p and m, respectively. Let { φ i E } i = 1 p and { φ j F } j = 1 m be orthonormal bases of E and F. The orthogonality between the two subspaces is quantified using the principal angles  Θ k , for k = 1 , , min ( p , m ) , defined as follows.
First, compute the p × m correlation matrix between the two subspaces:
K E , F ( i , j ) = φ i E , φ j F , i = 1 , , p , j = 1 , , m .
Next, compute the singular value decomposition (SVD) of K E , F :
K E , F = U Σ V .
The principal angles are obtained from the singular values as
cos ( Θ k ) = σ k , k = 1 , , min ( p , m ) ,
where σ k denotes the k-th singular value of K E , F .
The distribution of the angles { Θ k } provides a measure of the separation between the two subspaces. If all angles are equal to π 2 , the subspaces are perfectly orthogonal. Conversely, if all angles are equal to 0, the two subspaces coincide.
One of the most popular metrics used to estimate the orthogonality between two subspaces based on principal angles is the chordal metric, defined as
d c 2 = k = 1 m i n ( p . m ) s i n ( Θ k ) 2 ,
It is well established that d c 2 r = min ( p , m ) . Therefore, values of d c 2 close to zero indicate weak orthogonality, whereas values of d c 2 close to r indicate strong orthogonality. In this study d c 2 values are normalized by the minimum dimension.
Table 6 shows the average principal angles Θ k between subspaces; the corresponding standard deviations, which indicate how far the remaining angles deviate from the average; and the chordal distance metric. For multi-class problems, the average of the normalized chordal distance is reported.
The table shows that overall, there is a clear correlation between subspace orthogonality (separability) and the performance of the method. However, there are two exceptions. In the Madelon dataset, the subspaces exhibit high orthogonality, while the classification performance is poor. Conversely, the Breast Cancer dataset shows low orthogonality but high performance. These cases can be explained by the fact that orthogonality alone is not a decisive criterion for classification performance; the representativeness of the training data with respect to the real-world distribution is crucial.

3.2.4. CPU Time Discussion

The proposed method is implemented on the MATLAB-R2018b platform. Note that LIBSVM is written in C++; therefore, any direct comparison of execution time is not meaningful, as C++ implementations are reported in many studies to be up to 100 times faster than MATLAB (see, for instance, [40]. For fairness, the MATLAB execution time was divided by 50. The KNN and KPCA methods are mainly implemented in Python, and for a fair comparison, their processing time was also divided by 50. All experiments were conducted on an HP Victus 15L Gaming Desktop (TG02-0xxx), Victus, shipped from Riyadh KSA running Ubuntu 24.04.3 LTS, equipped with an AMD Ryzen 7 5700G processor (16 threads) and 32 GB of RAM. Only one CPU core was used. The results reported in Table 7 and Table 8 show that KOS is significantly faster than SVM, faster than the KNN, and comparable to KPCA. Notably, KOS remains faster than SVM even without re-scaling the CPU time.

3.2.5. Complexity Analysis

Let d denote the attribute space dimension, c the class size (assuming all classes have approximately the same size), and n the number of classes. The SVM method requires solving n + n ( n 1 ) 2 = O ( n 2 ) optimization problems in order to determine the separating hyperplanes. If Newton’s method is employed for the optimization process, the computational complexity of each iteration is O ( d 3 ) . This results in an overall complexity of O k n 2 d 3 , where k denotes the average number of iterations required for each optimization problem.
For KOS, computing the eigenpairs of the kernel (Mercer) matrix for each class has a complexity of O ( c 3 ) . Since this computation is performed independently for each class, the total complexity is O ( n c 3 ) .
To achieve a comparable processing time, we require
c = d k n 3 .
A typical accuracy of ε = 10 6 is generally required, and since k 1/ ε , we obtain k 10 6 . This leads to
c = d 10 3 n 3 .
This indicates that for KOS and SVM to have comparable processing times, the class size must be several orders of magnitude larger than the attribute space dimension. Such a situation is highly unlikely in practice, particularly in high-dimensional cases.
Finally, if both the feature space dimension d and the class size c are assumed constant and only the number of classes n increases, the computational complexities of SVM and KOS reduce to
O ( n 2 ) and O ( n ) ,
respectively. This confirms that KOS scales linearly with respect to the number of classes, as stated in the introduction.

3.3. KOS: Main Advantages Compared to SVM

As shown by the tests, KOS performs similarly to SVM, with many significant advantages, which can be listed as:
1.
The KOS algorithm is easier to implement: it only requires the computation of the correlation matrix eigenpairs (Formulas (9) or (14)), followed by a direct application of Formulas ((20), (21) or (26), (27)).
2.
Robustness: The KOS algorithm is robust and faster; it requires no optimization process which can slow down the algorithm in high dimensions and can fail even if the cost function is quadratic because of rounding-off errors, as in the case of SVM.
3.
Complexity: SVM complexity grows quadratically with the number of classes (requiring n + n ( n 1 ) 2 hyperplanes), while KOS scales linearly, needing only n sets of eigenpairs.
4.
Time efficiency: As a result of the two previous proprieties, KOS is significantly faster than SVM, up to more than 60 times faster in certain tests, as shown in Table 7.
5.
Parallelization: KOS is highly parallelizable with respect to the number of classes since all subspaces are independent.
6.
Dynamic classification: In the case of class creation or cancellation, SVM requires recalculating all feature space hyperplanes. By contrast, KOS only requires computing eigenpairs of the Mercer kernel matrix for the new class, with no update needed for canceled classes.
7.
Imbalanced classes: In KOS, this issue is naturally handled by subdividing large classes into smaller, balanced subclasses; see Section 2.6. This avoids the need for artificial attributes, which may distort the data structure, or the use of balancing weights, which often lack a clear and systematic procedure.

4. Conclusions

In this paper, we proposed an alternative to the SVM method, referred to as KOS (Kernel-based Optimal Subspaces), which belongs to the family of kernel subspace methods. The approach involves constructing optimal subspaces in the feature space and classifying an unseen attribute vector based on its minimum distance to these subspaces. Some theoretical foundations are presented, and proofs are provided. The POD subspaces are shown to be optimal in the sense that they are the smallest subspaces containing the feature classes, with basis systems that serve as the best representatives of the data. All necessary formulations for practical implementation are derived.
The issue of imbalanced classes is addressed using a simple and intuitive mechanism, eliminating the need for additional preprocessing or resampling techniques. Experimental results on both 2D and high-dimensional multi-class datasets from real-world scenarios demonstrate that KOS achieves performance comparable to optimized SVM. The method is also compared with KNNs and KPCA. Beyond accuracy, KOS offers several practical advantages over SVM: it is easier to implement, significantly faster, and more robust; systematically handles imbalanced classes; is suitable for dynamic classification tasks; and is highly parallelizable with respect to the number of classes. These strengths make KOS a compelling alternative to SVM across a wide range of applications.

Funding

This research study was funded by Alfaisal University grant IRG with reference IRG20411.

Institutional Review Board Statement

Not relevant to this study.

Informed Consent Statement

Not applicable. This study did not involve humans.

Data Availability Statement

All used data are public and available at: https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/. Accessed on 5 July 2025.

Acknowledgments

This research study is supported by Alfaisal University grant IRG with reference IRG20411. During the preparation of this manuscript, the author used the free online ChatGPT to improve the clarity, grammar, and overall readability of the English text. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The author declare no conflicts of interest.

Appendix A. Proof of the POD Best Representatives

Let Y k be the columns of matrix A; the upper index is used here instead of lower index, which will be used to refer to the column components. Y 1 satisfies the optimization problem
V 1 = A r g [ M a x V I R n i = 1 N V , Y i 2 subject to V 2 = 1 ] .
This constrained optimization problem can be solved using Lagrange multipliers as
L ( V , λ ) = i = 1 N V , Y i 2 + λ ( 1 V 2 ) ( V , λ ) I R m + 1 ,
and the solution is obtained by nullifying the gradient
L ( V , λ ) = 0 i n I R m × I R .
Differentiating L with respect of V gives
L V j = V j i = 1 N k = 1 m V k Y k i 2 + λ ( 1 k = 1 m V k 2 ) = i = 1 N V j k = 1 m V k Y k i 2 + V j λ ( 1 k = 1 m V k 2 ) = i = 1 N 2 k = 1 m V k Y k i Y j i 2 λ V j = 2 k = 1 m i = 1 N Y k i Y j i V k λ V j = 2 k = 1 N A A T k , j V k λ V j A A T is symmetric we can permute the indexes = 2 k = 1 N A A T j , k V k λ V j .
Then
L V j = 0 implies k = 1 N A A T j , k V k = λ V j
and then, by setting V = V 1 and λ = λ 1 ,
A A T V 1 = λ 1 V 1 .
This shows that V 1 is the eigenvector of matrix A A T and λ 1 is the associated eigenvalue. Now we evaluate the maximum by substituting V 1 :
i = 1 N V 1 , Y i 2 = i = 1 N k = 1 m V k 1 Y k i j = 1 m V j 1 Y j i = i = 1 N k = 1 m j = 1 m V k 1 Y k i V j 1 Y j i = k = 1 m j = 1 m V k 1 V j 1 i = 1 N Y k i Y j i = k = 1 m j = 1 m V k 1 V j 1 ( A T A ) k , j = A A T V 1 , V 1 = λ 1 V 1 , V 1 = λ 1 .
We want V 2 to be the second-best representative, with V 2 perpendicular to V 1 . We want V 2 in the orthogonal space to the one spanned by V 1 . This can be formulated as follows:
V 2 = A r g [ M a x V I R N i = 1 N | V , Y i | 2 subject to V 2 = 1 and V , V 1 = 0 ] .
Note that matrix A A T is symmetric and positive semi-definite; then it has m eigenvalues λ 1 λ 2 λ m 0 and m corresponding eigenvectors V 1 , , V 2 that can be chosen to be orthonormal; that is,
V i = 1 for i = 1 , m and V i , V j = 0 , if i j .
Since we look for a V orthogonal to V 1 , V s p a n { V 2 , , V m } , it can be expended as
V = k = 2 k = m V , V k V k .
Now we estimate the quantity to maximize at an arbitrary V:
i = 1 N V , Y i 2 = i = 1 N k = 2 m V , V k V k , Y i 2 = i = 1 N k = 2 m V , V k V k , Y i k = 2 m W , Y j V j , Y i = i = 1 N k = 2 m V , V k V k , Y i k = 2 m V , V j V j , Y i = i = 1 N k = 2 m j = 2 m V , V k V k , Y i V , V j V j , Y i = i = 1 N k = 2 m j = 2 m V , V k V , V j V j , Y i V k , Y i = k = 2 m j = 2 m V , V k V , V j i = 1 N V j , Y i V k , Y i .
We can check by expending in terms of components that
i = 1 N V j , Y i V k , Y i = A T A V j , V k ,
and by orthogonality
A A T V j , V k = 0 if j k , and A A T V j , V j = λ j ,
Therefore
i = 1 N V , Y i 2 = j = 2 m V , V j V , V j λ j = j = 2 m V , V j 2 λ j j = 2 m V , V j 2 λ 2 = λ 2 j = 2 m V , V j 2 = λ 2 V 2 = λ 2 .
Now if we substitute V with V 2 following the same steps as in (A4), we obtain
i = 1 N V 2 , Y i 2 = λ 2 ,
which proves that the maximum is reached at V 2 . We can repeat the same process and show that the p vectors we are looking for are the eigenvectors of A A T . For more details, see [26].

References

  1. Alpaydin, E. Introduction to Machine Learning, 4th ed.; MIT: Cambridge, MA, USA, 2020; pp. xix, 1–3, 13–18. [Google Scholar]
  2. Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
  3. Vapnik, V. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
  4. Cortes, C.; Vapnik, V. Support vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  5. Mercer, J. Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. R. Soc. 1909, 209, 415–446. [Google Scholar] [CrossRef]
  6. Wang, H.; Li, G.; Wang, Z. Fast SVM classifier for large-scale classification problems. Inf. Sci. 2023, 642, 119136. [Google Scholar] [CrossRef]
  7. Shao, Y.H.; Lv, X.J.; Huang, L.W.; Bai, L. Twin SVM for conditional probability estimation in binary and multiclass classification. Pattern Recognit. 2023, 136, 109253. [Google Scholar] [CrossRef]
  8. Wang, H.; Shao, Y. Fast generalized ramp loss support vector machine for pattern classification. Pattern Recognit. 2024, 146, 109987. [Google Scholar] [CrossRef]
  9. Wang, B.Q.; Guan, X.P.; Zhu, J.W.; Gu, C.C.; Wu, K.J.; Xu, J.J. SVMs multi-class loss feedback based discriminative dictionary learning for image classification. Pattern Recognit. 2021, 112, 107690. [Google Scholar] [CrossRef]
  10. Borah, P.; Gupta, D. Functional iterative approaches for solving support vector classification problems based on generalized Huber loss. Neural Comput. Appl. 2020, 32, 1135–1139. [Google Scholar] [CrossRef]
  11. Gaye, B.; Zhang, D.; Wulamu, A. Improvement of Support Vector Machine Algorithm in Big Data Background. Hindawi Math. Probl. Eng. 2021, 2021, 5594899. [Google Scholar] [CrossRef]
  12. Tian, Y.; Shi, Y.; Liu, X. Advances on support vector machines research. Technol. Econ. Dev. Econ. 2012, 18, 5–33. [Google Scholar] [CrossRef]
  13. Ayat, N.E.; Cheriet, M.; Remaki, L.; Suen, C.Y. KMOD—A New Support Vector Machine Kernel with Moderate Decreasing for Pattern Recognition. Application to Digit Image Recognition. In Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle, WA, USA, 10–13 September 2001; pp. 1215–1219. [Google Scholar]
  14. Remaki, L. Efficient Alternative to SVM Method in Machine Learning. In Intelligent Computing; Arai, K., Ed.; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2025; Volume 1426. [Google Scholar]
  15. Yoshikazu, W.; Nakayama, Y. Learning subspace classification using subset approximated kernel principal component analysis. IEICE Trans. Inf. Syst. 2016, 99, 1353–1363. [Google Scholar] [CrossRef]
  16. Jiang, W.; Chen, Y.; Wu, L.; Yu, P.S. Subspace learning for effective meta-learning. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 10177–10194. [Google Scholar]
  17. Cao, Y.-H.; Wu, J.X. Random subspace sampling for classification with missing data. J. Comput. Sci. Technol. 2024, 39, 472–486. [Google Scholar] [CrossRef]
  18. Schölkopf, B.; Smola, A.; Müller, K.R. Kernel principal component analysis. In Artificial Neural Networks—ICANN’97; Springer: Berlin/Heidelberg, Germany, 1997; pp. 583–588. [Google Scholar]
  19. Zhou, S.; Ou, Q.; Liu, X.; Wang, S.; Liu, L.; Wang, S. Multiple Kernel Clustering with Compressed Subspace Alignment. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 252–263. [Google Scholar] [CrossRef]
  20. Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 1–27. [Google Scholar] [CrossRef]
  21. Lin, S.W.; Ying, K.C.; Chen, S.C.; Lee, Z.J. Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst. Appl. 2008, 35, 1817–1824. [Google Scholar] [CrossRef]
  22. Syarif, I.; Prugel-Bennett, A.; Wills, G. SVM Parameter Optimization Using Grid Search and Genetic Algorithm to Improve Classification Performance. TELKOMNIKA 2016, 14, 1502–1509. [Google Scholar] [CrossRef]
  23. Shekar, B.H.; Dagnew, G. Grid Search-Based Hyperparameter Tuning and Classification of Microarray Cancer Data. In Proceedings of the Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), Gangtok, India, 25–28 February 2019. [Google Scholar]
  24. Hinton, E.; Osindero, S.; Teh, Y. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
  25. Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
  26. Volkwein, S. Proper Orthogonal Decomposition: Theory and Reduced-Order Modelling; Lecture Notes; University of Konstanz: Konstanz, Germany, 2013; Volume 4. [Google Scholar]
  27. Wang, W.; Zhang, M.; Wang, D.; Jiang, Y. Kernel PCA feature extraction and the SVM classification algorithm for multiple-status, through-wall, human being detection. EURASIP J. Wirel. Commun. Netw. 2017, 2017, 151. [Google Scholar] [CrossRef]
  28. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  29. Remaki, L.; Cheriet, M. KCS—New kernel family with compact support in scale space: Formulation and impact. IEEE Trans. Image Process. 2000, 9, 970–981. [Google Scholar] [CrossRef]
  30. Koenderink, J.J. The structure of images. Biol. Cybern. 1984, 53, 363–370. [Google Scholar] [CrossRef]
  31. Vanschoren, J.; Van Rijn, J.N.; Bischl, B.; Torgo, L. OpenML: Networked science in machine learning. SIGKDD Explor. Newsl. 2014, 15, 49–60. [Google Scholar] [CrossRef]
  32. Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1999, 286, 531. [Google Scholar] [CrossRef] [PubMed]
  33. Hsu, C.W.; Chang, C.C.; Lin, C.J. A Practical Guide to Support Vector Classification; Technical Report; Department of Computer Science, National Taiwan University: Taipei, Taiwan, 2003. [Google Scholar]
  34. Available online: http://archive.ics.uci.edu/ml/index.php (accessed on 5 July 2025).
  35. Available online: https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/ (accessed on 5 July 2025).
  36. Guyon, I.; Gunn, S.; Ben Hur, A.; Dror, G. Result analysis of the NIPS 2003 feature selection challenge. Adv. Neural Inf. Process. Syst. 2004, 17, 545–552. [Google Scholar]
  37. Hsu, C.W.; Lin, C.J. A comparison of methods for multi-class support vector machines. IEEE Trans. Neural Netw. 2002, 13, 415–425. [Google Scholar] [PubMed]
  38. Hull, J.J. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 550–554. [Google Scholar] [CrossRef]
  39. Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 16, Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
  40. Andrews, T. Computation Time Comparison Between Matlab and C++ Using Launch Windows. Available online: https://digitalcommons.calpoly.edu/aerosp/78/?utm_source=chatgpt.com (accessed on 22 September 2024).
Figure 1. SVM vs. KOS explanatory illustration.
Figure 1. SVM vs. KOS explanatory illustration.
Make 08 00052 g001
Figure 2. Imbalanced class strategy.
Figure 2. Imbalanced class strategy.
Make 08 00052 g002
Figure 3. Connected sets case (Ref. [14]).
Figure 3. Connected sets case (Ref. [14]).
Make 08 00052 g003
Figure 4. Non-connected sets case (Ref. [14]).
Figure 4. Non-connected sets case (Ref. [14]).
Make 08 00052 g004
Figure 5. Spiral case (Ref. [14]).
Figure 5. Spiral case (Ref. [14]).
Make 08 00052 g005
Table 1. Test information (two classes).
Table 1. Test information (two classes).
Test NameNo. of ClassesDescription/ReferenceTraining Set Size/
Class Sizes
Test Set Size/
Feature Vector Size
Leukemia2Molecular classification of cancer [32] 38 /
size ( C 1 ) = 27
size ( C 2 ) = 11
34 /
7129
svmguide12Astroparticle application [33] 3089 /
size ( C 1 ) = 200
size ( C 2 ) = 200
4000 /
4
splice2Splice junctions in a DNA sequence [34] 1000 /
size ( C 1 ) = 483
size ( C 2 ) = 517
2175 /
60
austrian2Credit Approval dataset [35] 349 /
size ( C 1 ) = 191
size ( C 2 ) = 158
341 /
14
madelon2Analysis of the NIPS 2003 [36] 200 /
size ( C 1 ) = 1000
size ( C 2 ) = 1000
600 /
14
Table 2. Test information (multiple classes).
Table 2. Test information (multiple classes).
Test NameNo. of ClassesDescription/
Reference
Training Set Size/
Class Sizes
Test Set Size/
Feature Vector Size
DNA3DNA [37] 2000 /
size ( C 1 ) = 464
size ( C 2 ) = 485
size ( C 3 ) = 1051
1186 /
180
Satimage6Satellite images [37]4435
size ( C 1 ) = 1072
size ( C 2 ) = 479
size ( C 3 ) = 961
size ( C 4 ) = 415
size ( C 5 ) = 470
size ( C 6 ) = 1038
2000 /
36
USPS10Handwritten text
recognition dataset [38]
7291 /
size ( C 1 ) = 1194
size ( C 2 ) = 1005
size ( C 3 ) = 731
size ( C 4 ) = 658
size ( C 5 ) = 652
size ( C 6 ) = 556
size ( C 7 ) = 664
size ( C 8 ) = 645
size ( C 9 ) = 542
size ( C 10 ) = 644
2007 /
256
letter26Letter recognition
dataset [38]
15,000/
size ( C 1 ) = 789
size ( C 2 ) = 766
size ( C 3 ) = 736
size ( C 4 ) = 805
size ( C 5 ) = 768
size ( C 6 ) = 775
size ( C 7 ) = 773
size ( C 8 ) = 734
size ( C 9 ) = 755
size ( C 10 ) = 747
size ( C 11 ) = 739
size ( C 12 ) = 761
size ( C 13 ) = 792
size ( C 14 ) = 783
size ( C 15 ) = 753
size ( C 16 ) = 803
size ( C 17 ) = 783
size ( C 18 ) = 758
size ( C 19 ) = 748
size ( C 20 ) = 796
size ( C 21 ) = 813
size ( C 22 ) = 764
size ( C 23 ) = 752
size ( C 24 ) = 787
size ( C 25 ) = 786
size ( C 26 ) = 734
500 /
16
shuttle7Space shuttle
sensors [38]
43,500/
size ( C 1 ) = 34,108
size ( C 2 ) = 37
size ( C 3 ) = 132
size ( C 4 ) = 6748
size ( C 5 ) = 2458
size ( C 6 ) = 6
size ( C 7 ) = 11
14,500/
9
Table 3. KOS-SVM performance comparison on LIBSVM database.
Table 3. KOS-SVM performance comparison on LIBSVM database.
Name/ClassesTest SizeSVM
Accuracy|Precision
KOS
Accuracy|Precision
Leukemia/2Train.#38
Test.#34
82.35 % | 86.42 % 82 . 35 % | 82 . 58 %
svmguide1/2Train.#3089
Test.#4000
96.87 % | 96.87 % 96 . 30 % | 96 . 31 %
splice/2Train.#1000
Test.#2175
90.43 % | 90.50 % 89 . 42 % | 89 . 41 %
austrian/2Train.#349
Test.#341
85.04 % | 85.01 % 85 . 63 % | 85 . 36 %
madelon/2Train.#2000
Test.#600
61.16 % | 59.51 % 61 . 50 % | 61 . 55 %
DNA/3Train.#2000
Test.#1186
94.43 % | 94.14 % 92 . 32 % | 92 . 83 %
Satimage/6Train.#4435
Test.#200
91.85 % | 91.68 % 91 . 35 % | 90 . 40 %
USPS/10Train.#7291
Test.#2007
95.26 % | 95.32 % 95 . 76 % | 95 . 76 %
letter/26Train.#15,000
Test.#5000
97.9 % | 97.90 % 97 . 42 % | 97 . 44 %
shuttle/7Train.#43,500
Test.#14,500
99.9 % | 99.93 % 99 . 91 % | 99 . 91 %
Table 4. KOS-SVM performance comparison on OpenML database.
Table 4. KOS-SVM performance comparison on OpenML database.
Name/ClassesTest SizeSVM
Accuracy|Precision
KOS
Accuracy|Precision
BreastCancer/2Train.#398
Test.#171
97.07 % | 97.07 % 97 . 07 % | 97 . 01 %
Zernike/10Train.#1400
Test.#600
82.00 % | 82.03 % 81 . 00 % | 80 . 82 %
Diabetes/2Train.#537
Test.#231
74.026 % | 74.40 % 74 . 46 % | 72 . 20 %
mfeat-morphological/10Train.#1400
Test.#600
73.66 % | 69.36 % 73 . 833 % | 75 . 07 %
Table 5. KOS-KNN-KPCA performance comparison on LIBSVM and OpenML databases.
Table 5. KOS-KNN-KPCA performance comparison on LIBSVM and OpenML databases.
Name/ClassesKNN
Accuracy|Precision
KPCA
Accuracy|Precision
KOS
Accuracy|Precision
Leukemia/2 67.6 % | 86.42 % 85.29 % | 84.51 % 82 . 3 % | 82 . 58 %
svmguide1/2 96.5 % | 96.5 % 88.98 % | 88.94 % 96 . 30 % | 96 . 31 %
splice/2 86.3 % | 86.3 % 83.49 % | 83.50 % 89 . 42 % | 89 . 41 %
austrian/2 84.5 % | 84.2 % 85.04 % | 84.98 % 85 . 63 % | 85 . 36 %
madelon/2 70.5 % | 71.2 % 58.33 % | 58.33 % 61 . 50 % | 61 . 55 %
DNA/3 90.1 % | 89.5 % 89.04 % | 89.12 % 92 . 32 % | 92 . 83 %
Satimage/6 90.0 % | 89.2 % 83.49 % | 83.50 % 91 . 35 % | 90 . 40 %
USPS/10 92.9 % | 95.32 % 66.62 % | 60.49 % 95 . 76 % | 95 . 76 %
letter/26 90.50 % | 92.3 % 61.56 % | 61.12 % 97 . 42 % | 97 . 44 %
shuttle/7 93.8 % | 93.3 % 84.78 % | 85.50 % 99 . 91 % | 99 . 91 %
BreastCancer/2 97.1 % | 96.3 % 93.57 % | 93.53 % 97 . 076 % | 97 . 01 %
Zernike/10 80.8 % | 80.2 % 72.50 % | 72.25 % 81 . 00 % | 80 . 82 %
Diabetes/2 73.6 % | 71.0 % 69.26 % | 69.90 % 74 . 46 % | 72 . 20 %
mfeat-morphological/10 69.0 % | 68.7 % 70.00 % | 69.79 % 73 . 83 % | 75 . 07 %
Table 6. Subspaces principal angles (PAs) and chordal distances (CDs).
Table 6. Subspaces principal angles (PAs) and chordal distances (CDs).
Name/ClassesPA AveragePA Standard DeviationCD
Leukemia/2 0.9431607 π 2 0.1098443 0.9676548 × m i n ( p , m )
svmguide1/2 0.8779066 π 2 0.3009425 0.8771604 × m i n ( p , m )
splice/2 0.7883427 π 2 0.1908748 0.8458960 × m i n ( p , m )
austrian/2 0.6836720 π 2 0.2971302 0.8458960 × m i n ( p , m )
madelon/2 0.9539925 π 2 0.05631952 0.9878780 × m i n ( p , m )
DNA/3 0.8459874 π 2 0.1555859 0.8984641 × m i n ( p , m )
Satimage/6 0.9486068 π 2 0.1346845 0.9541381 × m i n ( p , m )
USPS/10 0.9438224 π 2 0.09998060 0.9714052 × m i n ( p , m )
letter/26 0.9431607 π 2 0.1098443 0.9676548 × m i n ( p , m )
shuttle/7 0.8868192 π 2 0.09998060 0.9714052 × m i n ( p , m )
BreastCancer/2 0.4914156 π 2 0.2680962 0.4980344 × m i n ( p , m )
Zernike/10 0.6886542 π 2 0.2393689 0.7285724 × m i n ( p , m )
Diabetes/2 0.2689528 π 2 0.2955982 0.2431952 × m i n ( p , m )
mfeat-morphological/10 0.9094226 π 2 0.2617106 0.9099942 × m i n ( p , m )
Table 7. KOS-SVM processing time comparison.
Table 7. KOS-SVM processing time comparison.
Name/ClassesTest SizeSVM
Processing Time
KOS
Processing Time
Leukemia/2Train.#38
Test.#34
0m8.646s 0 m 0 . 005 s
svmguide1/2Train.#3089
Test.#4000
0m33.327s 0 m 1 . 680854 s
splice/2Train.#1000
Test.#2175
0m29.190s 0 m 0 . 262854 s
austrian/2Train.#349
Test.#341
0m1.721s 0 m 0 . 025804 s
madelon/2Train.#2000
Test.#600
16m50.721s 0 m 1 . 571 s
DNA/3Train.#2000
Test.#1186
4m24.300s 0 m 0 . 614 s
Satimage/6Train.#4435
Test.#200
3m53.042s 0 m 1 . 468212 s
USPS/10Train.#7291
Test.#2007
81m18.931s 0 m 6 . 347 s
letter/26Train.#15,000
Test.#5000
44m11.996s 0 m 21 . 330 s
shuttle/7Train.#43,500
Test.#14,500
62m28.783s 0 m 55 . 077 s
BreastCancer/2Train.#398
Test.#171
0m1.696s 0 m 0 . 037422 s
Zernike/10Train.#1400
Test.#600
0m48.875s 0 m 0 . 126698 s
Diabetes/2Train.#537
Test.#231
0m6.959s 0 m 0 . 062452 s
mfeat-morphological/10Train.#1400
Test.#600
0m23.421s 0 m 0 . 1069 s
Table 8. KOS-KNN-KPCA processing time comparison.
Table 8. KOS-KNN-KPCA processing time comparison.
Name/ClassesKNN
Processing Time
KPCA
Processing Time
KOS
Processing Time
Leukemia/20m0.0630s0m0.28780s 0 m 0 . 005 s
svmguide1/20m1.2287s0m0.8382s 0 m 1 . 680854 s
splice/20m0.5270s0m0.66s 0 m 0 . 262854 s
austrian/20m0.2475s0m0.1202s 0 m 0 . 025804 s
madelon/20m0.8191s0m0.541s 0 m 1 . 571 s
DNA/30m1.0193s0m0.6386s 0 m 0 . 614 s
Satimage/61m35.10s0m0.5332s 0 m 1 . 468212 s
USPS/100m24.9070s0m1.3262s 0 m 6 . 347 s
letter/262m47.025s0m4.1978s 0 m 21 . 330 s
shuttle/70m3.2707s0m56.142s 0 m 55 . 077 s
BreastCancer/20m0.3280s0m0.2946s 0 m 0 . 037422 s
Zernike/100m1.0684s0m0.5318s 0 m 0 . 126698 s
Diabetes/20m0.4419s0m0.3174s 0 m 0 . 062452 s
mfeat-morphological/100m0.743s0.3104s 0 m 0 . 1069 s
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Remaki, L. Kernel-Based Optimal Subspaces (KOS): A Method for Data Classification. Mach. Learn. Knowl. Extr. 2026, 8, 52. https://doi.org/10.3390/make8020052

AMA Style

Remaki L. Kernel-Based Optimal Subspaces (KOS): A Method for Data Classification. Machine Learning and Knowledge Extraction. 2026; 8(2):52. https://doi.org/10.3390/make8020052

Chicago/Turabian Style

Remaki, Lakhdar. 2026. "Kernel-Based Optimal Subspaces (KOS): A Method for Data Classification" Machine Learning and Knowledge Extraction 8, no. 2: 52. https://doi.org/10.3390/make8020052

APA Style

Remaki, L. (2026). Kernel-Based Optimal Subspaces (KOS): A Method for Data Classification. Machine Learning and Knowledge Extraction, 8(2), 52. https://doi.org/10.3390/make8020052

Article Metrics

Back to TopTop