AIMarkerFinder: AI-Assisted Marker Discovery Based on an Integrated Approach of Autoencoders and Kolmogorov–Arnold Networks

Pavel S. Demenkov; Timofey V. Ivanisenko; Vladimir A. Ivanisenko

doi:10.3390/informatics13010002

,

and

The Artificial Intelligence Research Center, Novosibirsk State University, Pirogova Street 1, 630090 Novosibirsk, Russia

^*

Author to whom correspondence should be addressed.

Informatics2026, 13(1), 2;https://doi.org/10.3390/informatics13010002

This article belongs to the Section Machine Learning

Version Notes

Order Reprints

Abstract

In modern bioinformatics, the analysis of high-dimensional data (genomic, metabolomic, etc.) remains a critical challenge due to the “curse of dimensionality,” where feature redundancy reduces classification efficiency and model interpretability. This study introduces a novel method, AIMarkerFinder (v0.1.0), for analyzing metabolomic data to identify key biomarkers. The method is based on a denoising autoencoder with an attention mechanism (DAE), enabling the extraction of informative features and the elimination of redundancy. Experiments on glioblastoma and adjacent tissue metabolomic data demonstrated that AIMarkerFinder reduces dimensionality from 446 to 4 key features while improving classification accuracy. Using the selected metabolites (Malonyl-CoA, Glycerophosphocholine, SM(d18:1/22:0 OH), GC(18:1/24:1)), the Random Forest and Kolmogorov–Arnold Networks (KAN) models achieved accuracies of 0.904 and 0.937, respectively. The analytical formulas derived by the KAN provide model interpretability, which is critical for biomedical research. The proposed approach is applicable to genomics, transcriptomics, proteomics, and the study of exogenous factors on biological processes. The study’s results open new prospects for personalized medicine and early disease diagnosis.

Keywords:

denoising autoencoder; Kolmogorov-Arnold Networks; KAN; DAE; feature selection; metabolomics; bioinformatics

1. Introduction

Recent advances in high-throughput technologies, such as next-generation sequencing (NGS), microarrays, and mass spectrometry (MS), have opened new opportunities for researchers in identifying the genetic causes of diseases [1,2]. Radiomics is a new technology in which medical imaging provides important information about tumor physiology [3]. Metabolomics, as part of systems biology, focuses on the comprehensive study of the metabolome—the collection of low-molecular-weight compounds (metabolites) involved in the biochemical processes of an organism. In recent years, this discipline has gained particular relevance due to its ability to reveal subtle mechanisms of cellular function regulation and predict changes in response to external and internal stimuli, such as diseases, drugs, or dietary changes. Metabolomic data provide a unique opportunity for the early diagnosis of pathologies, the development of personalized treatment methods, and a deeper understanding of the interplay between genetics, environment, and lifestyle [4].

These high-throughput representations suffer from the curse of dimensionality, so appropriate computational methods are required to extract knowledge from them [5]. Microarray data contain many heterogeneity factors because they include the expression of every possible gene in the genome. Scientific studies have proven that genes responsible for certain biological processes are interrelated, and some genes act as activators or inhibitors of others [6]. In high-dimensional data, such as microarray datasets, irrelevant features can interfere with true features, which in turn introduces heterogeneity into the data and generates dependencies between features. Statistical analysis loses its significance in the case of dependent features. Therefore, we must select features that play an important role in evaluation and are independent.

Identifying such independent genes (features) whose expression models have significant biological associations with phenotypic behavior is important for knowledge discovery. In microarray analysis, the goal for biologists is to detect a small number of features that explain microarray data behavior [7]. Selected significant biomarkers from microarray data are essential for patient stratification and the development of personalized medicine strategies [8]. From a machine learning perspective, controlling the number of features helps reduce overfitting, leading to better prediction of the target variable on training data. The dimensionality of the feature space casts doubt on model construction and the effectiveness of knowledge discovery. Therefore, a recommended ratio of 10:1 samples per feature is suggested for creating reliable classifiers and predictive models [9].

The reason for feature selection is that classifiers trained on a reduced feature space are more robust and reproducible than those built on the original large feature space. In feature selection, we particularly look for correlated features. Features that do not provide useful information are called irrelevant features, while features that do not provide additional information beyond already selected features are termed redundant features [10]. Features that are uncorrelated or unassociated with class variables are referred to as noise, which actually introduces bias into prediction and reduces classification efficiency. Therefore, noise must be removed to improve predictive performance, and this can be achieved through dimensionality reduction. This can be achieved either through feature extraction or feature selection [11].

In feature extraction, new features are derived from the original input data by selecting a new basis for the data. Feature selection helps reduce the impact of high dimensionality on a dataset by identifying a subset of features that effectively define the data. Direct evaluation of feature subsets becomes an NP-hard problem [12]. To address this, we attempt to use suboptimal procedures with tractable computations. We must also address another important issue when a feature depends on the response variable rather than the predictors. Selecting a subset of features allows classifiers to focus on important features while ignoring potentially misleading ones. From a computational complexity perspective, having an economical set of features involved in the classification process helps scale many learning algorithms rapidly with additional features [13].

The best feature selection algorithm should always bring benefits such as data understanding, a better classifier model, improved generalization, and the identification of irrelevant features. It should also assist in understanding the relationships between features and target variables, reduce computational requirements for solving a specific task, enable efficient dimensionality reduction in the case of high-dimensional datasets where the number of observations is less than the number of features, help improve the performance of the predictor used to solve a specific task, and enhance efficiency in terms of cost and time. The feature selection process contributes to knowledge discovery, where the identified features can be directly used in future research. In bioinformatics, the identification of important features can indicate new metabolic pathways and help reveal hidden connections between specific cellular processes [13].

This paper describes the developed method AIMarkerFinder for highlighting important features, based on a denoising autoencoder (DAE) with an attention mechanism. The AIMarkerFinder method highlights important features in the data, enabling more accurate classifier models.

2. Materials and Methods

2.1. Metabolomics Data

Feature extraction was performed on a dataset of metabolomic profiles from glioblastoma and adjacent brain tissue from the study by Basov et al. [14]. The dataset contains 32 samples (16 glioblastoma and 16 adjacent tissues), with the concentration of 446 metabolites measured.

2.2. Data Preparation

During the training phase, the input data underwent preprocessing consisting of two steps: class balancing and noise injection.

To address class imbalance, the RandomOverSampler method from the imbalanced-learn (v0.13.0) library (abbreviated as imblearn) was used. This method increased the number of instances in underrepresented classes by randomly selecting and duplicating existing samples. If the dataset size after balancing was less than 3000 samples, each vector was duplicated repeatedly until the total dataset size exceeded 3000.

After class balancing, Gaussian noise was added to each component of the feature vector. The process is described by the formula:

C^noisy_n = C^original_n + ε,

(1)

where

C^noisy_n—the noisy value of the n-th component of the vector,

C^original_n—the original value of the n-th component,

ε∼N(0, σ_n⋅d)—noise sampled from a normal distribution with mean 0 and variance σn⋅σ_n⋅d.

Parameters:

σ_n—the sample variance of the n-th component, calculated separately for each class,

d—a coefficient determining the noise level.

The noise model was chosen in accordance with the findings of Bishop [15], which justify that adding zero-mean Gaussian noise with controlled variance is a theoretically sound approach. The variance σ_n, calculated for each class and vector component, preserved the statistical characteristics of the original data. The coefficient d = 0.25 was selected empirically based on a sensitivity analysis of the variance of noisy data. For d > 0.4, a sharp increase in variance is observed—exceeding 10% relative to the original data—which may distort the statistical properties of the sample. In contrast, for d ≤ 0.25, the change in variance of the noisy data remains below 2%, preserving the structure of the original data while still providing sufficient variability for augmentation purposes.

This approach simultaneously addressed class balancing and increased the size of the training data by generating synthetic noisy samples, which improved the model’s generalization capability.

2.3. Implementation

The proposed denoising autoencoder (DAE) with an attention mechanism was implemented using the PyTorch (v0.2.6) framework. The Adam optimizer was used for training with a learning rate of 0.001. The model was trained for 10,000 epochs with a batch size of 2048. Early stopping was allowed if the error stabilized within 0.005 for 100 consecutive epochs. During training, model parameters achieving the best loss values on validation data were saved.

Training of the random forest and C4.5 models was performed using the sklearn(v1.5.1) library. For the random forest model, hyperparameter optimization was conducted with GridSearchCV using the parameters:

n_estimators: np.arange(10, 201, 20),
max_depth: np.arange(3, 15, 2),
random_state: [0].
For the C4.5 model, the GridSearchCV parameters were set as:
criterion: [‘entropy’],
max_depth: np.arange(1, 21),
min_samples_split: np.arange(2, 11),
max_leaf_nodes: np.arange(3, 26),
max_features: [‘sqrt’],
random_state: np.arange(0, 10).

The accuracy of the C4.5 and Random Forest (RF) models was evaluated using 5-fold cross-validation.

The KAN method implementation was based on the pykan(v0.2.6) library [16]. The network structure consisted of 3 layers: the first and last layers had dimensions matching the input and output data, respectively, while the intermediate layer had a dimension one-third that of the input data. The model was trained with parameters:

grid = 5,
k = 3,
scale_base_mu = 1,
scale_base_sigma = 2,
noise_scale = 0.3.

The optimizer was LBFGS, and the loss function was CrossEntropyLoss. Training was repeated 100 times with different initial values of the seed parameter (0–99). The model with the highest accuracy was selected.

It should be emphasized that the synthetic dataset generated through class-balancing and Gaussian noise injection was used only during the training phase of the denoising autoencoder (DAE) and Kolmogorov–Arnold Networks (KANs). All subsequent steps—including feature weighting via the attention mechanism, iterative selection of informative metabolites, and final model evaluation—were performed exclusively on the original, unmodified dataset (32 samples, 446 metabolites). This ensures that the biological interpretability of the selected features is grounded in real experimental measurements rather than artifacts of data augmentation.

2.4. Selection of the Most Significant Features

After training the model, a set of parameters with the best loss function performance on validation data is selected. The original data are input into the model, and the output includes latent layer vectors (embedding) and attention layer vectors (attention). Embedding vectors are used for visualization and can further be employed to build classification models. The attention layer vectors are averaged component-wise. The components of the resulting averaged attention vector (Matt) are sorted in descending order. The index of the component where the maximum change in values occurs is determined by the formula:

k_{\max} = \underset{k}{argmax} (\frac{M a t t [k - 1] - M a t t [k]}{M a t t [k]})

(2)

Attention components with indices not exceeding k_max are considered the most significant for the model. However, since the model simultaneously solves two tasks (compression and classification), the selected features include both classification-relevant features and those reflecting data structure. Therefore, in the next step, a Random Forest method was applied to select features with high classification ability. Training was conducted on a dataset containing only the previously selected features. As a result of Random Forest model training, a feature importance vector (FI) was obtained. The components of the FI vector are sorted in descending order and normalized by the median value. To account for model quality, all components of the vector are multiplied by the average accuracy of cross-validated models raised to the power of Alpha. The exponent Alpha is critical and determines the method’s sensitivity. At Alpha = 2, the model selects only features enabling high-accuracy classification, while at Alpha = 1, it allows lower-quality classification. In the implemented method, Alpha was evaluated as a function of the number of selected DAE features and the size of the DAE latent layer:

Alpha = (latent layer size + number of selected features)/latent layer size

(3)

For the obtained sorted and normalized vector, the inflection point of the graph is found using the Piecewise Linear Regression algorithm [17].

Normalized values greater than 1.0 are considered a necessary condition for the presence of classification-capable features. If no such values exist in the vector, the algorithm stops.

Data structures can be complex, and the solution to the feature selection problem for classification may be ambiguous. To address this, AIMarkerFinder (v0.1.0) proposes iteratively extracting important features by removing already selected features from the data.

As a result of iterative algorithm runs, a list of important features is identified. For this list, a classifier is built using the Random Forest (RF) method, and a feature importance vector is computed. Feature weights are normalized by the median value, and the inflection point is determined using the Piecewise Linear Regression algorithm. This results in a reduced set of important features used in the classifier model.

2.5. Hardware

All computations were performed on a workstation equipped with an 8-core AMD Ryzen 7 5800X CPU (Zen 3 architecture, base clock frequency of 3.8 GHz; manufactured by AMD, San Diego, CA, USA) and an NVIDIA GeForce RTX 4090 GPU with 24 GB of GDDR6X video memory (manufactured by NVIDIA, Santa Clara, CA, USA).

3. Results

3.1. Denoising Autoencoder

Autoencoders are a powerful tool for noise suppression tasks, allowing efficient extraction of clean signals from noisy data. However, to achieve high performance in real-world conditions, autoencoders must be trained on diverse types of noise. The study employs a denoising autoencoder (DAE) model with an attention mechanism for the classification task. The network architecture is based on the article Guan et al. [18] and was modified to detect the most informative features.

The general scheme of the AIMarkerFinder method is presented in Figure 1. It consists of four main components: an attention module, an encoder, a decoder, and a classifier. The attention module automatically identifies the most significant components of the input vector. The encoder compresses the input data to the hidden layer dimension. The decoder restores the original data dimension from the compressed representation. The classifier predicts the object category.

Figure 1. General scheme of the denoising autoencoder with an attention mechanism used in the study. Blue rectangles schematically represent data vectors.

During training, the data first pass through the attention module for “weighting” the features. They are then fed into the encoder and projected into a low-dimensional hidden feature space. The classifier is trained on compressed representations using category labels, enhancing the discriminative ability of latent features. Simultaneous decoder training ensures the preservation of maximum information necessary for data reconstruction.

The attention module is implemented as a fully connected layer with a combination of Softmax and Sigmoid activation functions. The Sigmoid function provides a clearer boundary between important and secondary features. The obtained coefficients are used to weight the input data fed into the encoder.

The encoder consists of 2 layers, where the size of the first layer matches the input vector size, and the hidden layer size is determined by the number of principal components explaining 99% of the data variance. The decoder has a symmetric architecture to the encoder.

The classifier receives data from the encoder output. Its architecture also consists of 2 fully connected layers: the first layer is half the size of the autoencoder’s hidden layer, and the output layer corresponds to the number of recognizable classes.

All model components are trained jointly, enabling the encoder to find features that simultaneously separate data into classes and reflect data properties.

The loss function includes a weighted sum of three components: Mean Squared Error (MSE) and the coefficient of determination (R²) for evaluating data reconstruction quality, as well as Binary Cross-Entropy (BCE) for classification quality:

Loss = MSE + λ₁ × (1 − R²) + λ₂ × BCE.

(4)

Parameters λ₁ and λ₂ in Equation (4) are set to 0.1 and 0.3, respectively.

3.2. Data Classification

For building classifiers on the selected features, the following models were used: C4.5 [19], Random Forest (RF) [20], and Kolmogorov–Arnold Networks (KANs) [16]. The C4.5 method constructs a decision tree, generating a tree where internal nodes represent features (or attributes), and leaf nodes represent classes or labels. Each path from root to leaf represents a classification rule. The method is easily interpretable as it results in a set of logical rules.

Random Forest is an ensemble method that builds multiple decision trees and aggregates their predictions to improve accuracy and model stability. Each tree is trained on a random subsample of data (with replacement) and uses randomly selected features for splitting. The method has high accuracy, especially on complex data structures, and is robust to overfitting due to the ensemble approach.

The KAN model is an approach to constructing multidimensional functions based on Kolmogorov’s theorem. This theorem states that any continuous function of multiple variables can be represented as a superposition of functions of one variable and sums. The KAN method uses this idea to build models that can produce analytical formulas describing dependencies.

To evaluate the quality of models built using C4.5 and RF, a 5-fold cross-validation procedure was employed. For the KAN model, training was conducted on a noisy dataset, and testing was performed on real data.

3.3. Analysis of Metabolomic Profile Samples

The data of metabolomic profiles of glioblastoma and adjacent brain tissue [14] have a complex structure. The results of Principal Component Analysis (PCA) on the original dataset are shown in Figure 2.

Figure 2. Results of PCA on the full sample of glioblastoma and adjacent brain tissue metabolomic profiles. The first three components explain 27%, 10%, and 9% of the total data variance, respectively. No linear separation between glioblastoma and adjacent tissue samples was observed.

The first three principal components explain 27%, 10%, and 9% of the total data variance, respectively.

We hypothesized that using vector representations of the original data for dimensionality reduction could improve data separation. To obtain vector representations of the original data, an DAE model with an attention mechanism was trained on data including all 446 features. The length of the vector representation was set to 30. Due to the limited sample size (n = 32), we performed class-conditional Gaussian noise augmentation to generate a training set of 3072 samples (1536 per class), as described in the Methods section. The results of PCA on the 30-dimensional vector representation obtained by the trained DAE model are presented in Figure 3.

Figure 3. Results of PCA on the 30-dimensional vector representation of data obtained by the trained DAE model with an attention mechanism. The first three components explain 44%, 25%, and 8% of the total data variance, respectively. Density estimation along the first principal component reveals no overlap between glioblastoma and adjacent brain tissue samples—the area of intersection of their distributions is effectively zero—demonstrating the strong class-separating power of the DAE-derived embedding.

While the DAE-based representation substantially increases the cumulative variance captured by the first three principal components (from 46% to 77%), we emphasize that the key benefit lies not merely in variance retention but in the reorganization of variance along directions that align with biological differences between glioblastoma and adjacent tissue. This is evidenced by the visibly enhanced linear separability in Figure 3, which we further quantified by training a linear SVM on the 30-dimensional DAE embeddings. The classifier achieved 100% accuracy (vs. 77.5% on raw data). However, the constructed model does not provide an interpretable result.

To obtain an interpretable result, we applied the iterative AIMarkerFinder method to analyze the metabolomic profile sample, enabling the selection of the most important features from the original data. During the iterative feature selection process, 31 important features were identified (Table 1). Using the Piecewise Linear Regression algorithm, 4 out of the 31 most important features were identified (Figure 4).

Table 1. List of selected metabolites with their classification weights.

Figure 4. Heatmap built on the sample with the 4 selected features using the AIMarkerFinder method. Light blue indicates the glioblastoma tissue samples, while dark blue corresponds to adjacent tissue samples.

Classification accuracy estimates obtained by Random Forest, C4.5, and KAN methods using both the full dataset and the selected features are presented in Table 2.

Table 2. Results of classifier model accuracy evaluation using 5-fold cross-validation.

As seen in the table, reducing the number of considered features significantly increases the accuracy of classifiers obtained by the RF method. The C4.5 method did not show good separation on either the full dataset or the selected features. The C4.5 algorithm constructs a decision tree using a greedy partitioning of the feature space, which renders it sensitive to noise, feature redundancy, and nonlinear relationships in the data. The relatively poor performance of C4.5 on selected features dataset is likely attributable to the linear non-separability of the dataset. For the KAN model, no classifier was obtained for the full dataset, but for the selected features, the KAN provided analytical formulas for data classification, achieving an accuracy (0.937) higher than RF (0.904):

F1 = 12.835 × Malonyl-CoA + 5.169 × SM(d18:1/22:0 OH) − 5.076

F2 = 9.330 × Glycerophosphocholine − 0.004 × SM(d18:1/22:0 OH) − 0.001 × GC(18:1/24:1) + 1.389

If F1 > F2, the first class is selected; otherwise, the second class.

4. Discussion

The study demonstrated that the application of the denoising autoencoder with an attention mechanism effectively reduces the dimensionality of metabolomic data. In the case of analyzing metabolomic data with 446 features, the DAE with the attention mechanism enabled vector representations of 30 features. The obtained vector representations revealed a structure allowing linear separation of groups (glioblastoma and adjacent tissue). However, these vector representations require additional analysis for interpretation. To address the task of obtaining interpretable features, the AIMarkerFinder method, based on the DAE with an attention mechanism, was employed. This method selected 4 metabolites (Malonyl-CoA, Glycerophosphocholine, SM(d18:1/22:0 OH), and GC(18:1/24:1)) from the 446 features.

Classification of metabolomic data using the Random Forest and Kolmogorov–Arnold Networks methods based on these 4 metabolites showed accuracy of 0.904 and 0.937, respectively. The analytical functions derived by the KAN from the selected metabolites not only provide high accuracy but also allow model interpretation through linear dependencies, which is critical for biomedical applications.

To compare AIMarkerFinder with known methods, the study by Chen et al. [21] was selected, which analyzed the quality of methods (varImp, Boruta, and Recursive Feature Elimination (RFE)) for selecting significant features in a dataset. The comparative analysis of feature selection and classifier accuracy conducted by the authors showed that the best results were achieved by the Recursive Feature Elimination (RFE) method combined with the RF classifier.

Therefore, this model was selected for comparison with AIMarkerFinder. The results of the analysis of the dependence of the RF model classification accuracy on the number of features selected by the Recursive Feature Elimination (RFE) method on the metabolomic profile dataset are presented in Figure 5.

Figure 5. Dependence of RF model accuracy on the number of features selected by RFE.

The highest accuracy scores were achieved with 1 and 8 RFE-selected features, reaching 0.876. As in the study by Chen et al. [21], recursive feature elimination (RFE) demonstrated superior performance relative to alternative feature selection methods in our experiments. Specifically, LASSO required 45 features to achieve the same classification accuracy as AIMarkerFinder (0.904), indicating substantially lower sparsity and reduced interpretability. Notably, when constrained to select only 8 features—the same number selected by RFE—LASSO attained a classification accuracy of merely 0.747, further underscoring its inferior performance under high-sparsity conditions. In contrast, the Boruta method identified 7 relevant features, but the resulting classification accuracy was only 0.804, significantly lower than both RFE and AIMarkerFinder.

Unlike RFE, AIMarkerFinder automatically selects the number of important features. It is important to note that the RF accuracy on the 4 features selected by AIMarkerFinder was 0.904, which exceeded the accuracy on the 8 best RFE-selected features (0.876). Notably, 3 of the 4 metabolites chosen by AIMarkerFinder overlap with those selected by RFE, indicating strong consensus between the two approaches despite their different underlying principles. Critically, all 4 metabolites selected by AIMarkerFinder have been previously implicated in glioblastoma pathophysiology. Specifically, they are fully contained within the list of 22 dysregulated metabolites identified by Basov et al. [14]

It should be noted that the limited size of the original cohort constitutes a significant methodological limitation of the present study. Although controlled data augmentation was employed to stabilize the training of deep models, the biological interpretation of the identified metabolites—specifically Malonyl-CoA, Glycerophosphocholine, SM(d18:1/22:0 OH), and GC(18:1/24:1)—requires rigorous validation in independent cohorts with substantially larger sample sizes. Only prospective, multicenter studies confirming their diagnostic and prognostic relevance will allow these compounds to be considered reliable biomarkers of glioblastoma.

The developed approach can be applied to analyze high-dimensional data in various biological fields (genomic, transcriptomic, proteomic, and metabolomic data), medical data, and for studying the impact of environmental factors (e.g., environmental pollution, diet, physical activity, or chemical exposure).

Author Contributions

Conceptualization, V.A.I.; writing—original draft preparation, P.S.D.; software, P.S.D.; validation, T.V.I.; supervision, V.A.I. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant for research centers, provided by the Ministry of Economic Development of the Russian Federation in accordance with the subsidy agreement with the Novosibirsk State University dated 17 April 2025 No. 139-15-2025-006: IGK 000000C313925P3S0002.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

AIMarkerFinder (v.0.1.0) is avaliable at https://github.com/dps123/AIMarkerFinder (accessed on 15 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DAE	Denoising autoencoder
RF	Random forest
KAN	Kolmogorov–Arnold Network

References

Mohammadi, M.; Sharifi Noghabi, H.; Abed Hodtani, G.; Rajabi Mashhadi, H. Robust and Stable Gene Selection via Maximum–Minimum Correntropy Criterion. Genomics 2016, 107, 83–87. [Google Scholar] [CrossRef] [PubMed]
Taylor, A.; Steinberg, J.; Andrews, T.S.; Webber, C. GeneNet Toolbox for MATLAB: A Flexible Platform for the Analysis of Gene Connectivity in Biological Networks. Bioinformatics 2014, 31, 442–444. [Google Scholar] [CrossRef] [PubMed]
Parmar, C.; Grossmann, P.; Bussink, J.; Lambin, P.; Aerts, H.J.W.L. Machine Learning Methods for Quantitative Radiomic Biomarkers. Sci. Rep. 2015, 5, 13087. [Google Scholar] [CrossRef] [PubMed]
Nicholson, J.K.; Connelly, J.; Lindon, J.C.; Holmes, E. Metabonomics: A Platform for Studying Drug Toxicity and Gene Function. Nat. Rev. Drug Discov. 2002, 1, 153–161. [Google Scholar] [CrossRef] [PubMed]
Hinrichs, A.; Prochno, J.; Ullrich, M. The Curse of Dimensionality for Numerical Integration on General Domains. J. Complex. 2019, 50, 25–42. [Google Scholar] [CrossRef]
Perthame, É.; Friguet, C.; Causeur, D. Stability of Feature Selection in Classification Issues for High-Dimensional Correlated Data. Stat. Comput. 2015, 26, 783–796. [Google Scholar] [CrossRef]
Kumar, A.P.; Valsala, P. Feature Selection for High Dimensional DNA Microarray Data Using Hybrid Approaches. Bioinformation 2013, 9, 824–828. [Google Scholar] [CrossRef] [PubMed]
Huang, G.T.; Tsamardinos, I.; Raghu, V.; Kaminski, N.; Benos, P.V. T-RECS: Stable selection of dynamically formed groups of features with application to prediction of clinical outcomes. Pac. Symp. Biocomput. 2015, 20, 431–442. [Google Scholar] [PubMed]
Kanal, L.; Chandraskekaran, B. On dimensionality and sample size in statistical pattern classification. Pattern Recognit. 1971, 3, 225–234. [Google Scholar] [CrossRef]
Kumar, V. Feature Selection: A Literature Review. SmartCR 2014, 4, 211–229. [Google Scholar] [CrossRef]
Drotár, P.; Gazda, J.; Smékal, Z. An Experimental Comparison of Feature Selection Methods on Two-Class Biomedical Datasets. Comput. Biol. Med. 2015, 66, 1–10. [Google Scholar] [CrossRef] [PubMed]
Chandrashekar, G.; Sahin, F. A Survey on Feature Selection Methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
Dunne, K.; Cunningham, P.; Azuaje, F. Solutions to Instability Problems with Sequential Wrapper-Based Approaches to Feature Selection Tech Rep; Trinity College: London, UK, 2002. [Google Scholar]
Basov, N.V.; Adamovskaya, A.V.; Rogachev, A.D.; Gaisler, E.V.; Demenkov, P.S.; Ivanisenko, T.V.; Venzel, A.S.; Mishinov, S.V.; Stupak, V.V.; Cheresiz, S.V.; et al. Investigation of Metabolic Features of Glioblastoma Tissue and the Peritumoral Environment Using Targeted Metabolomics Screening by LC-MS/MS and Gene Network Analysis. Vestn. VOGiS 2025, 28, 882–896. [Google Scholar] [CrossRef] [PubMed]
Bishop, C.M. Training with Noise Is Equivalent to Tikhonov Regularization. Neural Comput. 1995, 7, 108–116. [Google Scholar] [CrossRef]
Liu, Z.; Ma, P.; Wang, Y.; Matusik, W.; Tegmark, M. KAN 2.0: Kolmogorov-Arnold Networks Meet Science. arXiv 2024, arXiv:2408.10205. [Google Scholar] [CrossRef]
Arin, P.; Minniti, M.; Murtinu, S.; Spagnolo, N. Inflection Points, Kinks, and Jumps: A Statistical Approach to Detecting Nonlinearities. Organ. Res. Methods 2021, 25, 786–814. [Google Scholar] [CrossRef]
Guan, H.; Yue, L.; Yap, P.-T.; Xiao, S.; Bozoki, A.; Liu, M. Attention-Guided Autoencoder for Automated Progression Prediction of Subjective Cognitive Decline With Structural MRI. IEEE J. Biomed. Health Inform. 2023, 27, 2980–2989. [Google Scholar] [CrossRef] [PubMed]
Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers: Burlington, MA, USA, 1993. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, R.C.; Dewi, C.; Huang, S.W.; Caraka, R.E. Selecting Critical Features for Data Classification Based on Machine Learning Methods. J. Big Data 2020, 7, 52. [Google Scholar] [CrossRef]

Figure 1. General scheme of the denoising autoencoder with an attention mechanism used in the study. Blue rectangles schematically represent data vectors.

Figure 2. Results of PCA on the full sample of glioblastoma and adjacent brain tissue metabolomic profiles. The first three components explain 27%, 10%, and 9% of the total data variance, respectively. No linear separation between glioblastoma and adjacent tissue samples was observed.

Figure 3. Results of PCA on the 30-dimensional vector representation of data obtained by the trained DAE model with an attention mechanism. The first three components explain 44%, 25%, and 8% of the total data variance, respectively. Density estimation along the first principal component reveals no overlap between glioblastoma and adjacent brain tissue samples—the area of intersection of their distributions is effectively zero—demonstrating the strong class-separating power of the DAE-derived embedding.

Figure 4. Heatmap built on the sample with the 4 selected features using the AIMarkerFinder method. Light blue indicates the glioblastoma tissue samples, while dark blue corresponds to adjacent tissue samples.

Figure 5. Dependence of RF model accuracy on the number of features selected by RFE.

Table 1. List of selected metabolites with their classification weights.

Metabolite Name	Score
Malonyl-CoA	3.23
Glycerophosphocholine	2.17
SM(d18:1/22:0 OH)	1.95
GC(18:1/24:1)	1.73
Pyroglutamic acid	1.54
Ceramide(d18:1/16:0 OH)	1.39
Ceramide(d18:1/16:0)	1.38
3-Phosphoglyceric acid	1.25
4-phosphopantothenate	1.17
Hexose Disaccharide Pool	1.11
2-Octenoylcarnitine	1.11
GC(18:2/16:0)	1.03
THC 18:1/20:0	0.91
Suberylcarnitine	0.8
Ethanolamine	0.79
SM(d18:1/24:0 OH)	0.75
Phenyllactic acid	0.73
Thiamine	0.71
Coenzyme A	0.69
Cysteine-S-sulfate	0.59
2-Hydroxy-3-methylbutyric acid	0.56
Pyridoxal	0.53
Ceramide(d18:1/24:2 OH)	0.51
cysteine sulfinate	0.46
Ceramide(d18:1/26:0 OH)	0.46
1-Pyrroline-5-carboxylic acid	0.46
Ceramide(d18:1/26:0)	0.39
Citric + isocitric acid	0.38
Guanosine	0.33
Dodecanoylcarnitine	0.25
Deoxyribose phosphate	0.24

Table 2. Results of classifier model accuracy evaluation using 5-fold cross-validation.

Input Data Dimension	Random Forest	C4.5	KAN
446	0.738	0.809	-
4	0.904	0.714	0.937

Note: Training the KAN model using the pykan library (v0.2.6) on the full 446-dimensional feature set was not feasible due to excessive memory requirements. In our experiments, even on a GPU with 24 GB of VRAM (NVIDIA GeForce RTX 4090; manufactured by NVIDIA, Santa Clara, CA, USA), the model failed to initialize, resulting in an out-of-memory error.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

AIMarkerFinder: AI-Assisted Marker Discovery Based on an Integrated Approach of Autoencoders and Kolmogorov–Arnold Networks

Abstract

1. Introduction

2. Materials and Methods

2.1. Metabolomics Data

2.2. Data Preparation

2.3. Implementation

2.4. Selection of the Most Significant Features

2.5. Hardware

3. Results

3.1. Denoising Autoencoder

3.2. Data Classification

3.3. Analysis of Metabolomic Profile Samples

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics