1. Introduction
Recent advances in high-throughput technologies, such as next-generation sequencing (NGS), microarrays, and mass spectrometry (MS), have opened new opportunities for researchers in identifying the genetic causes of diseases [
1,
2]. Radiomics is a new technology in which medical imaging provides important information about tumor physiology [
3]. Metabolomics, as part of systems biology, focuses on the comprehensive study of the metabolome—the collection of low-molecular-weight compounds (metabolites) involved in the biochemical processes of an organism. In recent years, this discipline has gained particular relevance due to its ability to reveal subtle mechanisms of cellular function regulation and predict changes in response to external and internal stimuli, such as diseases, drugs, or dietary changes. Metabolomic data provide a unique opportunity for the early diagnosis of pathologies, the development of personalized treatment methods, and a deeper understanding of the interplay between genetics, environment, and lifestyle [
4].
These high-throughput representations suffer from the curse of dimensionality, so appropriate computational methods are required to extract knowledge from them [
5]. Microarray data contain many heterogeneity factors because they include the expression of every possible gene in the genome. Scientific studies have proven that genes responsible for certain biological processes are interrelated, and some genes act as activators or inhibitors of others [
6]. In high-dimensional data, such as microarray datasets, irrelevant features can interfere with true features, which in turn introduces heterogeneity into the data and generates dependencies between features. Statistical analysis loses its significance in the case of dependent features. Therefore, we must select features that play an important role in evaluation and are independent.
Identifying such independent genes (features) whose expression models have significant biological associations with phenotypic behavior is important for knowledge discovery. In microarray analysis, the goal for biologists is to detect a small number of features that explain microarray data behavior [
7]. Selected significant biomarkers from microarray data are essential for patient stratification and the development of personalized medicine strategies [
8]. From a machine learning perspective, controlling the number of features helps reduce overfitting, leading to better prediction of the target variable on training data. The dimensionality of the feature space casts doubt on model construction and the effectiveness of knowledge discovery. Therefore, a recommended ratio of 10:1 samples per feature is suggested for creating reliable classifiers and predictive models [
9].
The reason for feature selection is that classifiers trained on a reduced feature space are more robust and reproducible than those built on the original large feature space. In feature selection, we particularly look for correlated features. Features that do not provide useful information are called irrelevant features, while features that do not provide additional information beyond already selected features are termed redundant features [
10]. Features that are uncorrelated or unassociated with class variables are referred to as noise, which actually introduces bias into prediction and reduces classification efficiency. Therefore, noise must be removed to improve predictive performance, and this can be achieved through dimensionality reduction. This can be achieved either through feature extraction or feature selection [
11].
In feature extraction, new features are derived from the original input data by selecting a new basis for the data. Feature selection helps reduce the impact of high dimensionality on a dataset by identifying a subset of features that effectively define the data. Direct evaluation of feature subsets becomes an NP-hard problem [
12]. To address this, we attempt to use suboptimal procedures with tractable computations. We must also address another important issue when a feature depends on the response variable rather than the predictors. Selecting a subset of features allows classifiers to focus on important features while ignoring potentially misleading ones. From a computational complexity perspective, having an economical set of features involved in the classification process helps scale many learning algorithms rapidly with additional features [
13].
The best feature selection algorithm should always bring benefits such as data understanding, a better classifier model, improved generalization, and the identification of irrelevant features. It should also assist in understanding the relationships between features and target variables, reduce computational requirements for solving a specific task, enable efficient dimensionality reduction in the case of high-dimensional datasets where the number of observations is less than the number of features, help improve the performance of the predictor used to solve a specific task, and enhance efficiency in terms of cost and time. The feature selection process contributes to knowledge discovery, where the identified features can be directly used in future research. In bioinformatics, the identification of important features can indicate new metabolic pathways and help reveal hidden connections between specific cellular processes [
13].
This paper describes the developed method AIMarkerFinder for highlighting important features, based on a denoising autoencoder (DAE) with an attention mechanism. The AIMarkerFinder method highlights important features in the data, enabling more accurate classifier models.
4. Discussion
The study demonstrated that the application of the denoising autoencoder with an attention mechanism effectively reduces the dimensionality of metabolomic data. In the case of analyzing metabolomic data with 446 features, the DAE with the attention mechanism enabled vector representations of 30 features. The obtained vector representations revealed a structure allowing linear separation of groups (glioblastoma and adjacent tissue). However, these vector representations require additional analysis for interpretation. To address the task of obtaining interpretable features, the AIMarkerFinder method, based on the DAE with an attention mechanism, was employed. This method selected 4 metabolites (Malonyl-CoA, Glycerophosphocholine, SM(d18:1/22:0 OH), and GC(18:1/24:1)) from the 446 features.
Classification of metabolomic data using the Random Forest and Kolmogorov–Arnold Networks methods based on these 4 metabolites showed accuracy of 0.904 and 0.937, respectively. The analytical functions derived by the KAN from the selected metabolites not only provide high accuracy but also allow model interpretation through linear dependencies, which is critical for biomedical applications.
To compare AIMarkerFinder with known methods, the study by Chen et al. [
21] was selected, which analyzed the quality of methods (varImp, Boruta, and Recursive Feature Elimination (RFE)) for selecting significant features in a dataset. The comparative analysis of feature selection and classifier accuracy conducted by the authors showed that the best results were achieved by the Recursive Feature Elimination (RFE) method combined with the RF classifier.
Therefore, this model was selected for comparison with AIMarkerFinder. The results of the analysis of the dependence of the RF model classification accuracy on the number of features selected by the Recursive Feature Elimination (RFE) method on the metabolomic profile dataset are presented in
Figure 5.
The highest accuracy scores were achieved with 1 and 8 RFE-selected features, reaching 0.876. As in the study by Chen et al. [
21], recursive feature elimination (RFE) demonstrated superior performance relative to alternative feature selection methods in our experiments. Specifically, LASSO required 45 features to achieve the same classification accuracy as AIMarkerFinder (0.904), indicating substantially lower sparsity and reduced interpretability. Notably, when constrained to select only 8 features—the same number selected by RFE—LASSO attained a classification accuracy of merely 0.747, further underscoring its inferior performance under high-sparsity conditions. In contrast, the Boruta method identified 7 relevant features, but the resulting classification accuracy was only 0.804, significantly lower than both RFE and AIMarkerFinder.
Unlike RFE, AIMarkerFinder automatically selects the number of important features. It is important to note that the RF accuracy on the 4 features selected by AIMarkerFinder was 0.904, which exceeded the accuracy on the 8 best RFE-selected features (0.876). Notably, 3 of the 4 metabolites chosen by AIMarkerFinder overlap with those selected by RFE, indicating strong consensus between the two approaches despite their different underlying principles. Critically, all 4 metabolites selected by AIMarkerFinder have been previously implicated in glioblastoma pathophysiology. Specifically, they are fully contained within the list of 22 dysregulated metabolites identified by Basov et al. [
14]
It should be noted that the limited size of the original cohort constitutes a significant methodological limitation of the present study. Although controlled data augmentation was employed to stabilize the training of deep models, the biological interpretation of the identified metabolites—specifically Malonyl-CoA, Glycerophosphocholine, SM(d18:1/22:0 OH), and GC(18:1/24:1)—requires rigorous validation in independent cohorts with substantially larger sample sizes. Only prospective, multicenter studies confirming their diagnostic and prognostic relevance will allow these compounds to be considered reliable biomarkers of glioblastoma.
The developed approach can be applied to analyze high-dimensional data in various biological fields (genomic, transcriptomic, proteomic, and metabolomic data), medical data, and for studying the impact of environmental factors (e.g., environmental pollution, diet, physical activity, or chemical exposure).
Author Contributions
Conceptualization, V.A.I.; writing—original draft preparation, P.S.D.; software, P.S.D.; validation, T.V.I.; supervision, V.A.I. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by a grant for research centers, provided by the Ministry of Economic Development of the Russian Federation in accordance with the subsidy agreement with the Novosibirsk State University dated 17 April 2025 No. 139-15-2025-006: IGK 000000C313925P3S0002.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| DAE | Denoising autoencoder |
| RF | Random forest |
| KAN | Kolmogorov–Arnold Network |
References
- Mohammadi, M.; Sharifi Noghabi, H.; Abed Hodtani, G.; Rajabi Mashhadi, H. Robust and Stable Gene Selection via Maximum–Minimum Correntropy Criterion. Genomics 2016, 107, 83–87. [Google Scholar] [CrossRef] [PubMed]
- Taylor, A.; Steinberg, J.; Andrews, T.S.; Webber, C. GeneNet Toolbox for MATLAB: A Flexible Platform for the Analysis of Gene Connectivity in Biological Networks. Bioinformatics 2014, 31, 442–444. [Google Scholar] [CrossRef] [PubMed]
- Parmar, C.; Grossmann, P.; Bussink, J.; Lambin, P.; Aerts, H.J.W.L. Machine Learning Methods for Quantitative Radiomic Biomarkers. Sci. Rep. 2015, 5, 13087. [Google Scholar] [CrossRef] [PubMed]
- Nicholson, J.K.; Connelly, J.; Lindon, J.C.; Holmes, E. Metabonomics: A Platform for Studying Drug Toxicity and Gene Function. Nat. Rev. Drug Discov. 2002, 1, 153–161. [Google Scholar] [CrossRef] [PubMed]
- Hinrichs, A.; Prochno, J.; Ullrich, M. The Curse of Dimensionality for Numerical Integration on General Domains. J. Complex. 2019, 50, 25–42. [Google Scholar] [CrossRef]
- Perthame, É.; Friguet, C.; Causeur, D. Stability of Feature Selection in Classification Issues for High-Dimensional Correlated Data. Stat. Comput. 2015, 26, 783–796. [Google Scholar] [CrossRef]
- Kumar, A.P.; Valsala, P. Feature Selection for High Dimensional DNA Microarray Data Using Hybrid Approaches. Bioinformation 2013, 9, 824–828. [Google Scholar] [CrossRef] [PubMed]
- Huang, G.T.; Tsamardinos, I.; Raghu, V.; Kaminski, N.; Benos, P.V. T-RECS: Stable selection of dynamically formed groups of features with application to prediction of clinical outcomes. Pac. Symp. Biocomput. 2015, 20, 431–442. [Google Scholar] [PubMed]
- Kanal, L.; Chandraskekaran, B. On dimensionality and sample size in statistical pattern classification. Pattern Recognit. 1971, 3, 225–234. [Google Scholar] [CrossRef]
- Kumar, V. Feature Selection: A Literature Review. SmartCR 2014, 4, 211–229. [Google Scholar] [CrossRef]
- Drotár, P.; Gazda, J.; Smékal, Z. An Experimental Comparison of Feature Selection Methods on Two-Class Biomedical Datasets. Comput. Biol. Med. 2015, 66, 1–10. [Google Scholar] [CrossRef] [PubMed]
- Chandrashekar, G.; Sahin, F. A Survey on Feature Selection Methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
- Dunne, K.; Cunningham, P.; Azuaje, F. Solutions to Instability Problems with Sequential Wrapper-Based Approaches to Feature Selection Tech Rep; Trinity College: London, UK, 2002. [Google Scholar]
- Basov, N.V.; Adamovskaya, A.V.; Rogachev, A.D.; Gaisler, E.V.; Demenkov, P.S.; Ivanisenko, T.V.; Venzel, A.S.; Mishinov, S.V.; Stupak, V.V.; Cheresiz, S.V.; et al. Investigation of Metabolic Features of Glioblastoma Tissue and the Peritumoral Environment Using Targeted Metabolomics Screening by LC-MS/MS and Gene Network Analysis. Vestn. VOGiS 2025, 28, 882–896. [Google Scholar] [CrossRef] [PubMed]
- Bishop, C.M. Training with Noise Is Equivalent to Tikhonov Regularization. Neural Comput. 1995, 7, 108–116. [Google Scholar] [CrossRef]
- Liu, Z.; Ma, P.; Wang, Y.; Matusik, W.; Tegmark, M. KAN 2.0: Kolmogorov-Arnold Networks Meet Science. arXiv 2024, arXiv:2408.10205. [Google Scholar] [CrossRef]
- Arin, P.; Minniti, M.; Murtinu, S.; Spagnolo, N. Inflection Points, Kinks, and Jumps: A Statistical Approach to Detecting Nonlinearities. Organ. Res. Methods 2021, 25, 786–814. [Google Scholar] [CrossRef]
- Guan, H.; Yue, L.; Yap, P.-T.; Xiao, S.; Bozoki, A.; Liu, M. Attention-Guided Autoencoder for Automated Progression Prediction of Subjective Cognitive Decline With Structural MRI. IEEE J. Biomed. Health Inform. 2023, 27, 2980–2989. [Google Scholar] [CrossRef] [PubMed]
- Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers: Burlington, MA, USA, 1993. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, R.C.; Dewi, C.; Huang, S.W.; Caraka, R.E. Selecting Critical Features for Data Classification Based on Machine Learning Methods. J. Big Data 2020, 7, 52. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |