Next Article in Journal
Route of Application and Dose Evaluation of Dental Pulp Stem Cells for the Treatment of Sialadenitis Caused by Sjögren’s Syndrome: A Preclinical Study
Previous Article in Journal
Challenges in the Treatment of Urinary Tract Infections: Antibiotic Resistance Profiles of Escherichia coli Strains Isolated from Young and Elderly Patients in a Southeastern Romanian Hospital
Previous Article in Special Issue
Functional Role of Fatty Acid Synthase for Signal Transduction in Core-Binding Factor Acute Myeloid Leukemia with an Activating c-Kit Mutation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Acute Myeloid Leukemia Genome Characterization Study and Subtype Classification Employing Feature Selection and Bayesian Networks

1
Research & Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen 518057, China
2
Xi’an Key Laboratory of Stem Cell and Regenerative Medicine, Institute of Medical Research, Northwestern Polytechnical University, Xi’an 710072, China
3
School of Physics and Electronic Information, Yan’an University, Yan’an 716000, China
4
Yan’an Medical College, Yan’an University, Yan’an 716000, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Biomedicines 2025, 13(5), 1067; https://doi.org/10.3390/biomedicines13051067
Submission received: 18 March 2025 / Revised: 24 April 2025 / Accepted: 25 April 2025 / Published: 28 April 2025

Abstract

:
Background: The precise diagnosis and classification of acute myeloid leukemia (AML) has important implications for clinical management and medical research. Methods: We investigated the expression of protein-coding genes in blood samples from AML patients and controls using The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) databases. Subsequently, we applied the feature selection method of the least absolute shrinkage and selection operator (LASSO) to select the optimal gene subset for classifying AML patients and controls as well as between a particular FAB subtype and other subtypes of AML. Results: Using LASSO method, we identified a subset of 101 genes that could effectively distinguish between AML patients and control individuals; these genes included 70 up-regulated and 31 down-regulated genes in AML. Functional annotation and pathway analysis indicated the involvement of these genes in RNA-related pathways, which was also consistent with the epigenetic changes observed in AML. Results from survival analysis revealed that several genes are correlated with the overall survival in AML patients. Additionally, LASSO-based gene subset analysis successfully revealed differences between certain AML subtypes, providing valuable insights into subtype-specific molecular mechanisms and differentiation therapy. Conclusions: This study demonstrated the application of machine learning in genomic data analysis for identifying gene subsets relevant to AML diagnosis and classification, which could aid in improving the understanding of the molecular landscape of AML. The identification of survival-related genes and subtype-specific markers may lead to the identification of novel targets for personalized medicine in the treatment of AML.

1. Introduction

AML is the most prevalent leukemia among the adult population and accounts for approximately 80% of all cases. It is an extremely aggressive and heterogeneous disorder characterized by the expansion of myeloid precursors in the bone marrow that are arrested in the early stages of development. It is thought to originate from the defective regulation of the differentiation and self-renewal programs of primitive multipotent hematopoietic stem cells (HSCs) or progenitor cells as a result of chromosomal translocations, genetic mutations, and changes at the molecular level [1,2]. According to the new 5th World Health Organization (WHO) classification of hematolymphoid tumors, AML is classified into two families according to morphology, cytogenetics, molecular genetics, and immunological markers: AML with defining genetic abnormalities and AML defined by differentiation [3]. Although genetic abnormalities are one of the most useful classification indicators in AML, about 50% of de novo AML patients have normal karyotypes; these patients lack the defining genetic abnormalities and are categorized based on the differentiation of leukemia cells and how mature the cells are, which is similar to the French–American–British (FAB) classification and is mainly dependent on morphology and cytochemical criteria [3,4]. The forecast stratification and treatment decisions for patients with a normal karyotype are difficult due to the high degree of clinical heterogeneity; thus, identifying new predictive molecular markers is necessary to improve the classification and prognosis of AML.
The use of genetic molecular biomarkers has the potential to revolutionize the classification and treatment approaches for AML [5,6]. These biomarkers can provide insights into the molecular underpinnings of the disease, enabling the identification of subtypes with different biological behaviors and treatment responses. Consequently, the integration of molecular diagnostics into standard care is crucial for personalizing therapy and improving the outcomes of AML patients. Notably, the complexity of genomic data in AML is immense due to the extensive variability in genetic mutations and chromosomal alterations that characterize the disease. The genome of each patient contains a unique combination of genetic information, which creates a high-dimensional data space that is difficult to interpret with traditional statistical methods. The intricate patterns of gene expression and their interactions contribute to the heterogeneity of AML, making it more difficult to identify precise biomarkers and develop targeted therapies [7]. Given this complexity, machine learning has emerged as a pivotal tool in the analysis of genomic data [8,9]. Machine learning methods, such as feature selection algorithms, have revolutionized the field of bioinformatics by enabling the analysis of large-scale genomic and proteomic data [10,11,12,13]. These methods have been extensively applied to classify cancer subtypes based on genomic or proteomic data, providing valuable insights into tumor heterogeneity and personalized medicine. For instance, support vector machines (SVMs) or random forests (RFs) can be trained on gene/protein expression data from known cancer subtypes to learn patterns distinguishing them [14,15]. Feature selection techniques are often employed to identify a subset of gene/protein biomarkers that significantly contribute to classification accuracy, which reduces the dimensionality of the data and improves the accuracy of the classification [16]. The least absolute shrinkage and selection operator (LASSO) feature selection can handle the high-dimensional nature of genomic data and identify key biomarkers for AML classification due to its simplicity, efficiency, and strong feature selection capabilities, while Bayesian networks are particularly useful for modeling multiple interdependencies and updating predictions with new data based on the data after dimensionality reduction.
In this study, we investigated the expression of protein-coding genes in blood samples from AML patients and controls using The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) databases (detailed information is provided in the Methods section), and we subsequently applied the feature selection method of LASSO to select the optimal gene subset for classifying AML patients and controls as well as between a particular FAB subtype and other subtypes of AML. Among those involved, a number of genes were found to be correlated with the overall survival in AML patients by survival analysis. By identifying and choosing pertinent gene features, it is possible to effectively distinguish between different AML subtypes. This approach not only allows for improved diagnostic and treatment strategies but also has the potential to further our understanding of cancer heterogeneity and firmly support the morphological classification of AML defined by differentiation. Finally, the results of this study may aid in developing more precise categorization systems and personalized treatment plans for AML patients.

2. Materials and Methods

2.1. Data Acquisition and Preprocessing

The GTEx database is a comprehensive resource that provides information on gene expression patterns across various tissues in the human body [17]. This approach allows researchers to explore and analyze gene expression data to gain insights into tissue-specific gene regulation and its implications for human health and disease. TCGA aims to understand the molecular basis of cancer through the analysis of genomic and clinical data [18,19]. In this study, we derived gene expression data from the University of California-Santa Cruz (UCSC) Xena browser (https://xena.ucsc.edu/; accessed on 3 March 2024). Specifically, we extracted the gene expression RNAseq data of 337 blood samples (normal controls) and 151 samples of AML from the “TCGA TARGET GTEx” cohort (a combined cohort of TCGA, TARGET, and GTEx samples) in the UCSC Xena browser. The phenotypic information and survival data of the 151 AML samples were downloaded from the TCGA database (phenotype information: https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LAML.GDC_phenotype.tsv.gz; accessed on 5 March 2024. survival data: https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LAML.survival.tsv. accessed on 5 March 2024). The basic clinical information about the 151 AML patients is shown in Table 1.
The original dataset contains the expression patterns of 60,000+ genes in 487 samples (337 control samples and 151 AML cases). We screened 19,580 protein-coding genes from the original dataset by using the “EnsDb.Hsapiens.v75” package in R. Afterwards, the protein-coding genes with missing values were deleted, and a final dataset including the expression of 9932 genes in 487 samples was used for further analysis.

2.2. Using LASSO for Gene Feature Selection Between Different Groups

LASSO regression is an embedded algorithm and is a powerful method that helps perform the regularization and feature selection of given data [20,21]. It applies a penalty term to the regression coefficients, forcing some of them to be exactly zero, effectively selecting only the most important features. This helps to reduce overfitting and improve model interpretability. In LASSO regression, a 10-fold cross-validation model is used to increase the robustness of the model results. In this study, we employed MATLAB 2020 to perform LASSO regression to screen for the optimal candidate gene subset for the classification of different groups. The model of Lasso regression is as follows: y is the dependent variable, x i X is the independent variable, w is the regression coefficient, λ is the regularization parameter, and λ w 1 represents the L1 normal form of λ .
w = argmin w i = 1 N y i w T x i 2 + λ w 1

2.3. Gene Functional Annotation

The functional annotation of candidate genes is a crucial step in understanding their biological roles and potential involvement in various cellular processes. In this study, we used the Database for Annotation, Visualization, and Integrated Discovery (DAVID) database to perform the functional annotation of the genes [22]. This comprehensive bioinformatics database was used to map our list of genes to corresponding functional categories, including Gene Ontology (GO) annotations for Biological Processes, Cellular Components, and Molecular Functions as well as Reactome Pathway mappings.

2.4. Classification Visualization

We performed t-distributed stochastic neighbor embedding (t-SNE), principal component analysis (PCA), and heatmap analysis to visualize the differences between the AML and control samples, as well as between one specific subtype and other subtype samples of AML. T-SNE is a nonlinear dimensionality reduction technique that can preserve local similarities between data points in high-dimensional space. Compared with traditional dimensionality reduction techniques, t-SNE employs an optimization process to find the best mapping that minimizes the Kullback–Leibler (KL) divergence between the conditional probabilities in the high-dimensional and low-dimensional spaces, offering a unique perspective on the structure and relationships within complex datasets. In the present study, the R package “Rtsne” was used to perform t-SNE to visualize the classification model. The R package “FactoMineR” and “pheatmap” were applied for PCA and heatmap analysis, respectively.

2.5. Construct Bayesian Network Using PC-Stable Algorithm

As a probabilistic graphical model for causal notation, the Bayesian network is a directed acyclic graph (DAG) that represents a set of variables and their conditional dependencies [23]. Bayesian networks are ideal for visualizing causal relationships and inferring probabilistic relationships between genes and classification nodes. The PC-stable algorithm is a modification of PC algorithms, and the modified version is much simpler while preserving existing soundness, completeness, and high-dimensional consistency. Herein, we employed the R (version 4.0.2)-package “pcalg” implemented by Kalisch to determine the causal correlations of the selected gene features.

2.6. Statistical Analysis

The statistical analysis was performed mainly using R 4.0.2 in this study. The differential expression gene analysis between the AML patients and controls was conducted using the Wilcoxon test (p < 0.05 indicated a significant difference). The Kaplan–Meier (K-M) curves for AML patients were analyzed using the “survival” and “survminer” packages in R. The cut-off points were used to divide the samples into “low” and “high” groups based on the gene expression level, with p < 0.05 indicating a significant difference between the two groups in K–M curves.

3. Results

3.1. Protein-Coding Gene Profiling Analysis of AML and Control Samples

Figure 1 illustrates the workflow of the current study. The original datasets utilized in this study included more than 60,000 genes from 337 control samples and 151 AML samples. After data processing, a total of 9932 protein-coding genes remained for further research. According to the results from PCA (Figure 2A) and heatmap (Figure 2B) analysis, the gene expression patterns of the AML and control samples were relatively different based on the total 9932 gene features. Despite this, we applied the LASSO feature selection algorithm and selected the best gene subset, which helped to effectively distinguish AML patients from controls. The LASSO algorithm outputted a set of feature combinations containing 101 genes for discriminating between AML patients and controls. As displayed in Figure 2C,D, expression profiling of the 101 selected genes effectively distinguished between the two groups, with significantly improved performance compared to that of all 9932 genes. Thus, the application of feature selection methods to high-dimensional data aids in reducing dimensionality and improving model performance by selecting only the most relevant features, which provides insights into the significance of each gene in the diagnosis of AML.

3.2. Biological Analysis of the Selected Genes in AML

We previously selected 101 genes for better classification between AML and control samples and assumed that these gene features may provide insights into the underlying mechanisms for AML. According to the Wilcoxon test results, a distinct pattern of gene expression was observed in the two groups, with 70 genes up-regulated and 31 genes down-regulated in the AML patients compared to the control individuals (adjust. p < 0.0001, method = “bonferroni”) (Figure 2D). The detailed expression profiles of the 101 genes are displayed in Table S1. Using the DAVID gene annotation tool, we conducted a GO analysis of the 101 differentially expressed genes (DEGs). The DEGs were annotated for three subontologies, as indicated in Figure 3: Biological Process (BP), Cellular Component (CC), and Molecular Function (MF). For BPs, DEGs were primarily enriched in mRNA splicing, via spliceosome, DNA replication, and protein stabilization (Figure 3A). Regarding CCs, the DEGs were predominantly clustered in the nucleoplasm and nucleus (Figure 3B). In terms of the MF category, these genes were most enriched in RNA binding, protein binding, and ribosome binding (Figure 3C). Subsequently, REACTOME pathway analysis revealed that the identified DEGs were primarily implicated in the metabolism of RNA, RNA polymerase III abortive and retractive initiation, and RNA polymerase III transcription pathways. These findings aligned with the epigenetic alterations characteristic of AML [24,25,26] (Figure 4). Notably, specific genes like RUNX Family Transcription Factor 1 (RUNX1) and RUNX Family Transcription Factor 2 (RUNX2), have been demonstrated to play significant roles in both normal hematopoiesis and the development of blood cancers [27,28,29]. For instance, pathway annotation revealed that RUNX1/RUNX2 regulates genes involved in the differentiation of myeloid cells. Our study results indicated that, in comparison to controls, the expression levels of RUNX1 and RUNX2 in AML patients had fold changes (FC) of 6.30 and 6.23, respectively (Table S1). These data suggest a potential involvement of these genes in the pathogenesis of AML.

3.3. Survival Analysis of Candidate Genes in AML

We then investigated the prognostic value of the selected genes, and we discovered that several candidate genes were significantly correlated with overall survival (OS) in AML patients. For instance, our findings indicated that the expression levels of G Protein Nucleolar 3 Like (GNL3L), Ankyrin Repeat and MYND Domain Containing 1 (ANKMY1), Host Cell Factor C1 (HCFC1), Nucleolar Protein 9 (NOL9), Inosine Monophosphate Dehydrogenase 2 (IMPDH2), and RNA Polymerase III Subunit C (POLR3C) were significantly up-regulated in AML patients compared to controls (Table S1). Elevated expression levels of the abovementioned candidate genes were linked to decreased OS in AML patients (Figure 5). Conversely, reduced expression of Acyl-CoA Dehydrogenase Family Member 11 (ACAD11), Matrin 3 (MATR3), and Proteasome 20S Subunit Alpha 2 (PSMA2) was associated with poorer OS outcomes (Figure 6). The majority of these genes are associated with the initiation or progression of AML. For example, the IMPDH2 inhibitor has been shown to be an effective treatment for MLL-fusion leukemia [30,31]. Overall, these genes have the potential to be predictive markers for the survival outcomes of AML patients and could be valuable targets [32,33]. Further research is warranted to validate these findings and explore the underlying mechanisms by which these genes influence patient survival.

3.4. Using the LASSO Method to Select Gene Subsets for Classifying AML Subtypes

Classifying AML subtypes is crucial for implementing personalized medicine and advancing our understanding of subtype-specific molecular mechanisms in AML. With the goal of identifying subsets of genes useful for accurately categorizing AML subtypes, we further utilized LASSO feature selection methods for gene selection. Due to the small sample size and low occurrence, we excluded the M6 and M7 subtypes from further research. As a result, we are concentrating our research on the remaining six subtypes, M0 through M5. Based on a total of 9932 gene expression profiles, t-SNE and PCA showed no distinct classifications among the remaining six cancer subtypes, except for M3 (Figure S1). We then employed the LASSO method for gene feature selection to distinguish between the different AML subtypes. Gene subsets containing 11, 7, 10, 35, 9, and 12 gene features were selected for classifying M0 versus others, M1 versus others, M2 versus others, M3 versus others, M4 versus others, and M5 versus others, respectively. As shown in Figure 7 and Figure S2, the subset of genes that were filtered out following feature selection helps to better discriminate between different AML subtypes, particularly for M0, M3, M4, and M5. Unfortunately, in cases of blasts with low differentiation levels and complicated cell types, such as M1 and M2, gene feature selection is unable to identify them effectively (Figure 7).
To explore the interactions among the selected genes in each subset, a Bayesian network was constructed by defining the directed edges between gene nodes. As shown in Figure 8, the directionality of these edges represents causal relationships or effects between genes. By analyzing the direct effects of gene nodes within the Bayesian network, we can identify important genes that significantly impact the differentiation of one particular AML subtype from the others. The identification of these key genes is crucial for developing a better understanding of the pathogenesis of different subtypes of AML. Taken together, the abovementioned findings imply that the feature selection approach was successful in identifying key genes from the large-scale genomic dataset and that these selected genes may play critical roles in distinguishing between AML subtypes, potentially leading to more personalized and effective differentiation therapy strategies for AML patients. Further investigation of the functional importance of these genes could provide valuable insights into the underlying molecular mechanisms driving the development and progression of AML.

4. Discussion

Feature selection methods, as described in previous studies, play a pivotal role in identifying a subset of relevant genes that effectively discriminate between different cancer subtypes [34]. In this study, we utilized LASSO regression to screen out a subset of 101 genes for categorizing AML and control samples. Additionally, we identified gene subsets crucial for distinguishing a specific FAB subtype from other subtypes. This refined approach enhances our ability to distinguish between AML patients and healthy individuals, as well as between different AML subtypes, by recognizing and selecting relevant gene features. Furthermore, the selected gene subsets offer new prediction markers for AML patient prognosis, providing insights into the molecular-level pathogenic mechanisms underlying AML.
Compared to most other malignancies, AML exhibits a relatively lower somatic mutation burden according to genome sequencing [1,35]. Epigenetic dysregulation, which is central to the pathophysiology of AML, is a distinguishing feature of this disease [24,36]. Our findings also showed that the genes differentially expressed in AML are primarily associated with the metabolism of RNA and RNA polymerase Ⅲ (Pol Ⅲ) transcription. Pol Ⅲ is known to be highly specialized for the production of 5S rRNA, tRNA, and U6 snRNAs and is involved in the regulation of epigenetics [37,38]. These findings are consistent with the epigenetic changes observed in AML.
Genetic biomarkers play important roles in the early diagnosis, prognostic stratification, and surveillance of cancers, including AML. Current clinical guidelines for AML recognize three groups of cytogenetic risk—favorable, intermediate, and poor risk—through certain recurrent cytogenetic abnormalities and gene mutations, such as t(8; 21), t(15; 17), inv(16), inv(3), t(6; 9), 5q-, and NPM1, FLT3, TP53, and CEBPA mutations [3,39]. Nonetheless, according to the guidelines, almost 50% of adult patients with AML who have a normal karyotype are considered to be at intermediate risk [39,40]. The best therapeutic approaches for these patients have not been fully elucidated, and the treatment outcomes are heterogeneous. There is increasing evidence that the molecular analysis of genes, including those with mutations in RUNX1, ASXL1, and MLL as well as alterations in the expression levels of BAALC and MN1, can be used to identify a subgroup of poor-risk patients among those with normal cytogenic results [39,41,42,43]. Using LASSO regression, we identified a number of novel gene biomarkers related to the prognosis of AML in this study. For instance, we discovered that the conserved GTP-binding nucleolar protein GNL3L is linked to tumorigenesis and poor prognosis in patients with AML. The outcome was in line with recent research showing that GNL3L promoted AML cell proliferation and stimulated cytarabine resistance [32]. PSMA2 is a component of the 20S core proteasome complex, is a proteolytic degrader of most intracellular proteins, and is associated with prognosis. This protein can also form the PAN-PSMA2 fusion in Myelodysplastic Neoplasms (MDS) and progresses to AML [44]. Furthermore, a prior study demonstrated that PSMA2 may indirectly affect tumor cells through the immune microenvironment [45,46]. MATR3, a nuclear matrix protein that is thought to stabilize certain messenger RNA types [47], has also been shown to participate in creating the KMT2A-MATR3 fusion and accelerating the onset of acute leukemia [48]. Moreover, the IMPDH inhibitor FF-10501-01 has a potent therapeutic effect on aggressive AML with MLL rearrangements by excessively activating the TLR-VCAM1 pathway [30,31]. More research is necessary to validate these results and explore promising therapeutic targets for AML patients.
The FAB classification of AML poses a significant challenge, particularly in patients with negative cytochemical staining and in distinguishing between M1 and M2 or M2 and M4 [49]. Using the LASSO method for gene feature selection, we developed an improved strategy for distinguishing the M0, M3, M4, and M5 subtypes. The key genes we identified, which significantly impact the differentiation of a particular AML subtype from others, are likely associated with differentiation due to the varying degrees of differentiation arrest among the subtypes. These genes may serve as candidates for differentiation therapy. Therefore, additional research is needed to comprehensively understand the crucial roles played by these genes.
This study has several limitations. Certain AML subtypes, such as M1 and M2, posed challenges in effective gene feature selection, which may have impacted the robustness of the findings. Additionally, while the FAB classification provided a framework for subtype categorization in this study, it is acknowledged that the WHO classification offers a more contemporary and clinically relevant classification. Unfortunately, the WHO classification data were not available in the datasets utilized. This study was only focused on AML defined by differentiation, and future studies incorporating WHO classification will further enhance the clinical relevance of such analyses. Finally, further validation and functional studies are warranted to confirm the clinical relevance of the identified genes and their potential as therapeutic targets.

5. Conclusions

Our study underscores the application of machine learning, specifically the LASSO regression algorithm, for analyzing the complex genomic data of AML patients. We successfully identified a subset of 101 genes that effectively distinguished between AML and control samples. Importantly, the identified genes not only are valuable for diagnosis but also demonstrate prognostic significance, with certain genes correlating with the overall survival outcomes of AML patients. Moreover, the genomic signature was further analyzed using LASSO to classify different AML subtypes, which may help to better understand the molecular heterogeneity within AML. Overall, our study provides insights for more precise categorization and personalized treatment strategies for AML.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biomedicines13051067/s1, Figure S1: Visualization of t-SNE and PCA plots. (A) T-SNE plot of the six subtypes of AML based on the total 9932 gene features. (B) PCA plot of the six subtypes of AML based on the total 9932 gene features; Figure S2: Visualization of PCA plots. The figure shows the PCA plots for (A) M0 and other subtypes of AML, (B) M1 and other subtypes of AML, (C) M2 and other subtypes of AML, (D) M3 and other subtypes of AML, (E) M4 and other subtypes of AML, and (F) M5 and other subtypes of AML; Table S1: Excel file containing the detailed expression profiling of the selected 101 genes in AML patients and controls, related to Figure 2.

Author Contributions

Z.L. wrote the primary manuscript; J.L. prepared the datasets and constructed the models; S.L. produced a few figures and tables; J.W. and Y.W. directed and revised the entire manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Shenzhen Science and Technology Program (No. JCYJ20230807145259025 to Z.L.) and the National Natural Science Foundation of China (No. 81900134 to Z.L.).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The gene expression RNAseq data were downloaded from the UCSC Xena browser (https://xena.ucsc.edu/). The phenotypic information and survival data of the AML samples were downloaded from the TCGA database (phenotype information: https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LAML.GDC_phenotype.tsv.gz; survival data: https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-LAML.survival.tsv. Both datasets were retrieved on 5 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AMLAcute myeloid leukemia
LASSOLeast Absolute Shrinkage and Selection Operator
HSCsHematopoietic stem cells
WHOWorld Health Organization
FABFrench–American–British
SVMSupport vector machines
RFRandom forests
GTExGenotype-Tissue Expression
TCGAThe Cancer Genome Atlas
DAVIDDatabase for Annotation, Visualization, and Integrated Discovery
GOGene Ontology
t-SNEt-distributed stochastic neighbor embedding
PCAPrincipal component analysis
DEGsDifferentially expressed genes
BPBiological processes
CCCellular component
MFMolecular function
Pol IIIRNA polymerase III

References

  1. Cancer Genome Atlas Research Network; Ley, T.J.; Miller, C.; Ding, L.; Raphael, B.J.; Mungall, A.J.; Robertson, A.; Hoadley, K.; Triche, T.J., Jr.; Laird, P.W.; et al. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N. Engl. J. Med. 2013, 368, 2059–2074. [Google Scholar] [CrossRef] [PubMed]
  2. Dohner, H.; Weisdorf, D.J.; Bloomfield, C.D. Acute Myeloid Leukemia. N. Engl. J. Med. 2015, 373, 1136–1152. [Google Scholar] [CrossRef] [PubMed]
  3. Khoury, J.D.; Solary, E.; Abla, O.; Akkari, Y.; Alaggio, R.; Apperley, J.F.; Bejar, R.; Berti, E.; Busque, L.; Chan, J.K.C.; et al. The 5th edition of the World Health Organization Classification of Haematolymphoid Tumours: Myeloid and Histiocytic/Dendritic Neoplasms. Leukemia 2022, 36, 1703–1719. [Google Scholar] [CrossRef]
  4. Mo, Q.; Yun, S.; Sallman, D.A.; Vincelette, N.D.; Peng, G.; Zhang, L.; Lancet, J.E.; Padron, E. Integrative molecular subtypes of acute myeloid leukemia. Blood Cancer J. 2023, 13, 71. [Google Scholar] [CrossRef]
  5. Yin, P.Y.; Wang, R.W.; Jing, R.; Li, X.; Ma, J.H.; Li, K.M.; Wang, H. Research progress on molecular biomarkers of acute myeloid leukemia. Front. Oncol. 2023, 13, 1078556. [Google Scholar] [CrossRef]
  6. Prada-Arismendy, J.; Arroyave, J.C.; Röthlisberger, S. Molecular biomarkers in acute myeloid leukemia. Blood Rev. 2017, 31, 63–76. [Google Scholar] [CrossRef]
  7. Yang, X.; Wong, M.P.M.; Ng, R.K. Aberrant DNA Methylation in Acute Myeloid Leukemia and Its Clinical Implications. Int. J. Mol. Sci. 2019, 20, 4576. [Google Scholar] [CrossRef]
  8. König, I.R.; Auerbach, J.; Gola, D.; Held, E.; Holzinger, E.R.; Legault, M.A.; Sun, R.; Tintle, N.; Yang, H.C. Machine learning and data mining in complex genomic data—A review on the lessons learned in Genetic Analysis Workshop 19. BMC Genet. 2016, 17 (Suppl. 2), 1. [Google Scholar] [CrossRef]
  9. Smith, G.D.; Ching, W.H.; Cornejo-Páramo, P.; Wong, E.S. Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol. 2023, 24, 116. [Google Scholar] [CrossRef]
  10. Pudjihartono, N.; Fadason, T.; Kempa-Liehr, A.W.; O’Sullivan, J.M. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front. Bioinform. 2022, 2, 927312. [Google Scholar] [CrossRef]
  11. Wang, Y.; Gao, X.; Ru, X.; Sun, P.; Wang, J. Using feature selection and Bayesian network identify cancer subtypes based on proteomic data. J. Proteom. 2023, 280, 104895. [Google Scholar] [CrossRef] [PubMed]
  12. Yang, P.; Huang, H.; Liu, C. Feature selection revisited in the single-cell era. Genome Biol. 2021, 22, 321. [Google Scholar] [CrossRef] [PubMed]
  13. Wang, Y.; Gao, X.; Wang, J. Functional Proteomic Profiling Analysis in Four Major Types of Gastrointestinal Cancers. Biomolecules 2023, 13, 701. [Google Scholar] [CrossRef]
  14. Huang, S.; Cai, N.; Pacheco, P.P.; Narrandes, S.; Wang, Y.; Xu, W. Applications of Support Vector Machine (SVM) Learning in Cancer Genomics. Cancer Genom. Proteom. 2018, 15, 41–51. [Google Scholar] [CrossRef]
  15. Luo, J.; Feng, Y.; Wu, X.; Li, R.; Shi, J.; Chang, W.; Wang, J. ForestSubtype: A cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest. BMC Bioinform. 2023, 24, 289. [Google Scholar] [CrossRef]
  16. Hemphill, E.; Lindsay, J.; Lee, C.; Măndoiu, I.I.; Nelson, C.E. Feature selection and classifier performance on diverse bio- logical datasets. BMC Bioinform. 2014, 15 (Suppl. 13), S4. [Google Scholar] [CrossRef]
  17. Lonsdale, J.; Thomas, J.; Salvatore, M.; Phillips, R.; Lo, E.; Shad, S.; Hasz, R.; Walters, G.; Garcia, F.; Young, N.; et al. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 2013, 45, 580–585. [Google Scholar] [CrossRef]
  18. Wang, Z.; Jensen, M.A.; Zenklusen, J.C. A Practical Guide to The Cancer Genome Atlas (TCGA). In Statistical Genomics. Methods in Molecular Biology; Humana Press: New York, NY, USA, 2016; Volume 1418, pp. 111–141. [Google Scholar] [CrossRef]
  19. Hutter, C.; Zenklusen, J.C. The Cancer Genome Atlas: Creating Lasting Value beyond Its Data. Cell 2018, 173, 283–285. [Google Scholar] [CrossRef]
  20. Kang, J.; Choi, Y.J.; Kim, I.K.; Lee, H.S.; Kim, H.; Baik, S.H.; Kim, N.K.; Lee, K.Y. LASSO-Based Machine Learning Algorithm for Prediction of Lymph Node Metastasis in T1 Colorectal Cancer. Cancer Res. Treat. 2021, 53, 773–783. [Google Scholar] [CrossRef]
  21. Wang, T.; Dai, L.; Shen, S.; Yang, Y.; Yang, M.; Yang, X.; Qiu, Y.; Wang, W. Comprehensive Molecular Analyses of a Macrophage-Related Gene Signature with Regard to Prognosis, Immune Features, and Biomarkers for Immunotherapy in Hepatocellular Carcinoma Based on WGCNA and the LASSO Algorithm. Front. Immunol. 2022, 13, 843408. [Google Scholar] [CrossRef]
  22. Dennis, G., Jr.; Sherman, B.T.; Hosack, D.A.; Yang, J.; Gao, W.; Lane, H.C.; Lempicki, R.A. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4, P3. [Google Scholar] [CrossRef]
  23. Wang, L. Mining causal relationships among clinical variables for cancer diagnosis based on Bayesian analysis. BioData Min. 2015, 8, 13. [Google Scholar] [CrossRef]
  24. Li, S.; Chen, X.; Wang, J.; Meydan, C.; Glass, J.L.; Shih, A.H.; Delwel, R.; Levine, R.L.; Mason, C.E.; Melnick, A.M. Somatic Mutations Drive Specific, but Reversible, Epigenetic Heterogeneity States in AML. Cancer Discov. 2020, 10, 1934–1949. [Google Scholar] [CrossRef] [PubMed]
  25. Ueda, K.; Steidl, U. Epigenetic Achilles’ heel of AML. Nat. Cancer 2021, 2, 481–483. [Google Scholar] [CrossRef] [PubMed]
  26. Rodrigues, C.P.; Shvedunova, M.; Akhtar, A. Epigenetic Regulators as the Gatekeepers of Hematopoiesis. Trends Genet. 2020, 37, P125–P142. [Google Scholar] [CrossRef]
  27. Desai, P.; Mencia-Trinchant, N.; Savenkov, O.; Simon, M.S.; Cheang, G.; Lee, S.; Samuel, M.; Ritchie, E.K.; Guzman, M.L.; Ballman, K.V.; et al. Somatic mutations precede acute myeloid leukemia years before diagnosis. Nat. Med. 2018, 24, 1015–1023. [Google Scholar] [CrossRef]
  28. Mill, C.P.; Fiskus, W.; DiNardo, C.D.; Birdwell, C.; Davis, J.A.; Kadia, T.M.; Takahashi, K.; Short, N.; Daver, N.; Ohanian, M.; et al. Effective therapy for AML with RUNX1 mutation by cotreatment with inhibitors of protein translation and BCL2. Blood 2022, 139, 907–921. [Google Scholar] [CrossRef]
  29. Wadugu, B.A.; Nonavinkere Srivatsan, S.; Heard, A.; Alberti, M.O.; Ndonwi, M.; Liu, J.; Grieb, S.; Bradley, J.; Shao, J.; Ahmed, T.; et al. U2af1 is a haplo-essential gene required for hematopoietic cancer cell survival in mice. J. Clin. Investig. 2021, 131, e141401. [Google Scholar] [CrossRef]
  30. Garcia-Manero, G.; Pemmaraju, N.; Alvarado, Y.; Naqvi, K.; Ravandi, F.; Jabbour, E.; De Lumpa, R.; Kantarjian, H.; Advani, A.; Mukherjee, S.; et al. Results of a Phase 1/2a dose-escalation study of FF-10501-01, an IMPDH inhibitor, in patients with acute myeloid leukemia or myelodysplastic syndromes. Leuk. Lymphoma 2020, 61, 1943–1953. [Google Scholar] [CrossRef]
  31. Liu, X.; Sato, N.; Yabushita, T.; Li, J.; Jia, Y.; Tamura, M.; Asada, S.; Fujino, T.; Fukushima, T.; Yonezawa, T.; et al. IMPDH inhibition activates TLR-VCAM1 pathway and suppresses the development of MLL-fusion leukemia. EMBO Mol. Med. 2023, 15, e15631. [Google Scholar] [CrossRef]
  32. Li, J.; Wu, Z.; Pan, Y.; Chen, Y.; Chu, J.; Cong, Y.; Fang, Q. GNL3L exhibits pro-tumor activities via NF-κB pathway as a poor prognostic factor in acute myeloid leukemia. J. Cancer 2024, 15, 4072–4080. [Google Scholar] [CrossRef] [PubMed]
  33. Han, Y.; Hu, A.; Qu, Y.; Xu, Q.; Wang, H.; Feng, Y.; Hu, Y.; He, L.; Wu, H.; Wang, X. Covalent targeting the LAS1-NOL9 axis for selective treatment in NPM1 mutant acute myeloid leukemia. Pharmacol. Res. 2023, 189, 106700. [Google Scholar] [CrossRef] [PubMed]
  34. Park, J.; Lee, J.W.; Park, M. Comparison of cancer subtype identification methods combined with feature selection methods in omics data analysis. BioData Min. 2023, 16, 18. [Google Scholar] [CrossRef] [PubMed]
  35. Papaemmanuil, E.; Gerstung, M.; Bullinger, L.; Gaidzik, V.I.; Paschka, P.; Roberts, N.D.; Potter, N.E.; Heuser, M.; Thol, F.; Bolli, N.; et al. Genomic Classification and Prognosis in Acute Myeloid Leukemia. N. Engl. J. Med. 2016, 374, 2209–2221. [Google Scholar] [CrossRef]
  36. Fennell, K.A.; Bell, C.C.; Dawson, M.A. Epigenetic therapies in acute myeloid leukemia: Where to from here? Blood 2019, 134, 1891–1901. [Google Scholar] [CrossRef]
  37. Yeganeh, M.; Hernandez, N. RNA polymerase III transcription as a disease factor. Genes Dev. 2020, 34, 865–882. [Google Scholar] [CrossRef]
  38. Bhargava, P. Epigenetic regulation of transcription by RNA polymerase III. Biochim. Biophys. Acta 2013, 1829, 1015–1025. [Google Scholar] [CrossRef]
  39. DiNardo, C.D.; Erba, H.P.; Freeman, S.D.; Wei, A.H. Acute myeloid leukaemia. Lancet 2023, 401, 2073–2086. [Google Scholar] [CrossRef]
  40. Döhner, H.; Wei, A.H.; Appelbaum, F.R.; Craddock, C.; DiNardo, C.D.; Dombret, H.; Ebert, B.L.; Fenaux, P.; Godley, L.A.; Hasserjian, R.P.; et al. Diagnosis and management of AML in adults: 2022 recommendations from an international expert panel on behalf of the ELN. Blood 2022, 140, 1345–1377. [Google Scholar] [CrossRef]
  41. Marjanovic, I.; Karan-Djurasevic, T.; Kostic, T.; Virijevic, M.; Vukovic, N.S.; Pavlovic, S.; Tosic, N. Prognostic significance of combined BAALC and MN1 gene expression level in acute myeloid leukemia with normal karyotype. Int. J. Lab. Hematol. 2021, 43, 433–440. [Google Scholar] [CrossRef]
  42. Yagi, T.; Morimoto, A.; Eguchi, M.; Hibi, S.; Sako, M.; Ishii, E.; Mizutani, S.; Imashuku, S.; Ohki, M.; Ichikawa, H. Identification of a gene expression signature associated with pediatric AML prognosis. Blood 2003, 102, 1849–1856. [Google Scholar] [CrossRef]
  43. Haferlach, C.; Kern, W.; Schindela, S.; Kohlmann, A.; Alpermann, T.; Schnittger, S.; Haferlach, T. Gene expression of BAALC, CDKN1B, ERG, and MN1 adds independent prognostic information to cytogenetics and molecular mutations in adult acute myeloid leukemia. Genes Chromosomes Cancer 2012, 51, 257–265. [Google Scholar] [CrossRef] [PubMed]
  44. Panagopoulos, I.; Gorunova, L.; Andersen, H.K.; Bergrem, A.; Dahm, A.; Andersen, K.; Micci, F.; Heim, S. PAN3-PSMA2 fusion resulting from a novel t(7;13)(p14;q12) chromosome translocation in a myelodysplastic syndrome that evolved into acute myeloid leukemia. Exp. Hematol. Oncol. 2018, 7, 7. [Google Scholar] [CrossRef]
  45. Wang, N. Analysis of prognostic biomarker models and immune microenvironment in acute myeloid leukemia by integrative bioinformatics. J. Cancer Res. Clin. Oncol. 2023, 149, 9609–9619. [Google Scholar] [CrossRef] [PubMed]
  46. Qi, J.; Hu, Z.; Liu, S.; Li, F.; Wang, S.; Wang, W.; Sheng, X.; Feng, L. Comprehensively Analyzed Macrophage-Regulated Genes Indicate That PSMA2 Promotes Colorectal Cancer Progression. Front. Oncol. 2020, 10, 618902. [Google Scholar] [CrossRef]
  47. Attig, J.; Agostini, F.; Gooding, C.; Chakrabarti, A.M.; Singh, A.; Haberman, N.; Zagalak, J.A.; Emmett, W.; Smith, C.W.J.; Luscombe, N.M.; et al. Heteromeric RNP Assembly at LINEs Controls Lineage-Specific RNA Processing. Cell 2018, 174, 1067–1081.e17. [Google Scholar] [CrossRef] [PubMed]
  48. Komatsu, K.; Sakaguchi, K.; Shimizu, D.; Yamoto, K.; Kato, F.; Miyairi, I.; Ogata, T.; Saitsu, H. Characterization of KMT2A::MATR3 fusion in a patient with acute lymphoblastic leukemia and monitoring of minimal residual disease by nanoplate digital PCR. Pediatr. Blood Cancer 2023, 70, e30120. [Google Scholar] [CrossRef]
  49. Angelescu, S.; Berbec, N.M.; Colita, A.; Barbu, D.; Lupu, A.R. Value of multifaced approach diagnosis and classification of acute leukemias. Maedica 2012, 7, 254–260. [Google Scholar]
Figure 1. Workflow of the research.
Figure 1. Workflow of the research.
Biomedicines 13 01067 g001
Figure 2. Visualization of PCA plots and heatmaps. (A,B) PCA plot and heatmap for AML and control samples based on the total of 9932 gene features. (C,D) PCA plot and heatmap for AML and control samples based on the 101 gene features selected by LASSO. The R packages “FactoMineR” and “pheatmap” were applied to PCA and heatmap analyses, respectively.
Figure 2. Visualization of PCA plots and heatmaps. (A,B) PCA plot and heatmap for AML and control samples based on the total of 9932 gene features. (C,D) PCA plot and heatmap for AML and control samples based on the 101 gene features selected by LASSO. The R packages “FactoMineR” and “pheatmap” were applied to PCA and heatmap analyses, respectively.
Biomedicines 13 01067 g002
Figure 3. Gene Ontology (GO) annotations for the 101 gene features selected by LASSO. The GO terms include (A) Biological Process, (B) Cellular Component, and (C) Molecular Function. The DAVID database was used to assess the GO annotations.
Figure 3. Gene Ontology (GO) annotations for the 101 gene features selected by LASSO. The GO terms include (A) Biological Process, (B) Cellular Component, and (C) Molecular Function. The DAVID database was used to assess the GO annotations.
Biomedicines 13 01067 g003
Figure 4. Reactome Pathway annotations for the 101 gene features selected by LASSO. The pathway annotations were analyzed using the DAVID database.
Figure 4. Reactome Pathway annotations for the 101 gene features selected by LASSO. The pathway annotations were analyzed using the DAVID database.
Biomedicines 13 01067 g004
Figure 5. Prognostic analysis of candidate up-regulated genes. Shown here are the Kaplan–Meier curves for candidate up-regulated genes in AML compared to NC samples, including (A) ANKMY1, (B) GNL3L, (C) HCFC1, (D) IMPDH2, (E) NOL9, and (F) POLR3C. The Kaplan–Meier curves were analyzed using the “survival” and “survminer” packages in R.
Figure 5. Prognostic analysis of candidate up-regulated genes. Shown here are the Kaplan–Meier curves for candidate up-regulated genes in AML compared to NC samples, including (A) ANKMY1, (B) GNL3L, (C) HCFC1, (D) IMPDH2, (E) NOL9, and (F) POLR3C. The Kaplan–Meier curves were analyzed using the “survival” and “survminer” packages in R.
Biomedicines 13 01067 g005
Figure 6. Prognostic analysis of candidate down-regulated genes. Shown here are the Kaplan–Meier curves for candidate down-regulated genes in AML compared to NC samples, including (A) ACAD11, (B) MATR3, and (C) PSMA2.
Figure 6. Prognostic analysis of candidate down-regulated genes. Shown here are the Kaplan–Meier curves for candidate down-regulated genes in AML compared to NC samples, including (A) ACAD11, (B) MATR3, and (C) PSMA2.
Biomedicines 13 01067 g006
Figure 7. Visualization of t-SNE plots. The figure shows the t-SNE plots for (A) M0 and the other subtypes of AML, (B) M1 and the other subtypes of AML, (C) M2 and the other subtypes of AML, (D) M3 and the other subtypes of AML, (E) M4 and the other subtypes of AML, and (F) M5 and the other subtypes of AML.
Figure 7. Visualization of t-SNE plots. The figure shows the t-SNE plots for (A) M0 and the other subtypes of AML, (B) M1 and the other subtypes of AML, (C) M2 and the other subtypes of AML, (D) M3 and the other subtypes of AML, (E) M4 and the other subtypes of AML, and (F) M5 and the other subtypes of AML.
Biomedicines 13 01067 g007
Figure 8. Bayesian networks (BN) of the selected gene features. Subsets of 11, 7, 10, 35, 9, and 12 genes were identified for classifying between (A) M0 and others, (B) M1 and others, (C) M2 and others, (D) M3 and others, (E) M4 and others, and (F) M5 and others, respectively. Shown here are the BN structures of the selected genes in each classification model.
Figure 8. Bayesian networks (BN) of the selected gene features. Subsets of 11, 7, 10, 35, 9, and 12 genes were identified for classifying between (A) M0 and others, (B) M1 and others, (C) M2 and others, (D) M3 and others, (E) M4 and others, and (F) M5 and others, respectively. Shown here are the BN structures of the selected genes in each classification model.
Biomedicines 13 01067 g008
Table 1. Clinical characteristics of the AML cohort.
Table 1. Clinical characteristics of the AML cohort.
Clinical CharacteristicsAML Cohort
Age
  N/Range151/(21–88)
  Average54.17 ± 16.07
Gender (N)
  Male83
  Female68
Race (N)
  Asian1
  Black13
  White135
  Not reported2
Subtype (N)
  M015
  M135
  M238
  M315
  M429
  M515
  M62
  M71
  Not reported1
OS_status (N)
  Alive52
  Dead80
  Not reported19
OS_time (Days)
  Alive927.73 ± 730.63
  Dead414.50 ± 385.15
N: number of cases; OS: overall survival.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Z.; Li, J.; Li, S.; Wang, Y.; Wang, J. Acute Myeloid Leukemia Genome Characterization Study and Subtype Classification Employing Feature Selection and Bayesian Networks. Biomedicines 2025, 13, 1067. https://doi.org/10.3390/biomedicines13051067

AMA Style

Li Z, Li J, Li S, Wang Y, Wang J. Acute Myeloid Leukemia Genome Characterization Study and Subtype Classification Employing Feature Selection and Bayesian Networks. Biomedicines. 2025; 13(5):1067. https://doi.org/10.3390/biomedicines13051067

Chicago/Turabian Style

Li, Zhenzhen, Jingwen Li, Sifan Li, Yangyang Wang, and Jihan Wang. 2025. "Acute Myeloid Leukemia Genome Characterization Study and Subtype Classification Employing Feature Selection and Bayesian Networks" Biomedicines 13, no. 5: 1067. https://doi.org/10.3390/biomedicines13051067

APA Style

Li, Z., Li, J., Li, S., Wang, Y., & Wang, J. (2025). Acute Myeloid Leukemia Genome Characterization Study and Subtype Classification Employing Feature Selection and Bayesian Networks. Biomedicines, 13(5), 1067. https://doi.org/10.3390/biomedicines13051067

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop