A Bioinformatics View on Acute Myeloid Leukemia Surface Molecules by Combined Bayesian and ABC Analysis

“Big omics data” provoke the challenge of extracting meaningful information with clinical benefit. Here, we propose a two-step approach, an initial unsupervised inspection of the structure of the high dimensional data followed by supervised analysis of gene expression levels, to reconstruct the surface patterns on different subtypes of acute myeloid leukemia (AML). First, Bayesian methodology was used, focusing on surface molecules encoded by cluster of differentiation (CD) genes to assess whether AML is a homogeneous group or segregates into clusters. Gene expressions of 390 patient samples measured using microarray technology and 150 samples measured via RNA-Seq were compared. Beyond acute promyelocytic leukemia (APL), a well-known AML subentity, the remaining AML samples were separated into two distinct subgroups. Next, we investigated which CD molecules would best distinguish each AML subgroup against APL, and validated discriminative molecules of both datasets by searching the scientific literature. Surprisingly, a comparison of both omics analyses revealed that CD339 was the only overlapping gene differentially regulated in APL and other AML subtypes. In summary, our two-step approach for gene expression analysis revealed two previously unknown subgroup distinctions in AML based on surface molecule expression, which may guide the differentiation of subentities in a given clinical–diagnostic context.


Introduction
High-throughput molecular analyses are becoming increasingly affordable in clinical medicine, and have significantly improved our understanding of diseases such as cancer [1,2]. However, overwhelming amounts of "big omics data" still provoke the challenge of extracting meaningful information with clinical benefit. Deep molecular insights can provide a basis to shift organ/tissue-centered approaches to tumor diagnosis and therapy towards precision oncology, i.e., the integration of genetic and genomic data to estimate a patient's prognosis and guide treatment decisions [3]. In fact, molecular classification systems have been proposed already for some entities such as breast cancer [4] or acute myeloid leukemia (AML) [5]. However, large-scale sequencing or gene expression studies are of limited practical value in specific clinical settings concerning both diagnostics and therapy. Notably, specific low-complexity biomarkers can be determined in less time than system-wide profiles, and molecular therapeutics need precisely defined structures rather than superordinate patterns as their particular sites of action. Therefore, rapid diagnostics and e ffective molecular therapies require knowledge about "the important few" features that low-throughput testing procedures or targeted compounds can address.
Diagnosis of acute myeloid leukemia (AML) is of great medical importance in the field of hematology, as one of its subtypes, specifically, acute promyelocytic leukemia (APL), is associated with a substantial risk of early death from bleeding complications [6]. Although diagnosis of AML can be established in many cases by cytology, discrimination of APL and non-promyelocytic AML (np-AML) is sometimes challenging and requires further examination. Throughout the past two decades, flow cytometry has become an additional powerful tool to rapidly characterize AML and distinguish this entity from other acute leukemias or myeloid neoplasms based on the immunophenotyping of blasts [7,8]. Up to now, single distinct AML-specific cell surface markers have not been identified, thus the European Leukemia Net (ELN) guidelines recommend a broad antibody panel for diagnostic testing [9].
In this work, we applied an unbiased mathematical approach that combines Bayesian and computed ABC analysis [10] on gene expression data in order to identify the most informative cluster of differentiation (CD) genes on AML blasts that could distinguish APL from np-AML. Specifically, we analyzed microarray gene expression profiles for a series of 390 primary patient samples comprising 266 np-AML, 15 APL, and 109 control cases (healthy/non-leukemia patients) [11], and RNA-Seq-data from 135 np-AML and 15 APL patient samples from the TCGA LAML project [12].

Gene Expression Datasets
Two AML gene expression datasets were analyzed in this study. The first one (provided by T.H) was a subset of a previously published microarray dataset [11] and included gene expression profiles of 281 diagnostic AML samples (266 np-AML, 15 APL) and 109 nonleukemia/healthy individuals that had been obtained using Affymetrix HG-U133 Plus 2.0 microarrays. The ethics board approved the analysis of this dataset for the purpose of this study at the faculty of medicine, University of Marburg (No. 138/16). The second dataset comprised RNA-Seq data from the TCGA LAML project for 135 np-AML and 15 APL patients [12], and can be downloaded from the Broad GDAC Firehose platform (http://firebrowse.org/?cohort=LAML&download_dialog=true, accessed on 3 April 2017). It can also be downloaded from the GDC Data Portal for each case separately (https: //portal.gdc.cancer.gov/ (accessed on 3 April 2017). Clinical annotations for the LAML dataset included FAB subgroups as the most diverse established classification system for AML, while more recent classifications such as WHO 2016 [8] or ELN 2017/2022 classifications [7,13] were not provided. Thus, we chose FAB classes as the reference for our calculations, which we also considered appropriate given that our bioinformatic explorations were directed towards the cell surface. A list of 417 CD genes that were examined for expression in np-AML, APL, and non-leukemia samples (available only in the microarray dataset) was compiled manually using the list of CD molecules provided by the HUGO gene nomenclature committee (https://www.genenames.org/data/genegroup/ #!/group/471, accessed on 30 November 2019) and a query to the NCBI gene database (https://www.ncbi.nlm.nih.gov/gene, accessed on 2 November 2017; query term "CDxxx AND Homo Sapiens [Organism]").

Calculation of Bayesian Decision Borders for Expressed and Unexpressed Genes
Initial exploratory data analysis with techniques taken from [14] indicated that the distribution of the log-transformed expression data for each of the two complete sample sets followed a multimodal distribution. Mixture models are the standard statistical tool for such applications [15,16], and further preprocessing was not required. Gaussian mixture models (GMM) were fitted to the overall distribution of the log-transformed microarray and the RNA-Seq gene expression data and subjected to Bayesian analysis using the R package "AdaptGauss" available on the Comprehensive R Archive Network (CRAN) [15]. In brief, the procedure incorporated the following definitions and operations: In Gaussian Mixture models (GMM), the probability density of the measurements, GMM(x), is represented as a weighted sum of Gaussians where N(x|m i ,s i ) denotes Gaussian normal distributions with mean m i and standard deviation s i ; the weights, w i , which add up to 1, indicate the relative contribution of each component; and M denotes the number of components in the mixture. For each dataset, a suitable value for M was selected based on the Akaike information criterion [17], and the GMM was adapted to the data using an expectation-maximization algorithm. This approach resulted in a GMM with M = 2. Chi-square tests [18] were applied to estimate the probability that the GMM did not adequately describe the logarithmized gene expression data. The chi-square tests yielded p-values of p < 0.001 for both GMMs. Posteriors for each GMM were calculated as follows:

Identification of AML Subgroups
The subgroups G k k = 1, 2, 3 were identified with the Ward algorithm accessible in the R package "FCPS" available on CRAN [19]. As the algorithm was applied to the Bayes posteriors p(c i |x), no normalization was necessary. The inspection of the dendrogram indicated optimal clustering. Here, significant changes in fusion levels of the ultrametric portion of the Euclidean distance in the Ward algorithm (y-axis) indicated the best cut [20].

Selection of Differentially Expressed Genes (Deg)
In the next step, the probability p deg that gene a is differentially expressed is defined in Equation (4) as the difference of group average Bayes posteriors p G k = 1 N k ∑ N k u∈G k p u (c i |gene a ) with p u being the posterior value of a subject's case u ∈ G k of a subgroup G k with its cardinality N k = |G k |, as follows: with G k and G j (where by k = 1, 2, 3, j = 1, 2, 3, j = k) being the two groups of comparison (e.g., APL vs. AML) and the sign (+,−) defining the direction. A value of p deg (gene a ) = −2 indicates definitive underexpression, and a value of p deg (gene a ) = +2 denotes definitive overexpression in relation to the compared groups.
Differences in gene expression levels were analyzed for the dichotomies APL vs. AML1, APL vs. AML2, AML1 vs. AML2, APL vs. normal, AML1 vs. normal, and AML2 vs. normal in the microarray dataset, and APL vs. AML1, APL vs. AML2, and AML1 vs. AML2 in the RNA-Seq dataset, using Cohen's D with varying variances and group sizes [21] as a measure of the effect size. The most important effect sizes, i.e., the most significant differences in gene expression, were identified by computed ABC analysis [10] using the R package "ABCanalysis" available on CRAN (http://cran.r-project.org/web/packages/ ABCanalysis/index.html (accessed on 3 April 2017)). The ABC curve can visualize the data by graphically representing the cumulative distribution function closely related to the Lorenz curve. Using the ABC curve, the algorithm calculates the optimal limits by exploiting the mathematical properties pertaining to the distribution of analyzed items.
Recursively applying ABC analysis three times on all probabilities p deg yielded the used thresholds: genes were considered to be expressed if their difference exceeded the threshold of 0.3 for RNA-Seq (0.7 for microarray data), because genes with values above the threshold were in group A (A and B respectively) in the third computed ABC analysis, and considered to be the most important ones.

Results
At first, the gene expressions in both datasets were logarithmized. Distribution analysis [14] revealed multimodal distributions for the CD genes. Therefore, distributions were modeled with Gaussian mixtures as described in the methods section. The models allowed for calculating Bayesian posteriors, which were used for further analysis.

Subgroups of AML Based on the Expression of CD Genes
In the next step, hierarchical clustering, using the Ward algorithm, of the RNA-Seq data revealed three major clusters marked in the dendrogram in black in Figure 1, consisting solely of APL/AML M3 cases (Ward cluster 3). Beyond this separation, AML had two subgroups (magenta and red). In order to compare the RNA-expression-based AML clustering with the historical classification of the French-American-British group [22], both classifications were aligned in Table 1. Within the first cluster of Ward, M1, M2, and M0 were predominant, and within the second cluster, M5. M4 was equally distributed, and for M6 to M8, only a few cases were available. FAB M3 is historically defined as APL, and was only found in Ward cluster 3. Hence, the Ward dendrogram for the AML patients without APL of the microarray dataset was also investigated. In line with the clustering results from the RNA-Seq data, the non-APL group revealed a cluster structure of two subgroups, as depicted in Figure 2.   Table 1 which also depicts the colors of this figure. Clustering and dendrogram generation were computed with the R package "FCPS" available on the Comprehensive R Archive Network (CRAN) [19].  [22]. Abbreviations Table 1 which also depicts the colors of this figure. Clustering and dendrogram generation were computed with the R package "FCPS" available on the Comprehensive R Archive Network (CRAN) [19]. A dichotomy of np-AML is neither recognized in the present WHO classification [7] nor in the stratification for treatment strategies. The separation of np-AML to APL is in agreement with previous research [23]. In addition, it is mentioned in one prior publication that AML divides into one APL cluster and two AML clusters with principal compo- A dichotomy of np-AML is neither recognized in the present WHO classification [7] nor in the stratification for treatment strategies. The separation of np-AML to APL is in agreement with previous research [23]. In addition, it is mentioned in one prior publication that AML divides into one APL cluster and two AML clusters with principal component analysis using 37 microarray samples [24]. Alignment of the two clusters to FAB classification could either be interpreted as a myeloid vs. monocytic differentiation pattern or as an immature (M0-2 and M6/M7) vs. a more mature (M4-M5) biologic nature of the leukemia blasts, as suggested by Goardon et al. [25]. To corroborate the latter hypothesis, we performed a detailed inspection of specific antigen distribution patterns.

Differential Expression of CD Genes in AML Subtypes
To determine the CD genes that most accurately discriminate the subgroups of AML (APL, AML1, AML2) in the microarray analysis, we determined the differences in mean expression values for each dichotomy by calculating Cohen's D and then performed computed ABC analyses. In total, 23 genes significantly discriminated two AML subgroups in at least one dichotomy in the microarray dataset (Table 2). Moreover, 10 genes were differentially expressed in APL vs. AML1, and 19 genes in APL vs. AML2, with six co-regulated genes in APL vs. both groups. Only one gene (CD18) was differentially expressed in all three pairs of AML subgroups. Hence, a comparison of APL vs. a mixture of both AML groups would have underscored the expression of considerable many potential target genes. Of six genes that were correspondingly differentially expressed between APL and each AML1 and AML2, three (CD3D, CD98, CD339) were higher expressed in APL, and three (CD52, CD62L, CD83) were higher in AML1/2. In addition, the CD18 gene was also higher in APL vs. AML1/2, and differentially expressed in all three pairs of AML subgroups. Finally, one gene (CD354) discriminated AML1 from AML2, but not AML1 and AML2 from APL.
In the RNA-Seq dataset, 14 genes significantly discriminated between AML subgroups in at least one dichotomy (Table 3). Three genes (CD84, CD148, CD339) separated APL from each AML1 and AML2, with all of these genes showing higher expression in APL. In addition, three genes (CD91, CD197, CD261) discriminated between AML1 and AML2. However, CD197 did not discriminate in the microarray data. Of note, CD84, CD148, CD197, and CD261, as well as all other genes that significantly discriminated between two AML subgroups only in one dichotomy (APL vs. AML1, APL vs. AML2, AML1 vs. AML2), were only found in one dataset and showed a Cohen's D-value of 0 in the other and CD91 was not expressed in the microarray dataset (Table S1). Thus, overexpression of CD339 in APL was the only consistent finding in the two datasets. Together, the comparison of APL vs. AML2 revealed a diverse expression of wellknown differentiation marker molecules, such as CD11b, CD13, CD14, CD15, and CD24, whereas fewer surface molecules were differentially expressed between APL and AML1. The minor expression of specialized functional molecules inside the cell and on the surface is a feature of "stemness", hereby indicating that AML2 might represent a more differentiated type of AML, and that AML1 is more stem-cell-like. Moreover, only CD79a was overexpressed in AML1 vs. APL. CD79a is considered a B-cell antigen, but is often expressed in immature blast crisis acute leukemias following CML (chronic myeloid leukemia) [26].

Differential Expression of CD Genes in AML Compared to Normal Samples
To further substantiate the differential expression of specific CD genes in AML1, AML2, and APL, we compared these entities to normal samples separately. The results are presented in Table 4. As TCGA RNA-Seq data did not include healthy/normal controls, these analyses were performed only in the microarray dataset. ABC analyses revealed differential regulation between at least one AML subgroup and normal controls of 48 CD genes.

Validation of Target Genes by Literature Screening
In order to corroborate the target CD antigens with a focus on the distinction between APL and np-AML that were derived from both microarray and RNA-Seq datasets, we screened the literature for reports from other AML groups [ Table 5].
In sum, 36 genes (14 in the RNA-Seq, 23 in the microarray dataset) were expressed differentially in APL and np-AML ( Table 5). Overexpression of CD339/JAG1 in APL was the only consistent finding in the two datasets, in line with previous reports including both gene expression data from RNA-Seq and microarray analyses [27] or flow cytometry studies [28] Of note, differential expression in APL and other AML subtypes cannot be deduced from these earlier studies for all CD genes identified by our calculations, such as CD84 [29], CD88 [30], CD148 [23,31], CD210A/IL10RA, and CD227 [32,33]. However, all of these genes have been detected in cells from all hematopoietic lineages or AML blasts, and CD148 and CD210 have even been recently suggested as novel immunotherapeutic targets in AML [34], except forCD261/TNFRSF10A which we did not find in previous publications pointing to implications in AML or APL.
Therefore, current knowledge of CD gene expression in distinct subgroups of AML strongly supports the validity of our mathematical approach to extract clinically important information from gene expression data, specifically in the context of AML, for diagnostic purposes.   Table 5. Supporting evidence from the literature for differential expression of CD genes in APL and other AML subtypes. If differential expression could not be deduced from previous reports, references describing expression in AML are indicated. Table 2/Table 3

Discussion
Characteristic expression patterns of cell surface molecules provide the basis for rapid diagnosis of AML by flow cytometry [9]. Here, we used combined Bayesian and ABC analysis to identify subgroups of AML based on characteristic cell surface patterns reconstructed from gene expression data for CD genes. We explored associations of AML CD subgroups with an established morphologic classification system for AML, precisely, FAB classification. Moreover, we extracted the most important discriminating CD genes that distinguish the AML subgroups from each other and normal samples. As the focus of our work was more closely related to potential diagnostic applications, the following discussion primarily refers to current experimentally confirmed knowledge on CD gene expression in AML. Yet, clinical implications are clearly emerging as a future perspective, given the growing spectrum of monoclonal antibodies targeting AML and related myeloid diseases [71].
Intriguingly, using two independent AML expression datasets obtained by microarray or RNA-Seq analyses, we found that AML separated into only two natural clusters in both datasets based on the expression of CD genes. In the RNA-Seq dataset, we identified two subgroups of AML and were able to separate APL from AML. One of these clusters is APL, the only subclass of AML that has been previously defined, and is managed therapeutically differently from all other AML subclasses in the clinic [72]. Np-AML, on the other hand, is less diverse based on CD gene expression than, for example, based on morphology or genetic findings. Employing the Ward algorithm on 394 CD genes, AML clustered into two subgroups, which we here pragmatically designated AML1 and AML 2. In the RNA-Seq dataset, AML1 was associated with FAB classes M1 and M2, while AML2 corresponded to FAB class M5. Both AML subgroups contained FAB class M4. In the microarray dataset, the FAB classification was not accessible. This association supports the notion that the AML1 cluster refers to an immature immunophenotype, while the AML2 cluster characterizes leukemia cells with the expression of maturation markers, i.e., AML1 cells may derive from a pluripotent stem cell and AML2 from a progenitor cell, respectively.
In order to further investigate whether the classification of AML by CD gene expression may be exploited for AML diagnosis by flow cytometry in addition to previously recommended immunophenotype targets, we selected the most important discriminating genes for each subgroup compared to the other subgroups and to normal controls. Performing ABC analysis on the microarray and the RNA-Seq datasets predominantly yielded non-overlapping results. Only the CD339 gene, which corresponds to the Jagged-1 protein, was expressed higher in APL than in AML1 and AML2 in both datasets, and was also distinctive for APL compared to normal samples, which is in line with previous analyses of publicly available gene expression datasets [27]. However, CD339 is not included in previously reported antibody panels for the identification of APL by flow cytometry [41,42,73]. Therefore, it is tempting to speculate that our unbiased analysis of the AML surface can inform the design of improved diagnostic tools.
On the other hand, the only other gene that was somehow discriminating between AML subgroups, CD88, was underexpressed in AML2 compared to APL in the RNA-Seq dataset but overexpressed relative to both APL and AML1 in the microarray dataset. CD88 (C5AR1) has been reported to be expressed in AML cell lines and primary patient blasts and to contribute to cell motility [30], illustrating its potential role in this disease. We attribute the conflicting results for CD88 expression, as well as the overall lack of consistent observations, to the different molecular techniques for gene expression measurement [74], as Bayesian analysis based on GMMs has been shown to yield stable and consistent results on noisy data of various types, such as income, heat or cold pain, GDP time series or nitrate concentrations in streams [15,75,76]. Of note, the dataset of only 15 cases of APL samples is small, but in line with the reported frequency of this subgroup. Consequently, consolidation with a larger dataset would be desirable to strengthen the results reported here. Of non-overlapping genes that were differentially expressed between AML subgroups or AML and normal samples, several CDs are either known targets (such as CD33) or have already been reported to be implicated in AML.
Novel treatment options for AML cases with an unfavorable prognosis are highly demanded and monoclonal antibodies that target AML surface molecules are currently evaluated in clinical trials ( [71,77]). Although we do not have detailed information on cytogenetics or survival in both of our datasets, our study reveals potential target molecules, particularly if AML1 is indeed a stem cell near a subgroup with inferior prognosis.
In summary, we conclude that our unbiased mathematical approach was highly successful in deducing knowledge from complex biological datasets. "The important few" AML-specific CD genes identified by combined Bayesian and calculated ABC analysis are widely congruent with actual knowledge in the field, independent from the particular method of gene expression analysis. Thus, the work described here may serve as a blueprint for translating relevant information from high-dimensional biological data into clinical applications.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/bioengineering9110642/s1, Table S1: Overview of all significant differentially expressed CD genes in both datasets.

Informed Consent Statement: Not applicable.
Data Availability Statement: RNA-Seq data can be downloaded from the GDC Data Portal (https: //portal.gdc.cancer.gov/ (accessed on 3 April 2017). In addition, the RNA-seq data can be provided upon request. (accessed on 3 April 2017)). The microarray dataset was published by Haferlach et al. [11].