Machine Learning Based Analysis of Relations between Antigen Expression and Genetic Aberrations in Childhood B-Cell Precursor Acute Lymphoblastic Leukaemia

Flow cytometry technique (FC) is a standard diagnostic tool for diagnostics of B-cell precursor acute lymphoblastic leukemia (BCP-ALL) assessing the immunophenotype of blast cells. BCP-ALL is often associated with underlying genetic aberrations, that have evidenced prognostic significance and can impact the disease outcome. Since the determination of patient prognosis is already important at the initial phase of BCP-ALL diagnostics, we aimed to reveal specific genetic aberrations by finding specific multiple antigen expression patterns with FC immunophenotyping. The FC immunophenotype data were analysed using machine learning methods (gradient boosting, decision trees, classification rules). The obtained results were verified with the use of repeated cross-validation. The t(12;21)/ETV6-RUNX1 aberration occurs more often when blasts present high expression of CD10, CD38, low CD34, CD45 and specific low expression of CD81. The t(v;11q23)/KMT2A is associated with positive NG2 expression and low CD10, CD34, TdT and CD24. Hyperdiploidy is associated with CD123, CD66c and CD34 expression on blast cells. In turn, high expression of CD81, low expression of CD45, CD22 and lack of CD123 and NG2 indicates that none of the studied aberrations is present. Detecting aberrations in pediatric BCP-ALL, based on the expression of multiple markers, can be done with decent efficiency.


Introduction
Acute lymphoblastic leukaemia (ALL) is the most frequent leukaemia in children. There are two main types of ALL, one of them is B-cell precursor acute lymphoblastic leukaemia (BCP-ALL), comprising about 85% of all ALL cases. Abnormal leukemic cells (blasts) can be distinguished by assessing the immunophenotype i.e., the composition of specific antigens expressed on the cell surface or inside the cytoplasm with the use of flow cytometry technique (FC). There are several genetic aberrations related with BCP-ALL that have evidenced prognostic significance. Most common are translocations t(12;21), t(9;22), t (1;19) and translocations within the q23 region of chromosome 11 (t(11q23)), resulting in generation of fusion genes ETV6-RUNX1, BCR-ABL1, TCF3-PBX1 and KMT2A gene rearrangements, respectively [1,2]. Other types of aberrations manifest as increase or decrease in chromosome number corresponding to hyper-and hypodiploidy, respectively [3].
Genetic tests are generally considered expensive and have a long turnaround time from sample collection till result generation. On the contrary, FC technique appears to be more broadly available, affordable and the results can be obtained on the same working day.
It is already known that expression, or lack of certain antigens is associated with molecular aberrations. For instance, expression of NG2 and CD15 is associated with t(v;11q23)/KMT2A gene rearrangements [1,[4][5][6] and high expression of CD123 is often seen in patients with hyperdiploidy [3,7,8]. FC results are most commonly evaluated with the use of 2-dimensional dot plots representing expression pattern of 2 antigens at a time (Figures 1 and 2). Each dot plot shows a pair of antigens assessed in patients with BCP-ALL. Coloured dot populations represent individual patients with particular genetic aberration: red-t(v;11q23)/KMT2A; green-t(12;21)/ETV6-RUNX1 and blue-hyperdiploidy. The dot in the middle of each population represents the median of fluorescence intensity (MFI). One can see that, this 'classical' approach of immunophenotype assessment is not sufficient to infer the underlying genetic aberrations, even if the above mentioned phenotypic features are present.  To date, only a few papers describing potential association of quantitatively measured antigen expression with genetic aberrations exist in the literature [2,4]. However, this previous research covered only one-dimensional statistical analysis and the aberration prediction efficiency was examined only on the training set of examples. In addition, the authors did not make the analysed data sets available. Therefore, the results reported in these papers might not be optimally estimated.
This article presents an original approach of finding multiple (multi-dimensional) antigen expression patterns revealing possible genetic abnormalities underlying individual BCP-ALL cases. Our innovative approach exploits quantitative antigen expression measure and advanced multidimensional data analysis, including cross-validation and releasing the Shiny application for researchers to predict the occurrence of particular aberrations based on the expression of multiple markers by FC. As genetic aberrations have evidenced prognostic significance, success in finding immunophenotype-genotype correlation would allow to determine the patient prognosis already at the initial stage of diagnosis with FC immunophenotyping. The results show also the real specificity, positive and negative predictive values of the predictive models.
Machine learning methods, particularly data clustering [9][10][11] and outlier detection [12] were also previously used to visualize and interpret flow and mass cytometry data. These papers concern the stage of raw data processing.
Cell analysis and gating the leukemic blasts out of all events followed the exclusion of doublets and cellular debris based on the forward and side scatter parameters. Then, the expression of antigens on blasts was measured as median of fluorescence intensity (MFI). To normalize the antigen expression, raw MFI values were transformed into scores of a normalized (nMFI) scale, as reported in [15]. For each of the studied antigens, a negative and positive reference population (observable in healthy donor samples) was chosen. For example, for CD66c the reference positive population was neutrophils (because it has the highest expression of CD66c), while the negative population was erythroblasts (because it never expresses this antigen). The ranges of MFI between those cell types were divided into 10 equal antigen-specific steps (scores). Expression level corresponding to the negative control cell population was given 0 nMFI score, while 10 nMFI equalled the expression of positive reference cell population which in classical notation corresponds to bright expression. The expression (measured as MFI) of particular antigen on leukemic blasts was then compared against the nMFI scale obtained for normal cells and ascribed respective nMFI score. Therefore, expression level of 4-6 in classical notation could be described as "moderate" or "dim", while nMFI = 1 is considered low-positive or weak. Determination of the nMFI step width, enabled to extrapolate the scale over the normal reference MFI values, making possible to assess marker overexpression on leukemic blasts (nMFI > 10) which in classical notation is still defined as bright expression. The use of here proposed normalized scale of antigen expression provides more information than classical approach with the use descriptive notations which are often arbitrary. Gradual antigen expression assessment was chosen in order to associate the genetic abnormalities with particular levels of antigen(s) expression level determined quantitatively. For markers without identifiable positive control (e.g., NG2), binary type of expression was used The total raw data set contained 818 patients and 21 parameters-17 antigens with measured expression levels and 4 genetic aberrations (binary type attributes). Each decision attribute described one decision (classification) problem. Given the generally exclusive nature of genetic aberrations studied here, the paper formulated five decision problems: Initially, the dataset included other important recurrent genetic aberrations -BCR-ABL1, and TCF3-PBX1. However, there were only 7 cases of BCR-ABL and 6 of TCF3-PBX with many missing results. Therefore, these aberrations were discarded from the analysis.
During analysis each decision problem was considered separately. For 8 out of 17 antigens with measured expression levels the number of missing values exceeded 20% with maximum equal to 36% for CD13 (see Missing data section in the Supplementary material). For 6% observations the concurrent prevalence of missing values for attributes CD66c, cyIgM, CD33, CD13, CD22, CD9, CD81 was observed. Since the complete vector of antigen expression was not available for every case, the data set had to be preprocessed. In the first stage of preprocessing, cases with 50% or more missing values were removed. During the further analysis no imputation of missing values was carried out. Table 1 presents the descriptive statistics of conditional attributes. Table 2 shows detailed information about the data sets. One can notice that the data sets are imbalanced. The study was conducted from two perspectives: predictive and descriptive (explanatory). The purpose of the predictive analysis was to build classifiers resolving abovementioned decision problems. The aim of the explanatory analysis was to find important dependencies between decision attribute (e.g., aberration) and conditional attributes (i.e., antigen expression). Figure 3 presents the detailed workflow of the analysis.

Predictive Analysis
Predictive models were built using Gradient Boosting Machine (GBM) [16] method implemented in h2o R package. The main idea of GBM is inducing an ensemble of shallow trees in sequence with each tree learning and improving on the previous one. The decision of selection of GBM in this study has been made after running automl procedure [17] available in h2o R package. The GBM achieved the best prediction efficiency among other methods examined by automl i.e.,: Distributed Random Forest, Generalized Linear Model, Naïve Bayes Classifier and Support Vector Machine. The comparison of considered algorithms is available in supplementary material. During the experiments stratified 5-fold cross validation procedure repeated 10 times was applied. Thus, for each decision problem 50 classification models were built. As the loss function the harmonic mean of precision (PPV -Positive Predictive Value) and sensitivity (TPR-True Positive Rate) known as F 1 score (1) was used.
Additionally, in the literature of the subject there are also empirical research showing that GBM outperforms other classification methods [18].
Additionally variable importance was calculated using model-specific measure and model-agnostic approach based on permutation test with DALEX R package [19]. Moreover, in Shiny application, we used Shapley value [20] to assess the impact of feature in final prediction.

Descriptive Analysis
The main aim of the descriptive analysis was to discover multidimensional relations between antigen expression and molecular aberrations. Two strategies to find antigenaberration dependencies were applied: divide-and-conquer and separate-and-conquer. Utilise of two approaches allow on finding such relationships which will be impossible to obtain using only one of mentioned technique. The firs strategy utilises RPART decision tree induction algorithm [21], while the second one uses RuleKit, the decision rule induction package [22].
The descriptive analysis was carried out in two paths. The first path involves RPART decision tree induction, the second one involves RuleKit-based rule induction. Descriptive analysis was carried out on the entire set of examples. To avoid the data over-fitting and get decision tree with reasonable number of levels during the experiments the process of decision tree growing was stopped when Cohen's Kappa statistic (2) [23] calculated on the entire set of examples attains value 0.8. Kappa statistics is a measure of accuracy agreement between two classifiers, the evaluated classifier (ACC) and the random classifier which classifiers examples in accordance with class distribution (RAND). Kappa statistic is a special case of association that indicates an "association" of the elements of a classifier confusion matrix on its main diagonal.
In rule-based descriptive analysis, the induced rules can overlap each other. Each In our research, as the rule search heuristic in rule induction algorithm [22], the C2 (3) rule quality measure was used. During the induction, the form of a rule premise was optimised towards the C2 value maximisation.
where The C2 measure can be viewed as the weighted version of Kappa statistics. C2 evaluates each rule separately. During the rule induction, the form of a rule premise was optimised towards the C2 value maximisation.
For rule-based exploratory analysis the rule precision (p/(p + n)) rule coverage (p/P) and rule statistical significance (Fisher's exact test with p-value adjustment) were also reported. Figure 4 shows ROC curves-with 95% confidence intervals-for all considered decision problem. ROC curves were calculated on the independent test data sets. Table 3 presents values of sensitivity, specificity, PPV and NPV for all decision problems. In parenthesis standard deviations are presented. The overall classification accuracy of GBM classifier varied from 71% for no aberration to 98% for t(v;11q23)/KMT2A aberration. F 1 score takes values from 0.52 (t(12;21)/ETV6-RUNX1) to 0.81 (t(12;21)/ETV6-RUNX1 vs. hyperdiploidy).

Predictive Analysis
After model training the feature importance analysis was carried out. GBM algorithm allows to generate a feature importance ranking. The final feature importance was calculated as the average position of the feature importance in 50 classifiers (as was mentioned, experiments were carried out in 10 × 5CV mode).
The above results were obtained for classifiers trained on all available features. The purpose of the next experiment was to check whether is it possible to achieve similar classification accuracy with limited set of features. For this purpose, the simple strategy was applied. Basing on the feature importance ranking of all features, classifiers with iteratively extended feature set were trained. The first classifier worked only with the best feature, the second with two top-ranked features and so on. Consistently, all calculations were conducted in 10x5CV mode. The results of this experiment are presented in Figure 5.   For all decision/prediction problems one can observe that the F 1 value grows with the number of variables used in prediction model. In the vast majority of analysed cases the difference between the classifiers is not significant. It it difficult to indicate optimal number of features in classifier; however, one can observe that the F 1 value stabilises when the six most important features are used in the prediction.

Decision Trees
For each of the considered decision problem the RPART [24] decision tree was induced based on the entire available data set. The grid based hyper-parameter optimisation procedure was used to find the trees with high explanation efficiency. To get precise and transparent decision trees the induction process was stopped when during tree induction the Kappa coefficient exceeded value 0.8. Table 4 presents values of classification measures of induced trees. Induced trees characterised different complexity. The number of tree levels varied form 5 (ETV6-RUNX1) to 12 (no aberration), the number of leaves varied for 9 (ETV6-RUNX1) to 81 (no aberration). Based on the induced decision trees the feature importance rankings were determined. The ranking of the six most important features for each decision problem is as follows:  For hyperdiploidy the following positive and negative rules were induced:

Multi-Attribute Decision Rules
A sequential covering approach to the rule induction allowed to determine a few additional interesting rules. Below, there are two rules characterising high values of rule support and rule precision. The rules are also statistically significant The first rule has a form of: The above two rules were generated automatically. The next reported rules were induced within the user-driven mode to verify certain hypotheses about the co-occurrence of the considered aberrations.

Aberration t(12;21)/ETV6-RUNX1
In the first experiment, it was enforced that the elementary condition (CD9 > 1) has to occur in the rule premise. Two rules were obtained that covered a large group of examples representing patients without the ETV6-RUNX1 aberration: In the second experiment, it was checked whether the CD81 antigen is a prognostic factor in the ETV6-RUNX1 aberration. The occurrence of only this antigen was enforced in the premises of rules indicating this decision class. The following rule was achieved:

Aberration t(v;11q23)/KMT2A
In the case of the t(v;11q23)/KMT2A aberration we searched for the rules which described the "KMT2A" decision class and contained at least one of CD45, CD24 antigens. The following three strong rules were found:

Hyperdiploidy
In the user-driven rule induction for the hyperdiploidy decision problem it was checked whether the CD123 antigen is enough to precise distinguish patients with and without hyperdiploidy. For this purpose, there two rules: In the next experiment then, the user-driven rule induction was launched. It was enforced that at least the CD123 attribute has to occur in the rule premise. Three interesting rules indication the "hyperdiploidy" decision class were induced:

Discussion
Since the 'classical' approach of FC data analysis based on 2-dimensional dot plots examination is not sufficient to detect possible underlying genetic aberrations, we employed advanced analysis tools (including RPART decision tree induction and Gradient Boosting Machines) to find correlations between the entire immunophenotype of leukemic blasts, characterized quantitatively with 17 markers and genetic abnormalities of patients with BCP-ALL. We focused on identification of the three most frequent genetic abnormalities: ETV6-RUNX1, KMT2A rearrangements, hyperdiploidy and none of the mentioned aberration status. We also tried to identify specific differences in expression of the studied markers to discriminate between ETV6-RUNX1 and hyperdiploidy.
In the majority of previous studies in this field, expression of markers in terms of correlation with genetic results was described only qualitatively (as positive or negative) [3,6,25]. Different approach was proposed by Tsagarakis et al. [2] who used an original scoring system to estimate the chance of given aberration to occur based on expression of antigens presented as percentage of blasts positive for given marker.
We proposed to use a normalized (graded) scale of antigen expression based on median fluorescence intensity. Gradual antigen expression scale enabled to investigate for possible genetic abnormalities associated with certain level of expression of several markers simultaneously. We showed that, indeed, high expression levels of given markers is associated with occurrence of particular aberration, while moderate or low can indicate otherwise. Furthermore, also on the statistical field, modern and innovative approach was applied. Combination of the above allowed to find specific antigenic patterns, with gradually assessed expression, strongly associated with aberrations.

t(12;21)/ETV6-RUNX1
We have come to the conclusion that CD10 is the most important marker in both predictive and explanatory analysis and showed that very high expression of CD10 corre-lates with ETV6-RUNX1 occurrence, while moderate and low expression of CD10 is more frequent in patients without this aberration. De Zen et al. also previously proposed that CD10 increases probability of ETV6-RUNX1 [26].
Low expression of CD34 and CD45 was also important in both predictive and explanatory analysis. This was already observed by Tsagarakis et al. [2], who proposed that weak expression of CD34 (along with other conditions-see full rules in results section) is strongly associated with ETV6-RUNX1. De Zen et al. [26] have also proposed that the lack of CD45 and CD34 increases the chance of ETV6-RUNX1. On the other hand, Borowitz et al. suggested that this translocation might be related to high expression of CD45 however the specificity of their results was low [27]. Hrusak et al. observed that lack of CD66c antigen on blast cells is more frequent in patients with t(12;21) which can also be seen in our results as the expression of CD66c marker was one of the most important antigens associated with ETV6-RUNX1 in predictive analysis [28]. Our research also confirms previous findings that expression of CD24 is associated with ETV6-RUNX1 [3], however, the use of gradual expression scale pointed out more precisely that moderate expression of CD24 on blast cells has the strongest correlation. For the first time, we have demonstrated that low expression of CD38 could exclude the occurrence of ETV6-RUNX1 whereas low expression of TdT and specific expression of CD81 (along with other conditions-see full rules in results section) positively correlates with ETV6-RUNX1 occurrence.

t(v;11q23)/KMT2A
It is already well known that positive expression of ectopic NG2 marker on blast cells in BCP-ALL is strongly associated with different KMT2A gene rearrangements [1][2][3][4]6]. Since the presence of NG2 marker was from the statistical point of view strong enough to serve as a predictive marker of KMT2A rearrangement occurrence, we constructed the decision tree without NG2. Our results show that even without using NG2, the occurrence of KMT2A rearrangements can be indicated by the lack of CD10, CD34 and low TdT expression. This phenotype corresponds to pro-B-ALL subtype by EGIL (European Group for the Immunological Characterization of Leukemias). Association of the pro-B-ALL phenotype with KMT2A rearrangements was already confirmed in the past by other research groups [1,6]. Added value in KMT2A rearrangements prediction can also be assigned to low (or lack of) CD24 expression which is concordant to the findings of Hrusak et al. [3]. Tsagarakis et al. [2] proposed phenotype of blast cells: CD10-,CD34+/-, NG2+, CD15+, CD9+, CD38+, CD123+, CD20-, CD25-, cIgM-, CD27-, CD66c-as a phenotype strongly associated with KMT2A rearrangement. We have also confirmed that high expression of CD15 + CD65 markers and low cIgM can often be associated with this aberration.

Hyperdiploidy
CD123 marker was essential for prediction of hyperdiploidy in our analysis. Additional important markers in our both predictive and explanatory analysis were CD24, CD34, CD10 and CD9. Although there are some reports describing these markers, gradual assessment of expression allowed us to produce results with high specificity and decent odds ratio. CD123 is commonly known to be associated with hyperdiploidy [7,8]. Furthermore, other studies suggest that lack of CD45 [3,29] and high expression of CD66c and CD34 on blast cells is often seen in hyperdiploid BCP-ALL cases. In our results, the overexpression of CD34 stands out and confirms previous findings. We also associate positive CD66c expression and low CD45 with hyperdiploidy.

No Aberration
We have proposed "no aberration" class to test what antigenic patterns can exclude the occurrence of the three aberrations analyzed in this study. The following phenotype: CD123-/NG2-/CD45 low+/CD22 low+/CD81 high+ indicates that none of the studied aberrations is present, which was not reported previously. Notably, high expression level of CD81 is of added value to exclude the common aberrations which is a novelty in this matter.
Absence of NG2 and CD123 is not surprising, as these markers are strongly associated with KMT2A and hyperdiploidy, respectively.

t(12;21)/ETV6-RUNX1 vs. Hyperdiploidy
We have observed some overlap between phenotypic features associated with ETV6-RUNX1 and hyperdiploidy. Therefore, we decided to test what parameters could distinguish one aberration from another. The most useful discriminative markers were CD66c and CD123. Even low expression of these two antigens is more frequent in patients without ETV6-RUNX1 but with hyperdiploidy. Another useful marker is CD10-high expression (especially with conditions described above) indicates hyperdiploidy while low positive would be more frequent in patients with ETV6-RUNX1.

Conclusions
With the use of the cross-validation methodology, we presented the efficiency of detecting aberrations based on the antigen expression level ( Table 4). The efficiency values reported in Table 4 are slightly lower than those reported for statistical analysis without independent test data set isolation [2]. However, the values are still high, particularly in Positive Predictive Value and Negative Predictive Value.
Furthermore, by using advanced methods of data analysis, we have expanded and specified the meaning of particular expression levels of these markers by using gradual expression assessment rather than percentage positivity or qualitative annotations (positive, negative, dim, heterogeneous etc.). In a descriptive analysis we have shown that a multidimensional analysis based on rule induction makes it possible to identify-within the groups of patients with (or without) a given aberration-subgroups of patients with significantly higher chance of a given aberration occurrence (no occurrence) (Rules 1-10). Particularly, it was demonstrated for the so called positive rules (1,3,5,7,9) (Figure 4) how adding successive conditions increases this chance.
Finding immunophenotype-genotype correlations in pediatric ALL patients would be useful to predict the prognosis at the initial stages of diagnostics.
We have also found new associations, i.e., high levels of CD81 might be useful for exclusion of the common aberrations (ETV-RUNX1 and KMT2A gene rearrangements as well as numerical aberration-hyperdiploidy) occurrence which was not reported before.
Our research confirms previous findings about antigens expression associated with particular genetic aberrations, such as CD10, CD22, CD24, CD34, CD66c, CD45, NG2, CD123. A potential constraint of the study could be the lack of cell viability staining. However, in our opinion this might have only a minor influence on the MFI of particular antigens.
Additionally, we created (and made available) an application for users to predict the occurrence of aberrations studied herein, based on the expression of multiple markers evaluated in a normalised fashion by flow cytometry. The application allows: • to recalculate the antigen expression level to the nMFI scale; • to make a decision about the aberrations occurrence based on the antigen expression (it is possible to classify a single example or a set of examples); • to explain the reasons of classifier-made decisions for each of the considered decision problems.
The full utility of the application requires following strict laboratory sample handling procedures and flow cytometer settings (i.e., EuroFlow or other standardized protocols). This applies to both patient and normal bone marrow samples, which are necessary for construction of normalized marker expression scale, as described in methods section.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/jcm11092281/s1. Machine learning based analysis of relations between antigen expression and genetic aberrations in childhood acute lymphoblastic leukaemia-Supplementary Material.