Deep Learning Feature Extraction Approach for Hematopoietic Cancer Subtype Classification

Hematopoietic cancer is a malignant transformation in immune system cells. Hematopoietic cancer is characterized by the cells that are expressed, so it is usually difficult to distinguish its heterogeneities in the hematopoiesis process. Traditional approaches for cancer subtyping use statistical techniques. Furthermore, due to the overfitting problem of small samples, in case of a minor cancer, it does not have enough sample material for building a classification model. Therefore, we propose not only to build a classification model for five major subtypes using two kinds of losses, namely reconstruction loss and classification loss, but also to extract suitable features using a deep autoencoder. Furthermore, for considering the data imbalance problem, we apply an oversampling algorithm, the synthetic minority oversampling technique (SMOTE). For validation of our proposed autoencoder-based feature extraction approach for hematopoietic cancer subtype classification, we compared other traditional feature selection algorithms (principal component analysis, non-negative matrix factorization) and classification algorithms with the SMOTE oversampling approach. Additionally, we used the Shapley Additive exPlanations (SHAP) interpretation technique in our model to explain the important gene/protein for hematopoietic cancer subtype classification. Furthermore, we compared five widely used classification algorithms, including logistic regression, random forest, k-nearest neighbor, artificial neural network and support vector machine. The results of autoencoder-based feature extraction approaches showed good performance, and the best result was the SMOTE oversampling-applied support vector machine algorithm consider both focal loss and reconstruction loss as the loss function for autoencoder (AE) feature selection approach, which produced 97.01% accuracy, 92.60% recall, 99.52% specificity, 93.54% F1-measure, 97.87% G-mean and 95.46% index of balanced accuracy as subtype classification performance measures.


Introduction
Lots of bioinformatics techniques have been developed for the disease detection and diagnosis of patients with incurable diseases such as cancer for several decades [1]. However, it still remains challenging to deal with cancer patients. Furthermore, the development of appropriate classification models using the gene which is extracted from patients is useful for early diagnosis of both patients and normal people. Cancer is a major disease which causes death, involving abnormal cell differentiation [2]. It is caused by many reasons, but the majority of cancers (90~95%) are due to genetic mutations from lifestyle factors such as smoking, obesity, alcohol and so on. The remaining 5~10% are caused due to inherited genes [3].
A hematopoietic malignancy is a neoplasm from hematopoietic cells in the bone marrow, lymph nodes, peripheral blood and lymphatic system which is related to organs of the hematopoietic system. Furthermore, it can be found in other organs such as the gastrointestinal system and the central nervous system [4]. Given that hematopoietic malignancies occur in the hematopoietic system, this malignancy is called a liquid tumor. While uncommon in other cancers, chromosomal translocations are a common cause of these diseases. Hematopoietic cancer accounts for 8~10% of all cancer diagnoses, and its mortality rate is also similar to this [5].
Historically, hematopoietic cancer is divided by whether the malignant location is in the blood or the lymph nodes. However, in 2001, the World Health Organization (WHO) introduced the WHO classification of tumors of hematopoietic and lymphoid tissue as a standard and updated it in 2008 and 2016 [6]. This WHO classification criterion focused on cell linkage rather than the location of the occurrence. According to the WHO classification, hematopoietic malignancies are mainly divided into leukemia, lymphoma and myeloma [7].
Leukemia is one type of hematopoietic cancer that results from genetic changes in hematopoietic cells in the blood or bone marrow. If an abnormality occurs in the bone marrow, the abnormally generated blood cells mix with blood in the body and spread widely into the body through the blood stream [8]. Most leukemia cases are diagnosed in adults aged over 65 years, but it is also commonly observed in children under the age of 15 [7]. The American Cancer Society (ACS) reported, in 2020, that the United States will see about 60,530 new cases and 23,100 deaths from leukemia [9]. Lymphoma is usually found in distinct stationary masses of lymphocytes, such as the lymph node, thymus or spleen. Like leukemia, lymphoma can also travel through the whole body by the blood stream. Commonly, lymphoma cases are divided into Hodgkin lymphoma, non-Hodgkin lymphoma, acquired immune deficiency syndrome (AIDS)-related lymphoma and primary central nervous system (CNS) lymphoma [8]. In 2020, the ACS reported that there will be approximately 85,720 new cases and 20,910 deaths from lymphoma [9]. Myeloma is a tumor that occurs in plasma cells which are differentiated from bone marrow, blood or other tissue. Plasma cells generate antibodies that protect against disease or infection, but when they develop abnormally, it interferes with antibody generation and causes confusion in the human immune system [8]. According to the estimation by the ACS, there would be 32,270 new cases and 12,830 deaths in 2020 [9].
Over the years, various approaches for data mining have been applied on many cancer research studies. Specifically, a deep learning method was applied in this area [19][20][21][22][23]. Ahmed M et al. [19] developed a breast cancer classification model using deep belief networks in an unsupervised part for learning input feature statistics. Additionally, in the supervised part, they adopted a conjugate gradient and Levenberg-Marquardt algorithm.
Furthermore, there are several studies on cancer subtype classification. A study [20] used a deep learning approach for kidney cancer subtype classification using miRNA data from The Cancer Genome Atlas (TCGA), which contained five different subtypes. They employed neighborhood component analysis for feature extraction and a long short-term memory (LSTM)-based classifier. DeepCC [22] architecture using the gene set enrichment analysis (GSEA) method and artificial neural network generated deep cancer subtype classification frameworks that made comparisons between machine learning algorithms such as support vector machine, logistic regression and gradient boost using a colon cancer dataset from TCGA which contains 14 subtype cancer labels, and a Deeptype [23] framework has been made for cancer subtype classification based on PAM50 [24]. They used a multi-layer neural network structure for adopting representation power to project on representation space.
Traditional approaches [10][11][12][13][14][15][16][17][18] have used statistical techniques for cancer classification. Sun et al. [17] used an entropy-based approach for feature extraction in a cancer classification model. However, this method has a disadvantage; that is, multiple classes cannot be applied at once because each cancer is classified one by one through binary classification for cancer classification. In the case of Deeptype [23], clustering is established using a specific gene set called PAM50, previously known for breast cancer.
However, since this PAM50 indicator is used as an already known indicator, subtypes can be classified through some valid information about breast cancer. Above all, for other cancers, including hematopoietic cancer, there is no gene set for subtype classification. In view of this, our work has a difference in feature extraction and subtype classification only from the gene expression data of hematologic cancer. In order to overcome the demerit of the multi-class classification task and the limitations due to the absence of a gene set, we applied a method of feature extraction using an autoencoder-based method among deep learning methods. In addition, we propose a subtype classification method in which reconstruction error is generated by the autoencoder. The classification error generated by the classification model and merged error are used as the loss function by referring to the loss function application methods of Deeptype [23].
The goal of this work is to develop an autoencoder-based feature extraction approach for hematopoietic cancer subtype classification. We not only focus on the five subtypes of hematopoietic cancer and conduct a study on classifying by applying deep learning techniques, but we also perform a feature extraction and calculation of two kinds of errors, which are reconstruction error and classification error. In the process, first a deep autoencoder approach is used for extracting suited features for building a classification model, and then the reconstruction error and classification error (cross-entropy loss and focal loss) are calculated for considering the data imbalance problem when building the classification model using the extracted features. To validate the deep autoencoderbased classification model, we compared other traditional feature selection algorithms and classification algorithms with an oversampling approach. We compared five widely used classification algorithms including logistic regression (LR), random forest (RF), k-nearest neighbor (KNN), artificial neural network (ANN) and support vector machine (SVM).
We compared our proposed method with traditional cancer classification and cancer subtype classification methods such as data mining and machine learning approaches which are not able to be used in the previous end-to-end approaches. Our end-to-end approach has multiple steps including feature engineering, data imbalance handling and a classification task. The objectives of this study are to extract features from a deep learning-based approach on the gene expression data for predicting hematopoietic cancer subtypes and develop an end-to-end deep learning-based classification model. The major contributions of this study are, briefly, as follows:

•
We propose an end-to-end approach without any manual engineering, which classifies hematopoietic cancer subtypes; • We adopt a non-linear transformation step by using a deep autoencoder to select deep features from gene expression data in hematopoietic cancer by adopting a deep learning architecture-based feature engineering task; • We implement a mixed loss function for the proposed deep learning model, considering both the compression of knowledge representation and the data imbalance problem.
The remainder of this paper is organized as follows: Section 2 introduces the hematopoietic cancer gene expression dataset from TCGA. Furthermore, the proposed deep autoencoderbased approach is explained in detail. In Section 3, the experimental results are provided. Finally, Section 4 discusses the experimental results with our conclusion.

Dataset
TCGA is a site that contains plenty of information and data related to human cancer. Currently, as of 2020, there are 47 types of cancer data, and each cancer's data are provided various kinds of data such as gene expression, clinical data and methylation data from large numbers of patient with cancer [25]. Although some raw data, which include original sequence information, are treated by controlled data that have to be approved for use in experiments by TCGA, most data are freely accessible for researchers. We collected TCGA data from 2457 patients with hematopoietic cancer gene expression data. The collected gene expression data have five subtypes of hematopoietic cancer: lymphoid leukemia, myeloid leukemia, leukemia not otherwise specified (nos), mature B-cell leukemia and plasma cell neoplasm. The size of each hematopoietic subtype sample is 550 lymphoid leukemia cases, 818 myeloid leukemia, 104 leukemia nos, 113 mature B-cell leukemia and 860 plasma cell neoplasm. Furthermore, these data have 60,453 exons' information with one gene expression profiling measurement. The level of gene expression is fragments per kilobase per million (FPKM) mapped measure [26]. This FPKM can be calculated by the following equation: FPKM is a normalized estimation of gene expression based on RNA-seq data considering both the number of reads and the length of the exon, measured by kilobase unit. That is, a large FPKM means a large amount of expression per unit length, so the FPKM of a certain gene refers to a relative amount of gene expression.
The statistics of hematopoietic cancer are shown in Table 1. In a preprocessing step, we eliminated noisy and non-valued instances. These preprocessed data were used for the subtype classification in this experiment; they were divided into 80% for training and 20% for testing. However, as we introduced above, the dataset was considerably imbalanced. Due to this data imbalance problem, we applied a cost function on the classification and feature extraction and oversampling method. We also used an autoencoder-based model for extracting the highly related gene expression data and compared this algorithm with other traditional dimension reduction algorithms.

Proposed Autoencoder-Based Approach
In the experiment, we propose a deep learning-based hematopoietic cancer subtype classification approach. Figure 1 shows the proposed approach which inputs the hematopoietic cancer gene expression data from TCGA and outputs the subtype classification result. This approach consists of an autoencoder feature extraction part and a machine learning-based cancer subtype classifier.
In the experiment, we propose a deep learning-based hematopoietic cancer subtype classification approach. Figure 1 shows the proposed approach which inputs the hematopoietic cancer gene expression data from TCGA and outputs the subtype classification result. This approach consists of an autoencoder feature extraction part and a machine learning-based cancer subtype classifier. Overview of proposed autoencoder-based classification approach. We used hematopoietic cancer gene expression data from The Cancer Genome Atlas (TCGA). The deep autoencoder (DAE) model was used to extract deep features from these gene expression data as a lower-dimensional vector. In this study, we use an autoencoder (AE) and a variational autoencoder (VAE) as DAEs. The classifier is used to classify hematopoietic cancer subtypes. We summed the reconstruction loss on DAE and classification loss in the cost function. LR: logistic regression; RF: random forest; SVM: support vector machine; RBF: radial-based function; KNN: knearest neighbor; ANN: artificial neural network.
In the DAE structure, we employed the mean squared error (MSE) for measuring deep learning reconstruction loss when training the training set and adopted focal loss [27] as a measurement of the classification error in the classifier. Focal loss (FL) is an updated version of cross-entropy loss, which was used for class imbalance encountered during the model training. Therefore, our proposed autoencoder-based hematopoietic cancer subtype classification approach used the sum of both MSE as reconstruction loss and FL as classification loss as a cost function for this approach.
We performed this experiment on an Intel Xeon E3 1231 v3 processor with 32G memory and RTX 2060 (Gigabyte, New Taipei City, Taiwan). Additionally, we used Python 3.7 for parsing the data and analysis by implementing deep learning and machine learning libraries. The whole process of this experiment and the methodologies including the machine learning and deep learning approaches performed are explained in detail in the next section.

Feature Extraction using Deep Learning Approach on Gene Expression Data
In this research, we used a DAE-based feature selection approach. The autoencoder structure has a strong point in the non-linear feature selection and transformation. Additionally, we compared this DAE-based approach with traditional statistical-based feature selection approaches, which are Principal Component Analysis (PCA) [28] and Non-negative Matrix Factorization (NMF) [29]. PCA is one of most popular statistical techniques which relates factor analysis with multivariate analysis. This algorithm aims Figure 1. Overview of proposed autoencoder-based classification approach. We used hematopoietic cancer gene expression data from The Cancer Genome Atlas (TCGA). The deep autoencoder (DAE) model was used to extract deep features from these gene expression data as a lower-dimensional vector. In this study, we use an autoencoder (AE) and a variational autoencoder (VAE) as DAEs. The classifier is used to classify hematopoietic cancer subtypes. We summed the reconstruction loss on DAE and classification loss in the cost function. LR: logistic regression; RF: random forest; SVM: support vector machine; RBF: radial-based function; KNN: k-nearest neighbor; ANN: artificial neural network.
In the DAE structure, we employed the mean squared error (MSE) for measuring deep learning reconstruction loss when training the training set and adopted focal loss [27] as a measurement of the classification error in the classifier. Focal loss (FL) is an updated version of cross-entropy loss, which was used for class imbalance encountered during the model training. Therefore, our proposed autoencoder-based hematopoietic cancer subtype classification approach used the sum of both MSE as reconstruction loss and FL as classification loss as a cost function for this approach.
We performed this experiment on an Intel Xeon E3 1231 v3 processor with 32G memory and RTX 2060 (Gigabyte, New Taipei City, Taiwan). Additionally, we used Python 3.7 for parsing the data and analysis by implementing deep learning and machine learning libraries. The whole process of this experiment and the methodologies including the machine learning and deep learning approaches performed are explained in detail in the next section.

Feature Extraction using Deep Learning Approach on Gene Expression Data
In this research, we used a DAE-based feature selection approach. The autoencoder structure has a strong point in the non-linear feature selection and transformation. Additionally, we compared this DAE-based approach with traditional statistical-based feature selection approaches, which are Principal Component Analysis (PCA) [28] and Non-negative Matrix Factorization (NMF) [29]. PCA is one of most popular statistical techniques which relates factor analysis with multivariate analysis. This algorithm aims to represent the characteristics in a dataset as a small set of factors or a dataset that keeps important information. Furthermore, NMF is available for multivariate analysis. This algorithm based on linear algebra makes complex feature information into smaller non-negative matrices. Generally, PCA tends to group both positively and negatively correlated components; on the other hand, NMF divides factors into positive vectors. These kinds of statistical factor analyses access the linearity constraint, so we applied DAE techniques with non-linear calculations for obtaining better classification results.
We applied two DAE models. One was a normal autoencoder model (AE) [30,31] and the other was a variational autoencoder model (VAE) [32,33]. Both of the two autoencoder models were constructed using the Pytorch deep learning library in Python [34]. The structure of an AE is divided into two main parts: encoder and decoder. The encoder has an input layer with three fully connected hidden layers with 2000, 500 and 100 nodes. The decoder is comprised of two hidden layers, which are fully connected. The details of the autoencoder are explained in Appendix A.
Furthermore, similar with the AE structure, VAEs also have an encoder and a decoder. The VAE is normally used in semi-supervised learning models nowadays. Additionally, it can learn approximate inferences and be trained using the gradient descent method. The main approach of the VAE is using probability distribution to obtain samples which match the distribution. The details of the variational autoencoder are explained in Appendix B.

Hematopoietic Cancer Subtype Classification Model
For the subtype classification on hematopoietic cancer, we generated a feed-forward neural network (NN) which has one input layer, one hidden layer with 100 nodes and one output layer. The features we extracted from the DAE utilized the input of the NN. In here, we adopt ReLU and sigmoid for non-linear activation function. This NN has the two loss functions for measuring classification error, which is the error between the real value and the predicted value. The details of the NN are shown in Appendix C.
The two loss functions are cross-entropy loss (CE) and focal loss (FL). CE is the most widely used loss function in classification models. However, if there are plenty of class labels with the imbalance status, it incurs a loss with non-trivial magnitude. If there are a lot of easy examples to be classified, which means there is a large class imbalance on the dataset, the CE of major class labels is going to be larger than the minor classes.
For handling this class imbalance problem, a modulating factor is added to CE, defined as focal loss (FL). This modulating factor in focal loss is (1 − p i ) γ . Here, γ is a focusing factor which can be a changeable parameter that ranges γ ≥ 0. Due to this advantage, using FL can prevent the overfitting problem which can accompanied by class imbalance. The details of CE and FL are described in Appendix D.

Training the Models
The loss functions on autoencoder models are calculated by using the difference between the input and the output. The formulas of each loss function are shown below: In this experiment, we adopted the Adam optimizer [35] for updating the weight and bias iterative based in the training data. This approach has some merits. One is that the step size does not affect the gradient rescaling. Another is that Adam can refer to previous gradients for updating the step size. We performed several trials for defining the learning rate and set it to 0.0001. The batch size of the training set was 128, with the maximum epoch size set to 3000 with an early stopping approach for checking the optimal epoch number.

Results of Feature Selection
We extracted key features on hematopoietic cancer gene expression data by using DAE approaches. To compare this result with traditional feature selection algorithms, we used PCA and NMF methods. For this comparison, we coded PCA and NMF algorithms using a Scikit-learn [36] and DAE model using a Python deep learning library, Pytorch [34].
For pair comparison, we used t-Distributed Stochastic Neighbor Embedding (tSNE) [37] which is used in the conversion of high-dimensional data visualization into low-dimensional embedding. It converts high-dimensional Euclidian distance between data points into conditional probabilities for mapping low-dimension space and adopts KL-divergence [38] to minimize mismatch on the low-dimensional data representation. Using this technique, researchers can acquire more interpretable data visualization on high-dimensional data. In this experiment, we used the tSNE technique for mapping data into a two-dimensional plane (dimension X, Y) for DAE feature selection, and the visualization of the extracted features of each approach is shown in Figure 2.
Among the visualizations of the selected gene expression data using several approaches, we can see that the result of the DAE-based AE model is the most clearly well distinguished compared with other results. DAE approaches. To compare this result with traditional feature selection algorith used PCA and NMF methods. For this comparison, we coded PCA and NMF algo using a Scikit-learn [36] and DAE model using a Python deep learning library, P [34].
For pair comparison, we used t-Distributed Stochastic Neighbor Embedding [37] which is used in the conversion of high-dimensional data visualization int dimensional embedding. It converts high-dimensional Euclidian distance betwee points into conditional probabilities for mapping low-dimension space and adop divergence [38] to minimize mismatch on the low-dimensional data representation this technique, researchers can acquire more interpretable data visualization on dimensional data. In this experiment, we used the tSNE technique for mapping da a two-dimensional plane (dimension X, Y) for DAE feature selection, and the visual of the extracted features of each approach is shown in Figure 2. Among the visualizations of the selected gene expression data using approaches, we can see that the result of the DAE-based AE model is the most clear distinguished compared with other results.

Results of DAE Training Process
We extracted key features on hematopoietic cancer gene expression data by DAE approaches which consist of AE and VAE. Each DAE approach was traine 3000 epochs for each iteration. We also calculated the loss functions MSE, CE a During the feature extraction process, we calculated the MSE from the AE mod classification loss (CE and FL) was calculated on the model training class. Furthe the total loss was generated by merging MSE and classification loss. Figures 3 and the loss function graphs for the cancer classification using MSE as the reconstruction CE and FL as the classification error and total error as the merged MSE and CE on A VAE, respectively. Figures 5 and 6 show loss function graphs using MSE

Results of DAE Training Process
We extracted key features on hematopoietic cancer gene expression data by using DAE approaches which consist of AE and VAE. Each DAE approach was trained with 3000 epochs for each iteration. We also calculated the loss functions MSE, CE and FL. During the feature extraction process, we calculated the MSE from the AE model, and classification loss (CE and FL) was calculated on the model training class. Furthermore, the total loss was generated by merging MSE and classification loss. Figures 3 and 4 show the loss function graphs for the cancer classification using MSE as the reconstruction error, CE and FL as the classification error and total error as the merged MSE and CE on AE and VAE, respectively. Figures 5 and 6 show loss function graphs using MSE as the reconstruction error, FL as the classification error and total error as the calculated sum of MSE and FL on AE and VAE, respectively.

Hematopoietic Cancer Subtype Classification Model Interpretation
The Shapley Additive exPlanations (SHAP) method [39], which is based on game theory [40] and local explanations [41], is often used to describe a model's output. The top 20 important bio-markers for hematopoietic cancer subtype classification, which were derived from the autoencoder, are shown in Table 2 and in Figure 7 as a SHAP summary plot.

Hematopoietic Cancer Subtype Classification Model Interpretation
The Shapley Additive exPlanations (SHAP) method [39], which is based on game theory [40] and local explanations [41], is often used to describe a model's output. The top 20 important bio-markers for hematopoietic cancer subtype classification, which were derived from the autoencoder, are shown in Table 2 and in Figure 7 as a SHAP summary plot.
As shown above in Figure 7, the most important bio-marker for hematopoietic cancer subtype classification is Ring Finger Protein 130 (RNF130). Including RNF130, the top 20 bio-markers are well known as oncogenes and some bio-markers are directly related to hematopoietic processes. For example, RNF130 is usually related to pathways on the innate immune system and Class I MHC (Major Histocompatibility Complex)-mediated antigen processing and presentation. This bio-marker is related to growth factor withdrawalinduced apoptosis of myeloid precursor cells [42]. Another example is Breast Cancer Anti-Estrogen Resistance Protein 1, Crk-Associated Substrate (BCAR1). Overexpression of BCAR1 is usually detected in many cancers such as breast cancer, lung cancer, anaplastic large cell lymphoma and chronic myelogenous leukemia [43].  As shown above in Figure 7, the most important bio-marker for hematopoieti subtype classification is Ring Finger Protein 130 (RNF130). Including RNF130, th bio-markers are well known as oncogenes and some bio-markers are directly re hematopoietic processes. For example, RNF130 is usually related to pathways innate immune system and Class I MHC (Major Histocompatibility Complex)-m antigen processing and presentation. This bio-marker is related to growth withdrawal-induced apoptosis of myeloid precursor cells [42]. Another example Cancer Anti-Estrogen Resistance Protein 1, Crk-Associated Substrate (B Overexpression of BCAR1 is usually detected in many cancers such as breast canc cancer, anaplastic large cell lymphoma and chronic myelogenous leukemia [43].

Hematopoietic Cancer Subtype Classification Model Evaluation
For evaluating the hematopoietic cancer subtype classification, we u classification performance metrics: accuracy (Acc), precision (Pre), recall (R harmonic mean of precision and recall, which is called F1-measure (F1), geometr (G-mean, GM) and index of balanced accuracy (IBA, = 0.1). Furthermore, we ge a confusion matrix and a precision-recall curve (PR-curve) which were u evaluating the imbalanced data classification performance. The equations below the classification performance measurement.

Hematopoietic Cancer Subtype Classification Model Evaluation
For evaluating the hematopoietic cancer subtype classification, we used six classification performance metrics: accuracy (Acc), precision (Pre), recall (Rec), the harmonic mean of precision and recall, which is called F1-measure (F1), geometric mean (G-mean, GM) and index of balanced accuracy (IBA, α = 0.1). Furthermore, we generated a confusion matrix and a precision-recall curve (PR-curve) which were used for evaluating the imbalanced data classification performance. The equations below indicate the classification performance measurement. Geometric mean (GM) = Rec × Spe where TP, TN, FP and FN are the acronyms of true positive, true negative, false positive and false negative, respectively. TP and TN are the numbers of subtypes correctly classified into positive class or negative class, respectively; FP represents the number of instances of incorrect classification into positive class. Similarly, FN is the number of instances of incorrect classification into negative class.
The overall designed flowchart of this experiment is shown in Figure 8. Tables 3-7 represent the results of all combinations of the experiments, which include statistical approaches (PCA, NMF) and deep learning approaches (AE, VAE). Each results table includes SMOTE for handling class imbalance on the dataset. The CE, FL, RE, TOC and TOF keywords in the loss function column represent cross-entropy, focal loss, reconstruction error, cross-entropy + reconstruction error and focal loss + reconstruction error respectively.
where TP, TN, FP and FN are the acronyms of true positive, true negative, false positive and false negative, respectively. TP and TN are the numbers of subtypes correctly classified into positive class or negative class, respectively; FP represents the number of instances of incorrect classification into positive class. Similarly, FN is the number of instances of incorrect classification into negative class. Furthermore, for verifying the DAE-based cancer subtype classification models, we compared these models with traditional statistics and machine learning-based classification algorithms such as logistic regression (LR) [44,45], random forest (RF) [46,47], k-nearest neighbor (KNN) [48], artificial neural network (ANN) [49] and support vector machine (SVM) [50]. For the class imbalance problem on the classification task, we used an oversampling algorithm named the synthetic minority oversampling technique (SMOTE) [51,52].
The overall designed flowchart of this experiment is shown in Figure 8. Tables 3-7 represent the results of all combinations of the experiments, which include statistical approaches (PCA, NMF) and deep learning approaches (AE, VAE). Each results table includes SMOTE for handling class imbalance on the dataset. The CE, FL, RE, TOC and TOF keywords in the loss function column represent cross-entropy, focal loss, reconstruction error, cross-entropy + reconstruction error and focal loss + reconstruction error respectively.      The results of hematopoietic cancer subtype classification using the logistic regression classification algorithm are shown in Table 3. The result of AE using merged loss, which combined cross-entropy loss and reconstruction loss, with SMOTE shows the highest results on accuracy (96.33%), recall (91.73%) and F1-measure (91.99%). In addition, the result of AE using reconstruction error with SMOTE shows the highest result on G-mean (95.33%) and IBA (95.24%). Table 4 shows the results of hematopoietic cancer subtype classification using the k-nearest neighborhood classification algorithm. The result of AE using merged loss, which combined cross-entropy loss and reconstruction loss, with SMOTE shows the highest results on accuracy (96.06%), recall (93.82%) and F1-measure (91.31%), and the result of AE using merged loss, which combined focal loss and reconstruction error, with SMOTE shows the highest results on specificity (99.12%), G-mean (96.59%) and IBA (92.84%).
The results of hematopoietic cancer subtype classification using the random forest classification algorithm are shown in Table 5. The result of AE using merged loss, which combined with cross-entropy loss and reconstruction loss, with SMOTE shows the highest results on accuracy (96.60%), and F1-measure (91.82%), and the result of AE using reconstruction error with SMOTE shows the highest results on G-mean (99.12%), and IBA (90.80%).
In Table 6, the results of hematopoietic cancer subtype classification using the support vector machine classification algorithm are shown. The result of AE using merged loss, which combined focal loss and reconstruction loss, with SMOTE shows the best results on the evaluation matrices of accuracy (97.01%), precision (92.68%), recall (94.60%), specificity (99.52%), F1-measure (93.54%), G-mean (97.87%) and IBA (95.46%). In Table 7, the results of hematopoietic cancer subtype classification using the artificial neural network classification algorithm are shown. The result of AE using merged loss, which combined cross-entropy loss and reconstruction loss, with SMOTE shows the highest results on accuracy (96.74%), precision (93.62%) and F1-measure (92.47%), and the result of AE using reconstruction error without the oversampling method shows the highest results on G-mean (95.39%) and IBA (90.32%).
Combining all of above results, the top 10 results based on F1-measure are summarized in Table 8. As a summary of this experiment, we found that the result of autoencoder feature selection based on the support vector machine classification algorithm with the total loss, which combined focal loss and reconstruction loss, with SMOTE showed the best performance in accuracy (97.01%), recall (94.60%), specificity (99.52%), F1-measure (93.53%), G-mean (97.87%) and IBA (95.46%). Figure 9, below, shows the top six PR-curves and Figure 10 shows the top six confusion matrices among the results shown in Table 8.    Figure 9, below, shows the top six PR-curves and Figure 10 shows the top six confusion matrices among the results shown in Table 8.

Discussion
In this paper, we suggested an autoencoder-based feature extraction approach for hematopoietic cancer subtype classification. The five major hematopoietic cancer subtypes were selected to create experimental data based on gene expression level. We also compared our approach with traditional feature extraction algorithms, PCA and Figure 10. Confusion matrix for the top six hematopoietic subtype cancer classification results by F1-measure. (a) represents the confusion matrix for the autoencoder using FL + RE with SMOTE on SVM classifier; (b) represents the confusion matrix for the autoencoder using CL + RE with SMOTE on SVM classifier; (c) represents the confusion matrix for the autoencoder using CL + RE with SMOTE on ANN classifier; (d) represents the confusion matrix for the autoencoder using FL + RE with SMOTE on ANN classifier; (e) represents the confusion matrix for the autoencoder using CL + RE with SMOTE on LR classifier; (f) represents the confusion matrix for the autoencoder using CL + RE with SMOTE on RF classifier.

Discussion
In this paper, we suggested an autoencoder-based feature extraction approach for hematopoietic cancer subtype classification. The five major hematopoietic cancer subtypes were selected to create experimental data based on gene expression level. We also compared our approach with traditional feature extraction algorithms, PCA and NMF, which are widely used in cancer classification based on gene expression data. In addition, in consideration of the class imbalance problem occurring in multi-label classification, we applied the SMOTE oversampling algorithm.
In the experimental results, the traditional feature selection approaches, NMF and PCA, showed good performance, but our proposed DAE-based approach for subtype classification showed a better performance. For example, in the results of the SVM classifier using the SMOTE oversampling method, the PCA and NMF feature extraction approaches showed 90.63% and 90.22% accuracy, respectively, and the AE-based feature extraction approaches with cross-entropy error (CE), reconstruction error (RE) and merged error (CE + RE) showed 93.34%, 96.06% and 96.88% classification accuracy, respectively. This result was also the same when focal loss was applied instead of cross-entropy loss. The accuracy for each focal loss case that applied the AE-based feature extraction approach with focal loss (FL), reconstruction error (RE) and merged error (FL + RE) was 90.08%, 95.65% and 97.01%, respectively. Although SVM showed the best result with merged error (FL + RE) using focal loss, in other cases, we found that the merged error (CE + RE) using cross-entropy error showed the best performance in the other feature extraction approaches. Using those extracted results on classification algorithm, the result of subtype classification using the DAE-based feature selection approach showed better performance than traditional statistics and machine learning feature extraction approaches.
Furthermore, as shown in Table 8, when all of the results were summarized, we found that the AE-based feature extraction approach shows better performance than other feature extraction methods. In addition, when comparing the loss function, the results of applying both the classification error (FL/CE) and the reconstruction error (RE) together showed better performance rather than the single loss function, and the sampling method also showed better results when applying the SMOTE oversampling technique.

Conclusions
In this paper, we focused on the autoencoder-based feature extraction method to extract biological information from complex cancer data such as gene expression, clinical data and methylation data. We evaluated the proposed method on TCGA data samples from 2457 patients with hematopoietic cancer: lymphoid leukemia, myeloid leukemia, leukemia nos (not otherwise specified), mature B-cell leukemia and plasma cell neoplasm. To the best of our knowledge, there is no other research work on hematopoietic cancer using deep learning-based feature extraction techniques. We compared the proposed autoencoder-based method to the traditional state-of-the-art algorithms PCA and NMF, as well as another generative deep learning technique, VAE. We provided comprehensive experimental results that show the efficiency of our proposed method.
As shown in the experimental results, the proposed method shows higher performance than the other compared techniques in terms of various evaluation metrics. The proposed method with TOF loss achieved the highest accuracy (97.01%), precision (92.68%), recall (94.60%), specificity (99.52%), F1-measure (93.53%), G-mean (97.87%) and index imbalanced accuracy (95.46%) followed by the SVM classifier, which was trained on the sampled data by SMOTE. The learned representations contain rich, valuable information related to hematopoietic cancer which also can be used for other downstream tasks such as regression, classification, survival analysis, etc. We also applied the SHAP feature interpretation technique to our pre-trained model to explain the black box and show the importance of each bio-marker. By extracting bio-markers using deep learning structures, this study is expected to be helpful in enabling gene-specific treatment of patients. Furthermore, it is expected that this model will be helpful in the development of public healthcare through extensibility that can be applied not only to cancer but also to various diseases.
In conclusion, we found that our autoencoder-based feature extraction approach for hematopoietic cancer subtype classification algorithm showed good classification performance in multiclass classification, and our suggested approach showed better performance than PCA and NMF, which are widely used feature extraction methods for cancer clas-sification. Furthermore, the problem of unbalanced data can be solved by applying the SMOTE method. Figure A1. The procedure of autoencoder (AE) feature selection method. The encoder consists of three fully connected hidden layers which contain 2000, 500 and 100 nodes and the decoder consists of two fully connected hidden layers which contains 500 and 2000 nodes. This autoencoder structure is evaluated by reconstruction error using MSE measurement.

Appendix B. Variational Autoencoder
Similar to the AE structure, the VAE also has an encoder and a decoder. This VAE is normally used in semi-supervised learning models nowadays. Furthermore, it can learn approximate inferences and be trained using the gradient descent method. The main approach of VAEs is using probability distribution to obtain samples which match the distribution. This VAE calculates objective function using reconstruction loss and Kullback-Leibler divergence (KLdivergence) [38].
is a VAE objective function, which is known as a variational lower bound; is input data and is a latent variable. The left term of VAE is the reconstruction loss-it makes the decoder reconstruct input. The right term is referred to as KL-divergence for minimizing the difference between the encoder's distribution | and prior distribution . For calculating distribution, it generates mean and standard deviation. The goal of this objective function is to maximize the variational lower bound using the maximization of data generation and minimization the KL-divergence. In this paper, the VAE shared the construction with AE; however, due to the VAE considering the distribution of the input, the VAE applies the distribution calculating part before generating the latent space. Almost the same as the AE, the VAE transposes the weights of the encoding and decoding using the below equations: where, , and are the weight vectors between each layer. Assume that the input size is , the size of and is × 2000 and 2000 × 500 respectively. In the third layer on the encoder, the distribution of input is computed and the mean ( ) and standard deviation (σ) of input are generated for calculating latent space ( ). The , and are the bias information for each layer. In VAE, and functions are also used as non-linear activation functions. The result ℎ , called latent space ( ), is used as the extracted feature of the VAE. For measuring reconstruction loss, Figure A1. The procedure of autoencoder (AE) feature selection method. The encoder consists of three fully connected hidden layers which contain 2000, 500 and 100 nodes and the decoder consists of two fully connected hidden layers which contains 500 and 2000 nodes. This autoencoder structure is evaluated by reconstruction error using MSE measurement.

Appendix B. Variational Autoencoder
Similar to the AE structure, the VAE also has an encoder and a decoder. This VAE is normally used in semi-supervised learning models nowadays. Furthermore, it can learn approximate inferences and be trained using the gradient descent method. The main approach of VAEs is using probability distribution to obtain samples which match the distribution. This VAE calculates objective function using reconstruction loss and Kullback-Leibler divergence (KLdivergence) [38].
where L is a VAE objective function, which is known as a variational lower bound; X is input data and Z is a latent variable. The left term of VAE is the reconstruction loss-it makes the decoder reconstruct input. The right term is referred to as KL-divergence for minimizing the difference between the encoder's distribution Q(Z|X) and prior distribution P(Z). For calculating distribution, it generates mean and standard deviation. The goal of this objective function is to maximize the variational lower bound using the maximization of data generation and minimization the KL-divergence. In this paper, the VAE shared the construction with AE; however, due to the VAE considering the distribution of the input, the VAE applies the distribution calculating part before generating the latent space. Almost the same as the AE, the VAE transposes the weights of the encoding and decoding using the below equations: hidden encode 2 = ReLU W 2 × hidden encode 1 + b 2 hidden encode 3 = E[log P(X|Z)] − D KL [Q(Z|X)||P(Z)] hidden decode 1 = ReLU W 3 × hidden encode 3 + b 3 hidden decode 2 = ReLU W 2 × hidden decode 1 + b 2 reconstruction input = sigmoid W 1 × hidden decode 2 + b 1 (A4) where, W 1 , W 2 and W 3 are the weight vectors between each layer. Assume that the input size is N, the size of W 1 and W 2 is N × 2000 and 2000 × 500 respectively. In the third layer on the encoder, the distribution of input is computed and the mean (µ) and standard deviation (σ) of input are generated for calculating latent space (Z). The b 1 , b 2 and b 3 are the bias information for each layer. In VAE, ReLU and sigmoid functions are also used as non-linear activation functions. The result hidden decode 1 , called latent space (Z), is used as the extracted feature of the VAE. For measuring reconstruction loss, MSE is used as a loss function between the original data and reconstructed data, same as the AE. This sequence of variational autoencoder processes is shown in Figure A2. MSE is used as a loss function between the original data and reconstructed data, same as the AE. This sequence of variational autoencoder processes is shown in Figure A2.

Appendix C. Subtype Classification Neural Network
For the NN for the subtype classification on hematopoietic cancer, we generated a feed-forward neural network (NN), which has one input layer, one hidden layer with 100 nodes and one output layer. The features we extracted from DAE utilized the input of NN. ℎ = × ℎ + = × ℎ + (A5) where and indicate the weight vector between the layers between 100 × 100 and 100 × ; indicates the number of class labels-in this case, it is 5-and and are the bias of each NN layer's node.

Appendix D. Cross-Entropy Loss and Focal Loss
For calculating classification loss, we used two loss functions: cross-entropy loss (CE) and focal loss (FL). CE is the most widely used loss function in classification models. The equation of CE is shown in the below equation: where is the class label of the instance; is the number of classification labels on the dataset; | is the predicted probability. If the model predicts an instance correctly as , | heads to the true distribution of . As a result, the loss function will be decreased. For simplicity, if is defined in the estimated probability of the model when class label = 1, it can be written by the below equation: , = = − log (A7) However, if there are plenty of class labels with the imbalance status, it incurs a loss with the non-trivial magnitude. If there are a lot of easy examples to be classified, which means there is a large class imbalance on the dataset, the CE of major class labels is going to be larger than the minor classes.
For handling this class imbalance problem, a modulating factor is added to CE, defined as focal loss (FL). This modulating factor in focal loss is 1 − . Here, is a Figure A2. The procedure of the variational autoencoder (VAE) feature selection method. The encoder consists of three fully connected hidden layers which contain 2000, 500 and 100 nodes and the decoder consists two fully connected hidden layers which contain 500 and 2000 nodes. Unlike the autoencoder, this method uses Kullback-Leibler divergence (KL-divergence) for calculating reconstruction loss for variational autoencoder.

Appendix C. Subtype Classification Neural Network
For the NN for the subtype classification on hematopoietic cancer, we generated a feedforward neural network (NN), which has one input layer, one hidden layer with 100 nodes and one output layer. The features we extracted from DAE utilized the input of NN.
hidden layer = ReLU W 4 × hidden encode 3 + b 4 output = sigmoid W 5 × hidden layer + b 5 (A5) where W 4 and W 5 indicate the weight vector between the layers between 100 × 100 and 100 × C; C indicates the number of class labels-in this case, it is 5-and b 4 and b 5 are the bias of each NN layer's node.

Appendix D. Cross-Entropy Loss and Focal Loss
For calculating classification loss, we used two loss functions: cross-entropy loss (CE) and focal loss (FL). CE is the most widely used loss function in classification models. The equation of CE is shown in the below equation: where y is the class label of the instance; N is the number of classification labels on the dataset; P(i|S) is the predicted probability. If the model predicts an instance correctly as y i , P(i|S) heads to the true distribution of y i . As a result, the loss function will be decreased.
For simplicity, if p i is defined in the estimated probability of the model when class label y i = 1, it can be written by the below equation: However, if there are plenty of class labels with the imbalance status, it incurs a loss with the non-trivial magnitude. If there are a lot of easy examples to be classified, which means there is a large class imbalance on the dataset, the CE of major class labels is going to be larger than the minor classes.
For handling this class imbalance problem, a modulating factor is added to CE, defined as focal loss (FL). This modulating factor in focal loss is (1 − p i ) γ . Here, γ is a focusing factor which can be changeable parameter that ranges γ ≥ 0. Therefore, the FL can be formulated as the below equation: For instance, if the focusing factor is set as γ = 2 and a sample probability p t = 0.9, the loss contribution is 100 times lower than CE. On the other hand, the modulating factor makes the loss contribution of the example lower and makes it easier to classify and compare than when using traditional CE. Due to this advantage, using FL can prevent the overfitting problem which can be accompanied by class imbalance.