Molecular Classification Models for Triple Negative Breast Cancer Subtype Using Machine Learning

Triple negative breast cancer (TNBC) lacks well-defined molecular targets and is highly heterogenous, making treatment challenging. Using gene expression analysis, TNBC has been classified into four different subtypes: basal-like immune-activated (BLIA), basal-like immune-suppressed (BLIS), mesenchymal (MES), and luminal androgen receptor (LAR). However, there is currently no standardized method for classifying TNBC subtypes. We attempted to define a gene signature for each subtype, and to develop a classification method based on machine learning (ML) for TNBC subtyping. In these experiments, gene expression microarray data for TNBC patients were downloaded from the Gene Expression Omnibus database. Differentially expressed genes unique to 198 known TNBC cases were identified and selected as a training gene set to train in seven different classification models. We produced a training set consisting of 719 DEGs selected from uniquely expressed genes of all four subtypes. The highest average accuracy of classification of the BLIA, BLIS, MES, and LAR subtypes was achieved by the SVM algorithm (accuracy 95–98.8%; AUC 0.99–1.00). For model validation, we used 334 samples of unknown TNBC subtypes, of which 97 (29.04%), 73 (21.86%), 39 (11.68%) and 59 (17.66%) were predicted to be BLIA, BLIS, MES, and LAR, respectively. However, 66 TNBC samples (19.76%) could not be assigned to any subtype. These samples contained only three upregulated genes (EN1, PROM1, and CCL2). Each TNBC subtype had a unique gene expression pattern, which was confirmed by identification of DEGs and pathway analysis. These results indicated that our training gene set was suitable for development of classification models, and that the SVM algorithm could classify TNBC into four unique subtypes. Accurate and consistent classification of the TNBC subtypes is essential for personalized treatment and prognosis of TNBC.


Introduction
Triple negative breast cancer (TNBC) is a subset of breast cancer which lacks the expression of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor-2 (HER2). TNBC accounts for 10-20% of breast cancers, and primarily affects women under 40 years old [1]. Due to the heterogeneity of the disease, and the absence of molecular targets, TNBC patients are not sensitive to endocrine or HER2 targeted therapy. Chemotherapy remains a standard treatment for patients with TNBC. This cancer has a poor prognosis and a high rate of relapse and metastasis leading to tumor recurrence [2,3]. Hence, it is important to explore therapeutic targets to improve the outcomes of TNBC. Advances in gene expression microarray analysis have facilitated comprehensive molecular profiling, which can be used to classify TNBC into distinct subtypes [4][5][6][7][8]. According to gene expression signatures, Burstein et al. classified TNBC into four subtypes: basal-like immune-activated (BLIA), basal-like immune-suppressed (BLIS), mesenchymal (MES), and luminal androgen receptor (LAR) [8]. Previous studies have found that each TNBC subtype has different characteristics and responses to neoadjuvant chemotherapy [9,10]. Thus, TNBC subtyping is of value for prioritizing patients for personalized medicine. However, a laboratory tool for classification of TNBC subtype is still under investigation, and has not yet been implemented in the clinic.
Over the past decade, research groups studied the gene signatures of TNBC subtypes using different techniques. Advancement in gene microarray technology have produced datasets with a very large number of genes (features), but a small number of samples. This high dimensionality is a major challenges to the development of classification methods [11,12]. To address these challenges, machine learning (ML) approaches have been used. ML is the study of computer algorithms which improve automatically through experience. It learns from previous data to create the classification, prediction or identify processes [13]. To date, ML-based cancer classification models have been used to predict death outcomes [14], seek new drug mechanisms [15] and identify genes to differentiate TNBC from non-TNBC [16]. However, no reported studies have proposed ML-based classification schemes for classifying TNBC subtypes using gene expression data. The aim of this study was to investigate the potential application of ML to the classification of TNBC subtypes using microarray data derived from the public Gene Expression Omnibus (GEO) database [17]. In the present study, we analyzed a dataset consisting of 198 TNBC patients, to identify a set of upregulated differentially expressed genes (DEGs) among TNBC subtypes. We subsequently used this gene set to develop classification models using seven different ML algorithms: Support Vector Machines (SVM), K-nearest neighbor (KNN), Naïve Bayes (NB), Decision Tree (DT), Ensemble, Linear Discriminant, and Logistic Regression.

Identification of DEGs and Feature Selection
In this study, 198 known TNBC cases used as the training set were classified into four TNBC subtypes: BLIA, BLIS, MES, and LAR, containing 54, 60, 47, and 37 cases, respectively. The samples contained the expression profiles of 20,186 genes, making the dataset very high dimensional. Using a large number of genes to train the ML model takes a long time, and may reduce the efficiency of ML. Thus, to train the model and identify the best classifiers, we extracted the genes with upregulated expressed in each TNBC subtype according to the p-value and log 2 fold change cut-offs. We identified 80, 80, 400, and 197 upregulated DEGs in BLIA, BLIS, MES, and LAR, respectively ( Figure 1A). The top 20 upregulated DEGs of each subtype are shown in Table 1. We also used Venn diagrams to check the overlap between upregulated DEGs of each subtype, and found 73, 75, 385, and 186 genes which were expressed only in BLIA, BLIS, MES, and LAR, respectively ( Figure 1B) (Supplementary Table S1). There were few overlapping upregulated DEGs among the four subtypes. Hence, the 719 DEGs which were only expressed in each subtype were selected as the training gene set for training the classification models.

GO Term and KEGG Pathway Enrichment Analysis of Unique Upregulated DEGs in Each TNBC Subtype
GO function and KEGG pathway enrichment analysis were performed using MetaScape [18] to explore the biological functions of unique upregulated DEGs in each TNBC subtype. DEGs which were only upregulated in the BLIA subtype were significantly enriched in organelle fission, nuclear division, cell cycle phase transition, and immune regulation pathways. These genes were different from those of the BLIS subtype, which were significantly downregulated in immune regulation pathways [5]. In our study, functional enrichment analyses of upregulated DEGs in the BLIS subtype which is one of two basal-like clusters, showed significant enrichment in epithelial cell differentiation, tissue morphogenesis, chordate embryonic development. and the Wnt signaling pathway. In the MES subtype, signal transduction pathways associated with the naba core matrisome, which is an ensemble of genes encoding core extracellular matrix elements including ECM glycoproteins, collagens, and proteoglycans. The upregulated DEGs of the LAR subtype were mainly significantly enriched in estrogen-dependent gene expression, metabolism of lipids, and organic acid catabolic processes. Thus, our results indicated that each TNBC had a unique pattern of gene expression and signaling pathways ( Figure 2) (Supplementary Table S2).

Modeling Prediction and Performance Evaluation
In this section, all of the unique upregulated DEGs of each TNBC subtype were selected as the training set for classification and prediction using the MATLAB 2020a environment [19]. To choose the best model, we trained a selection of models, including SVM, KNN, NB, DT, Ensemble, Linear Discriminant, and Logistic Regression. We trained all of the models using five-fold cross validation, to protect against overfitting. The training data in our experiment were divided into five sets of similar size, and four of them were used in turn as the training set. One set was used as the test set to evaluate the model. In terms of model accuracy, the prediction models ranged from 61.5% for logistic regression, to 98.8% for SVM. The SVM algorithm was the best classification model, with the highest average accuracy of 95.7, 95.6, 95.0, and 98.8% for the BLIA, BLIS, MES, and LAR subtypes, respectively. The experimental results are presented in Table 2 (Supplementary Table S3).
Since the number of TNBC patients in each subtype was not balanced, the accuracy may not reflect the performance of the ML algorithms. Therefore, to prevent misleading interpretation of our results, the F1 score, which is the harmonic mean of precision and recall, was also considered. The BLIA, BLIS, MES, and LAR models exhibited F1-scores of 0.91, 0.91, 0.90, and 0.97, respectively. The LAR model was able to predict non-LAR patients correctly, resulting in a recall (sensitivity) of 1.00 (100%) ( Figure 3, Table 3). The area under the ROC curve (AUC) value was also used to evaluate the model performance. The SVM model of LAR exhibited the highest AUC, of 1.00, compared to BLIA, BLIS, and MES, with an AUC of 0.99 ( Figure 4). Our result showed that a training gene set and ML algorithms could classify TNBC with high accuracy.

Testing on Independent Cohorts Demonstrated the Generalizability of the Classification Model
The model was then evaluated on independent datasets of TNBC subtypes that had never been used in the training process. In model validation of 334 unknown TNBC samples, 97 (29.04%), 73 (21.86%), 39 (11.68%) and 59 (17.66%) were predicted to be BLIA, BLIS, MES, and LAR, respectively. In addition, 66 TNBC samples (19.76%) could not be predicted as any subtype, and were defined as unclassified samples ( Figure 5) (Supplementary table S4). Then, the upregulated DEGs of the test set were compared with those of the training set, to confirm the accuracy of each model. Our result showed that the DEG comparisons of the individual subtypes were quite similar between the training and the test set ( Figure 6) (Supplementary Table S5). The unclassified samples included only three upregulated genes, including EN1, PROM1, and CCL2 among all TNBC subtypes (Table  4).

Testing on Independent Cohorts Demonstrated the Generalizability of the Classification Model
The model was then evaluated on independent datasets of TNBC subtypes that had never been used in the training process. In model validation of 334 unknown TNBC samples, 97 (29.04%), 73 (21.86%), 39 (11.68%) and 59 (17.66%) were predicted to be BLIA, BLIS, MES, and LAR, respectively. In addition, 66 TNBC samples (19.76%) could not be predicted as any subtype, and were defined as unclassified samples ( Figure 5) ( Supplementary Table S4). Then, the upregulated DEGs of the test set were compared with those of the training set, to confirm the accuracy of each model. Our result showed that the DEG comparisons of the individual subtypes were quite similar between the training and the test set ( Figure 6) (Supplementary Table S5). The unclassified samples included only three upregulated genes, including EN1, PROM1, and CCL2 among all TNBC subtypes (Table 4).

Discussion
TNBC is a more aggressive and highly heterogeneous disease than other type of breast cancer [20,21]. TNBC patients do not benefit from targeted therapies such as endocrine therapy or trastuzumab, due to the absence of ER, PR, or HER2. TNBC patients have poorer survival and prognosis than other breast cancer types after chemotherapy [22,23]. Due to its high heterogeneity, TNBC can be classified into different subtypes [6][7][8]24]. Burstein et al. divided TNBC into four subtypes, BLIA, BLIS, MES, and LAR [8]. Some studies have found that TNBC subtypes have different prognosis and responses to neoadjuvant chemotherapy. LAR patients achieved the lowest pathologic complete response (pCR), but showed the best overall survival rate and delayed recurrence when compared with the other subtypes [9]. Patients with Basal-like 1 subtype (Lehmann subtyping) exhibited the highest pCR to carboplatin containing regimens [25]. TNBC subtyping can be used as a predictor of pCR, and may impact decision pertaining to treatment of TNBC. Thus, classification tools are needed to classify TNBC subtypes. However, the classification of TNBC subtypes has not been routinely used in clinical practice. The main goal of our study was to use public gene expression data to develop a tool for TNBC subtyping, using ML.
In this study, microarray gene expression data was downloaded from the GEO database. To enhance the efficacy of ML, gene signatures were selected from genes expressed only in each subtype, as identified using a cutoff of FC > 2 and p-value < 0.05. There were a few overlapping DEGs among subtypes, because they shared some similar characteristics. For example, BLIA and BLIS displayed basal-like characteristics. The training set 719 DEGs was used to train the prediction models. The highest average accuracy of a classifier for BLIA, BLIS, MES, and LAR subtype was the SVM algorithm (accuracy 95-98.8%; AUC 0.99-1.00). The performance was evaluated based on five-fold cross validation. SVM is a supervised ML algorithm based on the idea of maximizing the margins between different classes. Our results were consistent with those of previous studies. Asri et al. found that SVM showed the highest accuracy (97.13%) and lowest error rate in the classification of breast cancer, using the Wisconsin Breast Cancer datasets [26]. Wu et al. also found that the SVM algorithm could accurately classify breast cancer into TNBC and non-TNBC, and had fewer misclassification errors than the other ML algorithms [27]. Nindrea et al. confirmed that the SVM algorithm produced better accuracy of breast cancer risk calculation than other ML algorithms [28].
To ensure accurate subtype prediction, we compared the upregulated DEGs of each subtype between the training and test sets. We observed that predicted TNBC samples displayed upregulated DEGs that corresponded with those in the training set and in other reports [8]. The different TNBC subtypes exhibited different unique gene expression and signaling pathways. These data could be used to guide therapeutic decisions. The BLIA subtype showed high expression of genes related to the immune system, and therefore may be sensitive to immune checkpoint inhibitor treatments for BLIA. For the MES subtype, upregulated DEGs were associated with extracellular structure, extracellular matrix organization, growth factors, and blood vessel development. Therefore, patients with the MES subtype might be susceptible to anti-angiogenic therapy [29]. The LAR subtype showed significantly upregulated DEGs enriched in estrogen-dependent gene expression, including androgen receptors (AR). AR was expressed at a lower rate in other TNBC subtypes. The LAR model had the highest accuracy, 98.8%, and a recall of 1.0. AR could therefore be used as a novel therapeutic target for the LAR subtype. The use of enzalutamide, an androgen receptor inhibitor, is currently being explored in TNBC patients who express the androgen receptor (NCT01889238) [30][31][32].
The unclassified samples, which were not predicted as any subtype, had only three upregulated genes. EN1, PROM1, and CCL2 were found as upregulated DEGs of BLIS, BLIA&BLIS, and MES, respectively. These patients did not express the unique gene pattern; thus the classification model was unable to identify them into any subtypes. These findings indicated that the training gene set could discriminate between TNBC subtypes. However, further study is needed to investigate the unclassified subtypes. The integration of gene expression analysis with genomic, epigenetic, and microRNA data may lead to improvement of the efficacy of ML classification tools. TNBC subtyping identified the unique patterns of gene expression for each subtype, and could be used for guiding therapeutic choices, and also for the development of potential therapeutic targets for TNBC patients.
Our discovery phase, the training gene set consisted of 719 DEGs which were high number to train the prediction model. Minimize the number of unique genes in training gene set with high accuracy is challenging. Gene signatures selection will be considered on gene (feature) importance score based on how useful they are in classification model. This will be a practical method to select a few gene signatures for further validation in clinical samples using qPCR. It would be beneficial to utilize the RNAseq data with our ML models in the future.

Data Sources and Preprocessing
The overall design and execution strategy used in this study is presented in Figure 7. We downloaded the seven microarray gene expression profile datasets (GSE76124, GSE95700, GSE48390, GSE76275, GSE19697, GSE 20,711 and GSE21653) from the GEO database (https://www.ncbi.nlm.nih.gov/geo/, accessed on 1 May 2021). Our experimental dataset consisted of 532 TNBC cases, of which 198 TNBC cases were assigned as a training cohort and 334 TNBC cases were assigned as a test cohort. All seven datasets were based on platform GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array) and are freely available online. No additional ethics review was required for the in silico analysis of these data sets, because this study did not involve any experiments on humans performed by any of the authors. The raw Affymetrix cell intensity files (.CEL files) of all datasets were processed for normalization, background correction, and log 2 -transformation using robust multi-array average from the R/Bioconductor package affy [33].

Identification of DEGs and Feature Selection
The volcano plot which was generated by mavolcanoplot in MATLAB was used to identify DEGs with the most predictive power among TNBC subtypes. In this experiment, known TNBC cases from a previous study [8] were assigned to four TNBC subtypes: BLIA, BLIS, MES, and LAR. These cases were used as training data. We considered p-value < 0.05 and a log 2 (fold change) >2 to indicate statistically significant upregulated DEGs of each TNBC subtype. All upregulated DEGs were then plotted as a Venn diagram, to check for overlapping genes among subtypes. For feature selection, all upregulated DEGs which were expressed in only one of the four subtypes were selected as the gene set for training the prediction models.

Functional and Pathway Enrichment Analysis
A functional enrichment analysis of the unique upregulated DEGs of each TNBC subtype was performed using the Metascape software (http://metascape.org/, accessed on 1 May 2021) [18]. Functional enrichment was performed using three categories of GO terms: biological process, molecular function and cellular component (CC). In addition, KEGG pathways, Reactome Gene Sets, and CORUM were used as sources of pathway, gene network, and process enrichment analysis [34][35][36]. Terms with a p-value of <0.01, a minimum count of 3, and an enrichment factor of >1.5 were collected and grouped into clusters based on their membership similarities.

Identification of DEGs and Feature Selection
The volcano plot which was generated by mavolcanoplot in MATLAB was used to identify DEGs with the most predictive power among TNBC subtypes. In this experiment, known TNBC cases from a previous study [8] were assigned to four TNBC subtypes: BLIA, BLIS, MES, and LAR. These cases were used as training data. We considered pvalue < 0.05 and a log2 (fold change) > 2 to indicate statistically significant upregulated DEGs of each TNBC subtype. All upregulated DEGs were then plotted as a Venn diagram, to check for overlapping genes among subtypes. For feature selection, all upregulated DEGs which were expressed in only one of the four subtypes were selected as the gene set for training the prediction models.

Functional and Pathway Enrichment Analysis
A functional enrichment analysis of the unique upregulated DEGs of each TNBC subtype was performed using the Metascape software (http://metascape.org/) [18]. Functional enrichment was performed using three categories of GO terms: biological process, molecular function and cellular component (CC). In addition, KEGG pathways, Reactome Gene Sets, and CORUM were used as sources of pathway, gene network, and process enrichment analysis [34][35][36]. Terms with a p-value of <0.01, a minimum count of 3, and an enrichment factor of >1.5 were collected and grouped into clusters based on their membership similarities.

Model Construction
Seven different classification models, SVM, KNN, NB, DT, Ensemble, Linear Discriminant, and Logistic Regression, were used to generate the classification model. The detail of each model is presented in Supplementary Table S6. Here we used all of classification models, which are available from the Classification Learner app in the applications toolbox in MATLAB [19]. A training gene set was used to train all of models within the

Model Construction
Seven different classification models, SVM, KNN, NB, DT, Ensemble, Linear Discriminant, and Logistic Regression, were used to generate the classification model. The detail of each model is presented in Supplementary Table S6. Here we used all of classification models, which are available from the Classification Learner app in the applications toolbox in MATLAB [19]. A training gene set was used to train all of models within the training set (N = 198) which were divided into four TNBC subtypes: BLIA (n = 54), BLIS (n = 60), MES (n = 47) and LAR (n = 37). In this study, we performed five-fold cross validation to evaluate the model performance. To identify the best model, the performance analysis of each model was measured in terms of accuracy, sensitivity (recall), specificity, Precision (positive predictive value) (PPV), negative predictive value (NPV), F1 score, and AUC [37]. These performance indicators were defined and computed as follows: Finally, the classification model which gave the best performance indicators for each TNBC subtype was selected for model generation and evaluation.

Modeling Prediction and Performance Evaluation
After model training, 334 unknown TNBC cases were used to evaluate the prediction ability of the best performing model. There is no standard method to classify TNBC, so after making the predictions we identified the upregulated DEGs using mavolcanoplot in MATLAB for each TNBC subtype, to compare with the training set.

Conclusions
We proposed a new ML model to distinguish the four subtypes of TNBC using subtypespecific gene signatures based on gene expression data. Our finding confirmed that the SVM model offered the best potential classifier for TNBC classification. The utilization of a training gene set could be beneficial for TNBC subtyping and the development of personalized treatment for TNBC patients.