Development of Gene Expression-Based Random Forest Model for Predicting Neoadjuvant Chemotherapy Response in Triple-Negative Breast Cancer

Simple Summary Only 20–50% of patients with triple negative breast cancer achieve a pathological complete response from neoadjuvant chemotherapy, a strong indicator of patient survival. Therefore, there is an urgent need for a reliable predictive model of the patient’s pathological complete response prior to actual treatment. The purpose of this study was to develop such a model based on random forest recursive feature elimination and to benchmark the performance of the proposed model against existing predictive models. Our study suggests that an 86-gene-based random forest model associated to DNA repair and cell cycle mechanisms can provide reliable predictions of neoadjuvant chemotherapy response in patients with triple negative breast cancer. Abstract Neoadjuvant chemotherapy (NAC) response is an important indicator of patient survival in triple negative breast cancer (TNBC), but predicting chemosensitivity remains a challenge in clinical practice. We developed an 86-gene-based random forest (RF) classifier capable of predicting neoadjuvant chemotherapy response (pathological Complete Response (pCR) or Residual Disease (RD)) in TNBC patients. The performance of pCR classification of the proposed model was evaluated by Receiver Operating Characteristic (ROC) curve and Precision Recall (PR) curve. The AUROC and AUPRC of the proposed model on the test set were 0.891 and 0.829, respectively. At a predefined specificity (>90%), the proposed model shows a superior sensitivity compared to the best performing reported NAC response prediction model (69.2% vs. 36.9%). Moreover, the predicted pCR status by the model well explains the distance recurrence free survival (DRFS) of TNBC patients. In addition, the pCR probabilities of the proposed model using the expression profiles of the CCLE TNBC cell lines show a high Spearman rank correlation with cyclophosphamide sensitivity in the TNBC cell lines (SRCC =0.697, p-value =0.031). Associations between the 86 genes and DNA repair/cell cycle mechanisms were provided through function enrichment analysis. Our study suggests that the random forest-based prediction model provides a reliable prediction of the clinical response to neoadjuvant chemotherapy and may explain chemosensitivity in TNBC.


Introduction
Triple negative breast cancer (TNBC) is a particularly difficult form of breast cancer to treat because of its rapid growth and high recurrence rate [1]. It accounts for about 15% of all invasive breast cancers and usually has a high histological grade with a low long term survival rate [2]. Compared to the successful application of targeted therapies for other types of breast cancer, targeted therapy for TNBC is difficult due to the lack of major receptors in breast cancer such as ER/PR/HER2 receptors [3,4]. Several targeted therapies for TNBC patients have recently been introduced [5][6][7], but due to the limited applicability of these therapies [8], cytotoxic chemotherapy remains the mainstay of treatment [9].
Due to the aggressive tumor growth in TNBC, NeoAdjuvant Chemotherapy (NAC), which refers to preoperative chemotherapy, is often considered as a first-line treatment for TNBC patients. The main treatment goal of NAC is to achieve a pathological complete response (pCR) and pCR is defined as the complete disappearance of all invasive carcinoma cells in the breast and axillary lymph nodes, which is assessed pathologically in the resected tissue after neoadjuvant chemotherapy [10]. Patients who do not achieve pCR are refer to as having Residual Disease (RD).
The pCR by NAC has been demonstrated to be a strong prognostic factor for TNBC patients [11][12][13]. Therefore, pCR by NAC generally is considered as an appropriate surrogate endpoint for TNBC patients [12] and clinical studies have shown improved survival in TNBC patients who achieved pCR by NAC [14]. Unfortunately, not all patients receiving NAC could achieve pCR and in fact, only about 20-50% of patients with TNBC achieve pCR by NAC [11,12,14,15]. In other words, some patients are experiencing unnecessary side effects that can lead to loss of alternative treatment due to an inadequate and ineffective treatment. Therefore, there is an urgent need for a prediction model of the NAC response in TNBC.
Several previous studies have proposed gene expression-based predictive models to address this issue [16][17][18][19][20][21]. For example, Liedtke et al. reported that a high GGI (Genomic Grade Index) was found to be correlated with increased sensitivity to neoadjuvant chemotherapy [22]. Hess et al. [23], proposed DLDA30 to predict the pCR of patients who received preoperative weekly paclitaxel and fluorouracil-doxorubicincyclophosphamide (T/FAC) chemotherapy based on Diagonal Linear Discriminant Analysis (DLDA). Hatzis et al. [18] proposed a prediction model for the NAC response by Taxane + Anthracycline based on the Threshold Gradient Directed Regularization (TGDR) method [24,25]. Masanori Oshi et al. [19], proposed a three-gene-based predictive biomarker for pCR prediction after NAC in TNBC. The same authors also proposed a five-gene-based predictive biomarker for pCR prediction after NAC in ER+/HER2-breast cancer [20]. Fu et al. [21] proposed an immune-associated genomic signature to predict pCR after Neoadjuvant paclitaxel and anthracycline based chemotherapy in breast cancer. Fournier et al. [26] proposed a two-step classification model based on backward regression general linear modeling (BRGLM). Table 1 summarizes the reported studies about NAC response prediction for TNBC and other types of breast cancer. It is noteworthy that most of the reported studies are based on a combination of feature selection and relatively simple models such as Multivariate Logistic Regression (MLR) or Diagonalized Linear Discriminant Analysis (DLDA). However, because there is no clear correlation between the expression of a gene and a NAC response vector (mean PCC = −0.038 ± 0.103), linear models may not be the appropriate choice for classifying the NAC response [28]. In addition, some studies have utilized prior marker information to restrict the search space of optimal markersets, but these studies only utilized the associated genes of predefined mechanisms such as the immune response [29] and the E2F pathway [19,20], which may not fully represent the mechanisms of the NAC response. Moreover, there are only two studies specifically targeting TNBC patients in which pCR prediction by NAC is expected to improve the clinical prognosis.
The Random Forest (RF) is a machine learning model based on an ensemble of decision trees generated by random feature sets [30]. Because it utilizes nonlinear proximity based on bagging of random trees, it can classify samples partitioned by nonlinear decision boundary. It also provides an estimate of the importance of variables in the classification; therefore, if the user wants to reduce the number of variables, it can guide this process according to the importance of the variables. Therefore, it is suitable for developing a gene expression based classification model such as a NAC response prediction model.
To address the aforementioned problems, herein, we propose a novel NAC prediction model based on random forest (RF). We combined prior knowledge of breast cancer disease markers collected from various resources such as the literature, disease databases, drug target databases, and existing marker panels and applied several marker optimization strategies to develop an 86-gene-based RF model that outperforms the reported NAC predictive models in TNBC. To verify data specificity, we performed differential expression analysis and included both prior markers and differentially expressed genes as candidate markers. In addition, the predicted pCR status by the model well explains the distance recurrence free survival (DRFS) of TNBC patients. We also found the associations between the 86 genes and the disease mechanisms of breast cancer such as DNA repair and cell cycle.
There are several issues with reporting a novel NAC response prediction model as follows. First, it is difficult to objectively compare the performance of the existing model and the newly proposed model because there is no definite guideline to select a metric for the performance evaluation. Second, there is no consensus on which model performs best for a given task, because there are no benchmark studies that provide an objective assessment of the performance of NAC response prediction models. Third, it is difficult to replicate the reported studies due to the lack of data and codes used in the reported studies and insufficient reporting on the model development process. Fourth, it was impossible to compare the performance of our model with similarly constructed models in the literature because we could not find any random forest model proposed to predict the NAC response in breast cancer. Therefore, here, we implemented all listed NAC response prediction models in Table 1 and performed a comparative analysis on the same test set.

Collection of Datasets
All gene expression datasets were collected from the GEO database. Table 2 summarizes the number of patients and NAC regimens for each dataset used in this study. As shown in the Table 2, T/FAC (Taxol + Fluorouracil, Anthracycline and cyclophosphamide) and T/FEC (Taxol + Fluorouracil, Epirubicin and cyclophosphamide) were the most used NAC regimens. Here, taxol stands for paclitaxel or docetaxel, and anthracycline stands for doxorubicin or epirubicin, which can be substituted for each other in chemotherapy regimens. The GSE32646 data set contains only patients who received the T/FEC regimen, and all other data sets contain more patients who received the T/FAC regimen. All data included different types of breast cancer, so patients were selected based on receptor status (ER-, PR-, and HER2-). Because GSE25066 had the highest number of TNBC patients, we considered this dataset as the development dataset and the three other datasets as the independent validation datasets (GSE20271, GSE20194, and GSE32646).

Preprocessing
Raw intensity files from the Affymetrix HG U133A or HG U133 Plus 2.0 platforms were processed using the mas5 function in the affy R package [34] to normalize to an average array intensity of 600 and to generate expression values of the probe set level as in Hatzis et al. [18]. Some gene transcripts are recognized as more than one probe set because probe sets represent a set of (perfect match and mismatch) pairs of probes for multiple regions of a single gene transcript sequence. To avoid confusion, we used the average intensity of each probe set that maps to the same gene.

Collection of Prior Breast Cancer Markers
The statistical significance of certain markers in development data sets may not be reproduced due to their low association with breast cancer. It has been reported that prior knowledge approaches can improve the robustness and true biological relevance of selected markers in gene expression datasets [35]. Therefore, in this study, we not only utilized data specific markers such as differentially expressed genes but also included markers associated with breast cancer as candidate markers. To acquire such a prior knowledge markers, we collected breast cancer marker information from the existing literature [36][37][38][39][40][41][42], databases [43][44][45][46], and gene expression marker panels [23,27,[47][48][49][50][51][52][53]. We first gathered information on disease genes primarily listed in existing breast cancer reviews through PubMed and google searches. We selected seven papers that provide comprehensive summaries of disease genes associated with breast cancer, as shown in Table 3. Second, we considered the disease gene database DisGeNet [43] and three drug databases (Drug Central [44], Drug Bank [45] and DGIDB [46]) to collect genes and drug targets associated with breast cancer. We also considered genes in existing breast cancer prognostic/responsive marker panels. We considered genes in nine gene expression marker panels including Oncotype Dx [47], Mammaprint [48], Prosigna [27], Endopredict [49], Breast Cancer Index [50], CureBest [51], GenesWellBCT [52], Genomic Grade Index (GGI) [53], and DLDA30 [23]. The gene symbols collected from all resources are standardized by HGNC Multi Symbol checker (https://www.genenames.org/tools/multi-symbol-checker/, accessed on 15 November 2021) and mapped to NCBI Gene IDs (https://www.ncbi.nlm.nih.gov/gene, accessed on 15 November 2021). In summary, we collected 1079 breast cancer marker genes through this process.

Differential Expression Analysis
We performed a differential expression analysis using the development dataset (GSE25066) to find the most discriminating genes between pCR vs. RD patients. To avoid false positives due to the relatively small sample number and the skewness of individual gene expression levels in the training dataset, we performed bootstrap t-test using boot.t.test function in the MKinfer R package [54] . We utilized bootstrapped t-test under unequal variance proposed by Janssen et al. [55], and the number of bootstrap replicates was 9999. We adjusted the p-value with the Benjamini-Hotchoberg (BH) method [56] and selected genes with a FDR < 0.05 as differentially expressed genes. To avoid randomness of the bootstrap procedure, we iterated the process 10 times and selected genes that showed FDR < 0.05 in at least one trial.

Model Training
Because the number of markers is of practical importance for developing cost effective marker panels, all developed marker panels have a relatively small number of genes compared to the number of genes available in the microarray dataset. Therefore, as discussed earlier, we limited the entire search space to the union of prior maker genes (total 1079) and differentially expressed genes (total 52). For Recursive Feature Elimination (RFE), an N of 1000 times was used to derive an optimal set of genes for the RF model (See Supplementary Figure S1). We used 70% of the development dataset as the training set to optimize the marker set. For each trial, we partitioned the training dataset into 3 equal sized folds and trained the RF classifier using 2 folds of the data along with all genes derived from the union of the prior marker genes and the differentially expressed genes. We evaluated the performance of the trained classifier on the remaining fold and evaluated the significance of the feature by the mean decrease of accuracy. After three training and test runs, we derived feature ranks for all genes based on the variable importance and removed the least significant genes from the list of genes. We performed this procedure from the maximum number of genes (1124) to 2 genes and reported the best performing subset. To reduce the search space, here, we searched for the best subset with 10 equally spaced grids, which iteratively search for the best subset when the number of genes in the optimal subset exceeds 100. That is, the trial ends when the number of genes in the optimal subset is less than 100. We repeated this process (N = 1000 times) to obtain candidate models. All models were evaluated on the rest of the test set in the development dataset to find the best performing marker set and RF model. However, because the best model on the Affymetrix U133A dataset did not perform well on the Affymetrix U133 PLUS 2.0 dataset (GSE32646), we considered 30% of the GSE32646 data set as an additional test set to find a robust model for both platforms. To minimize the effect of imbalance in the pCR vs. RD samples, we used Synthetic Minority Oversampling TEchnique (SMOTE) [57] based on the SMOTE function in the DMwR R package [58]. The accuracy was used as a metric to optimize the RF model. In this process, the markers showing the highest accuracy for the test dataset were selected as the marker set for the final model. We used the rfe function in the Caret [59] R package.

Selection of an Optimal Threshold
Due to the unequal misclassification cost of the pCR prediction, a naive threshold that applies equal probabilities to both pCR and RD may not be an appropriate choice for a pCR prediction model [60]. Because of the unequal misclassification cost, it is recommended to control the type I error (FPR) below a certain level (α) [61]. Here, positive means the desired outcome, i.e., pCR. In other words, when misclassifying RD patients as pCR patients, the cost of the misclassification is higher than vice versa because alternative treatment options for RD patients may be lost. On the other hand, even if pCR is misclassified as RD, the risk of misclassification can be minimized by subsequently reassessing the patient for appropriate treatment options. We therefore selected an optimal threshold based on a predefined false positive rate, α, of ≤10%, which is comparable to an ultrasound (US) image-based pCR prediction [62].

Model Validation
As previously discussed, we used three independent test data sets to validate the proposed RF model. Various binary classification performance metrics were used to evaluate the test performance of the proposed model including Accuracy (ACC), Balanced Accuracy (BACC), Sensitivity (TPR), Specificity (TNR), Precision (PPV), Negative Predictive Value (NPV), F1 score, Metthew Correlation Coefficient (MCC), and Yoden's Index. Because we predefined α ≤ 10%, the specificity or TNR of the model was about 90%. We also evaluated the AUROC and AUPRC of the proposed model for each test data set and visualized the ROC and PR curves.

Comparative Analysis
We performed comparative analysis of the proposed model with existing NAC response prediction models. We considered ROR-S [27], GGI [53], DLDA30 [23], Hatzis [18], BA100 [26], three-gene [19], five-gene [20], and Immune associated signatures [21] as comparable models for the NAC response prediction, as discussed earlier. To derive the ROR-S and GGI score, we used the ggi and rorS function in the genefu R package [63]. For the DLDA30 score, we reimplemented the DLDA30 score function available at the author's web site (http://bioinformatics.mdanderson.org/pubdata.html, accessed on 15 November 2021). For the Hatzis model, we used adaptive LASSO based on the genes of the NAC response for ER-breast cancer patients proposed by the authors [18] because model coefficients and a training algorithm (TGDR) were not available. We newly implemented the BA100 [26], three-gene [19], five-gene [20], and immune associated signature [21] models as described in the original papers. Because the BA100 uses a two step classifier, we evaluated the performance of both classifiers separately. We compared the ROC and PR curves, confusion matrices with prespecified FPR and performance metrics of the binary classification.

Survival Analysis
A survival analysis was performed to determine if the proposed model could also predict Distant Recurrence Free Survival (DRFS) in patients receiving NAC. Since only one dataset (GSE25066) provides survival information, we investigated whether the predicted pCR and RD classes had differential survival curves in the training and test datasets. We used the survfit function from the survminer R package [64] to fit the survival curves of the pCR versus RD on the training and test sets and visualized the survival curves with the ggsurvplot function in the same package. The statistical significance of the survival difference was calculated by log-rank test.

Chemosensitivity Analysis
Some projects, such as GDSC [65], CCLE [66], CTRP [67], and PRISM [68] provide TNBC cell line sensitivity to drugs used for T/FAC-or T/FEC-based NAC, i.e., Paclitaxel, Docetaxel, 5-Fluorouracil, Doxorubicin, Epirubicin, and Cyclophosphamide [9]. Because the GDSC project provides the most complete set of sensitivity observations for TNBC cell lines, GDSC profiled cell line sensitivity (AUC of the dose response curves) data were used to investigate the relationship between the pCR probability of the proposed model and the chemosensitivity of TNBC cell lines. To obtain a list of TNBC cell lines, we referred to the study of Chavez et al [69]. Here, we investigated the Spearman rank correlation between the pCR probability and the sensitivity of the TNBC cell lines to these drugs (1 − Area Under the dose response curve, 1 − AUC). CCLE microarray data were preprocessed in the same manner as described in Section 2.2 to obtain the pCR scores of the TNBC cell lines. Among 27 TNBC cell lines listed by [69], 16 cell lines are available in the CCLE dataset (BT20, BT549, Cal51, HCC38, HCC1143, HCC1187, HCC1395, HCC1599, HCC1806, HCC1937, HCC2157, Hs578T, MDA-MB-157, MDA-MB-231, MDA-MB-436, and MDA-MB-468). Some drug responses for a particular cell line could be missing; thus, the Spearman's correlation was calculated by a different number of data points for each drug. For example, only 10 cell lines have a cyclophosphamide drug response, so we calculated the correlation between the pCR probability and drug response from these 10 TNBC cell lines.

Function Enrichment Analysis
To investigate the association between the predictors of our model and cellular function, we performed a functional enrichment analysis [70] based on the Fisher exact test in our in-house module database. Briefly, we collected known gene sets from Gene Ontology [71], MSigDB [72], and Enrichr [73] databases. After removing ambiguous and redundant gene collections from these databases, we built a gene module database of 86,769 unique sets of genes with 29,107 unique genes. An ambiguous gene set here refers to a set of genes that have less direct significance in cellular functions, such as chromosomal location, genome browser PWM, NIH grant related gene sets, etc. [73].

Visualize Error Matrix and Score Distributions
To understand the performance improvement in the proposed 86-gene-based RF model compared to the 8 published NAC response prediction models, we performed several analyses as follows. First, we visualized the sample error matrix according to the prediction of the published models and the proposed model. Second, we visualized the score distribution of the published models and the proposed model to see whether the proposed model has better calibration. Because the score range of the published models differ from each other, we performed min-max normalization as follows: The discriminative power of each model was assessed by t-test. Table 4 shows the characteristics of the patients with TNBC in the development and independent validation datasets. There were no statistically significant differences between the pCR and RD patients for all datasets. Two datasets (GSE20271 and GSE20194) were missing the tumor stage information, and one dataset (GSE32646) was missing the N stage. Stages T2, N0, and Grade 3 had the highest prevalence in the TNBC population in all datasets.

Differential Expression Analysis
As previously discussed, the bootstrap t-test was repeated 10 times to minimize the randomness of the bootstrap procedure. A gene with a FDR ≤ 0.05 in at least one trial was called a differentially expressed gene (DEG). Figure 1 shows a volcano plot of the differentially expressed genes (DEGs) in the development dataset (GSE25066). We found 23 up regulated genes (HAT1, TFG, ABT1, PDCL3, ILF2, TMEM14B, DEK, PDCL3P4,  SEC13, HACD1, MCM3, RANBP6, ITGA6, NOL7, FBXO16, SMARCA2, ZNF395, FN3KRP,  DCTN3, TMEM258 Table 5 summarizes the differentially expressed genes (DEGs) in the development dataset. The column count indicates how many times a gene was called a DEG in the 10 trials. The Fold Change (FC) and False Discovery Rate (FDR) were averaged from 10 trials and placed in the FC and FDR columns of the table. The number of common genes between the prior marker gene and DEGs was only 7.

Optimal Marker Set and Model Selection by RF-RFE
Random Forest Recursive Feature Elimination (RF-RFE) was used as described in Section 2.5 to derive the optimal combination of markers for the model training from 1124 genes obtained from the union of prior marker genes (1079 genes) and DEGs (52 genes). Here, we used 70% of the development dataset (GSE25066) as the training set (40 pCR vs. 80 RD) and 30% of the data as the test set (17 pCR vs. 33 RD). Figure 2 shows the Receiver Operating Characteristic (ROC) and Precision Recall (PR) curves of the RF-RFE optimized model for the test set. Figure 2A,B show the ROC and PR curves of all 1000 models. It can be seen in the figure that all trained models are working well for the test set (mean BACC = 0.848, AUROC = 0.908 ± 0.019, AUPRC = 0.891 ± 0.026). Figure 2C,D show the ROC and PR curves of the best model consisting of 86 genes (See Supplementary Table S1). The AUROC and AUPRC of the best model were 0.918 and 0.902, respectively. Table 6 shows the binary classification performance metrics for the best model at the predefined threshold of the False Positive Rate (FPR, α ≤ 10%). Here, positive indicates the desired outcome pCR, and negative indicates the undesired outcome RD.

Validation of the Final Model Using Independent Test Datasets
The final model was validated on three independent data sets (GSE20271, GSE20194, and GSE32646).

Comparative Analysis
We compared the performance of the proposed RF model with eight published NAC response prediction models, as discussed earlier. Figure 4 shows the ROC and PR curves of proposed RF and all other prediction models. Figure 4A,B show the ROC and PR curves of the test set in the development dataset (GSE25066) and Figure 4C,D show the ROC and PR curves of all the test sets in development and three independent test sets. The proposed model shows a better performance than all the other published models in terms of the AUROC/AUPRC (0.918/0.902 in development test set and 0.891/0.829 in all the test sets). Table 7 shows a comparison of the binary classification performance metrics for the proposed and eight published models at a predefined FPR threshold (α ≤ 10%). For all the test sets, the proposed model shows a superior performance compared to the published models.

Survival Analysis
We performed survival analysis to determine if the pCR predicted by our model could predict a longer survival of the patients as the actual pCR. Figure 5 shows the survival curves for the actual pCR (A) and predicted pCR in (B) all, (C) training set and (D) test set of the development data (GSE25066). The predicted pCR showed a longer survival compared to the predicted RD patients in all cases. Although the statistical significance of the test set was relatively low due to the small sample size, the pCR predicted by the proposed RF model explains the differences in survival among the TNBC patients well.

Relation to Chemosensitivity in TNBC Cell Models
Here, we investigated the relationship between the pCR score of the proposed model and chemosensitivity of the TNBC cell lines. As shown in Figure 6, the pCR score of the proposed model shows the highest Spearman rank correlation for cyclophosphamide sensitivity in the TNBC cell line compared to all other drugs used for the T/FAC or T/FEC treatment (SRCC = 0.697, p-value = 0.031). A high AUC of the dose response curve means a low sensitivity of the cells to the target drug; thus, 1 − AUC can be interpreted as the sensitivity score of the drug treated cell line. That is, the proposed model predicts a higher pCR probability when the TNBC cell line has a high sensitivity score to cyclophosphamide. We also investigated the SRCC between the pCR score and the cyclophosphamide chemosensitivity calculated by seven published models; however, as shown in Figure 7, none of the models showed a high positive correlation between the pCR score and the chemosensitivity of the TNBC cell lines. Because the HS578T cell data point appears to be an outlier, we recalculated the SRCC excluding the HS578T data point. As a result, the SRCC increased in all models, but the proposed model (SRCC = 0.7167, p-value = 0.037) showed the highest SRCC. In the case of the BA100 model, the increase in SRCC values was very large when HS578T cells were excluded (SRCC = 0.200, 0.030 vs. 0.3500, 0.4167).

Cellular Functions Associated with the 86 Genes of the Proposed Model
In addition to the performance of the predictive models, a mechanistic understanding of the genes is important for clinical response prediction. To investigate the association between the 86 genes of our model and cellular function, we performed a function enrichment analysis [70] based on Fisher's exact test on our in house module database. Table 8 shows the result of the function enrichment analysis. We excluded phenotypic terms, which may cause confusion. Among the terms including five or more genes (N gene ≥ 5), cases in which 30% or more of the genes of the corresponding term were mapped to the query genes were selected (Ratio mapped ≥ 0.3). The results show an enrichment of DNA repair and cell cycle gene modules reported to be involved in the NAC response of TNBC [74,75]. Fisher's Exact test p-value. Figure 8A shows the error matrix of the published NAC response prediction models and proposed model. Here, yellow represents correctly classified samples and red represents misclassified samples for each model. Interestingly, all models are correctly classifying 79 RD patients and the proposed model correctly classifies the 18 pCR samples, which were misclassified by all 8 models. Figure 8B shows the min-max normalized score distribution of the NAC response prediction models for all test samples. t-test was used to calculate the p-value of the score difference between pCR and RD. This analysis reconfirms that the proposed model is most discriminative compared to all other models (t-test p-value < 2.2 × 10 −16 ) In other words, the proposed model provides a better score distribution, which indicates a better calibration of the model that can separate pCR from RD patients more effectively.

Association for Metabolic Pathway Based Subtypes
Recently, Gong et al. proposed metabolic-pathway based subtyping (MPS) of triple negative breast cancer to reveal potential therapeutic targets [76]. We stratified our test samples based on the MPS signatures suggested by Gong (39/81). Figure 9A,B shows the ROC and PR curves for MPS specific prediction by the proposed RF model. The proposed model showed relatively high performance in MPS2 and MPS3 subtypes compared to MPS1 subtype. Table 9 shows the binary classification performance of the proposed model for each MPS subtypes. The BACC of MPS1, MPS2, and MPS3 were 0.646, 0.747, and 0.841, respectively. These results may imply that the TNBC patients who have MPS 2 subtype can have higher probability of pCR with T/FAC or T/FEC based NAC.

Discussion
In this study, we developed a gene expression based NAC response prediction model using random forest recursive feature elimination (RF-RFE). Extensive characterization of the model performance using ROC, PR curves, and binary classification metrics at a predefined FPR threshold of 10% showed that the proposed 86 gene-based RF model outperforms the existing 8 NAC response predictive models. In addition, we investigated the relationship between the proposed model and drug sensitivity in the TNBC cell models. Our model shows a high Spearman's rank correlation with cyclophosphamide sensitivity in TNBC cell lines. Through function enrichment analysis, we found that genes in our model are associated with DNA repair and cell cycle mechanisms.
Compared to prognostic models for ER+/HER2-patients currently used in clinical practice to evaluate the benefits of additional adjuvant chemotherapy [77], neoadjuvant chemotherapy response prediction models for TNBC patients are less convincing for use in clinical practice. The major concerns for these predictive models for NAC response is the misclassification of RD patients as pCR. This is because it not only guides overtreatment but also eliminates alternative treatment options for those patients. In other words, there is an unequal classification risk between the type I and type II errors, and in order to objectively evaluate the NAC response prediction models, it is necessary to control the type I error (FPR) below a certain level. We set this level as less than or equal to 10% and set the threshold based on the test set in the development dataset (GSE25066). The level selected in this study is similar to other types of NAC response prediction models, such as clinical variable based models [78], Ultrasound [79,80], and MRI imaging based [81] NAC response prediction models.
We also performed a literature survey for the association between NAC response and predictors of our model, in terms of their enriched mechanisms. Specifically, the genes mapped to the DNA repair term were MSH2 and MSH6. These genes have been reported to be associated with the NAC response in breast cancer [82][83][84]. Knockdown of MYCN has been reported to be associated with the NAC response of TNBC. [85]. Loss of ENO1 has been reported to correlate with clone forming unit (CFU) potential in MDA-MB-231 and BT-549 cell models associated with susceptibility to paclitaxel [86]. RAD51 [87] has been reported to be associated with homologous recombination deficiency (HRD) of TNBC, which affects the NAC response. Inhibition of CDK2 has been reported to be associated with the chemosensitivity of TNBC cell lines [88]. Inhibition of CDK2 has also been reported to be associated with growth inhibition in TNBC bearing mice [89] and reduced cell migration in TNBC cell lines [90]. E2F3 has been reported to be associated with doxorubicin responses [91] and EMT mechanisms in TNBC [92]. MCM2 and MCM3 have also been reported to have a role in DNA repair mechanisms in breast cancer [93], and these genes have also been reported as predictors of other NAC response predictive models [19,20]. The survey may indicate a possible mechanism of the NAC response in TNBC as DNA repair and cell cycle mechanisms by investigating the relationship between genes in the proposed model and the NAC response mechanism in TNBC.
Recently, Gong et al. proposed stratification of the TNBC patients using metabolic pathway enrichment based subtpying (MPS) [76]. They proposed three distinct subtypes, so called MPS1 (Lipogenic), MPS2 (Glycolytic) and MPS3 (Mixed) and validated their results through xenograft and organoid experiments. To derive the subtype, they performed geneset variation analysis (GSVA) for 86 metabolic pathway gene sets and consensus clustering on the GSVA score matrix. The proposed model best explained MPS2 model, which contains a large number of pCR patients compared to other two MPS subtypes. Because the proposed model is associated with DNA Repair mechanisms, the high performance of the proposed model in MPS2 subtype characterized by upregulation of carbohydrate and nucleotide metabolic pathways can be considered as consistent results with the report of Gong et al. Although this study shows the utility of the proposed RF model, there are some limitations. First, we only utilized Affymetrix microarray datasets, so the proposed model may not be applicable to other platforms such as other microarray platforms and RNA sequencing datasets. Second, due to the limited number of available samples, the performance evaluated by the current study may be inconclusive. However, our results showed fairly consistent results across multiple cohorts. Third, the relationship between the chemosensitivity of TNBC cell lines and pCR probability drawn by the proposed model was only validated by some of the TNBC cell lines available in the GDSC datasets. Because drug sensitivity profiling projects are continuously expanding, we expect validation of the current results with updated drug profiling of the TNBC cell lines in near future.

Conclusions
In conclusion, the proposed 86 gene-based random forest model provides an accurate prediction of pCR after NAC in triple negative breast cancer and may be associated with DNA repair and cell cycle mechanisms. Our study suggests that the random forest based prediction model can provide a reliable prediction of the clinical response to neoadjuvant chemotherapy and may explain chemosensitivity in TNBC.

Supplementary Materials:
The following supporting information can be downloaded at: https: