Prognostic Models Using Machine Learning Algorithms and Treatment Outcomes of Occult Breast Cancer Patients

Background: Occult breast cancer (OBC) is an uncommon malignant tumor and the prognosis and treatment of OBC remain controversial. Currently, there exists no accurate prognostic clinical model for OBC, and the treatment outcomes of chemotherapy and surgery in its different molecular subtypes are still unknown. Methods: The SEER database provided the data used for this study’s analysis (2010–2019). To identify the prognostic variables for patients with ODC, we conducted Cox regression analysis and constructed prognostic models using six machine learning algorithms to predict overall survival (OS) of OBC patients. A series of validation methods, including calibration curve and area under the curve (AUC value) of receiver operating characteristic curve (ROC) were employed to validate the accuracy and reliability of the logistic regression (LR) models. The effectiveness of clinical application of the predictive models was validated using decision curve analysis (DCA). We also investigated the role of chemotherapy and surgery in OBC patients with different molecular subtypes, with the help of K-M survival analysis as well as propensity score matching, and these results were further validated by subgroup Cox analysis. Results: The LR models performed best, with high precision and applicability, and they were proved to predict the OS of OBC patients in the most accurate manner (test set: 1-year AUC = 0.851, 3-year AUC = 0.790 and 5-year survival AUC = 0.824). Interestingly, we found that the N1 and N2 stage OBC patients had more favorable prognosis than N0 stage patients, but the N3 stage was similar to the N0 stage (OS: N0 vs. N1, HR = 0.6602, 95%CI 0.4568–0.9542, p < 0.05; N0 vs. N2, HR = 0.4716, 95%CI 0.2351–0.9464, p < 0.05; N0 vs. N3, HR = 0.96, 95%CI 0.6176–1.5844, p = 0.96). Patients aged >80 and distant metastases were also independent prognostic factors for OBC. In terms of treatment, our multivariate Cox regression analysis discovered that surgery and radiotherapy were both independent protective variables for OBC patients, but chemotherapy was not. We also found that chemotherapy significantly improved both OS and breast cancer-specific survival (BCSS) only in the HR−/HER2+ molecular subtype (OS: HR = 0.15, 95%CI 0.037–0.57, p < 0.01; BCSS: HR = 0.027, 95%CI 0.027–0.81, p < 0.05). However, surgery could help only the HR−/HER2+ and HR+/HER2− subtypes improve prognosis. Conclusions: We analyzed the clinical features and prognostic factors of OBC patients; meanwhile, machine learning prognostic models with high precision and applicability were constructed to predict their overall survival. The treatment results in different molecular subtypes suggested that primary surgery might improve the survival of HR+/HER2− and HR−/HER2+ subtypes, however, only the HR−/HER2+ subtype could benefit from chemotherapy. The necessity of surgery and chemotherapy needs to be carefully considered for OBC patients with other subtypes.


Introduction
Occult breast cancer (OBC) is a rare malignant tumor, with an estimated incidence of 0.1% to 1% of all breast cancers. It is distinguished by the presence of pathological breast carcinoma in local lymph nodes or distal metastatic organs (usually axillary lymphadenopathy), but clinical or imaging examination fails to demonstrate the primary breast tumor [1][2][3]. Since OBC was initially described by Halsted in 1907 [4], its prognosis and management have been a matter of debate [1,5,6].
So far, the prognosis of OBC is still debatable. OBC patients have been shown to have a lower chance of mortality than non-OBC patients in some studies [7,8], whereas others have revealed comparable outcomes [9] or even significantly worse prognoses [10]. In addition, studies suggest that the prognosis factors of OBC patients vary greatly with different clinical features [3,5,11]. Moreover, nearly all these studies analyzed only patients with stage I to III, and so the effect of distant metastasis on OBC patients is still unclear. Therefore, there is an urgent need for prognostic prediction models to accurately answer the concerns of all OBC patients about survival and to help optimize their management.
So far, two studies have built several nomograms to predict the breast cancer specific survival (BCSS) of OBC patients [3,12]; however, one model can only be used for patients who have undergone surgery [3] and the other one can only be used for early-stage patients [12]. Moreover, the accuracy of the models is far from sufficient (AUC value or C-index is only around 0.7) and both of them did not assess the overall survival (OS). Consequently, a more widely available and accurate model is necessitated. Nowadays, with the advent of machine learning, analyzing the vast, multi-dimensional and multimodal data generated by a clinical database has become easier [13,14]. Machine learning can also help us constructed artificial intelligence (AI) prognostic models, significantly improving their accuracy [14][15][16]. Our previous study successfully used a machine learning algorithm to predict the prognosis of breast cancer patients with initial bone metastases and greatly improve the accuracy [15]. However, no one has utilized machine learning to create prognostic models for OBC patients. Thus, we used six kinds of machine learning algorithm to create prognostic models and found that the LR algorithm performed best. Moreover, it has been challenging to conduct randomized controlled trials and standardize management because this particular type of breast cancer is extremely rare. A lot of retrospective studies focus on the effect of different treatment methods on the prognosis of all OBC patients [7,[17][18][19][20][21], but no one has analyzed it in relation to different molecular subtypes, hence there is a need for a further inquiry.
Our study explores the factors influencing the prognosis of OBC patients using the most up-to-date Surveillance, Epidemiology and End Results (SEER) database and is the first one to create high-precision AI models to predict the 1, 3 and 5-year OS of OBC patients. We first investigated the role of the N0 stage, family income and months from diagnosis to therapy in OBC patients. Additionally, we further investigated the treatment outcomes of surgery and chemotherapy in different molecular subtypes, which have never been reported before, and found that primary surgery might improve the survival of OBC patients with HR+/HER2− and HR−/HER2+ subtypes; however, only the HR−/HER2+ subtype could benefit from chemotherapy. The necessity of surgery and chemotherapy needs to be carefully considered for OBC patients with other subtypes. This work gains insight into the prognosis of OBC patients and is helpful for their prognostic prediction and clinical management.

Data Source and Study Design
The workflow for the design and analyses of this study is illustrated in Figure 1. The data used for analysis in our study were obtained from the SEER database [SEER research plus data, 17 Regs, November 2021 Sub (2000-2019); version 8.4.0], which is openly accessible. As the information about distant metastasis and molecular subtypes were collected from 2010, and taking account of the data update, we only analyzed the data from 2010-2019. From this database, data on females with OBC were obtained. Inclusion criteria were as follows: (1) breast cancer proved to be the sole primary cancer the patient had been diagnosed with; (2) all these patients had shown histopathological and structural evidence consistent with the International Classification of Cancer Diseases Edition III (ICD-O-3); (3) aged ≥18 years; and (4) had T0 stage cancer according to the American Joint Committee on Cancer (AJCC). Exclusion criteria were as follows: (1) patients carried two or more primary cancers; (2) unexplainable TNM stage, such as T0N0M0; and (3) patients whose survival time was vague. Follow up was until the patient died, was lost to follow-up or until 31 December 2019.

Data Source and Study Design
The workflow for the design and analyses of this study is illustrated in Figure 1. The data used for analysis in our study were obtained from the SEER database [SEER research plus data, 17 Regs, November 2021 Sub (2000-2019); version 8.4.0], which is openly accessible. As the information about distant metastasis and molecular subtypes were collected from 2010, and taking account of the data update, we only analyzed the data from 2010-2019. From this database, data on females with OBC were obtained. Inclusion criteria were as follows: (1) breast cancer proved to be the sole primary cancer the patient had been  diagnosed with; (2) all these patients had shown histopathological and structural evidence  consistent with the International Classification of Cancer Diseases Edition III (ICD-O-3); (3) aged ≥18 years; and (4) had T0 stage cancer according to the American Joint Committee on Cancer (AJCC). Exclusion criteria were as follows: (1) patients carried two or more primary cancers; (2) unexplainable TNM stage, such as T0N0M0; and (3) patients whose survival time was vague. Follow up was until the patient died, was lost to follow-up or until 31 December 2019.

Machine Learning Models
For feature selection, patients were sorted into train and test sets at random in a 7:3 ratio. In our train set, characteristics that were statistically significant in the multivariate Cox analysis, including age at diagnosis, molecular subtype, N stage, surgery, bone, lung, liver and brain metastases were included in our machine learning models to predict 1-, 3and 5-year overall survival of OBC patients. Prior to excluding patients who were still alive but survived less than 1, 3 or 5 years at the follow-up cut-off date, the above analyses were conducted. A response variable for the survival information was obtained prior to launching the training program, with 1 denoting survival and 0 denoting death. On the test data, we compared the area under the curve (AUC value) of logistic regression (LR), random forest (RF), support vector machine (SVM), decision tree (ID3), k-Nearest

Machine Learning Models
For feature selection, patients were sorted into train and test sets at random in a 7:3 ratio. In our train set, characteristics that were statistically significant in the multivariate Cox analysis, including age at diagnosis, molecular subtype, N stage, surgery, bone, lung, liver and brain metastases were included in our machine learning models to predict 1-, 3-and 5-year overall survival of OBC patients. Prior to excluding patients who were still alive but survived less than 1, 3 or 5 years at the follow-up cut-off date, the above analyses were conducted. A response variable for the survival information was obtained prior to launching the training program, with 1 denoting survival and 0 denoting death. On the test data, we compared the area under the curve (AUC value) of logistic regression (LR), random forest (RF), support vector machine (SVM), decision tree (ID3), k-Nearest Neighbor (kNN) and extreme gradient boosting (XGBoost). The LR model was further assessed by calibration curve and decision curve analysis.

LR
Logistic regression is known as log odds regression, it is a classification algorithm [22]. Under the assumption that the outcome variable has a probability distribution, logistic regression models the log odds of each patient experiencing the outcome linearly. This is converted to probabilities by means of a "sigmoid" function. Logistic regression is a highly interpretable algorithm and a hallmark of classical predictive modeling.

SVM
Support vector machines locate the hyperplanes that divide data points into two groups by mapping input vectors to higher dimensional feature spaces [23]. This maximizes the edge distance between the instance nearest to the boundary and the decision hyperplane. The identified hyperplane is the decision boundary between the two clusters and the resulting classifier has considerable generalization power.

ID3 and RF
The tree-structured classification method used by ID3, one of the earliest and most prevalent machine learning architectures, uses nodes to symbolize input factors and leaves to reflect decision outcomes [24]. Being based on the DT architecture, they are easy to interpret and fast to learn. Based on ID3, a random forest is generated by repetitively drawing k samples from the original training sample set, N, at random, followed by generating k classification trees according to the self-help sample set to generate the random forest.

XGBoost Model
The XGBoost algorithm modifies the gradient boosting algorithm by performing Taylor expansion of the loss function to the second order, adding a regularization term to the loss function, and solving for the extreme values of the loss function using Newton's technique [25]. In addition, the technique employed in the XGBoost algorithm called "feature subsampling", which can be understood as selecting a subset of all features to train each tree (similar to a random forest) in order to improve the generalization capability of the model, make it more diverse and prevent overfitting.

kNN
The kNN algorithm is founded on the premise that, if a sample falls under a category, most of the k closest to the neighboring samples in the feature space also fall under that category and share the same traits [26]. In determining the classification choice, the technique bases its determination of the category to which the sample to be classified corresponds solely on the category of the few most adjacent samples.

Statistical Analysis
To explore the connection between diverse pathological and clinical traits and patient survival rates, we applied univariate Cox regression models. To evaluate patient mortality risk and determine independent prognostic factors, further multivariate Cox analysis was carried out. Patients experiencing chemotherapy or surgical therapy and those receiving neither were paired on a 1:1 propensity score (PSM), according to variables in the univariate Cox regression, as a way to examine the role of these therapies on the outcome of patients with OBC [27]. On the PSM-adjusted population, we also conducted Kaplan-Meier (K-M) survival analysis [28] stratified by molecular subtype. Finally, we performed subgroup univariate, as well as multifactorial, Cox analyses in OBC patients according to molecular subtype. R software (version 4.0.2) was employed to conduct all the statistical analyses in this study. Statistical significance was determined to exist when the bilateral tail value was less than 0.05.

Clinical Characteristics of OBC Patients
Eventually, we obtained information on 906 qualified OBC patients from the SEER database (2010 to 2019). The clinicopathological traits of OBC patients are displayed in Table 1 and summarized below. The patients' median age was 62 years, of which 142 (15.67%) patients were younger than 50 years, and 92 (10.15%) patients were older than 80 years. In total, 449 (49.56%) patients began therapy immediately following diagnosis, whereas 377 (41.61%) patients began therapy after more than 1 month since diagnosis.
For the molecular subtypes, HR+/HER2− made up 41.50%, followed by HR−/HER2− (16.11%), HR+/HER2+ (11.81%) and HR−/HER2+ (8.39%). In terms of ethnicity, 80.68% of the patients were white, and the most prevalent histopathological subtype was invasive ductal carcinoma (IDC; 30.57%). Regarding marital status, 49.89% of the patients were married and 14.79% were single. The proportions of stages N0 to N3 were 18.98%, 53.20%, 8.28% and 11.48%, respectively. Only 0.66% of the patients had grade I tumors, compared to 14.24% who developed grade III or IV. About 31.90% of the patients were found to have a decent annual family income of more than US$750,000. In the treatment field, only 24.50% of patients received surgery, 43.71% received radiotherapy and 64.13% received chemotherapy. Bone, lung, liver and brain metastases, respectively, accounted for 29.80%, 9.93%, 8.94% and 4.30% of all patients.

Univariate and Multivariate Cox Regression Analysis
To uncover significant factors influencing BCSS, as well as overall survival (OS) of OBC patients, we conducted univariate Cox regression analysis, including age at diagnosis, time from diagnosis to therapy, histological type, molecular subtype, marital status, N stage, race, grade, median family income (inflation-adjusted), distant metastases and information about treatment (Table 2).
Furthermore, we carried out multivariate Cox regression analysis to eliminate confounding factors and uncover independent factors correlated to BCSS and OS (Table 2). It showed that, in patients aged >80, distant metastases were significantly linked to inferior BCSS and OS. The HR−/HER2− subtype showed worse BCSS and OS compared with HR+/HER2− patients, whereas the HR+/HER2+ and HR−/HER2+ subtypes did not exhibit any difference. Patients at N1 and N2 stages had more favorable prognosis than at the N0 stage, but the N3 stage was similar to the N0 stage. In terms of treatment, only primary tumor surgery, and not chemotherapy or radiotherapy, could prolong both OS and BCSS according to multivariate Cox regression analysis, although radiotherapy could improve only the OS, just not the BCSS. Additionally, social variables such as family fiscal conditions and marriage status were analyzed; however, they are not independent prognosis factors for OBC.

Constructing and Assessing Predictive Models for the Estimation of OBC Patients' Prognosis
In light of the above findings, patients were sorted into train and test data, at random, in a 7:3 ratio (Supplementary Table S1) and univariate and multivariate Cox analysis was used to analyze the train set again (Supplementary Table S2). Eight independent prognostic factors were chosen as model features, and prognostic models were created with six machine learning algorithms to assess the OS of OBC patients at 1, 3 and 5 years.  (Table 3).    (Table 3).   Then, the accuracy of our LR models was further assessed using calibration curves [29]. According to the calibration curves of the train and test sets ( Figure 3A-F), the predicted values of LR models were perfectly in keeping with the observed values, indicating that LR models had remarkable accuracy. After determining the accuracy of the prediction models, we further analyzed their clinical applicability via decision curve analysis (DCA) [30]. The results showed that the LR models had a wide threshold probability range and a good net benefit in predicting 1-year, 3-year and 5-year OS rates for OBC ( Figure 4A-F). Overall, our models performed well. dicted values of LR models were perfectly in keeping with the observed values, indicat that LR models had remarkable accuracy. After determining the accuracy of the predic models, we further analyzed their clinical applicability via decision curve analysis (DC [30]. The results showed that the LR models had a wide threshold probability range a good net benefit in predicting 1-year, 3-year and 5-year OS rates for OBC ( Figure 4A Overall, our models performed well.

Benefits of Chemotherapy in OBC Patients Subdivided by Molecular Subtype
Unexpectedly, chemotherapy was not an independent prognostic factor for OBC patients in our multivariate Cox regression analysis (Table 2). Hence, we took a further look at how chemotherapy affected OBC patient prognosis. We contrasted the baseline features of patients receiving chemotherapy with those without chemotherapy (Table 4). These two groups had different baselines. Therefore, the observed disparity was adjusted with the help of propensity score matching (PSM). After PSM adjustment, there existed no discernible differences in baseline characteristics (Table 4).  According to the PSM-adjusted data, the chemotherapy group's overall risk of death was reduced by about 28% (p = 0.013, HR: 0.72; 95% CI: 0.56-0.93) ( Figure 5A), whereas there was no difference in the risk of breast cancer-related death (p = 0.17, HR: 0.81; 95% CI: 0.6-1.09) ( Figure 5B). Only the HR−/HER2+ subgroup could substantially benefit from chemotherapy in terms of OS and BCSS, according to the stratified K-M survival study ( Figure 6C,G); however, it did not show any benefit for the OS and BCSS of other three subtypes ( Figure 6A,B,D-F,H). To further validate these results, we divided all the 906 eligible OBC patients into four groups, according to molecular subtype, and performed univariate and multivariate Cox analyses again (Supplementary Table S3). It showed that only the HR−/HER2+ subtype could benefit from chemotherapy, which is consistent with our results for the PSM-adjusted K-M survival analysis.

Benefits of Surgery for OBC Patients Subdivided by Molecular Subtype
In view of the above results, we looked further into the influence of surgery on the prognosis of OBC patients with distinctive subtypes. Using the same PSM method, there appeared no significant differences between patients receiving surgical treatment and those without surgery in terms of baseline characteristics (Table 5).
According to the PSM-adjusted data, the surgery group's overall risk of death was reduced by around 56% (p = 0.001, HR: 0.44; 95% CI: 0.27-0.73) ( Figure 7A), with the risk of breast cancer-related death reduced by approximately 51% (p = 0.012, HR: 0.49; 95% CI: 0.27-0.87) ( Figure 7B). The stratified K-M survival analysis uncovered that surgical treatment significantly improved OS in the HR+/HER2− and HR−/HER2+ subtypes ( Figure 8A,C). However, there was no significant difference in HR+/HER2+ and HR−/HER2− subtypes ( Figure 8B,D). In addition, the effect of surgical treatment on BCSS in patients with all subtypes was similar ( Figure 8E-H). To further validate these results, we divided all the 906 eligible OBC patients into four groups, according to molecular subtype, and performed univariate and multivariate Cox analyses again (Supplementary  Table S3). It showed that surgical intervention was proven to be an independent prognostic factor only for HR+/HER2− and HR−/HER2+ subtypes, supporting our findings from the PSM-adjusted K-M survival analysis.

Discussion
OBC is an unusual clinical entity and represents a therapeutic challenge for doctors [31]. Since this type of breast cancer is quite rare, its prognosis remains debatable, and standardized management of OBC is still difficult [1,6]. Some large-sample retrospective studies using SEER could help solve the problem of rare cases, but most such cases in previous studies have a large time span [7,20,21] and some cases that were not OBC might have been considered so in the past because of the limitations of imaging technology [6,32,33]. The present study, as far as we are aware, is the most up-to-date one to examine the clinical traits and prognosis of OBC patients. In two recent investigations, several nomogram prediction models for OBC patients were created using SEER populations [3,12]; however, their models could not predict OS and could only be used for patients who had undergone surgery [3] or were at an early-stage [12]. Moreover, the accuracy of their models is far from sufficient. Thus, our research is also the first to develop AI prognostic models for OBC patients, and our LR models are the most widely available and are more accurate in predicting the OS of OBC patients.
This study identified several independent factors associated with poor prognosis, including age ≥80, triple negative molecular subtype, N0/N3 stage, and distant metastasis. Some studies have shown that OBC patients aged ≥70 are more likely to develop worse OS [20,21], whereas other studies have claimed that age was not a risk factor [3,5,12]. We looked at a wider range of age categories and discovered a worse OS for people aged ≥80. Compared to the HR+/HER2− subtype, only the HR−/HER2− subtype showed poorer survival and some studies also showed that ER+ was an independent favorable factor [3,5], which implies the importance of endocrine therapy for HR+ OBC patients. On the contrary, several studies have indicated that OBC patients of different subtypes showed no difference in terms of survival [7,12,20], which could be attributed to the diverse enrolled populations. Interestingly, compared with N0 stage, the OBC patients at N1 and N2 stages showed better OS and BCSS, but there was no difference between the N0 and N3 stages. Perhaps the prime reason for this is that N0 stage OBC must be accompanied by distant metastasis, coupled with the fact that distant metastasis is also an unfavorable independent prognostic factor; thus, OBC patients at N0 and N3 stages had the worst prognosis. Some previous studies have shown that N2+ is an unfavorable independent prognostic factor of OBC [3,12], but all their references were at the N1 stage; in other words, we are the first to have investigated the role of the N0 stage in OBC patients. We also detected the role of family income and months from diagnosis to therapy, which have never been reported in OBC patients; although both of these are not prognosis factors in OBC.
In terms of treatment, surgery and radiotherapy were both independent protective variables for OBC patients, according to our multivariate Cox regression analysis of the data, whereas chemotherapy was not. Many studies have focused on the therapeutic effects of different surgical methods, such as mastectomy or breast-conserving treatment, combined with radiotherapy and indicated that breast conservation can be considered in patients with OBC [17,[19][20][21]34,35]. Surprisingly, previous studies have also reported that chemotherapy was not an independent prognostic factor in OBC patients [3,11,20]. However, no one had investigated the role of chemotherapy and surgery in OBC patients with different subtypes, thus, we further explored this issue. We found that chemotherapy significantly improved both OS and BCSS only in OBC patients with the HR−/HER2+ subtype, suggesting that anti-HER2-targeted therapy combined with chemotherapy may prolong the survival of OBC patients and that endocrinotherapy is more important in the HR+ subtype than chemotherapy. We also found that surgery appeared to be an independent prognostic factor only when it comes to HR+/HER2− and HR−/HER2+ subtypes, indicating that comprehensive endocrinotherapy and surgical treatment is very important for the HR+/HER2− subtype and that multimodal treatment, involving chemotherapy, surgical treatment and anti-HER2-targeted therapy, could benefit the HR−/HER2+ subtype. For OBC patients with other subtypes, the necessity of surgery and chemotherapy needs to be carefully considered.
Our study may have several limitations despite its promising discoveries. First, for systemic therapy, there is no detailed information on, for example, the dosage of each drug or the chemotherapy formula, in current database; hence, we were unable to find out more about the relationships between various chemotherapy regimens and the survival of patients. Meanwhile, the most recent version of the SEER database does not contain any information about endocrine therapy. Second, the SEER database represents the general situation well, but due to ethnic differences, it may not always apply to Asian, and especially Chinese, patients. Third, owing to the limited number of cases, the number of matches in PSM was not 100%; so selection bias might have occurred.

Conclusions
We analyzed the clinical features of OBC patients and constructed three high-precision and applicability machine-learning prognostic models to predict their survival. According to our analysis of possible prognostic variables for OBC patients, the survival of OBC patients with the HR−/HER2+ subgroup may benefit from chemotherapy, whereas the prognosis for the HR+/HER2− and HR−/HER2+ subtypes may be benefited by primary surgery.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/jcm12093097/s1, Table S1: Baseline characteristics of occult breast cancer (OBC) patients in train set and test set; Table S2: Univariate and multivariate Cox analysis of OBC characteristics extracted from train data; Table S3: Univariate and multivariate Cox analysis of OBC characteristics (stratified by molecular subtype).  Institutional Review Board Statement: Ethical review and approval were waived for this study due to the fact that the data are fully de-identified and no intervention on patients was performed.

Informed Consent Statement: Not applicable.
Data Availability Statement: All data here are publicly available in the SEER database (https://seer. cancer.gov/ (accessed on 15 April 2022)).