Sentinel Lymph Node Metastasis on Clinically Negative Patients: Preliminary Results of a Machine Learning Model Based on Histopathological Features

The reported incidence of node metastasis at sentinel lymph node biopsy is generally low, so that the majority of women underwent unnecessary invasive axilla surgery. Although the sentinel lymph node biopsy is time consuming and expensive, it is still the intra-operative exam with the highest performance, but sometimes surgery is achieved without a clear diagnosis and also with possible serious complications. In this work, we developed a machine learning model to predict the sentinel lymph nodes positivity in clinically negative patients. Breast cancer clinical and immunohistochemical features of 907 patients characterized by a clinically negative lymph node status were collected. We trained different machine learning algorithms on the retrospective collected data and selected an optimal subset of features through a sequential forward procedure. We found comparable performances for different classification algorithms: on a hold-out training set, the logistics regression classifier with seven features, i.e., tumor diameter, age, histologic type, grading, multiplicity, in situ component and Her2-neu status reached an AUC value of 71.5% and showed a better trade-off between sensitivity and specificity (69.4 and 66.9%, respectively) compared to other two classifiers. On the hold-out test set, the performance dropped by five percentage points in terms of accuracy. Overall, the histological characteristics alone did not allow us to develop a support tool suitable for actual clinical application, but it showed the maximum informative power contained in the same for the resolution of the clinical problem. The proposed study represents a starting point for future development of predictive models to obtain the probability for lymph node metastases by using histopathological features combined with other features of a different nature.


Introduction
The prediction of lymph node involvement in breast cancer represents an important task which could reduce unnecessary surgery and improve the definition of oncological therapies [1][2][3][4][5][6]. The research of a trade-off strategy among time-consuming, expensive and invasive methodologies is the scientific goal of several research studies [7,8], especially with reference to breast cancer patients with clinically negative sentinel lymph nodes.
The aim of our work is to develop other less invasive and preferably cheaper diagnostic tools without compromising the diagnosis of patient care. In clinically negative patients, the availability of a new tool able to provide an accurate probabilistic estimation makes surgical actions unnecessary, and this would result in an immediate improvement in both the effectiveness and the quality of care [19].
In our previous work [20] we had generalized an open-source classification algorithm adding other prognostic factors with respect to those used in the original work [21][22][23]. In this study, we report the results of a multivariate analysis aimed at developing a sentinel lymph nodes status predictive model for patients with and characterized by a clinically negative lymph node status. Particularly, we evaluated the predictive power of different breast cancers' clinical and immunohistochemical features. With respect to previous work, we introduced new prognostic factors such as Her2/neu, multiplicity and an in situ component and trained three different state-of-the-art machine learning algorithms on the retrospective collected data. Moreover, we implemented a sequential forward procedure to select an optimal subset of features, and we evaluated the results obtained in hold-out cross-validation.

Experimental Data
The dataset used in our analysis is composed by the histological outcomes of 907 patients, registered in the period 2015-2018 and referred to Istituto Tumori "Giovanni Paolo II" in Bari (Italy), which resulted negative at both clinical and instrumental examination and had undergone the one-step nucleic acid amplification (OSNA) procedure. This procedure is time consuming and expensive, but it is still the intra-operative exam with the highest performance (it currently has a sensitivity of 87.5-100% and a specificity of 90.5-100%) [12][13][14].
We considered the patients with clinically negative lymph nodes who did not have suspicious signs in axillary ultrasound, which is a routine examination during the presurgical staging phase of the armpit, or patients that resulted negative after a fine needle aspiration biopsy following the identification of axillary changes on instrumental examination.
For each patient, we collected several prognostic factors characterizing the tumor evaluated on postoperative specimen pathology. The retrospective observational study was approved by the Scientific Board of the Istituto Tumori "Giovanni Paolo II" and carried out according the Helsinki Statement. Based on our regulation on retrospective studies, all patients who gave consent to use the data for scientific purposes were recruited.
The tumor grade G was defined by the Elston-Ellis modification of the Scarff-Bloom-Richardson grading system on a three-grade scale, i.e., grade G1 (low grade), G2 (intermediate grade) or G3 (high grade), where a lower grade indicates a better prognosis [24]. The histological exam was performed through multiple biopsy sampling with 14-16 G core under ultrasound guidance. Immunohistochemical expression is categorized according to the following molecular subtypes on the basis of the St. Gallen convention [25], using a threshold equal to 20% for Ki67 [26]: luminal A (ER+ and/or PR+, HER2-and low Ki67), luminal B (ER+ and/or PR+, and HER2+ or HER2-with high value of Ki67), HER2 positive (ER/PR-and HER2+) and triple negative (ER-, PR-and HER2-).

Statistical Analysis
In order to evaluate the association between each clinical feature and the sentinel lymph node status, we used the Mann-Whitney test for the age feature measured on an interval scale, whereas we used the Chi-square or Spearman test for all the other features that were measured on a nominal or ordinal scale, respectively. A result was considered statistically significant when the p-value was less than 0.05.
Partitioning Around Medoids (PAM) algorithm was employed to identify non-supervised clustering in k-groups. Specifically, the PAM algorithm tries to minimize the mean square error, trying to reduce the distance between the points of a cluster and the point that, among the observed data, is located more centrally, called the Medoid. For the optimal estimate of the k-medoids (and therefore of the k clusters), we used the analysis of the silhouette that allows us to graphically visualize the quality of the clustering. The silhouette index is generally used to identify the optimal number of groups in a hierarchical cluster and as a synthetic indicator to evaluate the overall quality of clustering [27]. Its advantage is the low computational complexity and the simple rules of interpretation.
In order to show the results of the cluster analysis in a bivariate space, we applied a Multiple Correspondence Analysis for dimensionality reducing.

Classification Models
Three different classification models were trained to predict the sentinel lymph node positivity. We used well-known machine learning methods, which were Random Forest (RF), logistic regression, and Naïve Bayesian.
Random Forests is a well-known ensemble machine learning classifier, which generally provides good performance with low over-fitting [28]. RF provides an embedded method for feature selection: it takes advantage of its own feature selection process and performs classification at the same time. There are two measures of importance for each feature: the first one measures how much the accuracy decreases when a feature is excluded to the forest, the second one measures the decrease in Gini impurity when a feature is chosen to split a node of a tree. In our work, we used this second method. A standard configuration of RF was adopted with 100 trees and 20 features (as described in Breiman (24)) randomly selected at each split because more complicated architectures did not give any significant classification improvement. Moreover, in order to control the over-fitting risk, we have fixed a small number of observations per tree leaf, such as five.
A logistic regression prediction model measures the underlying relationships between features and patient outcomes existing within the data [29]. The accuracy of a logistic regression model is mainly judged by considering discrimination and calibration. Discrimination is the model capability to correctly assign a higher risk of an outcome to the patients who are truly at higher risk, whereas calibration is the model capability to assign the correct average absolute level of risk, i.e., to accurately estimate the probability of the patient outcome.
The Naive Bayes classifier is a probabilistic machine learning model that is used to fulfill classification tasks based on the Bayes theorem [30]. It requires a small amount of training data to estimate the necessary parameters; despite their apparently over-simplified assumptions, Naive Bayes classifiers worked well in many real-world situations.
Default parameters were used for logistic regression and Naive Bayes classifiers. The considered classifiers are based on a different approach to solve the classification task. A RF classifier is an ensemble technique that combines several decision trees to calculate the predicted class, so that the forecasts made by decision trees, which may be individually inaccurate, aim to improve performance and reduce over-fitting when combined together. The Naïve Bayesian classifier is based on a probabilistic approach to solve a classification problem, whereas the logistic regression model exploits a mathematical approach to the problem through the estimation of the classification score performance by means of a linear combination of features.
Feature importance techniques and classification models were performed using the MATLAB R2018a (Mathworks, Inc., Natick, MA, USA) software.

Performance Evaluation
We performed the hold-out cross-validation procedure in order to evaluate the classification performance of each learning model to lymph node status. Specifically, we carried out a feature selection analysis and developed the learning classification models on 70% of the samples (training set) randomly selected. Then, we evaluated the obtained results on the remaining 30% of 907 clinically negative patients.
On the training hold-out training set, in order to identify a subset of features with higher diagnostic power, we developed a forward stepwise feature selection. The sequential forward selection algorithm identified a subset of features that best predicted the expected result by sequentially adding at the each step the features that improve the classification performances on 100 ten-fold cross-validation rounds. Specifically, the selection of features according to their importance is driven by the AUC index: in each series of cross-validation, we appended to the prognostic factors sequence the one endowed with the highest median value of AUC distribution. Then, we proceeded with the evaluation of the new distribution for each sequence obtained by adding a feature in the remaining set and comparing the associated median until the remaining prognostic factors were empty.
Performances of each classification model on both training and test hold-out sets are assessed in terms of Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve and also F1 score, accuracy, sensitivity and specificity calculated by choosing the optimal threshold through the Youden's index [31].
The performance measures at each feature selection step on the hold-out training set were estimated on 100 ten-fold cross-validation rounds and summarized in terms of the median, first and third quartile.

Statistical Analysis Results
The samples' characteristics are summarized in Table 1. Nine hundred and seven patients aged between 23 and 92 years (with a median, first, and third quartile of 51, 61 and 70 years, respectively) were involved in the study. In total, 296 patients were positive on histological examination of the sentinel lymph nodes, whereas 611 patients resulted negative.
The association test between the sentinel lymph node status and each prognostic factor is essential to understand whether there is a relationship between the two variables. Our experimental results have shown that there was a relationship between the sentinel lymph node status and tumor diameter, ki67, grading, histological type, multiplicity and also age (p-value < 0.05). All the other variables were not significantly associated with sentinel lymph node status.

Classification Results on Hold-Out Training Set
The training set of hold-out validation sampling was made up of 635 clinically negative patients with 421 negative sentinel lymph node cases (control cases) and 214 positive cases at postoperative histological survey. Figure 1 shows the performance of the three classifiers considered in terms of AUC value (median, first and third quartiles' values) at increasing the number of features selected on the hold-out training set (Table 2). The solid line stands for the median yielded by 100 ten-fold cross validation, while the shaded region correspondds to the interquartile range. The logistic regression classifier seems to always achieve slightly higher performances than the others, although overcoming the Naïve Bayesian classifier of a few percentage points. The best performance was reached with seven features for the Naïve Bayesian and logistic regression classifiers, whereas eight features were needed for the RF classifier.
In Table 3, we report the performances obtained for each classifier by using the subset of features that maximized performance. Although the performances were comparable in terms of accuracy, the logistics regression classifier reached an accuracy of 71.5% and showed a better trade-off between sensitivity and specificity than the other two classifiers (69.4 and 66.9%, respectively) when seven features, i.e., tumor diameter, age, histologic type, grading, multiplicity, in situ component and Her2-neu status, were considered. Table 3. Median performance of the best models calculated on the hold-out training set by means of the sequential forward feature selection evaluated on 100 ten-fold cross validation rounds for each of the used classification algorithms. The prediction performances are summarized in terms of median, 1st and 3rd quartile.

Cluster Analysis on Hold-Out Training Set
The classification performance does not exceed 68% accuracy with any of the used classifiers. In order to highlight any profile characterizing the cases not correctly identified by the classifiers, we implemented unsupervised hierarchical analysis of the groups. On the whole dataset, on the basis of the relative Silhouette index, we identified the optimal number of groups in which to divide the samples (Figure 2). In particular, in Figure 3, we represent the three groups with respect to the first two main components. For the sake of simplicity, we separated the control case clusters from those made up of patients who tested positive for the sentinel lymph node and in each of them, we reported the cases (24 false negative and 25 false positive) that were never correctly classified in all the validation rounds by each of the three implemented classifiers. Evidently, such cases are not characterized by a particular pattern of the used features.
False negatives are mostly related to group 2, characterized by patients with mildly aggressive tumors, i.e., small T1b and T1c, low grading (G1 or G2), negative ki67 and an absent in situ component. On the contrary, the false positives are afferent to group 1 or 3, characterized by T1c and T2 tumors, ki67 positive, medium-high grading G3 or G2 and components present in situ (Table 4).

Classification Results on Hold-Out Test Set
In order to evaluate the robustness of obtained results, we calculated the performance classification of the three best models on the hold-out test set by using hold-out validation sampling, which was made up of 272, composed of 190 control cases and 82 cases with negative and positive sentinel lymph nodes, respectively.
The performance of each classifier loses accuracy by about five percentage points (Table 5). However, the logistic regression model still shows a sensitivity higher than 68% and a specificity of about 60%, thus losing about six percentage points compared to the results observed on the hold-out training set.

Discussion
In patients with clinically node-negative breast cancer, current international guidelines require sentinel lymph node biopsy (SLNB) [11][12][13][14]. Although it is currently the mostperformed exam, it is a time consuming and expensive procedure and is also extremely invasive, and could result in a number of side effects [7,18,32]. However, in patients with early stage breast cancer the incidence of axillary metastases is low, therefore that procedure may be an unnecessary invasive procedure.
The aim of our work was to develop a sentinel lymph nodes status predictive model for clinically negative patients that could replace the SLNB. Starting from a set of 907 patients, for each of them we collected several prognostic features, such as age, tumor size, histological subtype, estrogen receptor expression, progesterone receptor expression, histological grade, cellular marker for proliferation, human epidermal growth factor receptor-2, multiplicity, in situ component and also the sentinel lymph nodes status. Subsequently, we trained three different classifiers combined with a sequential forward feature selection algorithm on the hold-out training set to select the feature subset which reached the highest value of AUC. Then, we trained the three previously selected models on the hold-out test set and we compared their performances in terms of accuracy, sensitivity and specificity.
On the hold-out training set, the best classification model was the Logistic Regression algorithm on a subset of seven features, i.e., tumor size, age, histologic type, grading, multiplicity, in situ component and Her2-neu status. This model reached a median AUC equals to 71.5%, a median accuracy of 67.9%, sensitivity equal to 69.4% and specificity of 66.9%.
Furthermore, on the same training set we performed a cluster analysis, in order to identify the positive or negative contribution of different feature subsets for classification purposes. The cluster analysis highlighted the correlation among some feature subsets and the realization of either false-positive or false-negative cases. Indeed, we observed that false negatives were afferent to features such as small T1b and T1c tumors, low grading (G1 or G2), negative ki67 and absent in situ component; instead, false positives were related to features as T1c and T2 tumors, ki67 positive, medium-high grading G3 or G2 and present during the in situ component.
Finally, we observed that the performance of the best Logistic Regression model, as well as the performances of the other ones, were lower on the hold-out test set. Even though the best model in terms of accuracy and specificity resulted the RF classifier on a subset of eight features, the only model with a sensitivity still greater than 68% was the Logistic Regression algorithm, which had the same result as the best model of the hold-out training set. The state-of-the-art model is characterized by many works, which propose non-sentinel lymph nodes status predictive models based on features of different nature [33][34][35][36][37]. On the contrary, there are a low number of studies whose aim was the development of a sentinel lymph nodes status predictive model through the analysis of histological features [20,[38][39][40][41]. Thus far, the nomogram developed by the researchers of Memorial Sloan-Kettering Cancer Center (MSKCC) (a, b, c) is the most widely used model to predict the likelihood of SLN metastasis. The baseline model (a) reached an AUC value of 75.4%. In subsequent validation studies, comparable performances were achieved (b, c). However, these models were evaluated on cohorts of non-clinically negative patients and therefore different from the sample object of our study.
The model developed in [41] for prediction of sentinel lymph nodes metastasis reached an AUC value equal to 88.3% by considering some histological features, such as tumor size, and lymph vascular invasion in ER-positive and HER2-negative (ER+/HER2−), but also genetic ones.
In our previous work [20], we evaluated the usefulness of the CancerMath tool to predict the sentinel lymph node status for clinically negative patients referred to our institute. In this validation study, by using tumor size, age, histologic type, grading, expression of estrogen receptor, and progesterone receptor ki67 and HER2 on the independent test set, the model showed an accuracy of 53.8%, which is much lower than the accuracy value achieved thanks to the analysis exposed here.
The current work differs from the other mentioned research studies because of the use of a machine learning approach for the analysis. Although the model can be further optimized by evaluating other feature selection techniques or classification algorithms on a very extensive experimental database, the only histological features did not allow us to develop a support instrument suitable for actual clinical application. What emerges, and which is important to underline to a public of biomedical data scientists or clinician with a transactional interest field, is that the maximum informative power contained in the only characteristics considered for the prediction of the sentinel lymph node status does not exceed 62-63% on the independent test. For this reason, encouraged by recent studies with improved results in the prediction of the lymph node metastasis probability thanks to the joint use of histopathologic and radiomic features [6,[40][41][42][43][44][45][46][47][48][49][50][51][52][53][54], in our future works we will also involve radiomic features extracted from first-level radiological examinations, such as ultrasound and mammography.

Conclusions
In patients with clinically negative lymph nodes, the incidence of lymph nodes is generally low. Although for this type of patient the sentinel lymph node biopsy (SLNB) is the intra-operative exam with the highest performance, it often represents a useless invasive procedure with possible serious complications. Therefore, an important clinical task is to develop a procedure that could surrogate SLNB without compromising the quality of care. In this work, we presented a preliminary model to predict the sentinel lymph node status. We trained different machine learning algorithms on tumor histopathology features but the performances evaluated on a hold-out test set reached an accuracy classification of about 64%. Therefore, the model trained on histopathologic features is yet not suitable for clinical use for the prediction of metastatic lymph nodes in clinically negative patients. However, in future works we will explore other information sources, such as radiomics and genetics, in order to use them conjointly with clinical data considered in this study. By reaching high levels of accuracy, the use of such a support system would have a high clinical impact, either avoiding the sentinel lymph node procedure or reducing the time and cost of surgical interventions, unnecessary axillary dissections with related comorbidities.

Data Availability Statement:
The data presented in this study are available on request to the corresponding author. The data are not publicly available because they are the property of Istituto Tumori 'Giovanni Paolo II'-Bari, Italy.