Receiver Operating Characteristic Prediction for Classiﬁcation: Performances in Cross-Validation by Example

: The stability of receiver operating characteristic in context of random split used in development and validation sets, as compared to the full models for three inﬂammatory ratios (neutrophil-to-lymphocyte (NLR), derived neutrophil-to-lymphocyte (dNLR) and platelet-to-lymphocyte (PLR) ratio) evaluated as predictors for metastasis in patients with colorectal cancer, was investigated. Data belonging to patients admitted with the diagnosis of colorectal cancer from January 2014 until September 2019 in a single hospital were used. There were 1688 patients eligible for the study, 418 in the metastatic stage. All investigated inﬂammatory ratios proved to be signiﬁcant classiﬁcation models on both the full models and on cross-validations (AUCs > 0.05). High variability of the cut-o ﬀ values was observed in the unrestricted and restricted split (full models: 4.255 for NLR, 2.745 for dNLR and 255.56 for PLR; random splits: cut-o ﬀ from 3.215 to 5.905 for NLR, from 2.625 to 3.575 for dNLR and from 134.67 to 335.9 for PLR), but with no e ﬀ ect on the models characteristics or performances. The investigated biomarkes proved limited value as predictors for metastasis (AUCs < 0.8), with largely sensitivity and speciﬁcity (from 33.3% to 79.2% for the full model and 29.1% to 82.7% in the restricted splits). Our results showed that a simple random split of observations, weighting or not the patients with and whithout metastasis, in a ROC analysis assures the performances similar to the full model, if at least 70% of the available population is included in the study.


Introduction
The receiver operating characteristic (ROC) has been introduced in medicine as a methodological tool used for the investigation of signal detection [1][2][3]. The method uses the probability of correct detection (as a binary outcome) against the probability of a false positive outcome. The ROC detection

Dataset
The dataset was represented by a retrospective collection of absolute neutrophil, lymphocyte, leucocyte, platelet counts of patients amitted at Third Surgical Clinic, "Prof. Dr. Octavian Fodor" Regional Institute of Gastroenterology and Hepatology Cluj-Napoca with the diagnosis of colorectal cancer from January 2014 until September 2019. The presence of metastasis was the outcome variable in this study. The metastatic stage was documented by imaging explorations, namely computer tomography (CT), magnetic resonance imaging (MRI) or contrast enhanced ultrasound. Raw data belonging to the patients with histopathological diagnosis of colorectal cancer were evaluated. Records with incomplete data (e.g., laboratory results, TNM classification of the disease-T size or extent of the primary tumor, N degree of spread to regional lymph nodes, and M presence of distant metastasis) were excluded.
The absolute counts were the input data for calculation of the following ratios, used as predictors of metastasis in this study: where NLR neutrophil-to-lymphocyte ratio, dNLR = derived neutrophil-to-lymphocyte ratio and PLR = platelet-to-lymphocyte ratio.

Methods
The receiver operating characteristic (ROC) analysis was conducted using three scenarios to test the classification abilities of the predictive models incorporating the studied ratios, which are presented in Equations (1)-(3) regarding the presence of metastasis ( Figure 1): • First scenario (): the whole sample was used to generate the classification model, the control model. • Second scenario (): thirteen random sets were generated (1688 patients) by specifying the desired percentage of subjects and used to identify the classification models. No restrictions were imposed regarding the percentage of patients with and without metastasis and for this reason the generated sets were named unrestricted (URunx, where 1 ≤ x ≤ 13). The following percentages were used: 70% for URun01, 65% for URun02, 60% for URun03, 55% for URun04, 50% for URun05, 45% for URun06, 40% for URun07, 35% for URun08, 30% for URun09, 25% for URun10, 20% for URun11, 15% for URun12, 10% for URun13. • Third scenario (): five sets, each with a development and validation group were randomly generated by weighting the percentage of patients with and without metastasis. These sets were name restricted (RRuny, 1 ≤ y ≤ 5; with 70% in development set & 30% in validation set). The model was generated using the development set (Se, Sp, Acc-Accuracy) and tested on validation set (PPV-Positive Predictive Value and NPV-Negative Predictive Value). The performances of the models were evaluated whenever the AUC proved statistically significant and the Gini index (GI) was higher than 0 (GI equal with 0 indicates a random model). Ten metrics (see Figure 1) were used to prove the accuracy of a classification model. The sensitivity and specificity, important metrics widely used to show models accuracy [4,5,11,14,51,52]. The F1-score (2 × ((Se × PPV)/(Se + PPV))) was calculated by combining the performances in the development set with those in the validation set in restricted runs. Matthews correlation coefficient (MCC = (TP × TN -FP × FN)/√[(TP + FP) × (TP + FN) × (TN + FP) × (TN + FN)]) confers a more suitable balance to all four confusion matrix categories (TP-true positive, TN-true negative, FP-false positive, FN-false negative) and it is considered a more specific score [53], compared to the model accuracy or F1-Score, which are overoptimistic estimators [54].
The ROC analysis was conducted with SPSS v. 26 (trial version) under the non-parametric assumptions. The cut-off value for each biomarker in each run was used to create a derivate dichotomial variable (presence of metastasis for values equal to or higher than the cut-off). Youden's index (J = max), which maximizes the distance from the diagonal (random classification model) [55] was used to identify the cut-off values. The tangent method (d = √((1-Se) 2 + (1-Sp) 2 )), d = min) [56] was also used to identify the cut-off values in the control model. The observed confusion matrix was generated and the true positive (TP), false positive (FP), false negative (FN) and true negative (TN) values were used to calculate the performances of the models using the following online resource: https://statpages.info/ctab2x2.html (accessed 3 August 2020  The performances of the models were evaluated whenever the AUC proved statistically significant and the Gini index (GI) was higher than 0 (GI equal with 0 indicates a random model). Ten metrics (see Figure 1) were used to prove the accuracy of a classification model. The sensitivity and specificity, important metrics widely used to show models accuracy [4,5,11,14,51,52]. The F1-score (2 × ((Se × PPV)/(Se + PPV))) was calculated by combining the performances in the development set with those in the validation set in restricted runs. Matthews correlation coefficient ) confers a more suitable balance to all four confusion matrix categories (TP-true positive, TN-true negative, FP-false positive, FN-false negative) and it is considered a more specific score [53], compared to the model accuracy or F1-Score, which are overoptimistic estimators [54].
The ROC analysis was conducted with SPSS v. 26 (trial version) under the non-parametric assumptions. The cut-off value for each biomarker in each run was used to create a derivate dichotomial variable (presence of metastasis for values equal to or higher than the cut-off). Youden's index (J = max), which maximizes the distance from the diagonal (random classification model) [55] was used to identify the cut-off values. The tangent method (d = [56] was also used to identify the cut-off values in the control model. The observed confusion matrix was generated and the true positive (TP), false positive (FP), false negative (FN) and true negative (TN) values were used to calculate the performances of the models using the following online resource: https://statpages.info/ctab2x2.html (accessed 3 August 2020). Clinical utility index (CUI) was calculated for the control model using the online resource available at http://www.clinicalutility.co.uk/ (accessed 30 September 2020). A fair utility for case-finding (+CUI) or screening (-CUI) is seen if 0.49 ≤ CUI < 0.69, a good utility if 0.69 ≤ CUI < 0.81 and an excellent utility if CUI ≥ 0.81 [57].

First Scenario: Full Predictive Models
Significant contribution in identifying the presence of metastasis (AUC significantly different by 0.5) was identified for all investigated ratios, when the whole sample was used (Table 1). The performances metrics of the control models showed modest or low classification abilities of the investigated ratios (Table 2), NLR being the most performant biomarker for metastasis. NLR = neutrophil-to-lymphocyte ratio; dNLR = derived neutrophil-to-leucocyte ratio; PLR = platelet-to-lymphocyte ratio; TP = true positive; TN = true negative; FP = false positive; FN = false negative; Se = sensitivity; Sp = specificity; Acc = accuracy; LR = likelihood ratio; MCR = miss-classification rate; DOR = diagnostic odds ratio; NND = number needed to diagnose.

Second Scenario: Unrestricted Random Samples
At least one unrestricted random sample retrieves the same Youden's cut-off value for each investigated ratios as the full predicted model (URun01 for NLR and dNLR; URun02 for dNLR; URun05 for PLR; see Tables 1 and 3). Table 3. Number of total subjects in the run, distribution among those with and without metastasis, cut-off values and models characteristics for unrestricted random samples.

AUC [95%CI]
StdErr p-Value Gini Index Cut-Off Value * The obtained cut-off values were lower than the values reported in Table 1 for NRL (8/13, 61.5%; Table 3). Changes in the significance of the AUC were observed only for PLR for unrestricted random samples with 16% of the subjects in the development set (Table 3).

Third Scenario: Restricted Random Samples
The classification models in cross-validation, with 70% of the available data in the development set and 30% in the validation set, showed similar performances of AUCs for NLR and dNLR (p-values in the same range), but with AUCs close to 0.5 for PLR (in all RRuny, Table 4). The true positive, true negative, false positive and false negative in restricted random samples varied in the same way in the development and validation sets ( Table 5). The above-mentioned values varied between runs in different amounts, ranging, for example, in the case of true positives from a difference between maximum and minimum form of 3.4% observed for PLR and 46.2% observed for AGR ( Table 5).
The cross-validation ROC analysis applied on restricted random samples identified the NLR and dNLR as the most potent markers for metastasis according to the highest F1-Score (Table 6). However, better performances are obtained when the whole sample was investigated for NLR and dNLR, while on the opposite, better performances are obtained in cross-validation for PLR, all of them showing the instability of the classification model dictated by the input data (Tables 2 and 6).  NLR = neutrophil-to-lymphocyte ratio; dNLR = derived neutrophil-to-leucocyte ratio; PLR = platelet-to-lymphocyte ratio; Se = sensitivity; Sp = specificity; Acc = accuracy; +LR = positive likelihood ratio; -LR = negative likelihood ratio; PPV = positive predictive value; NPV = negative predictive value; MCR = miss-classification rate; DOR = diagnostic odds ratio; NND = number needed to diagnose/predict; MCC = Matthews correlation coefficient; dev = development set; val = validation set.
The visual representation of the AUCs for each scenario related to the full model for NLR, dNLR, and PLR showed higher variability in the case of unrestricted random samples (Figure 2). The AUC representation of each run as compared to the full model is available in the Supplementary Material ( Figures S1-S3). MCR = miss-classification rate; DOR = diagnostic odds ratio; NND = number needed to diagnose/predict; MCC = Matthews correlation coefficient; dev = development set; val = validation set.
The visual representation of the AUCs for each scenario related to the full model for NLR, dNLR, and PLR showed higher variability in the case of unrestricted random samples (Figure 2). The AUC representation of each run as compared to the full model is available in the Supplementary Material ( Figures S1-S3).

Summary of Main Findings
Cross-validation with random split of the data set under restrictions, namely respecting the proportion of those with and without metastasis, showed that the ROC classification models of the investigated ratios had low variability reported to the full mode. The variability of ROC classification

Summary of Main Findings
Cross-validation with random split of the data set under restrictions, namely respecting the proportion of those with and without metastasis, showed that the ROC classification models of the investigated ratios had low variability reported to the full mode. The variability of ROC classification models in the unrestricted samples is very similar to the full model, when almost 70% of cases are in the development set with identical cut-off for 4/5 ratios (excepting PLR). For all other unrestricted samples, higher variability of the AUC was observed as compared to the restricted samples (Figure 2, Figures S1-S3).

Findings Discussion
The cut-off values are different, when different methods are used (Table 1), as expected and this affects the number of absolute frequencies in the confusion matrix. The Youden index maximizes both sensitivity and specificity and retrives higher cut-off values for our dataset, as compared to minimization of the distance to the point with (0,1) coordinates. The number of false positives increases and the number of false negatives decreases, when the cut-off values are identified by the Tanget method as compare with the Youden's index, which reflectes on both Se (increases) and Sp (decreases), but without significant effects on the LRs or CUIs (Table 2). Even if the Tangent method showed superiority in identification of the true cut-off value [58], this was not observed in our sample. Despite the method used to define the threshold values, there are very poor case-findings. NLR and dNLR are fair for screening, while PLR is poor for screening (Table 2). On the other hand, they are not recommended as biomarkers for metastatis in patients with colorectal cancer.
As expected, different cut-off values were observed with few exceptions in both restricted and unrestricted cross-validation approaches (Tables 1, 3 and 4). Generally, not only for the unrestricted, but also for the restricted cross-validation, the cut-off values were lower than those of the full model. It resulted exactly the opposite for dNLR and PLR. Different cut-off values reported in cross-validation (Tables 3 and 4) indicate a variability, most probably caused by the characteristics of the included observations (the input data) and it has been observed in previously reported researches on patients with colorectal cancer [47][48][49][50].
The characteristics of the models were similar for both scenarios, with the AUCs values confined in the confidence interval of the AUC of the full model, without exception and with the AUC of the full model confined in the confidence interval of each individual classification model, regardless of the approach applied (Tables 1, 3 and 4). Opposite significances of the AUCs in unrestricted cross-validation in comparison to the full model was observed for PLR (2/13 classification models, one model that included 35.4% subject from the cohort and the other one that included 15.4%). The consistency of the significance across different samples indicates the stability of the classification models for the investigated ratios and the presence of metastasis as the outcome.
The cut-off values for 4/5 ratios were identical with those of the full model for the first unrestricted run, with almost 70% of subjects randomly selected for the development set. The relative value of the cut-off value from the full model ((Cut-Off model -Cut-Off full-model )/Cut-Off full-model × 100) was generally lower for the restricted models as compared to the unrestricted model. It can be explained by the small differences for those runs with~70% of observations in the development sets. A pattern is observed, when the cut-off values are investigated. There are lower cut-off values in the majority of the cases for NLR (61.5% in the unrestricted and 80% in the restricted) and higher cut-off values for dNLR (61.5% in the unrestricted runs and 100% in the restricted runs) or PLR (53.8% in the unrestricted runs and 80% in the restricted runs). The cut-off values are reflected in the number of TP, TN, FN and FN and thus in the performances of the classification model (Tables 2 and 6). The evaluation of the full model performances showed that none of the investigated ratios is a good predictor for metastasis on patients with colorectal cancer (Table 2), with low +LR, high -LR, a relatively high misclassification rate and MCC values less than 0.25. The low performances of the classification model are also observed on restricted cross-validation (Table 6), supporting the results reported by Martens et al. [59], namely that small changes in AUCs bring small changes in performances of the classification model. The ROC analysis is used to overcome this shortcoming, namely poor correlation of one predictor with a certain outcome, in order to find the threshold for classification and, based on the threshold, to consecutively identify a multivariable predictive model [42][43][44][45][46][47][48][49][50]60,61]. A small number of articles reported the threshold for the investigated inflammatory ratios on patients with colorectal cancer regarding metastasis as the outcome. Anuk and Yıldırım reported a cut-off value for PLR equal to 194.7 for liver metastasis (Se = 74.5.%, Sp = 72.7%) and of 163.95 (Se = 56.8%, Sp = 56.3%) for lymph node tumour cells invasion [62] on a sample of 152 patients. The thresholds for PLR reported in this study are higher than those reported by Anuk and Yıldırım [62], nonetheless the specificity is higher at the expense of sensitivity (Tables 2 and 6). In our study we did not devided into the type of metastasis, but the differences in thresholds and Se  [63]. The optimal cut-off values reported in this study are different by the those previously reported in the scientific literature. However, larger studies are needed to find the explanation for the reported thresholds and models performances.
The use of area under the ROC curve (AUC) in the assessment of classification model performances is known to be biased especially for small samples [64][65][66][67][68] and different cross-validation approaches have been proposed for model validation. The cross-validation approaches are usually used when computer algorithms are applied to identify the best performing classification model [64][65][66][67][68][69]. Moreover, leave-pair-out (LPO) cross-validation showed low bias in AUC estimation [70]. In our study, we applied cross-validation outside the computer algorithm in identification of the classification model and the unrestricted random sample with 70% of observations in the development set and it proved similar performances with the restricted cross-validation also with 70% of observations in the development set (Tables 1-3 and 5, Figures S1-S3). Our result supports thus the conclusion that, for large samples, an appropriate random sample with inclusion in the study of 70% of eligible subjects will closely reflect the target available population. Similarly, for large samples, the experimental design is the key factor for a valid ROC classification model, so the dictum "Garbage in, garbage out" seems true even for large data sets [71].

Study Limitations
Despite of a rigorous experimental design we need to list some limitations of our study. First limitation refers to the cross-validation methods used. The unrestricted cross-validation method was applied using the options of the program and unfortunately, the split was not saved, thus the performances of the models could not be reported. The simple random sampling proved its ability of appropriate splitting of a dataset in development (training) and validation (test) set, in the context of a predictive liniar multivariate regression model [72]. Furthermore, only two cross-validation methods were applied and the use of other cross-validation methods (e.g., minimum of p-value on the confusion matrix [73], max(Se × Sp) [74], min(|Se − Sp| [75]) could retrieve different cut-off values and thus different performances of classification models. However, since simple random splitting in development and validation sets with~70% of the observations in the development set and the restricted cross-validation (that also included 70% of the observations in the development sets) perform closely to the full model, no significant changes in terms of clinical utility are expected. Second limitation refers to the number of runs for each scenario, a higher number of runs being able to better reflect the reality. As it could be observed, we decided to report the performances in individual runs, not only to present an average of the performances, which is a common practice, but also to closely monitor at what length the performances differ from each other. Nevertheless, since the classification models showed stability in both restricted and unrestricted cross-validations, we are not expecting significant add-on with the increase in the number of runs. Third limitation is related to the considered outcome. We did not separately evaluated different types of colorectal cancer metastasis (e.g., liver, lung, brain, peritoneum, distant lymph nodes) and it can be expected that the thresholds of the investigated ratios to vary with the type of metastasis. However, for such evaluation we need to expand the time frame for the retrospective evaluation, in order to assure a sufficient number of observations for each type of metastasis. Fourth limitation is related to the investigated inflammatory ratios. Investigation of the behaviour of other inflammatory ratios, such as lymphocyte-to-monocyte ratio, systemic immune inflammation index or prognostic nutritional index is also of interest. All these markers are taken into consideration by our research team and they will be tested for prediction abilities and proof of clinical utility.

Conclusions
The investigated ratios proved low clinical utility in predicting metastasis on patients with colorectal cancer. The full models showed fair clinical utility in screening, but the values of positive and negative likelihood ratios did not support their application in clinical settings. The classification models identified in the development sets, regardless of the use of unrestricted or restricted (percentage of patients with and whithout metastasis) random split, showed characteristics and performances similar to the full models. Our results showed that a simple random split of observations, weighting or not the patients with and whithout metastasis, in a ROC analysis assures the performances similar to the full models when at least 70% of the available population is included in the evaluation. Moreover, the sample size of the development set in the case of ROC classification analysis of investigated inflammatory ratios, considered as predictors for metastasis on patients with colorectal cancer, had little interest, in most of the cases, either on the models characteristics or on the models performances. The cut-off values of the investigated scenarios should probably be explained by the characteristics of the input data and are, most likely, linked to the percentage of correct and false classification, but without a significant impact on the models characteristics or performances. This behaviour supports the stability of the classification models in the context of a proper experimental design.