Clinical Prediction Models for Recurrence in Patients with Resectable Grade 1 and 2 Sporadic Non-Functional Pancreatic Neuroendocrine Tumors: A Systematic Review

Simple Summary The risk prediction for tumor recurrence after surgery for non-functional pancreatic neuroendocrine tumors is a major unmet clinical need. Accurate recurrence risk prediction could pave the way for tailor-made follow-up protocols in the future, as well as help select suitable patients for cancer treatment trials. Multiple prediction models have been developed; however, none are currently incorporated into international guidelines. This systematic review found 13 original models, of which 3 were validated outside the patient group in which they were developed. The effectiveness of a prediction model is not proven in the wider population without this validation; thus, the lack of it hinders the progress toward clinical use. We propose to test the included models in a large, multinational database to compare their performance. We recommend all authors developing a prediction model to perform the minimally required tests of model development and to implement it by creating an online calculator. Abstract Recurrence after resection in patients with non-functional pancreatic neuroendocrine tumors (NF-pNET) has a considerable impact on overall survival. Accurate risk stratification will tailor optimal follow-up strategies. This systematic review assessed available prediction models, including their quality. This systematic review followed PRISMA and CHARMS guidelines. PubMed, Embase, and the Cochrane Library were searched up to December 2022 for studies that developed, updated, or validated prediction models for recurrence in resectable grade 1 or 2 NF-pNET. Studies were critically appraised. After screening 1883 studies, 14 studies with 3583 patients were included: 13 original prediction models and 1 prediction model validation. Four models were developed for preoperative and nine for postoperative use. Six models were presented as scoring systems, five as nomograms, and two as staging systems. The c statistic ranged from 0.67 to 0.94. The most frequently included predictors were tumor grade, tumor size, and lymph node positivity. Critical appraisal deemed all development studies as having a high risk of bias and the validation study as having a low risk of bias. This systematic review identified 13 prediction models for recurrence in resectable NF-pNET with external validations for 3 of them. External validation of prediction models improves their reliability and stimulates use in daily practice.


Introduction
Survival of patients with pancreatic neuroendocrine tumors (pNET) after resection is largely dependent on the development of tumor recurrence [1]. In contrast to other pancreatic neoplasms, pNET displays heterogeneous behavior with survival depending on tumor size, tumor grade, and treatment outcome. Patients with small (<2 cm) nonmetastatic grade 1 pNET have 5-year survival outcomes approaching 100%. Meanwhile, patients with larger metastasized grade 2 pNET had a 5-year survival rate of less than 40% [2,3]. Most patients are diagnosed incidentally with a non-functional pNET (NF-pNET), and only a subset of these patients eventually develops recurrence after tumor resection [4]. Surgical resection is the only curative intent treatment for localized pNET [5,6]. Improvements in the therapeutic armamentarium for advanced-stage disease in pNET have broadened the treatment options for patients with recurrence after surgery. Systemic therapies such as somatostatin analogs and chemotherapy appear to provide favorable outcomes in patients with metastases, while PRRT can be offered to a select group of patients [6][7][8][9].
Identifying patients at high risk for recurrence after resection with curative intent is a challenge faced by clinicians treating patients with a pNET. Currently, no tool is used to provide patients with standardized postoperative follow-up regimens [10]. Follow-up programs could be tailored according to the risk of recurrence, reducing the follow-up intensity of low-risk patients while those at high risk could undergo intensive surveillance. Furthermore, adjuvant therapy could also be tested in high-risk patients to prevent disease spread, but no such therapy is offered yet. Various studies have explored the risk factors associated with recurrence in patients with pNET [11][12][13]. Combining them into a prediction model is needed to make these tailor-made treatments possible. Multiple prediction models have been developed, but none of these models have been incorporated either into European Neuroendocrine Tumor Society (ENETS) guidelines or into clinical practice yet [7,14,15].
Accurate patient selection for the intensity of follow-up and (neo)adjuvant treatment is both an unmet need according to the ENETS guidelines, therefore warranting the need for a clinical prediction model [16]. The aim of this systematic review was to identify and evaluate currently available prediction models for recurrence in resected grade 1 and 2 NF-pNET and to select models for future use in clinical practice and clinical trials.

Materials and Methods
The systematic review followed the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS) guideline [17]. The study was reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [18]. The study was registered at PROSPERO (CRD42022380671).

Search Strategy
A systematic search was performed in MEDLINE via PubMed, Embase, and the Cochrane Library for studies published from inception until December 2022. The search included synonyms for [pancreatic neuroendocrine tumors] combined with [prognostic/predictive/prediction models] and [recurrence] (Appendix A). After the removal of duplicates, titles/abstracts and full-text articles were screened independently by two authors (J.W.C. and C.M.H.). Differences in opinion were resolved through discussion; if necessary, a third author (E.J.M.N.v.D.) was consulted. References of included articles were checked for other potentially eligible studies.

Eligibility
Studies developing, updating, or validating a clinical prediction model for recurrence in patients with grade 1 or 2 NF-pNET undergoing resection were included. Models designed for use in either the preoperative or postoperative setting were included. Studies providing an absolute probability estimate and those stratifying patients into risk categories were included, as both strategies might serve the clinical aim of providing more tailored treatment. To allow evaluation and comparison of model performance, at least one of the following performance outcomes had to be reported: c statistic, area under the curve (AUC), R2, Brier score, sensitivity, specificity, calibration plots, or calibration statistics.
Review papers, abstracts, case studies, studies with patients aged less than 18 years, and non-human studies were excluded. Articles including patients with diseases other than pancreatic neuroendocrine tumors (such as pancreatic ductal adenocarcinoma or intraductal papillary mucinous neoplasm) or other neuroendocrine tumor localizations were excluded. Studies were excluded if more than 20% of its population were patients with WHO grade 3, genetic background, metastatic or functional pNET. If not reported, the studies were included but marked as a risk of bias. Studies validating a staging system (i.e., TNM, WHO, AJCC) without any modification were not included, because the limited discriminative strength of these systems was one of the reasons for developing better models [19][20][21].

Data Extraction and Analysis
Data extraction and critical appraisal were performed independently by two authors (J.W.C. and C.M.H.). Articles were categorized into development/model update studies and validation studies. A data extraction sheet was used to extract data for all studies and included: the first author, year of publication, pre-or postoperative setting, sample size, study interval, source of data, countries of inclusion, number of centers, predicted outcome, included predictors, model performance, information on validation and number of citations. For development/update studies, model development method, number of prognostic factors screened, and final model presentation were additionally extracted. Predictors included in regression analyses were collected and scored for statistical significance. If multiple models were developed or tested, the prediction model with the highest c statistic or the model proposed by the authors was included. Meta-analysis was not possible due to heterogeneity in prediction model variables and outcomes. Critical appraisal was performed following the CHARMS guidelines [22]. The PROBAST critical appraisal tool was used for the assessment of methodological quality [23]. A clinical epidemiologist was consulted for this review [S.v.D.].

Definitions and Terminology
A prediction model was defined as "a formal combination of multiple predictive factors from which risks of a specific endpoint can be calculated for individual patients" [24]. Clinical prediction models should "discriminate between individual patients who do and do not experience a specific event (discrimination), should make accurate predictions (calibration), perform well across different patient populations (generalizability) and be readily interpretable" [25,26].
Model performance can be assessed by its ability to discriminate and calibrate. Discrimination (i.e., do patients who have the outcome also have a higher predicted risk than those who do not have the outcome) can be quantified through measures such as sensitivity, specificity, area under the receiver operating curve, or by the concordance statistic (c statistic) [27,28]. A c statistic of 1.0 represents a perfect model, while a score of 0.5 indicates that the model is not better at prediction than random selection [29]. In binary outcomes, the AUC is equal to the c statistic. In turn, calibration is important for model performance since it compares the predicted probability of the outcome (i.e., recurrence) with the actual outcome. It is most often visualized using calibration plots or assessed as goodness of fit, which can be quantified using the Hosmer-Lemeshow test (p < 0.050 indicates poor calibration).
Model validation is imperative and can be performed using several different techniques. It is possible to internally validate (reproducibility) followed by external validation (generalizability). Internal validation can be performed through split-sample validation (training and test set), cross-validation (development in random segments of the population, tested in the remaining segment and repeating this process), and bootstrapping (random samples of the same size are drawn with replacement) [24,29]. Ideally, all prediction models should undergo not only internal but also external validation. Geographical (using a new cohort from a different center), temporal (same center, but patients at a different time interval), and fully independent (new research group at a different center) are the three most important external validation options [25,29].
The term tumor grade was used for both the World Health Organization (WHO) definition of tumor grade as well as the Ki-67 index, unless specified otherwise. Tumor grade was defined according to the WHO 2017 classification: grade 1 (Ki-67 index of <3%), grade 2 (Ki-67 index of 3-20%), and grade 3 (Ki-67 index of >20%), unless otherwise specified [20]. Histological grade was defined as the grading of tumor cell differentiation into well, intermediate, or poorly differentiated.

Preoperative Prediction Models
Four out of thirteen prediction models were designed for preoperative use [31,33,36,37], with c statistics ranging from 0.78 to 0.94. Sun et al. used preoperative MRI-imaging variables to calculate the recurrence risk [31]. Tumors exhibiting hypoenhancement and low apparent diffusion coefficient values were associated with worse RFS after curative resection. Fisher et al. included serum chromogranin A > 5 times the upper limit and the presence of a recurrent tumor as recurrence predictors [33]. Their sensitivity analysis  [36]. However, in their study, C-reactive protein was not associated with inferior 5-year RFS (72.2% vs. 70.4%, p = 0.89) or increased recurrence rate (25.6% vs. 21.2%, p = 0.34). The model by Zhou et al. was designed for preoperative prediction of both recurrence and survival, but only the AUC of overall survival was reported (0.83) [37]. Their study analyzed the predictive strength of "gamma-glutamyl transferase/lymph-noderatio" as a preoperative predictor, but the predictive strength of adding this predictor to the AJCC staging system (

Discrimination
Discrimination of the prediction models, expressed in c statistic or AUC, was described in all studies (Table 2). However, Zhou et al. only provided the c statistic for OS, despite presenting their model as a predictor for recurrence also [37]. The c statistic for predicting RFS ranged between 0.67 and 0.94. The c statistic for predicting OS ranged from 0.69 to 0.83 [36,37,41]. Primavesi et al. tested their model development for three different outcomes (RFS, DSS, OS), of which it worked best for the prediction of DSS (77.3 [67.2-87.5]). However, it had the lowest discrimination score of the included models, with an AUC of 66.5 [36]. The highest discrimination was reported by the prediction model of Zou et al., with a c statistic of 0.94 for 5-year RFS [40].

Calibration
Calibration was reported by seven studies [11,30,31,34,39,40,43]. Although all studies concluded that their model had good agreement of calibration, differences were present. The calibration curves presented by Pulvirenti et al. and Wei et al. showed a relatively high tendency to underpredict the risk of recurrence compared to the other models [11,39]. Genç et al. did not present their calibration curves but did provide a Hosmer-Lemeshow test with a Chi-square of 11.25, p = 0.258, indicating good calibration [34].
Four studies performed an internal validation using bootstrapping [11,30,35,39]. One performed a split-sample validation and had a median c statistic of 0.78 (range 0.74-0.84) [33]. Two studies reported the c statistic both before and after internal validation [30,41] Three out of thirteen original prediction models were externally validated [11,34,39]

Critical Appraisal
Methodological assessment of the included studies showed poor overall quality, primarily in the Analysis domain (Table S1). All developed models were scored positive for risk of bias. Most studies did not have the minimally required number of patients with an event (recurrence) in the development cohorts for the prevention of overfitting. Dong et al. and Fisher et al. were the only two models with enough events per variable, 66 events and 60 events, respectively [30,33]. However, Fisher et al. developed their preoperative prediction model using postoperatively determined tumor grade [33]. Only the models reported by Genç et al., Pulvirenti et al., and Wei et al. were externally validated [11,34,39]. The models developed by Wei et al. and Viúdez et al. were flagged for concerns of applicability due to the need for highly specific predictors (Immunoscore and Immunohistochemistry Prognostic Score, respectively) in order to use these models in daily clinical practice [39,41].

Discussion
This first systematic review to evaluate existing prediction models for recurrence in patients with resectable grade 1 and 2 NF-pNET found 13 model development studies and one validation study. Most models were presented as scoring systems, followed by nomograms and modified staging systems. The most frequently incorporated risk factor was tumor grade, which had the highest rate of significant association with recurrence after regression analyses (91%). However, 10/13 (76.9%) models were not externally validated, thereby hindering the progress towards clinical implementation. The studies that did perform an external validation also performed the minimally required performance tests of prediction model development, i.e., discrimination, calibration, internal validation, and external validation.
Taking the outcomes of the critical appraisal into account, the results of the development models by Genç  appear to be the most reliable since these are the only studies that performed the minimally required tests [11,34,39]. The predictors used by these three models were tumor grade/Ki-67 index, positive lymph nodes, tumor size, vascular and/or perineural invasion, metastasis, and 'Immunoscore'. Tumor grade was incorporated in all three models. The incorporated predictors are in line with the significant predictors described in the literature [45]. Genç [34,39]. Dichotomizing a continuous variable leads to a loss of predictive ability due to an assumption of a constant level of risk above and below the threshold [46]. A separate study by Genç et al. showed that variations within the margins of WHO grade 2 (Ki-67 index 3-20%) led to significant differences in recurrence rate [12]. Lopez-Aguiar et al. reported that these significant changes in prognosis even occur within the margins of grade 1 (0-2.99%) [47]. In turn, dichotomized variables are user-friendlier than continuous variables and will likely result in the more frequent use of the prediction model. For instance, frequent internationally used models, such as the CHA2DS2VASc-model for calculating stroke risk for patients with atrial fibrillation [48] and the Wells' criteria model for predicting the risk of a pulmonary embolism [49], also use dichotomized variables. As such, the loss of the predictive ability of the variable does not outweigh the benefit of dichotomizing it since one of the major pitfalls of prediction models is that they are seldom used in clinical practice.
Wei et al. developed a unique predictor where they quantified the immune response in the tumor microenvironment into the Immunoscore [39]. A low Immunoscore, i.e., a pattern of low peritumoral inflammatory activity and high intratumoral CD8+ activity, was associated with a better RFS. Several studies have reported significant associations between immune response patterns in the tumor microenvironment and NET prognosis [50][51][52]. Takahashi et al. found that certain immune patterns were complementary to the WHO 2017 grading system, providing an argument for a possible augmenting effect if combined [53]. Immunoscores could be a powerful addition in the future; however, the scarcity of literary evidence speaks against suggesting this model for immediate implementation. Moreover, the highly specific tests required to determine the immune response patterns and the risk of interobserver variations make it less likely that the model would be internationally used in daily clinical practice.
Contrary to the other models, Pulvirenti et al. preserved the continuous function of its numeric predictors in their nomogram, i.e., Ki-67 index, tumor size, and positive lymph nodes [11]. A possible confounding factor is that they combined vascular invasion and perineural invasion into a single predictor, while the variables were associated with different hazard ratios (8.55 [95% CI 5.14-14.21] and 5.91 [3.72-9.40], respectively). Thereby risking underprediction of RFS when only vascular invasion is present, overprediction when only perineural invasion is present, and losing predictive strength when both predictors are present. Additionally, calibration of the nomogram by Pulvirenti et al. showed a tendency to underpredict if the 5-year RFS was between 55 and 80%. However, in clinical practice, their model could be used to identify patients with low-risk tumors (>80% chance of 5-year RFS) that would benefit from a low-frequency postoperative follow-up.
Selecting the correct population is particularly important in models predicting outcomes in pNET. They exhibit heterogeneous behavior, which challenges accurate risk stratification [54]. Current guidelines now recommend watchful waiting for asymptomatic NF-pNET ≤2 cm due to its favorable prognosis [7,14,55]. Yet, a total of 39.8% of patients from 8 of the development cohorts in this review had a tumor of ≤2 cm [31,34,35,[37][38][39][40][41]. The model by Genç et al. was additionally externally validated by Heidsma et al. for pNET >2 cm and resulted in better discrimination [34,43]. The risk of recurrence is likely to be underestimated by a prediction model if a large proportion of indolent tumors are included. Future studies should exclude these small, low-grade tumors to increase the relevance of its population when studying resectable cases or perform a separate analysis for this group.
The main practical advantage of using a prediction model is its ability to discriminate patients with a low risk for recurrence from high-risk patients. The recurrence risk threshold up to which pNET can be considered low-risk tumors differed among the studies included in this review from 3.1% to 19.9% [11,31,33,35,[37][38][39][40]43]. Prospective studies using recurrence prediction models are needed to determine the optimal cut-off value, as well as the optimal timing for postoperative imaging.
The results of this review should be interpreted considering several limitations. First, the studies included in this systematic review are prone to publication bias since poorly performing prediction models are rarely publicized. Second, a frequent risk of bias was the low number of events per variable due to small sample sizes. For the screening of a single predictor, the rule of thumb states that at least 10 events are required to eliminate bias and preferably more [56][57][58]. However, the studies developing a prediction model for recurrence screened a median of 11 predictors, while the median number allowed was less than 2. This means that the predictors that were selected are susceptible to overfitting. Third, predictor selection based on univariable analysis could result in omitting potentially relevant predictors or inclusion on the basis of accidental association. For example, as seen in Table 3, tumor grade shows a strong association with RFS. The omission of this predictor by Sun et al., due to it not reaching statistical significance after univariable analysis, probably resulted in a less effective prediction model [31]. Preferably, predictors should be chosen based on their relationship to the predicted outcome and not solely on the basis of their statistical significance [59]. However, the use of univariable analysis to choose predictors for the multivariable analysis is common practice and, if handled correctly, should not pose a problem for the reliability of a model. Fourth, the most important biasintroducing factor was the lack of external validation. The stated accuracy of a prediction model cannot be made certain without it, thus limiting the applicability of the results of this systematic review. It provides a strong argument for multicenter collaboration when developing and validating prediction models for patients with an NF-pNET. It is otherwise nearly impossible to obtain sufficient events (i.e., recurrence) to create a reliable prediction model.
The main strength of the current systematic review is the strict list of eligibility criteria that was used to create a homogeneous study population, thereby limiting confounding factors that influence the risk of recurrence. This study also provides an overview of the current landscape of developed prediction models as well as an overview of the predictive strength of the analyzed and included predictors. This information could be used for the selection of predictors for multivariable regression analysis when developing a prediction model.

Conclusions and Future Directions
Despite the development of several prediction models for the recurrence of NF-pNET after surgery, they are rarely evaluated on their performance in other participant data. To judge the true performance of a prediction model, it must be externally validated to evaluate model overfitting or deficiencies in the statistical modeling [29,60]. None of the models have undergone impact analysis in prospective studies. This might be a reason why none of the models are incorporated into (international) guidelines. Preferably, a model should be investigated in multiple external validation studies and later, ideally, also in a (randomized) controlled trial [61]. We propose to test the included prediction models in a large, multinational cohort to compare their performance. The best-performing prediction model could thereafter be applied in both clinical and trial settings to determine risk outcomes in pNET. In turn, models presented as an online calculator may be the most suitable for regular use since they are easier to access for clinicians and insightful for patients. The only model currently available online is the model by Genç et al. [34]. We highly recommend all authors developing a prediction model to calibrate, discriminate, and validate their results and to implement the model by creating an online calculator.