Artiﬁcial Intelligence and Machine Learning in the Diagnosis and Management of Gastroenteropancreatic Neuroendocrine Neoplasms—A Scoping Review

: Neuroendocrine neoplasms (NENs) and tumors (NETs) are rare neoplasms that may affect any part of the gastrointestinal system. In this scoping review, we attempt to map existing evidence on the role of artiﬁcial intelligence, machine learning and deep learning in the diagnosis and management of NENs of the gastrointestinal system. After implementation of inclusion and exclusion criteria, we retrieved 44 studies with 53 outcome analyses. We then classiﬁed the papers according to the type of studied NET (26 Pan-NETs, 59.1%; 3 metastatic liver NETs (6.8%), 2 small intestinal NETs, 4.5%; colorectal, rectal, non-speciﬁed gastroenteropancreatic and non-speciﬁed gastrointestinal NETs had from 1 study each, 2.3%). The most frequently used AI algorithms were Supporting Vector Classiﬁcation/Machine (14 analyses, 29.8%), Convolutional Neural Network and Random Forest (10 analyses each, 21.3%), Random Forest (9 analyses, 19.1%), Logistic Regression (8 analyses, 17.0%), and Decision Tree (6 analyses, 12.8%). There was high heterogeneity on the description of the prediction model, structure of datasets, and performance metrics, whereas the majority of studies did not report any external validation set. Future studies should aim at incorporating a uniform structure in accordance with existing guidelines for purposes of reproducibility and research quality, which are prerequisites for integration into clinical practice.


Introduction
Neuroendocrine neoplasms (NENs) of the gastrointestinal tract and the pancreas are rare tumors that tend to be diagnosed incidentally but with an increasing frequency [1,2]. GEP-NENs arise from the neural crest and may be located in the stomach, the small intestine, the appendix, the colon, the rectum, the pancreas, the ampulla of Vater, and the extrahepatic bile ducts, as well as the liver in the form of metastases. For the purposes of this review, we will focus on the former group of organs. For purposes of systematization, NENs can be divided into well differentiated neuroendocrine tumors (NETs) and poorly differentiated neuroendocrine carcinomas (NECs), the latter representing 10-20% of NENs [3]. This classification is not arbitrary, as NETs and NECs represent two genetically and biologically separate entities. NETs may be further classified into NETs arising from the gastrointestinal tract (GI-NETs, also known as carcinoids;~50% of GEP-NETs) and ones affecting the pancreas (Pan-NENs;~30% of GEP-NETs). NENs may or may not be functional. Nonfunctioning NENs are usually asymptomatic (especially early-stage ones), but may cause gastrointestinal bleeding and anemia, as well as obstructive effects which may present as jaundice, small bowel obstruction, intussusception, appendicitis and palpable abdominal mass depending on their anatomic location. Functioning GI-NENs may cause the one hand that NENs are relatively rare entities and on the other hand that AI, ML and DL are novel in the field of Medicine, we deemed it a rather uncharted area of interest and opted for a scoping review.

Materials and Methods
This review was performed according to the PRISMA extension for scoping reviews [8]. We performed literature search using the PubMed database in January 2021. The combined search terms were [artificial intelligence; machine learning; deep learning] AND [neuroendocrine; NET; NEN; carcinoid; insulinoma; glucagonoma; gastrinoma; VIPoma] AND [gastrointest*; GI; small intest*; appendi*; colon*; rect*; colorect *; stomach; gastric; duoden*; pancrea*; biliary; bile duct; Vater; ampulla; liver; hepa*]. There was no chronological restriction. Included articles had to have study populations with diagnosed NEN or NEN should be included in the differential diagnosis. They should also have at least 1 ML/DL algorithm for the process of their data, irrespective of the study design. The presence of a comparison group (external validation) was desired but not mandatory. Similarly, the report of at least one benchmarking metric, among accuracy, F1-score, area under receiver operator characteristic curve (AUROC) or area under precision-recall curve (AUPRC) were desired but not mandatory. Table 1 summarizes eligibility criteria. Only full-text publications were considered. Articles not in English language or not providing full text were excluded. Data extraction was performed by two independent researchers (A.G.P., P.A.P.) using a predefined template with the eligibility and exclusion criteria. In case of disagreement, a third researcher (D.P.L.) made the decision whether to include the article or not. For the collection of relevant data we consulted the Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research [9]. We collected data on year of publication, country of origin, DOI number, study design (prospective vs. retrospective), classification vs. regression, NEN type studied, dataset (number of patients or samples), input (predictors), output (outcomes), tested AI algorithm(s), training set, test set, internal and external validation sets, cross-validation method, accuracy, F1-score, AUROC (with 95% CI, if available) and AUPRC (with 95% CI, if available).
Numerical variables are presented as mean ± standard deviation (SD). Categorical variables are presented using frequencies and percentages. Calculations and statistical analysis were carried out using the online tool Prism ® , GraphPad Software, San Diego, CA, USA.

Results
Literature search across PubMed yielded 1327 articles. In addition, 9 articles were retrieved through other sources (Google ® search, screening through articles' literature). After screening of titles and abstracts, removal of duplicates, and implementation of eligibility criteria, 44 unique articles were included in the final analysis ( Figure 1) . discrepancy between the total number of studies (44) and the sum of analysis based on prospective-retrospective study design (45). Regarding the nature of the prediction, we dichotomized the studies into diagnostic vs. prognostic, depending on whether the prediction referred to healthy subjects or subjects with already diagnosed NET, respectively [55]. The analysis yielded 24 diagnostic (54.5%) and 20 prognostic (45.5%) studies. Finally, all studies but one [24] had to do with classification. The prediction characteristics of each study are summarized in Table 2.       In order to identify the prediction problem of each study, we collected data on study design, nature of the prediction, and continuity of the target variable, as per Luo et al. [9]. Consequently, there were 19 prospective (42.2%) and 26 retrospective (57.8%) analyses. Notably, one study had 2 stages, one prospective and one retrospective [13], hence the discrepancy between the total number of studies (44) and the sum of analysis based on prospective-retrospective study design (45). Regarding the nature of the prediction, we dichotomized the studies into diagnostic vs. prognostic, depending on whether the prediction referred to healthy subjects or subjects with already diagnosed NET, respectively [55]. The analysis yielded 24 diagnostic (54.5%) and 20 prognostic (45.5%) studies. Finally, all studies but one [24] had to do with classification. The prediction characteristics of each study are summarized in Table 2.

Discussion
This scoping review deals with the current applications of artificial intelligence in the diagnosis and management of gastrointestinal and pancreatic neuroendocrine neoplasms (GEP-NENs). GEP-NENs are inherently rare neoplasms, as such an empirical approach to their management would be unreliable. One of the advantages of AI and its application through machine learning and deep learning is that it can integrate a vast amount of data collected anywhere in the world (big data) and then render them applicable into clinical practice in an individualized manner.
Despite the rarity of NENs, our research yielded a total of 44 relevant studies, the vast majority of which have been published over the last three years. On the one hand, this harmonizes with the general tendency of incremental accumulation of pertinent evidence in Medicine [54,56], on the other hand it may reflect an increasing diagnosis rate of NENs, as it has been documented by the SEER registry [2]. In any case, this establishment may pave the way for future research.
Nevertheless, available studies have several limitations. First, a major restriction are the small datasets of the majority of the studies. There were only 3 among them which used data from large databases with populations of 13,830 [48], 10,580 [22] and 9,663,315 [27] patients, whereas the rest of the studies had populations of 50-361 individuals. Another serious point is that most of the studies did not provide clear information on the structure of the prediction problem (i.e., study design, prognostic vs. diagnostic, classification vs. regression), as such these pieces of information were derived after strenuous digest through the text. Most importantly, there is a non-negligible number of studies with poorly defined training and test sets. Another area of confusion is the lack of universal nomenclature regarding the discrete data sets (i.e., training, validation and test). Some studies use the terms "test set" and "validation set" interchangeably, whereas others are structured based on all three datasets. Future studies should also present their findings on AI algorithm performance in a robust way, including accuracy, F1-score, AUROC and AUPRC, because each one measures different performance aspects and may be a better predictor than the other ones under certain circumstances [57]. Also, such quantification will pave the way for meta-analyses. Furthermore, the ultimate goal of AI is the implementation of the findings of relevant studies into clinical practice. This can be achieved only if the performance of AI algorithms is benchmarked against established tests. Given the small number of studies with an external validation dataset, there is plenty of room for improvement in the field. As mentioned earlier, future endeavors in the field should follow a universal structure as per the existing guidelines, for purposes of both reproducibility and quality [9,58].
As one proceeds from the structure to the content of relevant studies, as we documented, the most popular topics are tumor type identification and grade, tumor detection, 5-year survival, cell segmentation, disease progression, disease recurrence and Ki-67 scoring. In a recent review, Yang et al. showed similar applications of AI with satisfactory prediction accuracy in the diagnosis, risk stratification and prognosis of small intestinal tumors [59]. Interestingly, this review shares 3 studies with the review in hand [14,21,33], which is not surprising given the rarity of small intestinal tumors and the major share of NENs among them. Kim et al. performed a similar analysis of the usefulness of AI in gastric neoplasms [60].
The combination of radiomics, i.e., the multitude of features and technical parameters that can be extracted from imaging studies, with the capability of big data processing offered by AI has opened new frontiers and has led to an exponential burst of pertinent literature. The fundamentals of the process of transforming an imaging study into data that can be processed by an AI algorithm are image acquisition, segmentation (i.e., selection of a region of interest in two dimensions), preprocessing (which allows data homogenization), data extraction, data selection and modelization. Given the routine performance of a constellation of imaging studies in clinical practice, this concept could contribute to the prompt diagnosis of NENs even at a preclinical stage. Promising evidence from imaging of pancreatic tumors with CT and MRI shows that this technology could find more widespread application in the field of NENs [61]. Partouche et al. performed a systematic review and meta-analysis of 161 studies on AI and imaging for Pan-NETs [62]. In accordance with our review, they documented wide heterogeneity of practices, poor procedural compliance with international guidelines, and poor reporting of clinical protocols. They reach the conclusion that standardization and homogenization is the key to future research if AI has the aspiration to enter clinical practice as a standard of care. In an another recent review on the role of radiomics in Pan-NETs, Bezzi et al. also acknowledge the need for further validations before widespread clinical adoption, nevertheless this discipline has great potential in decision-making regarding diagnosis and management [63].
In a process similar to data extraction from imaging studies, histology images can be utilized for processing with the aid of AI algorithms, following a pipeline from whole slide images (WSIs), segmentation into tiles, biomarker visualization and classification. Kuntz et al. recently published a review of 16 studies that used CNN in order to analyze gastrointestinal cancer histology images and showed good performance metrics with external validation, but none of them had clinical implementation for the time being [64].
The main limitation of the review in hand is the heterogeneity of the included studies, on grounds of methodology, dataset allocation and performance benchmarking, which did not allow for a meta-analysis. Structured publications are consequently mandatory in order to facilitate reproducible evidence of high quality. Another predicament for our study is set by the heterogeneity of NENs itself, which may raise methodological limitations. Nevertheless, given the probing nature of our research, an inclusive search strategy was inevitable. Future reviews could focus on specific histologic neuroendocrine types or disease stages.

Conclusions
To our knowledge, this is the first attempt to systematize existing evidence on the applications of AI in the field of NENs. Published studies focus mostly on diagnosis (tumor detection, tumor identification and tumor grading) rather than management and decision-making, mainly with the use of imaging studies and histology samples. Future directions should take into serious consideration the reporting and quality prerequisites set by already existing guidelines.