Risk Prediction of Barrett’s Esophagus in a Taiwanese Health Examination Center Based on Regression Models

Determining the target population for the screening of Barrett’s esophagus (BE), a precancerous condition of esophageal adenocarcinoma, remains a challenge in Asia. The aim of our study was to develop risk prediction models for BE using logistic regression (LR) and artificial neural network (ANN) methods. Their predictive performances were compared. We retrospectively analyzed 9646 adults aged ≥20 years undergoing upper gastrointestinal endoscopy at a health examinations center in Taiwan. Evaluated by using 10-fold cross-validation, both models exhibited good discriminative power, with comparable area under curve (AUC) for the LR and ANN models (Both AUC were 0.702). Our risk prediction models for BE were developed from individuals with or without clinical indications of upper gastrointestinal endoscopy. The models have the potential to serve as a practical tool for identifying high-risk individuals of BE among the general population for endoscopic screening.


Introduction
Esophageal adenocarcinoma (EAC) has a poor prognosis, with a 5-year survival rate of <15% [1]. Over the last three decades, the incidence of EAC has risen sharply in many countries around the world [2]. In the United States, the incidence of EAC increased from 0.40 cases per 100,000 individuals in 1975 to 2.58 cases per 100,000 individuals in 2009, a more than 6-fold increase [3]. Upward trends in the incidence of EAC have also been observed in Asian countries [4,5]. Barrett's esophagus (BE) is a well-documented precancerous condition of EAC [1]. Although the prevalence of BE has been lower in Asian populations than in Western populations, the rate has increased from 0.8% in 1991-1999 to 2.2% in 2010-2014 [6]. A study reported that BE is probably under diagnosed in most parts of Asia [7], which may affect the prognosis of EAC. A recent meta-analysis demonstrated that the development of endoscopic therapy for BE and early-stage EAC, targeted BE screening, and endoscopic surveillance of patients with BE could provide better survival outcomes [8].
For now, the diagnostic tools for BE are upper gastrointestinal (UGI) endoscopy and histological confirmation [9,10]. Due to their invasiveness and cost, however, the methods' usefulness in BE screening is limited. Identifying target populations for BE screening therefore remains a challenge. The current BE screening guidelines recommend that patients with gastroesophageal reflux disease (GERD) with several risk factors including the male sex, an age >50 years, central obesity, and a history of smoking should be considered for screening with UGI endoscopy [9,10]. However, these guidelines do not provide quantitative data to stratify the risk of BE while combining multiple risk factors. A number of risk prediction models for other diseases are widely employed to help clinicians make individualized medical decisions for their patients [11][12][13][14]; however, most of the published BE prediction models were constructed for Western populations and have yet to be verified for Asian populations [11,[15][16][17][18][19][20], a relevant issue given that BE presents different patterns for Asian and Western populations [21].
Logistic regression (LR) and machine learning are two common methods for constructing risk prediction models [12,22]. LR is a classical statistical method for building prediction models, while artificial neural networks (ANNs) are state-of-the-art learning machines in the machine learning area for modeling a complex system, such as modeling, voice recognition, and image classification [23]. An ANN consists of multi-layered processing units called neurons that resemble the structure and behavior of biological neurons. The class of ANNs has the universal approximation property in the sense that they can approximate any reasonable function to any desired degree of accuracy. Traditional ANNs are referred to as "shallow" because usually only one hidden layer is employed. Recent studies have employed numerous hidden layers with different variations, which are known as deep neural networks. Srinivas et al. compared the performance between LR and ANN models for BE in patients with GERD [15]. The authors demonstrated that the ANN model was superior to the LR model, with a slightly larger area under the curve (AUC). Nevertheless, no further studies were conducted on the BE prediction models using both LR and ANN methods, especially in Asian populations.
We therefore conducted this retrospective cross-sectional study by analyzing a database of information collected during physical examinations at a health examination center in southern Taiwan. The study population underwent UGI endoscopy as part of their health examination with or without clinical indications. The aim of this study was to develop LR and ANN risk prediction models for BE in an Asian population and to comprehensively compare the predictive performance of the LR and ANN models.

Study Population and Data Collection
A total of 11,879 adults aged ≥20 years with or without gastrointestinal symptoms underwent UGI endoscopy during physical examinations at the Kaohsiung Veterans General Hospital, Taiwan, between 1 January 2016 and 31 December 2018. Their demographic data, history of comorbidities (including hypertension and diabetes mellitus), history of smoking (number of packs, frequency and duration), alcohol intake (number, frequency and alcohol percentage per week), exercise habits (frequency and duration per week), and GERD symptoms (e.g., heart burn, regurgitation and dysphagia) in the past three months were recorded as medical records during the pre-endoscopic examination interview. Their weight, height, and waist circumferences (WC) were measured routinely by trained examiners during health examination at our center and also recorded as medical records.
Six experienced physicians performed the endoscopic examinations at our hospital. In accordance with the American College of Gastroenterology clinical guidelines, we diagnosed BE if salmon-colored mucosa was observed extending 1 cm or more above the gastroesophageal junction, with histological confirmation of intestinal metaplasia [10]. Biopsies were obtained from four quadrants (2 cm apart) of the circumferential part of the endoscopically suspected esophageal metaplasia (ESEM). For mucosal tongues of non-circumferential ESEM, the sites and biopsies were evaluated under narrowband imaging and then decided upon by the physicians.
Of these 11,879 adults, 2233 were excluded because one had a history of esophageal cancer and 2232 had missing data in the dataset. Therefore, 9646 adults were included. The sources of the dataset were from the medical records of our hospital and the data were anonymous in the dataset. The study was approved by the Ethics Committee of the Kaohsiung Veterans General Hospital (VGHKS21-CT3-12). Participants' consents were not required in the study because this is a retrospective study, and all data were analyzed in anonymity.
To select the variables, we employed statsmodels (version 0.9.0, [24] in Python to construct a generalized linear model from the twelve variables (setting link function = logistic and family = binomial in statsmodels) [24] and checked each variable to determine its statistical significance (p < 0.05). The most significant variables will then be employed to construct the LR and ANN models for comparisons. Given the different measurement scales between the categorical and continuous variables, we employed the z-score formula to standardize all continuous variables including age, height, weight, BMI, and WC. With x as one of the input variables, the standardization process for x is performed through where µ x is the mean value and σ x is the standard deviation (SD) of the variable x. Note that in the future predicting phase, numerical type variables need to be standardized before applying the model. The LR models were built with the Python package scikit-learn (version 0.20.3) [25] and validated by stratified 10-fold cross-validation. To construct the ANN models, we employed the Python package Keras (version 2.2.4) [26]. The architecture of our proposed ANN model, presenting four hidden layers with LeakyReLU or ReLU activation functions, is shown in Figure 1. To prevent overfitting, we employed the dropout regularization mechanism [27]. Dropout is a technique which drops out the neuron connection randomly. We also adapted the batch normalization by re-centering and re-scaling the layers' inputs for each mini batch to reduce the covariate shift between the layers in the neural network. Furthermore, the residual connection, which injects the outputs of earlier layers directly into the inputs of latter layers, is used to decrease the likelihood of representational bottleneck and vanishing gradient [28,29].
We employed AUC as a metric to measure model performance and validated the LR and ANN models using stratified 10-fold cross-validation. We also compared the sensitivity, specificity, and accuracy under different threshold settings of the LR and ANN models.

Characteristics of the Subjects and Variables Selection
The simulation programs were coded using the Python programming language (version 3.5.4) running on Microsoft Windows 10 on an Intel Core i5 CPU with 8 GB of RAM. Table 1 shows the baseline characteristics of the 9646 participants, which included 5020 men and 4384 women (mean age, 50.41 years old; SD, 11.75 years), 242 of whom (2.51%) had BE. We first constructed the generalized linear model using statsmodels and analyzed the twelve variables one by one to check their statistical significance. The corresponding p-values are shown in Table 2. We included the four statistically significant variables (p < 0.05) in the successive simulations of the LR and ANN models: age (odds ratio [

LR and ANN Models Development and Their Predictive Performance Comparisons
To better assess the trained models, we employed a stratified 10-fold cross-validation. Table 3 shows the mean performance of the stratified 10-fold cross-validation for the two models in the sampling threshold settings. With a threshold setting of 90% sensitivity, the specificity of the LR and ANN models was 31% and 20%, respectively. At 90% specificity, the sensitivity of the LR and ANN models was 30% and 28%, respectively. The point (0,1) is the best cutoff point for the perfect model (AUC = 1) on the receiver operating characteristic (ROC) curve. We therefore assumed a "closest to (0,1)" criterion to find the closest point to the top left corner (0,1) on the ROC curve, which serves as a method for finding the optimal cutoff value [30]. The values for the performance of the closest point to point (0,1) for the LR and ANN models, which are similar, are shown in Table 3. We computed the AUC and SD for the AUCs across folds for the LR and ANN models.

Final Mean LR Model
We obtained the coefficients, which included the intercept of the final mean LR model, by taking the mean of the relevant coefficients across the ten folds in the cross-validation (Table 4). Assume the variables xj, j= 1,2,3,4,5 have been standardized using Equation (1) and that y denotes the estimated probability of BE, the final mean LR model through cross-validation was described using the following Equation (2): where x 1 is the age x 2 is 1 if the sex is male, otherwise 0 x 3 is 1 if the patient has presented GERD symptoms in the past 3 months, otherwise 0 x 4 is 1 if the patient's cumulative smoking exposure is >0 but ≤20 pack-years, otherwise 0 x 5 is 1 if the patient's cumulative smoking exposure is >20 pack-years, otherwise 0 We provided the sampling threshold settings using the final mean LR model in Table 4. Based on a threshold setting of 90% sensitivity and of 90% specificity, a cutoff point was 0.33 and 0.67, respectively. Based on the "closest to (0,1)" criteria, a cutoff point was 0.52. Users can choose a suitable cutoff point according to their clinical considerations. In addition, the code of the ANN model is provided in the Supplementary Materials (see Supplementary Code S1).

Discussion
This is a large-scale study for building a risk prediction model for BE. Age, sex, GERD symptoms, and smoking were significantly associated with BE and employed to construct models. In the case of different ethnicity with different pattern of BE, our LR model had comparable performance with those of previously published models [11,[15][16][17][18][19]. To our knowledge, only one published model used ANNs. Compared with that, the study had larger sample sizes and our ANN model had better performance [15]. In addition, different from previous studies, the participants in our study were not only from physicians' referral [17,18,20] and those with subjective symptoms [15], but also from individuals without symptoms. Rubenstein et al. developed a prediction model from participants without a referral for a clinical indication; however, the study population was limited to older males (50-79 years) [11]. Our prediction models constructed by those who aged over 20 years and were not restricted by clinical indications of UGI endoscopy have the potential to be applied to the general population.
ANNs can capture nonlinear relationships between inputs (or variables) and outputs (or responses) which is considered as a powerful method of for model development.
When constructing the ANNs employed in this study, we used dropout regularization techniques to avoid overfitting the training data, batch normalization to reduce the covariate shift between network layers, and residual connection to decrease the likelihood of representational bottleneck and vanishing gradient [28,29]. These methods are state-of-the-art techniques employed in deep learning and were adopted for our proposed neural network model. However, the performance of the ANN model, which was constructed by using four hidden layers, was not superior to that of the LR model in the study. The low number of significant variables for the predictive model could be a contributing factor for the result because the strength of the neural network for nonlinear mapping among these variables is limited [31].
The four most significant variables (age, sex, presence of GERD symptoms and a history of smoking), which were included as predictors in the construction of the LR and ANN models, are compatible to previous studies as risk factors [9,10]. In clinical practice, information on these four variables can be easily obtained through interviews, and the prediction process can be started with only a few variables. The stratified 10-fold cross-validation was performed in our study to minimize the variability of the models' performance. This study demonstrated the performance characteristics of the final mean LR model for various cutoff points (Table 4). A sensitivity and specificity close to 90% having differing clinical objectives. High sensitivity is better suited to personal physical examinations to lower the proportion of undetected cases. Given the low prevalence of BE in Asian populations, a high specificity may be taken as a reference from the public health perspective. Thrift et al. indicated that, given the low risk of progression to cancer for patients with BE, the higher threshold setting for the cutoff point may be considered [16].
As a practical example for the final mean LR model, consider a 55-year-old man with GERD symptoms in the past three months but no history of smoking. After standardizing the numerical predictors in the data preprocessing step, the age variable should be adjusted to 55-50.4 ([grand mean (age)])/11.8 [SD (age)]. The formula to apply would therefore be the following (Equation (3) 11.8 [standard deviation (age)] = 0.39 x 2 is 1, because the patient is male. x 3 is 1, because of the GERD symptoms in the past 3 months. x 4 is 0, because the cumulative smoking exposure is 0. x 5 is 0, because the cumulative smoking exposure is 0.
The resulting BE probability is therefore 0.66. According to the cutoff points listed in Table 4, 0.66 is greater than 0.33, which is the cutoff point for 90% sensitivity. Endoscopic BE screening can be considered if we set cutoff point at 90% sensitivity from the personal health examination standpoint. However, the BE probability is <0.67, which is the cutoff point for 90% specificity. Endoscopic BE screening may not be suggested from the public health perspective if we set cutoff point at 90% specificity.
Our study had some limitations. First, the study enrolled participants from the center for out-of-pocket physical examinations, which may enroll individuals with a higher socioeconomic status. However, the role played by socioeconomic status in the risk of BE has not been established [9,10]. Further studies about association between socioeconomic status and BE may be needed. Second, a number of potential factors (e.g., hiatal hernia and H. pylori infection) were not included in this analysis [32,33], which might have improved the model. However, this information would limit the study population to those who underwent UGI endoscopy prior to the study and complicate the data collection from interviews. Third, although the stratified 10-fold cross-validation decreases the overfitting problem and provides low variability for the prediction results, external validation is still needed to confirm the model's generalizability.

Conclusions
Both LR and ANN models exhibit good discriminative power. The ANN model did not exhibit superior performance to the LR model in the study. LR may be a preferable method for model development while low number variables employed because of its comparable performance with ANNs and interpretability in the clinical settings. However, careful interpretation of the association between independent and dependent variables is crucial due to the concerns of collapsibility and exchangeability for LR. Further studies are needed to validate the models and to evaluate the cost benefit of BE screening in different clinical settings.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/ijerph18105332/s1, Code S1: The code for ANN model is provided in the file. Data S2: The dataset used in this study is provided in the file.  Informed Consent Statement: Patient consent was waived because this is a retrospective study, and all data were analyzed in anonymity.