A Lesion-Based Convolutional Neural Network Improves Endoscopic Detection and Depth Prediction of Early Gastric Cancer

In early gastric cancer (EGC), tumor invasion depth is an important factor for determining the treatment method. However, as endoscopic ultrasonography has limitations when measuring the exact depth in a clinical setting as endoscopists often depend on gross findings and personal experience. The present study aimed to develop a model optimized for EGC detection and depth prediction, and we investigated factors affecting artificial intelligence (AI) diagnosis. We employed a visual geometry group(VGG)-16 model for the classification of endoscopic images as EGC (T1a or T1b) or non-EGC. To induce the model to activate EGC regions during training, we proposed a novel loss function that simultaneously measured classification and localization errors. We experimented with 11,539 endoscopic images (896 T1a-EGC, 809 T1b-EGC, and 9834 non-EGC). The areas under the curves of receiver operating characteristic curves for EGC detection and depth prediction were 0.981 and 0.851, respectively. Among the factors affecting AI prediction of tumor depth, only histologic differentiation was significantly associated, where undifferentiated-type histology exhibited a lower AI accuracy. Thus, the lesion-based model is an appropriate training method for AI in EGC. However, further improvements and validation are required, especially for undifferentiated-type histology.


Introduction
Accurate staging is the basis for determining an appropriate treatment plan for suspected early gastric cancer (EGC) based on endoscopy or biopsy findings. As the indications for endoscopic resection (ER) and minimally invasive surgery are usually decided by the T-stage, tumor invasion depth is crucial for determining the treatment modality [1][2][3].
EGC is categorized as tumor invasion of the mucosa (T1a) or that of the submucosa (T1b). Endoscopic ultrasonography (EUS) is useful for T-staging of gastric cancer because it can delineate each gastric wall layer [4,5]. However, EUS is not superior to conventional endoscopy for T-staging of EGC, having a low accuracy of approximately 70% [6,7]. Therefore, there has been increasing interest in the field of medical imaging regarding modalities for predicting EGC depth.
Recently, deep learning-based artificial intelligence (AI) has shown remarkable progress across multiple medical fields. Diagnostic imaging is currently the highest and most efficient application of AI-based analyses in medical fields [8,9]. AI using endoscopic images has been applied to diagnose neoplasms in the gastrointestinal tract [10,11]. Deep convolutional neural networks (CNNs) are a type of deep learning model that are widely used for image analysis [12]. However, they differ from general image classification as the difference in EGC depth in endoscopic images is subtler and more difficult to discern. Therefore, more sophisticated image classification methods are required.
Although conditions such as easily distinguishable visual features and large-scale datasets play key roles in the performance improvements of natural image classification models, these conditions are difficult to be applied to EGC detection and EGC depth prediction models. Although the definitions of invasion depth in EGC differ, features such as textures, shapes, and colors are visually similar. In addition, each degree of invasion depth may not have sufficient training images to cover all types of visual features, because of the fine-scale granularity. Therefore, models for EGC detection and depth prediction may be used to focus on other visually distinguishable patterns rather than EGC. For example, model weights may initially be tuned to find a tiny particle appearing on most images in a T1b-EGC training set rather than extracting features from homogeneous regions. Therefore, it is critical to guide the model to learn the visual features of EGC regions rather than those of other gastric textures.
The present study aims to develop a model and training method optimized for EGC depth prediction, evaluate its diagnostic performance, and investigate factors affecting AI diagnosis.

Patients
This study included 800 patients (538 men and 262 women; age: 26-92 years; mean age: 62.6 years) with an endoscopic diagnosis of EGC at the Gangnam Severance Hospital, Yonsei University College of Medicine, Seoul, Korea, between January 2012 and March 2018. EGC was suspected based on endoscopy findings and all patients underwent a curative treatment by either operation or ER for gastric cancer. The invasion depth was confirmed pathologically through specimens obtained after the treatment. This study was approved by the Institutional Review Board of Gangnam Severance Hospital (no. 3-2017-0365).

Data Preparation (Endoscopic Image Collection)
Endoscopy was performed for screening or preoperative examinations. Images were captured using standard endoscopes (GIF-Q260J, GIF-H260, and GIF-H290; Olympus Medical Systems, Co., Ltd., Tokyo, Japan). The image of the lesion should have both close-up and a distant view so that the size and position of the lesion can be identified. Additionally, the amount of gas insufflation should be adjusted appropriately to reflect the condition of the lesion and its surrounding area.
We collected 11,686 endoscopic images, including 1097 T1a-EGC, 1,005 T1b-EGC, and 9834 non-EGC images. The non-EGC images were endoscopic images of the gastric mucosa that were not EGC, including chronic gastritis, chronic atrophic gastritis, intestinal metaplasia, and erosion. The images with poor quality were filtered out. The image inclusion criteria comprised white light images and images with whole lesions. However, images with motion-blurring, out of focus, halation, and poor air insufflation were excluded. Finally, 11,539 images (896 T1a-EGC, 809 T1b-EGC, and 9834 non-EGC) were selected. To prepare the image dataset for the models, the selected images were randomly organized into five different folds to assess how the trained model was generally applicable while avoiding overfitting or testset selection bias [13]. The five folds were used to train and evaluate the deep learning models. All the folds were independent, and the training:validation:testing dataset ratio at each fold was 3:1:1 (Supplementary Table S1). The images extracted from one patient were assigned to a fold; therefore, the number of images between the folds differed slightly (Supplementary Table S2). The validation set that was a totally independent fold than the training folds was used to observe the training status during the training. After training the model, the other independent fold was used to evaluate the model performance as a testing set. For example, cross validation-group 1 of Supplementary Table S1 used the first three folds (A, B, and C) as the training set, the fourth fold (D) as the validation set, and the remaining folds (E) as the testing set (Supplementary Table S1).

Convolutional Neural Network and Training
Detailed descriptions of the neural network architectures, novel loss function methods, training methods, and algorithms are presented in the Supplement. A short summary is provided below. We used two networks on two training methods to evaluate which one allowed the CNN to be better oriented to EGC regions. The two network models were based on a transfer learning method with the visual geometry group (VGG)-16 network pre-trained on ImageNet, which is a large-scale dataset published for the image classification task to effectively initialize and train network weights [14,15]. The first model was a typical method that computed the loss between the real and predicted classes of input data. The second was a novel method that used the weighted sum of gradient-weighted class activation mapping (Grad-CAM) and cross-entropy losses. To let the model focus on the fine-grained features of EGC regions, we employed a novel loss function by adding Grad-CAM [16,17]. Although most existing visualization methods require an additional module to generate visual explanations, Grad-CAM can visualize activation statuses that are gradually changed over the training time as its initial architectures [18,19]. The gradually activated EGC regions of the input RGB image, which were produced by passing trained layers, are shown in Supplementary Figure S1. The blue and red colors on Grad-CAM indicate lower and higher activation values, respectively.
The proposed novel method allows the training procedure to optimize an objective that simultaneously minimizes not only the classification error (real classes-predicted classes) but also the localization error (real lesion mask-activated Grad-CAM). The real lesion mask is part of an endoscopic image that the endoscopist identified as real EGC area. We named this novel method "lesion-based VGG-16." An overview of the proposed algorithm for the computer-aided diagnosis (CAD) of EGC is shown in Supplementary Figure S2.

Evaluation
To evaluate the performance of EGC detection and depth prediction models, we measured the sensitivity (%), specificity (%), positive predictive value (PPV) (%), negative predictive value (NPV) (%), and area under the curve (AUC) of receiver operating characteristic (ROC) curves by summing all cross-validation folds. Because the number of test images comprising each cross-validation fold was different, it was insufficient to evaluate their generalized performances. Therefore, we randomly selected a fixed number of images from each class of the test dataset. The test set of EGC depth prediction included 300 images, comprising 150 T1a-EGC and 150 T1b-EGC images. The EGC detection model test set included 660 images consisting of 330 EGC and 330 non-EGC images. Additionally, 90 EGC images not included in the cross-validation datasets were also tested. A total of 1590 and 3390 images were evaluated for predicting EGC depth and detecting EGC, respectively.
Since the network was trained by activating EGC regions, activated regions extracted by Grad-CAM at the last convolutional layer could be considered as suspected cancer regions when an endoscopy image was fed to the network. To demonstrate the utility of cases where activated regions can be localized EGC regions, we evaluated the EGC localization performances. Our method for localizing EGC regions from Grad-CAM and evaluation metrics is described in the supplementary materials.

Statistical Analysis
Chi-squared and Fisher's exact tests were used to evaluate the associations among various categorical variables. Univariable and multivariable logistic regression analyses were performed to identify factors significantly affecting the AI accuracy. Odds ratios (ORs) and relevant 95% confidence intervals (CIs) were calculated. Analyses were performed using SAS version 9.4 (SAS Institute, Cary, NC, USA) or IBM SPSS Statistics for Windows, version 23.0 (IBM Co., Armonk, NY, USA) and p-values < 0.05 indicated statistical significance.
Subsequently, we evaluated the lesion-based VGG-16 on the same test image set. The sensitivity and specificity for EGC detection were 91.0% and 97.6%, respectively, and the PPV and NPV were 97.5% and 91.1%, respectively. The overall AUC was 0.981. The sensitivity and specificity of the prediction of tumor depth in the lesion-based VGG-16 were 79.2% and 77.8%, respectively, and the PPV and NPV were 79.3% and 77.7%, respectively. The overall AUC was 0.851 (Supplementary Table S3 and Figure 1). PPV and NPV were 79.3% and 77.7%, respectively. The overall AUC was 0.851 (Supplementary Table  S3 and Figure 1).

Localization Ability of the Activated Regions
We compared the localization ability of the activated regions on the last convolutional layer of the VGG-16 with and without using the Grad-CAM loss. Supplementary Figure S3A Figure S4) did not precisely cover the actual EGC regions (first two columns of Supplementary Figure S4), and in some cases, deviated from the EGC regions (last two columns of Supplementary Figure S4). In contrast, the lesion-based VGG-16 (last row) attempted to completely activate and reach the EGC regions. In depth prediction, lesion-based VGG-16 reflected the actual EGC regions more accurately, as shown in Supplementary Figure S4B. Figure 2 shows some examples of the correctly classified (first two rows) and misclassified (last row) images of the lesion-based VGG-16. Although the model misclassified the presence or depth of EGCs in some cases, the EGC region was accurately activated.

Localization Ability of the Activated Regions
We compared the localization ability of the activated regions on the last convolutional layer of the VGG-16 with and without using the Grad-CAM loss. Supplementary Figure S3A Figure  S4) did not precisely cover the actual EGC regions (first two columns of Supplementary Figure S4), and in some cases, deviated from the EGC regions (last two columns of Supplementary Figure S4). In contrast, the lesion-based VGG-16 (last row) attempted to completely activate and reach the EGC regions. In depth prediction, lesion-based VGG-16 reflected the actual EGC regions more accurately, as shown in Supplementary Figure S4B. Figure 2 shows some examples of the correctly classified (first two rows) and misclassified (last row) images of the lesion-based VGG-16. Although the model misclassified the presence or depth of EGCs in some cases, the EGC region was accurately activated.

Factors Associated with the Accuracy of Tumor Detection by AI
EGCs with a flat morphology had a significantly lower accuracy for EGC detection than other gross types (p = 0.038) ( Table 2). Relatively small size (1-13 mm) (p = 0.002) and T1a-EGC (p = 0.001) were significantly associated with tumor detection. In multivariable analysis, small size (1-13 mm) (p = 0.006) and T1a-EGC (p = 0.019) showed statistically lower accuracies. The accuracies of EGC

Factors Associated with the Accuracy of Tumor Detection by AI
EGCs with a flat morphology had a significantly lower accuracy for EGC detection than other gross types (p = 0.038) ( Table 2). Relatively small size (1-13 mm) (p = 0.002) and T1a-EGC (p = 0.001) were significantly associated with tumor detection. In multivariable analysis, small size (1-13 mm) (p = 0.006) and T1a-EGC (p = 0.019) showed statistically lower accuracies. The accuracies of EGC detection for tumors ≤5 and ≤10 mm were 88.4% and 89.4%, respectively. The EGC detection did not differ significantly according to the histologic differentiation and location.

Factors Associated with the Accuracy of T-Staging by AI
Undifferentiated-type histology was the only factor significantly associated with a lower accuracy for T-stage prediction in both univariable and multivariable analyses (p = 0.001 and 0.033, respectively) ( Table 3). The accuracy did not differ significantly according to the size. The factors associated with T-stage prediction were reanalyzed in undifferentiated-type histology. T1b was only significantly associated with a lower T-stage prediction accuracy (p = 0.015) (Table 4). Thus, factors associated with T-staging in undifferentiated-type histology were investigated. Relatively large size (≥14 mm) (p = 0.003) and poorly differentiated adenocarcinoma (p < 0.001) were significantly associated with T1b in undifferentiated-type histology (Table 5). Among undifferentiated-type EGCs, flat and elevated morphologies were more common in T1a and T1b, respectively. Table 4. Factors affecting the accuracy of T-staging in undifferentiated-type adenocarcinoma.

Discussion
Although previous studies have reported the clinical efficacy of EUS in T-staging of EGC, the results are conflicting [7,[20][21][22]. Some studies have reported that conventional endoscopy is comparable to EUS for the T-staging of EGC [6,23]. Various morphologic features, such as irregular surface and submucosal tumors, like marginal elevation, have been proposed as predictors of tumor invasion depth [24]. Identification and verification of additional morphological features of deep invasion in large datasets would allow a more complete depth prediction.
The sensitivity and overall AUC of EGC detection in the present study were 91.0% and 0.981, respectively, comparable to those in a previous report [10]. The overall AUC of T-staging by our lesion-based VGG-16 system was 0.851, which is higher than that previously reported for EUS prediction [6,22]. Unlike other studies, the present study also analyzed the factors affecting AI diagnosis [10,25]. The diagnostic accuracy of AI for T-staging was significantly affected by histopathologic differentiation. Undifferentiated-type histology was more frequently associated with an incorrect invasion depth diagnosis by AI. By reanalyzing only undifferentiated-type histology, T1b-EGC was significantly associated with an incorrect EGC invasion depth diagnosis by the AI. Interestingly, this finding was similar to that for the analysis in EUS. Previous studies have reported that the accuracy of EUS for depth prediction is poor in undifferentiated-type EGC or T1b-EGC [6,26]. Undifferentiated-type histology and T1b-EGC are two important factors for the decision to perform an extended ER. Therefore, these results can provide important directions for the development of an AI for EGC.
As it is critical that AI is properly trained, we performed extensive experimentation and discussion. There are some challenges in applying the loss function designed to train the classification model for a natural-image dataset to AI for EGC without modification. To overcome these difficulties, we proposed a novel loss function that computed a weighted sum of typical classification and Grad-CAM losses. By applying the proposed loss function to EGC detection and EGC depth prediction models, the optimizer simultaneously minimized classification and localization losses in the activated Grad-CAM regions. Although there was no significant performance improvement in predicting the depth of EGCs between VGG-16 (AUC = 0.844) and lesion-based VGG-16 (AUC = 0.851), the trained lesion-based VGG-16 predicted the depth of EGCs by automatically activating EGC regions, whereas VGG-16 did not. The classification performance of VGG-16 trained by cross-entropy loss alone is still debatable regarding dataset bias, where the model considered non-EGC regions to optimize the objective. To the best of our knowledge, this is the first study to use a novel loss function that allows the optimizer to determine an optimum by simultaneously considering EGC depth prediction and localization losses of the activated regions. This model uses the proposed method to simultaneously provide prediction and localization.
To determine which proposed loss function made the CNN focus on the EGC region regardless of the network, we trained an 18-layer residual network (ResNet-18) as a CNN-based EGC depth prediction model [27]. We fine-tuned all weights for a ResNet-18 pre-trained on the ImageNet Dataset. The activation results of ResNet-18 are shown in Supplementary Figure S5. As with VGG-16, ResNet-18 was also trained using two types of loss functions [27]. As shown in the last two columns of Supplementary Figure S5, the lesion-based ResNet-18 more accurately activated the EGCs as compared to ResNet-18.
The present study has several limitations. First, we did not analyze the accuracy of EGC detection according to the background mucosa. That is, the background mucosa of the stomach is accompanied by chronic inflammatory changes such as chronic atrophic gastritis and intestinal metaplasia. These are important features that complicate EGC diagnosis. However, to overcome the differences in accuracy based on the features of the background mucosa, more than 9800 non-EGC endoscopic images were learned. Second, the number of undifferentiated-type histology cases was relatively smaller than that of differentiated-type histology. The AI performance is related to the amount of data, and thus may have played an important role in the accurate prediction of EGC depth. It is possible that growth patterns or biological characteristics of undifferentiated histology areas are also affected. Similar findings were reported in previous EUS studies. Third, we did not compare the diagnostic accuracy of lesion-based VGG-16 to that of endoscopists for all images of the study, although endoscopists predicted the invasion depth for subsets of images in this study, with a sensitivity of 76% and overall accuracy of 73% (data not shown). Therefore, the proposed method may be a good tool for predicting the depth of EGC invasion. Finally, this is a retrospective study. Standardization of images is a very important part of the research involving image analysis. The images used were of good quality, and they appropriately characterized the lesions. However, they were not completely standardized with numerical analysis. To overcome the aforementioned limitations, we plan to perform research by using endoscopic video in the future.

Conclusions
In conclusion, AI may be a good tool for not only EGC diagnosis but also for the prediction of invasion depth, especially in differentiated-type EGC. To maximize the clinical usefulness, it is important to choose an appropriate method for AI application. The lesion-based model is the most appropriate training method for AI in EGC. EGC with undifferentiated-type histology and T1b-EGC is more frequently associated with an incorrect EGC invasion depth by AI. The development of a well-trained AI for undifferentiated-type histology and T1b-EGC is warranted. Further study is also necessary to understand the operating principles of AI and to validate these findings.
Supplementary Materials: The following are available online at http://www.mdpi.com/2077-0383/8/9/1310/s1, Figure S1: Examples of the Grad-CAM output extracted from each convolutional layer of the trained lesion-based VGG-16., Figure S2: Overview of the VGG-16-based model., Figure S3: Localization performances of activated regions extracted using the Grad-CAM method., Figure S4: Example of RGB color input images and their Grad-CAM results extracted from the last convolutional layer of VGG-16 and the lesion-based VGG-16., Figure S5: Comparisons of Grad-CAMs extracted from the last convolutional layer when a network was trained using two types of loss functions., Table S1: Groups for 5-fold cross validation., Table S2: Composition of the five folds cross validation dataset., Table S3: Diagnostic accuracy of the VGG-16 and lesion-based VGG-16.