Deep Learning-Based Computer-Aided Diagnosis System for Gastroscopy Image Classiﬁcation Using Synthetic Data

: Gastric cancer has a high mortality rate worldwide, but it can be prevented with early detection through regular gastroscopy. Herein, we propose a deep learning-based computer-aided diagnosis (CADx) system applying data augmentation to help doctors classify gastroscopy images as normal or abnormal. To improve the performance of deep learning, a large amount of training data are required. However, the collection of medical data, owing to their nature, is highly expensive and time consuming. Therefore, data were generated through deep convolutional generative adversarial networks (DCGAN), and 25 augmentation policies optimized for the CIFAR-10 dataset were implemented through AutoAugment to augment the data. Accordingly, a gastroscopy image was augmented, only high-quality images were selected through an image quality-measurement method, and gastroscopy images were classiﬁed as normal or abnormal through the Xception network. We compared the performances of the original training dataset, which did not improve, the dataset generated through the DCGAN, the dataset augmented through the augmentation policies of CIFAR-10, and the dataset combining the two methods. The dataset combining the two methods delivered the best performance in terms of accuracy (0.851) and achieved an improvement of 0.06 over the original training dataset. We conﬁrmed that augmenting data through the DCGAN and CIFAR-10 augmentation policies is most suitable for the classiﬁcation model for normal and abnormal gastric endoscopy images. The proposed method not only solves the medical-data problem but also improves the accuracy of gastric disease diagnosis.


Introduction
According to the statistics released by the Global Cancer Observatory in 2018, gastric cancer is the fifth most frequently diagnosed cancer and the third leading cause of cancer deaths worldwide, as shown in Figure 1 [1].
To increase the survival rate of patients with gastric cancer, it is important to detect and treat it early through gastroscopy. A previous study showed that the survival rate of patients with gastric cancer who underwent gastric endoscopy was 2.24 times higher than that of those who did not [2]. Precancerous lesions that cause gastric cancer include gastritis, gastric ulcer, and gastric bleeding. Most of these gastric diseases are difficult to detect because they are asymptomatic until they develop into gastric cancer. Therefore, gastric cancer can be prevented through the early detection of lesions that develop into gastric cancer with regular gastroscopy. As the importance of gastroscopy increases, the number of gastroscopy examinees is expected to increase. In addition, as the imaging technology continually develops and the number of medical images rapidly increases, technology continually develops and the number of medical images rapidly increases, the fatigue experienced by specialists who perform the diagnosis by relying on the naked eye increases, and differences in diagnosis occur depending on the skill of the specialist. Therefore, the need for a computer-aided diagnosis (CADx) system to assist specialists in performing the diagnosis is increasing. A CADx system assists a doctor in performing the diagnosis by detecting and analyzing lesions, reduces the manual work of endoscopy specialists, and improves the accuracy of gastric-disease diagnosis. Currently, CADx systems using deep learning are being actively studied. According to research results, the latest deep-learning technologies can consistently deliver superior performance even if the data contain some errors, as long as there are enough data [3]. However, if the system does not have enough training data to train parameters, then overfitting problems that hinder its performance occur more easily. Therefore, deep learning-based research requires a large amount of quality data. However, the collection of medical data, owing to their nature, requires approval from institutional review boards for the protection of personal information of patients; further, it is highly expensive and time consuming because of the involved verification process that includes medical examinations and biopsy tests. To solve this problem, many studies have proposed methods of generating data similar to actual data through data augmentation. Frid-Adar et al. [4] introduced data-augmentation methods through rotations, flipping, transmission, and scaling and a general adversarial network (GAN) for computedtomography (CT) images of the liver. Zafar et al. [5] proposed a method of augmenting data and classifying melanoma by applying random image-brightness and color-contrast values to skin-lesion images. Shin et al. [6] proposed a method of generating synthetic abnormal magnetic resonance imaging (MRI) images with brain tumors using a GAN from brain MRI images. Dai et al. [7] produced split images of the lungs and heart through a trained GAN from chest X-ray images. Zhao et al. [8] improved the performance of the method classifying malignant and benign pulmonary nodules by generating various lung CT images through a forward GAN and improving the image quality through a backward GAN. In addition, Gomes Ataide et al. [9] flipped, rotated, and blurred thyroid nodule ultrasound images to augment data, extract features, and classify them as either benign or malignant through a random forest classifier. Lyu et al. [10] classified different types of lung nodule malignancies through a multilevel cross-residual convolutional neural network (CNN). These studies [4][5][6][7][8][9][10] were conducted by applying a basic augmentation method or GAN for lesions such as lung, skin, and brain lesions but not gastric lesions. Asperti et al. [11] increased the amount of data by randomly applying rotation, width shift, height shift, shear, and zoom methods within a certain range to classify gastroscopy images as normal or abnormal. Togo et al. [12] improved the performance of the model Currently, CADx systems using deep learning are being actively studied. According to research results, the latest deep-learning technologies can consistently deliver superior performance even if the data contain some errors, as long as there are enough data [3]. However, if the system does not have enough training data to train parameters, then overfitting problems that hinder its performance occur more easily. Therefore, deep learning-based research requires a large amount of quality data. However, the collection of medical data, owing to their nature, requires approval from institutional review boards for the protection of personal information of patients; further, it is highly expensive and time consuming because of the involved verification process that includes medical examinations and biopsy tests. To solve this problem, many studies have proposed methods of generating data similar to actual data through data augmentation. Frid-Adar et al. [4] introduced data-augmentation methods through rotations, flipping, transmission, and scaling and a general adversarial network (GAN) for computed-tomography (CT) images of the liver. Zafar et al. [5] proposed a method of augmenting data and classifying melanoma by applying random image-brightness and color-contrast values to skin-lesion images. Shin et al. [6] proposed a method of generating synthetic abnormal magnetic resonance imaging (MRI) images with brain tumors using a GAN from brain MRI images. Dai et al. [7] produced split images of the lungs and heart through a trained GAN from chest X-ray images. Zhao et al. [8] improved the performance of the method classifying malignant and benign pulmonary nodules by generating various lung CT images through a forward GAN and improving the image quality through a backward GAN. In addition, Gomes Ataide et al. [9] flipped, rotated, and blurred thyroid nodule ultrasound images to augment data, extract features, and classify them as either benign or malignant through a random forest classifier. Lyu et al. [10] classified different types of lung nodule malignancies through a multilevel cross-residual convolutional neural network (CNN). These studies [4][5][6][7][8][9][10] were conducted by applying a basic augmentation method or GAN for lesions such as lung, skin, and brain lesions but not gastric lesions. Asperti et al. [11] increased the amount of data by randomly applying rotation, width shift, height shift, shear, and zoom methods within a certain range to classify gastroscopy images as normal or abnormal. Togo et al. [12] improved the performance of the model by generating X-ray gastritis images using a loss function-based conditional progressive growing generative adversarial network (GAN). Nguyen et al. [13] proposed a method of classifying in vivo endoscopy images into normal or abnormal through VGG, DenseNet, and inception-based networks using the proposed ensemble learning. Some studies [11][12][13] also used gastroscopy and X-ray images of the stomach, but data were augmented by applying basic augmentation methods or a GAN. However, in the present study, for data augmentation, we mixed data generated through a deep convolutional generative adversarial network (DCGAN) with data applied using the augmentation policies of the CIFAR-10 dataset proposed by AutoAugment. In addition, we attempted to improve the performance of the model by using an image quality-measurement method for the generated images.
Related studies have proposed various augmentation methods using various medical images, such as lung CT images, brain MRIs, and endoscopy images, and a method that applies classification networks based on machine learning and deep learning. In this study, two methods were used to supplement the insufficient amount of data required for learning. In our method, a gastroscopy image is generated through a DCGAN using CNNs with excellent image-processing performance, and data are augmented by applying 25 policies optimized for the CIFAR-10 dataset suggested by AutoAugment. An image quality-measurement method, with the Xception network (a deep learning-based imageclassification network), is applied for the augmented data. Then, only images that are similar to real images are selected and added to the training dataset. By including the image quality-measurement process, the increased image was verified, and only data that could improve the quality of learning were selected for performance improvement. The proposed method is expected to improve the accuracy of gastric-disease diagnosis and help doctors perform the diagnosis. Therefore, the classification network, data-augmentation method, and image quality-measurement method employed in this study will be described.

Dataset
The gastrointestinal endoscopy images used in this study were obtained from the Department of Gastroenterology at Gyeongsang National University Hospital and used with white-light endoscopy images that were approved by the Institutional Review Board. All endoscopy images were acquired using OLYMPUS GIF-HQ290. They were verified through an examination and a biopsy test conducted by a gastroenterologist. The data used in the experiment were obtained from 150 patients and randomly divided into a training dataset and a test dataset. The training and test datasets also were confirmed by a gastrointestinal endoscopy specialist (i.e., a gastroenterologist). As shown in Table 1, the training dataset of the actual gastroscopy images consisted of 655 normal and 655 abnormal images. The test dataset consisted of 164 normal and 164 abnormal images. Lesions with abnormalities include gastritis, gastric SMT (submucosal tumors), early gastric cancer, polyps, gastric ulcer, and bleeding. Figure 2 presents the normal and abnormal gastroscopy images from the dataset.

Classification Method
Xception is a model based on Inception. In addition to it reducing the connection between nodes using the Inception module in GoogLeNet, Xception is a network that separates finding the relationships between all the channels and finding local information [14]. Accordingly, the extreme version of the Inception module is proposed herein. As shown in Figure 3, after applying a 1 × 1 convolutional layer to the input, all the channels are separated, and each channel is individually operated for the 3 × 3 convolution. Xception uses a depth-wise separable convolution created by modifying the operation. In a depth-wise separable convolution, the convolution operation was performed for each channel, and a 1 × 1 convolutional layer was obtained. As shown in Figure 4, the standard convolution creates one feature map by considering all the channel and area information. Conversely, the depth-wise separable convolution adjusts the number of output feature maps by performing a 1 × 1 convolution operation, called point-wise convolution, after the depth-wise convolution operation that creates one feature map for each channel.

Classification Method
Xception is a model based on Inception. In addition to it reducing the connection between nodes using the Inception module in GoogLeNet, Xception is a network that separates finding the relationships between all the channels and finding local information [14]. Accordingly, the extreme version of the Inception module is proposed herein. As shown in Figure 3, after applying a 1 × 1 convolutional layer to the input, all the channels are separated, and each channel is individually operated for the 3 × 3 convolution. Xception uses a depth-wise separable convolution created by modifying the operation. In a depth-wise separable convolution, the convolution operation was performed for each channel, and a 1 × 1 convolutional layer was obtained. As shown in Figure 4, the standard convolution creates one feature map by considering all the channel and area information. Conversely, the depth-wise separable convolution adjusts the number of output feature maps by performing a 1 × 1 convolution operation, called point-wise convolution, after the depth-wise convolution operation that creates one feature map for each channel.

Classification Method
Xception is a model based on Inception. In addition to it reducing the connection between nodes using the Inception module in GoogLeNet, Xception is a network that separates finding the relationships between all the channels and finding local information [14]. Accordingly, the extreme version of the Inception module is proposed herein. As shown in Figure 3, after applying a 1 × 1 convolutional layer to the input, all the channels are separated, and each channel is individually operated for the 3 × 3 convolution. Xception uses a depth-wise separable convolution created by modifying the operation. In a depth-wise separable convolution, the convolution operation was performed for each channel, and a 1 × 1 convolutional layer was obtained. As shown in Figure 4, the standard convolution creates one feature map by considering all the channel and area information. Conversely, the depth-wise separable convolution adjusts the number of output feature maps by performing a 1 × 1 convolution operation, called point-wise convolution, after the depth-wise convolution operation that creates one feature map for each channel.

Classification Method
Xception is a model based on Inception. In addition to it reducing the connection between nodes using the Inception module in GoogLeNet, Xception is a network that separates finding the relationships between all the channels and finding local information [14]. Accordingly, the extreme version of the Inception module is proposed herein. As shown in Figure 3, after applying a 1 × 1 convolutional layer to the input, all the channels are separated, and each channel is individually operated for the 3 × 3 convolution. Xception uses a depth-wise separable convolution created by modifying the operation. In a depth-wise separable convolution, the convolution operation was performed for each channel, and a 1 × 1 convolutional layer was obtained. As shown in Figure 4, the standard convolution creates one feature map by considering all the channel and area information. Conversely, the depth-wise separable convolution adjusts the number of output feature maps by performing a 1 × 1 convolution operation, called point-wise convolution, after the depth-wise convolution operation that creates one feature map for each channel.

Generating Synthetic Gastroscopy Images
A large amount of high-quality training data is required to improve th of deep learning. However, medical data are difficult to collect because

Generating Synthetic Gastroscopy Images
A large amount of high-quality training data is required to improve the performance of deep learning. However, medical data are difficult to collect because the process is expensive and takes considerable time to specify the ground truth of the lesion. This section describes the 25 policies of the CIFAR-10 dataset and DCGAN used to improve the performance of the model and an image quality-measurement method used to select high-quality data from the augmented data.

DCGAN
A GAN is a deep neural network architecture composed of two neural networks, namely generator and discriminator networks [15]. The generator network generates new data using existing data. The generator aims to generate new data similar to real data based on a randomly generated vector of numbers called a latent space. The discriminator distinguishes between real data and synthetic data through the generator. As shown in Figure 5, two neural networks are trained against each other by repeating the generation and discrimination processes. During the training, each of the two neural networks attempts to minimize its own objective functions. Equation (1) expresses the final objective function of the GAN.

Generating Synthetic Gastroscopy Images
A large amount of high-quality training data is required to improve the performance of deep learning. However, medical data are difficult to collect because the process is expensive and takes considerable time to specify the ground truth of the lesion. This section describes the 25 policies of the CIFAR-10 dataset and DCGAN used to improve the performance of the model and an image quality-measurement method used to select high-quality data from the augmented data.

DCGAN
A GAN is a deep neural network architecture composed of two neural networks, namely generator and discriminator networks [15]. The generator network generates new data using existing data. The generator aims to generate new data similar to real data based on a randomly generated vector of numbers called a latent space. The discriminator distinguishes between real data and synthetic data through the generator. As shown in Figure 5, two neural networks are trained against each other by repeating the generation and discrimination processes. During the training, each of the two neural networks attempts to minimize its own objective functions. Equation (1)  The DCGAN used in this study was an improved GAN [16]. Since a fully connected layer was used in the GAN, the generation of high-resolution images by the generator is limited, and learning is not stable. The DCGAN addresses this limitation using convolutional layers in both subneural networks. In the DCGAN, discriminators classify images as real or fake using a dense classification layer. The generator takes a random noise vector from a uniform distribution and transforms it until it produces a final image. The DCGAN used in this study was an improved GAN [16]. Since a fully connected layer was used in the GAN, the generation of high-resolution images by the generator is limited, and learning is not stable. The DCGAN addresses this limitation using convolutional layers in both subneural networks. In the DCGAN, discriminators classify images as real or fake using a dense classification layer. The generator takes a random noise vector from a uniform distribution and transforms it until it produces a final image. Figure 6 shows the structure of a generator that generates a 128 × 128 image. The generator takes one tensor with the shape of (batch size, 100) and outputs one tensor with the shape of (batch size, 128 × 128 × 3).

AutoAugment
Augmentation is proposed as one of the techniques to secure enough data to train deep-learning models. Augmentation refers to a methodology for obtaining new training data by applying artificial changes to a small amount of training data. The goal is to create data that are similar to the real data and secure new images by flipping or cropping the image. However, it is expensive and takes considerable effort to find an aggregation technique suitable for the data. Therefore, to solve this problem, we applied AutoAugment developed by Google.
. 2021, 11, x FOR PEER REVIEW 6 of 12 Figure 6 shows the structure of a generator that generates a 128 × 128 image. The generator takes one tensor with the shape of (batch size, 100) and outputs one tensor with the shape of (batch size, 128 × 128 × 3).

AutoAugment
Augmentation is proposed as one of the techniques to secure enough data to train deep-learning models. Augmentation refers to a methodology for obtaining new training data by applying artificial changes to a small amount of training data. The goal is to create data that are similar to the real data and secure new images by flipping or cropping the image. However, it is expensive and takes considerable effort to find an aggregation technique suitable for the data. Therefore, to solve this problem, we applied AutoAugment developed by Google.
The AutoAugment method employed in this study is an algorithm that automatically finds the most appropriate augmentation policy for an image dataset through reinforcement learning. The method presented by the Google Brain team at the Conference on Computer Vision and Pattern Recognition 2019 provides an optimal augmentation method for validated datasets such as ImageNet, Street View House Number (SVHN), and CIFAR-10 [17]. The CIFAR-10 dataset contains 50,000 training images, including cats, birds, and airplanes, and Google used data consisting of 4000 images randomly selected out of the 50,000 and collectively called them the "reduced CIFAR-10 dataset". The ImageNet dataset consists of approximately 1.4 million images in 21,841 classes including people, animals, and musical instruments, and SVHN is a dataset composed by cropping the house number image from Google Street View. As shown in Figure 7, through a recurrent neural network (RNN), which is the controller that determines the augmentation technique policy, and the child network created by the controller, various augmentation policies are applied to the dataset. In this process, we obtained R, which is the performance accuracy, and updated R in the controller to find the best policy. The AutoAugment method employed in this study is an algorithm that automatically finds the most appropriate augmentation policy for an image dataset through reinforcement learning. The method presented by the Google Brain team at the Conference on Computer Vision and Pattern Recognition 2019 provides an optimal augmentation method for validated datasets such as ImageNet, Street View House Number (SVHN), and CIFAR-10 [17]. The CIFAR-10 dataset contains 50,000 training images, including cats, birds, and airplanes, and Google used data consisting of 4000 images randomly selected out of the 50,000 and collectively called them the "reduced CIFAR-10 dataset". The ImageNet dataset consists of approximately 1.4 million images in 21,841 classes including people, animals, and musical instruments, and SVHN is a dataset composed by cropping the house number image from Google Street View. As shown in Figure 7, through a recurrent neural network (RNN), which is the controller that determines the augmentation technique policy, and the child network created by the controller, various augmentation policies are applied to the dataset. In this process, we obtained R, which is the performance accuracy, and updated R in the controller to find the best policy.

AutoAugment
Augmentation is proposed as one of the techniques to secure enough data to train deep-learning models. Augmentation refers to a methodology for obtaining new training data by applying artificial changes to a small amount of training data. The goal is to create data that are similar to the real data and secure new images by flipping or cropping the image. However, it is expensive and takes considerable effort to find an aggregation technique suitable for the data. Therefore, to solve this problem, we applied AutoAugment developed by Google.
The AutoAugment method employed in this study is an algorithm that automatically finds the most appropriate augmentation policy for an image dataset through reinforcement learning. The method presented by the Google Brain team at the Conference on Computer Vision and Pattern Recognition 2019 provides an optimal augmentation method for validated datasets such as ImageNet, Street View House Number (SVHN), and CIFAR-10 [17]. The CIFAR-10 dataset contains 50,000 training images, including cats, birds, and airplanes, and Google used data consisting of 4000 images randomly selected out of the 50,000 and collectively called them the "reduced CIFAR-10 dataset". The ImageNet dataset consists of approximately 1.4 million images in 21,841 classes including people, animals, and musical instruments, and SVHN is a dataset composed by cropping the house number image from Google Street View. As shown in Figure 7, through a recurrent neural network (RNN), which is the controller that determines the augmentation technique policy, and the child network created by the controller, various augmentation policies are applied to the dataset. In this process, we obtained R, which is the performance accuracy, and updated R in the controller to find the best policy. n(M) = 10) were used. Consequently, a total of ((16 × 11 × 10) 2 ) 5 = 2.9 × 10 32 candidate groups for image augmentation were defined in AutoAugment. In the learning process, these policies were randomly selected and applied to the training data for learning, and the classification was repeated to find an enhancement policy with improved performance to consequently determine the optimal policy. Accordingly, the optimal augmentation method was suggested according to the data characteristics.

Image Quality Measurement Method
The quality of data generated through GAN is not automatically measured and must be inspected with the naked eye, rendering the optical judgment difficult. Therefore, to use only high-quality data for learning, a quantitative criterion for evaluating the generated data is required. Many studies have proposed criteria to measure the GAN performance. The inception score proposed by Shane Barratt [18] is the most widely used scoring algorithm for GANs. It measures the quality and diversity of the generated image by extracting the features of the real image and the image created using the pretrained Inception V3 neural network, where the higher the inception score, the better the quality of the model. The inception score can be calculated using Equation (2).
where p(y|x) is the conditional class distribution, x is the generated image, and y is the label. p(y) is the marginal class distribution and can be calculated using Equation (3).
If the generated image is diverse, then p(y) approaches a uniform distribution. However, measuring the quality based on the inception score also encounters problems. If the model generates only one image per class, even if the diversity is low, p(y) may be close to a uniform distribution, thus resulting in incorrect performance.
Therefore, in this study, quality was evaluated by applying the method proposed by Shmelkov [19] to objectively evaluate the generated image. Our proposed method is to train a deep-learning model with a training dataset consisting only of real data and test newly created data to select only images with an accuracy of 0.8 or higher, among correctly classified images. Through this process, not only the images generated through GAN but also the quality of data augmented by 25 policies of the CIFAR-10 dataset were evaluated. As shown in Figure 8, the Xception model was trained with an actual gastroscopy image, and the data expanded through the 25 policies of CIFAR-10 and DCGAN were composed of a test dataset and classified as normal or abnormal. Correctly classified images mean that they are similar to real images, and only high-quality images were selected by selecting only images with a classification prediction degree of at least 0.8. Through this method, the quality of image was judged based on a quantitative standard rather than subjective evaluation.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 12 Figure 8. Process of the image quality measurement method.

Results
We propose a classification method for normal and abnormal gastroscopy images through CADx. Figure 9a presents the architecture of a basic deep-learning model using a dataset consisting of only real gastroscopy images. Figure 9b presents the architecture of a deep-learning model using training data that combines DCGAN and AutoAugment.

Results
We propose a classification method for normal and abnormal gastroscopy images through CADx. Figure 9a presents the architecture of a basic deep-learning model using a dataset consisting of only real gastroscopy images. Figure 9b presents the architecture of a deep-learning model using training data that combines DCGAN and AutoAugment.

Results
We propose a classification method for normal and abnormal gastroscopy images through CADx. Figure 9a presents the architecture of a basic deep-learning model using a dataset consisting of only real gastroscopy images. Figure 9b presents the architecture of a deep-learning model using training data that combines DCGAN and AutoAugment. In a previous study [20], we applied 25 augmentation policies to optimize gastroscopy images in the CIFAR-10, ImageNet, and SVHN datasets. It was confirmed that the policies of the CIFAR-10 dataset are the most suitable for gastroscopy image classification. Therefore, in this study, the existing data were augmented 25 times by applying the optimized augmentation policies of the CIFAR-10 dataset. The augmentation policies-Equalize, AutoContrast, Color, and Brightness-were generally selected, and most of them chose color-based conversion. Moreover, we found that the results of Xception in the gastroscopy medical image classification were the best among four different deep-learning models, namely Xception, Inception-V3, Resnet-101, and Inception-Resnet-V2. Based on the results, we selected the Xception network for this study.
The DCGAN and the 25 augmentation policies of the CIFAR-10 dataset were implemented to augment the training data. After selecting data using the Xception-based image quality-measurement method, the augmented data were trained and tested. Among the collected training data, 655 normal and 655 abnormal images were used to generate 200 normal and 200 abnormal images through the DCGAN, which were increased by 25 times through the CIFAR-10 dataset, and a total of 11,744 normal and 11,744 abnormal images were selected. The selection criterion of the data was an accuracy In a previous study [20], we applied 25 augmentation policies to optimize gastroscopy images in the CIFAR-10, ImageNet, and SVHN datasets. It was confirmed that the policies of the CIFAR-10 dataset are the most suitable for gastroscopy image classification. Therefore, in this study, the existing data were augmented 25 times by applying the optimized augmentation policies of the CIFAR-10 dataset. The augmentation policies-Equalize, AutoContrast, Color, and Brightness-were generally selected, and most of them chose color-based conversion. Moreover, we found that the results of Xception in the gastroscopy medical image classification were the best among four different deep-learning models, namely Xception, Inception-V3, Resnet-101, and Inception-Resnet-V2. Based on the results, we selected the Xception network for this study.
The DCGAN and the 25 augmentation policies of the CIFAR-10 dataset were implemented to augment the training data. After selecting data using the Xception-based image quality-measurement method, the augmented data were trained and tested. Among the collected training data, 655 normal and 655 abnormal images were used to generate 200 normal and 200 abnormal images through the DCGAN, which were increased by 25 times through the CIFAR-10 dataset, and a total of 11,744 normal and 11,744 abnormal images were selected. The selection criterion of the data was an accuracy of at least 0.8. To verify whether the image quality-measurement method is effective, we compared the performances of the model with and without the image quality-measurement method. We used the receiver operating characteristic curve (ROC curve), an evaluation index, and compared the performances with the Az value of the area under the curve. Figure 10 presents the results of the classification performance of the models based on the ROC curve. As shown in Figure 10b, the Az values after adding 400 images through the DCGAN and after adding 23,488 images through the CIFAR-10 dataset were 0.882 and 0.884, respectively. After applying both DCGAN and CIFAR-10 datasets, the Az value was 0.9, which was the highest. The Az value was significantly (p ≤ 0.01) higher for the DCGAN + CIFAR-10 than for the DCGAN and CIFAR-10. However, the difference between the DCGAN+CIFAR-10 and the original did not reach statistical significance (p ≥ 0.01). The results confirmed that when data were augmented, the performance of all the data was improved, compared to the original data. Moreover, the model with the image quality-measurement method outperformed the model without the image qualitymeasurement method. was 0.9, which was the highest. The Az value was significantly (p ≤ 0.01) higher for the DCGAN + CIFAR-10 than for the DCGAN and CIFAR-10. However, the difference between the DCGAN+CIFAR-10 and the original did not reach statistical significance (p ≥ 0.01). The results confirmed that when data were augmented, the performance of all the data was improved, compared to the original data. Moreover, the model with the image quality-measurement method outperformed the model without the image qualitymeasurement method. In addition, after making correct predictions for normal and abnormal images based on the confusion matrix, the performance was improved in terms of the accuracy, precision, recall, and F1 score, as computed using Equations (4)- (7).
A true positive (TP) is a value that represents correct classification of an abnormal image as an abnormal image, a false negative (FN) is a value that represents incorrect classification of an abnormal image as a normal image, a false positive (FP) is a value that represents incorrect classification of a normal image as an abnormal image, and a true negative (TN) is a value that represents correct classification of a normal image as a normal image. Precision refers to the ratio of correctly predicted abnormal images to total predictions of images as abnormal images, and recall refers to the ratio of correctly In addition, after making correct predictions for normal and abnormal images based on the confusion matrix, the performance was improved in terms of the accuracy, precision, recall, and F1 score, as computed using Equations (4)- (7).
A true positive (TP) is a value that represents correct classification of an abnormal image as an abnormal image, a false negative (FN) is a value that represents incorrect classification of an abnormal image as a normal image, a false positive (FP) is a value that represents incorrect classification of a normal image as an abnormal image, and a true negative (TN) is a value that represents correct classification of a normal image as a normal image. Precision refers to the ratio of correctly predicted abnormal images to total predictions of images as abnormal images, and recall refers to the ratio of correctly predicted abnormal images to total actual abnormal images. Accuracy refers to the correctly classified proportion in all the cases, and the F1 score is the harmonic average of precision and recall. After increasing the data using both the DCGAN and CIFAR-10, the method has an accuracy of 0.851 and F1 score of 0.841, which improves the performance by approximately 0.06, compared to the model without data augmentation, as shown in Table 2. The values in brackets in the table are the results of the model without the image quality-measurement method. Thus, the results confirmed that the model with the image quality-measurement method delivered a better overall performance than the model without the method.

Discussion
In this study, the data required for learning were secured by applying an imagegeneration method through the DCGAN and an automated augmentation method using a CNN and RNN. Through the Xception-based image quality-measurement method, it has been augmented by approximately 18 times, compared to the existing training data. Comparing the performance of the model with the augmented data, the AutoAugment with the CIFAR-10 dataset is more suitable for the classification of actual gastroscopy images because it delivers better performance than that of the DCGAN. Since a larger amount of data are required to generate various types of data through the DCGAN, data with relatively better quality than that of CIFAR-10 could not be obtained. Moreover, many of the AutoAugment policies of the CIFAR-10 dataset are based on a variety of colors, which seem to yield superior classification results on the gastroscopy images. In this study, our proposed model, which aims to augment data using the DCGAN and CIFAR-10 datasets, delivered superior performance. The accuracy, Az value, and F1 score of the model were 0.851, 0.9, and 0.841, respectively. The value of precision increased to 0.896, but the value of recall decreased to 0.793. The higher the values of both indicators, the better the model. However, the two values have a trade-off relationship; thus, the higher the precision, the lower the recall. The findings of this study confirm that the classification model generated through the DCGAN and the policies of the CIFAR-10 dataset with the addition of an image quality-measurement method performs the best. Moreover, securing sufficient learning data by augmenting data through the DCGAN and CIFAR-10, as proposed herein, and selecting data through an image quality-measurement method are effective in improving the performance.
This paper proposed a method of solving the problem of performance deterioration of deep learning owing to insufficient data. The proposed method augments the data required for learning and solves the problem of the lack of data through the image quality-evaluation process. Not only in the medical field but also in various areas, such as defect inspection in a smart factory, pest classification, and distracted driving detection, the problem of data shortage is being solved by data augmentation [21][22][23]. The method proposed herein is expected to be applicable to not only medical images but also various areas where data are insufficient. Future studies will be required to evaluate the adaptability of our methods to other modalities.
To generate data through DCGAN, there is a limitation in that the larger the amount of data, the more diverse and high-quality data can be generated. Therefore, in the future, we plan to conduct research to generate data through a GAN after going through different data-augmentation methods. In addition, we planned to compare the performance of the model by creating an image through a different type of GAN other than DCGAN or applying it to other CNN models.

Conclusions
Deep neural networks are effective when trained with a large supervised dataset. However, acquiring such a dataset used in a CADx system is a difficult task. Herein, we proposed a computer-assisted diagnostic system that generates data through a DCGAN and increases the amount of data by implementing the augmentation policies of the CIFAR-10 dataset. An image quality-measurement method was used to select accurate data from the augmented data. Our results revealed the performance of the proposed model in terms of accuracy, precision, recall, Az value, and F1 score. The model that used the two methods, DCGAN and Cifar10, delivered 5% superior Az value results than those that did not use these two methods. It delivered the accuracy and F1 score that were about 6% better than that of the existing method. Therefore, the method of using the DCGAN and CIFAR-10 dataset policies, along with the image quality-measurement method proposed herein, was suitable for solving the problem of acquiring large datasets for training deep neural networks. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. It needs additional IRB approval.