COVID-CGAN: Efﬁcient Deep Learning Approach for COVID-19 Detection Based on CXR Images Using Conditional GANs

: COVID-19, a novel coronavirus infectious disease, has spread around the world, resulting in a large number of deaths. Due to a lack of physicians, emergency facilities, and equipment, medical systems have been unable to treat all patients in many countries. Deep learning is a promising approach for providing solutions to COVID-19 based on patients’ medical images. As COVID-19 is a new disease, its related dataset is still being collected and published. Small COVID-19 datasets may not be sufﬁcient to build powerful deep learning detection models. Such models are often over-ﬁtted, and their prediction results cannot be generalized. To ﬁll this gap, we propose a deep learning approach for accurately detecting COVID-19 cases based on chest X-ray (CXR) images. For the proposed approach, named COVID-CGAN, we ﬁrst generated a larger dataset using generative adversarial networks (GANs). Speciﬁcally, a customized conditional GAN (CGAN) was designed to generate the target COVID-19 CXR images. The expanded dataset, which contains 84.8% generated images and 15.2% original images, was then used for training ﬁve deep detection models: InceptionResNetV2, Xception, SqueezeNet, VGG16, and AlexNet. The results show that the use of the synthetic CXR images, which were generated by the customized CGAN, helped all deep learning models to achieve high detection accuracies. In particular, the highest accuracy was achieved by the InceptionResNetV2 model, which was 99.72% accurate with only ten epochs. All ﬁve models achieved kappa coefﬁcients between 0.81 and 1, which is interpreted as an almost perfect agreement between the actual labels and the detected labels. Furthermore, the experiment showed that some models were faster yet smaller compared to the others but could still achieve high accuracy. For instance, SqueezeNet, which is a small network, required only three minutes and achieved comparable accuracy to larger networks such as InceptionResNetV2, which needed about 143 min. Our proposed approach can be applied to other ﬁelds with scarce datasets


Introduction
The coronavirus (COVID- 19) outbreak has resulted in a worldwide crisis; globally, the rate of cases has risen rapidly. The World Health Organization (WHO) has reported 131,020,967 confirmed COVID-19 cases and 2,850,521 deaths worldwide as of 6 April 2021 [1]. To investigate COVID-19 symptoms, a chest X-ray (CXR) imaging is considered a first-line diagnostic examination. However, medical providers may suffer from a lack of sufficient experience to conduct accurate diagnoses based on medical images such as CXR and computed tomography (CT) due to the rapid and vast changes in the COVID-19 virus. Therefore, artificial intelligence (AI), especially deep learning techniques, can be applied as powerful tools to support practitioners' efforts to automate diagnostic procedures for • A customized CGAN was designed to generate CXR images that can be used for COVID-19 detection studies. This includes the architectures of generator and discriminator networks as well as parameter configurations; • An expanded COVID-19 CXR dataset that involves 3290 images that can be used to build COVID-19 detection models was generated; • COVID-19 detection using the synthetic CXR images generated by the CGAN was demonstrated.
This paper is structured as follows: Section 2 presents the related work. Section 3 describes the materials and methods used in our experimental work. Section 4 presents the results, and Section 5 discusses the main findings and concludes the paper.

Related Work
This section provides an overview of the state-of-the-art studies in relation to COVID-19 detection based on CXR images. The section also provides a summary of previous studies suggesting that applied GANs are able to overcome the problem posed by the small COVID-19 dataset size. Table 1 presents a summary of COVID-19-related studies in terms of image type used, dataset size, GAN architecture applied to expand the dataset in cases where GANs are used, classification method, and classification model performance. From Table 1, we can draw the following observations: • The majority of the studies used small datasets to feed their classification models. The maximum size of these datasets was 423 images. Some of these studies used internal datasets that have not been published for public use; • Although most of the models achieved high accuracies, the prediction results based on them cannot be generalized due to the limited samples on which the models were trained; • There are no studies that applied CGANs to CXR. To the best of our knowledge, one study applied a CGAN to CT but did not achieve more than 81.41% accuracy.
The studies shown in Table 1 applied several machine learning methods such as support vector machines (SVM) and random forest (RF) as well as deep learning methods such as AlexNet, GoogleNet, SqueezeNet, and ResNet18. Most of these methods achieved high performance in detecting COVID-19 cases. Moreover, only a few studies [6][7][8]17] applied GANs to expand the COVID-19 datasets. Specifically, these studies applied traditional GANs or other GAN architectures, including CGAN, auxiliary classifier GAN (ACGAN), and conditional variational auto-encoder GAN (CVAE-GAN).
In the following, we provide a summary of the related studies. Maghdid et al. [9] developed a simple convolutional neural network (CNN) and a modified AlexNet model to classify CXR and CT scan images. The results of the study showed that the two applied models provided accuracies of 94.1% and 98%, respectively. Wang and Wong [10] introduced a customized convolutional neural network, i.e., COVID-Net, for COVID-19 cases diagnosis and made the used dataset available for public use. The model achieved an accuracy of 92.4%.
Narin et al. [11] proposed three different convolutional network models to detect COVID-19 patients, namely, ResNet50, InceptionV3, and Inception-ResNetV2. The results showed that the ResNet50 model obtained the highest recognition performance, with 98% accuracy, compared to the other two proposed models.
Zhang et al. [12] used 100 CXR images of COVID-19 patients and another 1431 CXR images of other pneumonia cases. Based on this dataset, the authors developed a new deep learning model for detection, which achieved a sensitivity of 96%. Hall et al. [13] used pre-trained Resnet50 and VGG16 plus their own small CNN trained on 102 COVID-19 cases and 102 other pneumonia cases in 10-fold cross-validation. Their model achieved an accuracy of 94.4%.
Farooq and Hafeez [14] proposed a public dataset for COVID-19 cases and other pneumonia cases to be used by the public. The authors also proposed a novel model, i.e., COVID-ResNet, that was tested to reduce training time. The model achieved an accuracy of 96.23%.
Hammoudi et al. [16] proposed a number of deep learning models, i.e., ResNet34, ResNet50, DenseNet169, VGG-19, Inception ResNetV2, and RNN, to detect pneumonia infection cases, notably viral cases. They claimed that cases of viral pneumonia diagnosed during the COVID-19 period are extremely likely to be COVID-19 infections. The DenseNet169 architecture reached the best performance with an average classification accuracy of 95.72% on the CXR Images Pneumonia dataset.
Nour et al. [17] proposed a CNN-based hybrid model for detecting COVID-19. The models used CNN as a feature extractor, and the extracted features were applied as the input to classical machine learning techniques, specifically k-nearest neighbor, support vector machine (SVM), and decision tree, to perform the classification task. The CNN-SVM model achieved 98.97% accuracy, 89.39% sensitivity, 99.75% specificity, and 96.72% F1-score. Neural architecture search network (NASNet) is a deep convolutional network architecture developed by the Google Brain team in 2018. NASNet effectively shows improved performance on classification tasks.
Martínez et al. [18] proposed the COVID-19 detection model based on NASNet. The model achieved 97% in both accuracy and sensitivity. Alazab et al. [19] proposed a model based on VGG-16 that detects COVID-19 from CXR images. The model achieved a 99% F1 score. They also proposed a model that predicted the cases of COVID-19 confirmations, recoveries, and deaths over the next seven days in Jordan and Australia using three architectures: prophet algorithm (PA), the autoregressive integrated moving average (ARIMA) model, and long short-term memory neural network (LSTM). The empirical results showed that PA achieved an average accuracy of 94.8 and 88.43% in predicting the cases in Australia and Jordan, respectively.
To overcome the scarcity of data related to COVID-19 symptoms, Chowdhury et al. [20] used transfer learning techniques and data augmentation to train pre-trained deep CNN models to detect COVID-19 from CXR images. They used CheXNet, a CNN model that consists of 121 layers trained on CXR images to detect different respiratory diseases, in addition to MobileNetv2, SqueezeNet, ResNet18, InceptionV3, ResNet101, VGG19, and DenseNet201. The empirical evaluation showed that all the models achieved high performance in detecting COVID-19, and the highest performance was achieved by both CheXNet and DenseNet201 when the data augmentation technique was applied. Sethy et al. [21] introduced two types of hybrid models that use SVM as a classifier in detecting COVID-19 based on CXR images. Each model of the first type consists of a deep pre-trained CNN model, specifically AlexNet, VGG16, VGG19, GoogleNet, ResNet18, ResNet50, ResNet101, InceptionV3, InceptionResNetV2, DenseNet201, XceptionNet, MobileNetV2, and Shuf-fleNet, as the feature extractor, and the extracted features were then fed to SVM to perform the classification task. The second type used traditional image classification methods, e.g., local binary patterns (LBP), a histogram of oriented gradients (HOG), and a gray level co-occurrence matrix (GLCM), with an SVM classifier. The experiment showed that the ResNet50-based hybrid model achieved the best performance with an accuracy of 95.33%.
To determine the effect of preprocessing on improving the detection of COVID-19 in CXR images, Togaçar et al. [22] focused on preprocessing the images using fuzzy and stacking techniques. To evaluate the preprocessing step's effectiveness, two deep learning models, MobileNetV2 and SqueezeNet, with SVM as a classifier, were trained with the original dataset and the preprocessed dataset. The experiment results proved that the preprocessing step improves the classification models.
Brunese et al. [23] introduced a three-stage architecture that detects and highlights the lung area infected by COVID-19 from CXR images. The model uses VGG-16 in the first and second stages to differentiate between healthy and COVID-19 cases. If the image is classified as COVID-19, the gradient-weighted class activation mapping (Grad-CAM) algorithm is used to generate an activation map that visually illustrates the infected areas. Minaee et al. [24] prepared a dataset of 5000 CXR images, called COVID-Xray-5k, that includes 184 CXR images of COVID-19. They compared the performance of four pretrained convolutional models trained on the COVID-Xray-5k dataset, specifically ResNet18, ResNet50, SqueezeNet, and DenseNet-121. Khan et al. [25] proposed a deep convolutional neural network model, called CoroNet, based on Xception, that is able to distinguish between three types of pneumonia: viral pneumonia, bacterial pneumonia, and COVID-19, from CXR images. Pereira et al. [26] comparatively studied different feature extraction approaches with different classification algorithms. To train the models, they prepared a dataset of CXR images called RYDLS-20 with an unbalanced distribution of the classes to reflect the real-world distribution. The dataset has CXR images that belong to normal, majority, and six other pneumonia classes, including COVID-19. Ozturk et al. [27] proposed DarkCovidNet, a deep learning model based on the DarkNet model that works as a binary classifier and multi-class classifier to classify CXR images into COVID or no findings, and COVID, pneumonia, or no findings. Civit-Masot et al. [28] proposed a multi-class classification model that uses VGG-16 to classify CXR images into COVID-19, pneumonia, or healthy.
To effectively train a deep network for detecting COVID-19 with a limited dataset, Oh et al. [29] introduced a patch-based convolutional neural network algorithm consisting of a segmentation network and a classification network. The model extracts the lung area from the CXR image, and different patches of the extracted area are classified with a classification network that consists of a network of ResNet. Mohammed et al., in their benchmarking paper [30], stated that the large number of machine learning models proposed for detecting COVID-19 from CXR images complicates the task for health organizations in selecting an appropriate model. Therefore, they proposed a benchmarking methodology using a multi-criteria decision-making (MCDM) model that evaluated 12 machine learning classification algorithms on ten time-consuming and accuracy parameters, e.g., precision and recall. To conduct the experiment, they constructed a dataset of 50 CXR images, including 25 COVID-19 CXR images, and the InceptionV3 model was used as a feature extractor. Rajaraman et al. in [31] evaluated the performance of a custom CNN and eight different pre-trained CNN-based models, e.g., VGG-16 and Inception-V3, in detecting COVID-19 from CXR images. To reduce model complexity, and improve robustness and generalization, they used modality-specific transfer learning, iterative pruning, and ensemble strategies.
Abdul Waheed et al. [6] introduced an ACGAN-based model for generating CXR images of COVID-19. They compared the performance of a VGG16 model trained on the original dataset to another VGG16 model trained on a dataset augmented by the ACGAN model. Based on CT scan images, Loey et al. [32] applied CGANs to differentiate COVID-19 and non-COVID-19 images.
In this study, we propose using CGANs to generate COVID-19 CXR images and then using them along with the original images to detect COVID-19 cases. Although Loey et al. [32] applied a CGAN, they used it with CT images, not CXR images. Loey et al. [7] applied a GAN to CXR images to detect respiratory diseases from the images. The original dataset contained 306 CXR images in four classes, 97 of which were COVID-19 CXR images. The best-performing model was GoogleNet, which achieved 100% accuracy in the discrimination of normal and COVID-19 CXR images compared to 52.8% before using the GAN. Based on 16 classes, Albahli [8] empirically evaluated the performance of several deep-learning-based models that detect and classify different respiratory diseases, including COVID-19, from CXR images. The results showed that the best-performing model was ResNet, which achieved 85 to 87% validation accuracy. Figure 1 depicts an overview of COVID-CGAN. As shown in the figure, a dataset of CXR images was collected. The dataset includes normal and COVID-19 images. To ensure that our smart detector is able to accurately recognize COVID-19 images, pneumonia images were added to the dataset. The collected dataset was augmented using generative models that synthesize images from the dataset. Specifically, a CGAN was designed to generate the images. The expanded dataset resulting from this phase was then used to detect COVID-19 patients based on their CXR images. In particular, the generated images were used to train multiple deep learning models based on certain configurations of the model's parameters. The trained model was later used to predict the presence of COVID-19 based on testing data.

Conditional Generative Adversarial Networks (CGANs)
GANs are structured to train generative models. They contain two networks: the generator (G) and discriminator (D). These two networks work against each other to produce convincing yet false images. GANs generate samples from the random noise that is provided as the input to the generator. CGANs are designed to include extra information about images (y) for both the generator (G) and the discriminator (D), which control the class of the generated output ( Figure 2). The details of the image generation phase, i.e., dataset expansion and the deep learning approach to COVID-19 detection, are presented in the following two subsections.

Conditional Generative Adversarial Networks (CGANs)
GANs are structured to train generative models. They contain two networks: the generator (G) and discriminator (D). These two networks work against each other to produce convincing yet false images. GANs generate samples from the random noise that is provided as the input to the generator. CGANs are designed to include extra information

The Original Dataset
There are three classes in the dataset: normal, COVID-19, and pneumonia. The dataset contains around 500 CXR images for each class. As COVID-19 images are not widely available, they were collected from various datasets [33][34][35]. The normal and pneumonia CXR images were collected from a dataset that was published in 2018 [36], and the number of images was chosen to be balanced with the COVID-19 images. A sample of each of the three classes is shown in Figure 3. All CXR images were preprocessed to have the same color system, type, and size. Specifically, the images were unified to be RGB of PNG type. Flipping was applied to the images in which the images were reflected horizontally with 50% probability.
In addition, some of the images were cropped to remove any details that do not belong to the main CXR image such as the header and footer. All annotates and arrows generated by X-ray devices were also removed from the images. Additionally, all the images in the dataset were resized to be 128 × 128.

The Original Dataset
There are three classes in the dataset: normal, COVID-19, and pneumonia. The dataset contains around 500 CXR images for each class. As COVID-19 images are not widely available, they were collected from various datasets [33][34][35]. The normal and pneumonia CXR images were collected from a dataset that was published in 2018 [36], and the number of images was chosen to be balanced with the COVID-19 images. A sample of each of the three classes is shown in Figure 3. All CXR images were preprocessed to have the same color system, type, and size. Specifically, the images were unified to be RGB of PNG type. Flipping was applied to the images in which the images were reflected horizontally with 50% probability.

The Original Dataset
There are three classes in the dataset: normal, COVID-19, and pneumonia. The dataset contains around 500 CXR images for each class. As COVID-19 images are not widely available, they were collected from various datasets [33][34][35]. The normal and pneumonia CXR images were collected from a dataset that was published in 2018 [36], and the number of images was chosen to be balanced with the COVID-19 images. A sample of each of the three classes is shown in Figure 3. All CXR images were preprocessed to have the same color system, type, and size. Specifically, the images were unified to be RGB of PNG type. Flipping was applied to the images in which the images were reflected horizontally with 50% probability.
In addition, some of the images were cropped to remove any details that do not belong to the main CXR image such as the header and footer. All annotates and arrows generated by X-ray devices were also removed from the images. Additionally, all the images in the dataset were resized to be 128 × 128.  In addition, some of the images were cropped to remove any details that do not belong to the main CXR image such as the header and footer. All annotates and arrows generated by X-ray devices were also removed from the images. Additionally, all the images in the dataset were resized to be 128 × 128. Based on the theories of the CGAN presented earlier, we constructed a customized CGAN that uses CXR images from the original dataset to generate new images. The loss scores were calculated based on the following equation: where D(x|y) is the discriminator's estimate of the probability that a real data instance (x) is real for a given class (y), and D(G(z|y)) is the discriminator's estimate of the probability that a fake instance is real for a given class (y). The specific architectures of the CGAN generator and discriminator networks are shown in Figures 4 and 5, respectively. As shown in Figure 4, the generator (G) contains a total of 18 layers: an input layer (noise), a project and reshape layer (proj), an embedding layer (embed), a concatenation layer (concat), five transposed convolutional layers (tconv1-tcon5), four batch normalization layers (bnorm1-bnorm4) [37], four rectified linear unit (ReLU) layers (relu1-relu4), and a hyperbolic tangent layer (tanh). Based on the theories of the CGAN presented earlier, we constructed a customized CGAN that uses CXR images from the original dataset to generate new images. The loss scores were calculated based on the following equation: where D(x|y) is the discriminator's estimate of the probability that a real data instance (x) is real for a given class (y), and D(G(z|y)) is the discriminator's estimate of the probability that a fake instance is real for a given class (y). The specific architectures of the CGAN generator and discriminator networks are shown in Figures 4 and 5, respectively. As shown in Figure 4, the generator (G) contains a total of 18 layers: an input layer (noise), a project and reshape layer (proj), an embedding layer (embed), a concatenation layer (concat), five transposed convolutional layers (tconv1-tcon5), four batch normalization layers (bnorm1-bnorm4) [37], four rectified linear unit (ReLU) layers (relu1-relu4), and a hyperbolic tangent layer (tanh). In particular, the network converts a noise vector of 100 samples into 4 × 4 × 256 arrays using project and reshape and then up-scales the resulting arrays to 128 × 128 × 3 using a series of transposed convolution layers with batch normalization and ReLU layers.
Notably, the number of the transposed convolutional neural networks as well as the filter number and sizes, i.e., 5, were chosen carefully so that the generated images were 128 × 128. If the desired size is different, then the number of the transposed convolutional neural networks will differ as well. For example, if the desired size is 64 × 64, then there will be four transposed convolutional neural networks instead of five.
The network converts a noise vector of 100 samples into 4 × 4 × 512 arrays using the project and reshape layer. Then, it converts the categorical labels to embedding vectors and reshapes them to a 4 × 4 array. After, the network concatenates the resulting images from the two inputs along the channel dimension to a 4 × 4 × 512 array. Next, it up-scales the resulting arrays to 128 × 128 × 3 using a series of transposed convolutional layers with batch normalization and ReLU layers.
Notably, the number of the transposed convolutional neural networks as well as the filter number and sizes, i.e., 5, were chosen carefully so that the generated images were 128 × 128. If the desired size is different, then the number of the transposed convolutional neural networks will differ as well. For example, if the desired size is 64 × 64, then there will be four transposed convolutional neural networks instead of five.
The network converts a noise vector of 100 samples into 4 × 4 × 512 arrays using the project and reshape layer. Then, it converts the categorical labels to embedding vectors and reshapes them to a 4 × 4 array. After, the network concatenates the resulting images from the two inputs along the channel dimension to a 4 × 4 × 512 array. Next, it up-scales the resulting arrays to 128 × 128 × 3 using a series of transposed convolutional layers with batch normalization and ReLU layers.
The discriminator (D), as shown in Figure 5, contains 24 layers: an input layer (in), an embedding layer (embed), a concatenation layer (concat), six convolutional layers (tconv1-tcon6), six dropout layers (drop, drop1-drop5) [38], four batch normalization layers (bnorm2-bnorm5), and five leaky rectified linear unit layers (lrelu1-lrelu5). 021, 11, x FOR PEER REVIEW 10 of 23 The addition of the dropout layers in the discriminator networks helps to reduce the capacity of the network during training and to avoid overfitting. The CGAN was trained with the original dataset described in Section 3.2.2 using the configurations shown in Table 2. The number of epochs was chosen based on the experiment so that we obtained reasonable image quality. To increase the stability of the CGAN, the generator and discriminator learning rates were not equal. Figure 5. The proposed architecture of the CGAN discriminator. The discriminator learns from real images as well as image labels and down-samples the images produced by the generator using six convolutional layers to obtain the classification. To validate the images generated by the CGAN, the train on synthetic, test on real (TSTR) method [40] was applied. In this method, the generated images are used to train a deep learning model, VGG16, and then the model is tested on the real images. The training dataset contains the CXR images of the three classes: COVID-19, pneumonia, and normal. The training parameters' configurations used in the training process are as follows: Figure 5. The proposed architecture of the CGAN discriminator. The discriminator learns from real images as well as image labels and down-samples the images produced by the generator using six convolutional layers to obtain the classification.
The addition of the dropout layers in the discriminator networks helps to reduce the capacity of the network during training and to avoid overfitting. The CGAN was trained with the original dataset described in Section 3.2.2 using the configurations shown in Table 2. The number of epochs was chosen based on the experiment so that we obtained reasonable image quality. To increase the stability of the CGAN, the generator and discriminator learning rates were not equal. To validate the images generated by the CGAN, the train on synthetic, test on real (TSTR) method [40] was applied. In this method, the generated images are used to train a deep learning model, VGG16, and then the model is tested on the real images. The training dataset contains the CXR images of the three classes: COVID-19, pneumonia, and normal. The training parameters' configurations used in the training process are as follows: To measure the performance of the VGG16 model, recall, precision, and F1-score metrics were used. The full descriptions of the model as well as the performance measures are available in Sections 3.3.1 and 3.3.2, respectively.
To ensure that CGAN generated distinct images, the mean squared error (MSE) and structural similarity index measure (SSIM) [41] were used. The two metrics were calculated using the following two equations: whereĝ(n, m) and g(n, m) represent image1 and image2, respectively.
where µ and σ denote the average and standard deviation of the original images X and the test image Y, respectively; σ xy is the covariance of X and Y; and C 1 and C 2 are constants that prevent numerical instabilities.

The Deep Learning Models
To detect COVID-19 based on CXR images, five deep transfer learning models were used: InceptionResNetV2, Xception, SqueezeNet, VGG16, and AlexNet. These models were selected to range in size, i.e., the number of layers and size on disk. In the following, we provide a brief description of each model.

•
InceptionResNetV2 is a type of convolutional neural network that consists of 164 layers deep with image input size 299 × 299. The architecture of InceptionResNetV2 is formulated based on a combination of the Inception structure and a residual network (ResNet) connection. The usage of a ResNet connection not only eliminates degradation issues during deep structure but also reduces the training time. The InceptionResNetV2 architecture consists of a stem block that contains three standard convolutional layers and two 3 × 3 max-pooling layers. Multiple convolutional and max-pooling layers follow stem blocks with different sizes and different orders using ReLU and SoftMax functions [42]. The InceptionResNetV2 architecture is depicted in Figure 6a. • Xception is a convolutional neural network that was adapted from the Inception network, where the Inception modules are replaced with depthwise separable convolutions. The network has an image input size of 299 × 299 and is 71 layers deep. Figure 6b shows the architecture of Xception, which consists of multiple convolutions with 1 × 1 size and depthwise separable convolutions with 3 × 3 size using the batch normalization, ReLU, and SoftMax functions [43].

•
SqueezeNet is a small convolutional neural network that is 18 layers deep. It was designed to reduce the number of parameters to fit into computer memory or be easily transmitted over computer networks. SqueezeNet begins with a standard convolutional layer followed by eight fire modules, ending with a final convolutional layer and the SoftMax function. It performs max-pooling after the first standard convolutional layer, Fire4, Fire8, and the last standard convolutional layer [44]. Figure 6c shows the architecture of SqueezeNet. • VGG16: The most straightforward method to improve deep neural networks' performance is by increasing the network's size. For this reason, the visual geometry group (VGG) was created with three fully connected layers, 13 convolutional layers, and smaller size filters (2 × 2 and 3 × 3) using ReLU and SoftMax functions. It performs max-pooling twice with size 2 × 2 [45]. The architecture of VGG16 is depicted in Figure 6d. • AlexNet is a convolutional neural network eight layers deep. It contains five convolution layers, three max-pooling layers, and three fully-connected layers using ReLU and SoftMax functions. The input image size is 227 × 227 [46]. Figure 6e shows the architecture of SqueezeNet.

Performance Metrics
The most common metrics used for evaluating deep learning models are accuracy, The five models were trained on the expanded dataset, which contains the real images from the original dataset and the images generated by the proposed CGAN. To enable fair comparison among the different models, all the models were trained with the same parameters' configurations. Table 3 shows the parameter configurations of the training process.

Performance Metrics
The most common metrics used for evaluating deep learning models are accuracy, precision (specificity), recall (sensitivity), and F1-score [47]. In addition, macro-recall, macro-precision, and macro-F are preferred for evaluating multi-label models [48]. The kappa coefficient is used to determine the degree of agreement between the reference data and the classified map. It is used to control only those instances that may have been correctly classified by chance [49]. Accordingly, these metrics were chosen for use in this study. All the metrics are based on the numbers of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) cases. In addition, the confusion matrix [50] is used to present the multi-class classification. The formulas of the metrics mentioned above are presented below: F1Score i (10) Kappa = (Accuracy − RandomAccuracy)/ (1 − RandomAccuracy) (12) where N is the number of classes.

Experimental Results
The experimental work for this study was implemented using the deep learning toolbox in MATLAB 2020b. All the experiments were conducted using a computer device with a GPU (NVIDIA GeForce RTX 2060 8GB). The values of the training time and the accuracy may vary when different hardware and mini-batch sizes are used. The results of the image generation and detection are presented in the following two subsections.

Image Generation Results
The CGAN model was trained on the original dataset described in Section 3.2.2, which contains three classes: COVID-19, pneumonia, and normal. Figure 7 shows the CGAN training process in terms of the loss scores of the generator and the discriminator. The training process required about 16 h. The number of epochs, i.e., 2000, was chosen based on the quality of the generated images. We started with 500 epochs and increased the number until we obtained good quality images.  The generator and discriminator are competing in an adversarial manner; the generator tries to reduce the loss function to fool the discriminator, and the discriminator tries to increase the loss function to distinguish between real and fake images. Our proposed CGAN generated 2790 images for all three classes, COVID-19, pneumonia, and normal, producing 930 images for each. To obtain an overview of the images generated by the CGAN, we present samples of the three classes in Figure 8. Generator and discriminator scores of the CGAN through the training process. The CGAN is trained on the whole dataset containing the three classes and then generates images for a given class (given label).  To validate the images, the VGG16 model was trained on the generated images and tested on the real images only, i.e., Train on Synthetic, Test on Real (TSTR) method. Table 4 shows the performance metrics of the VGG16 for the three classes in terms of recall, precision, and F1-score. Also, the confusion matrix is shown in Figure 9. Focusing on the COVID-19 class, which is the main target of this study. The high values of the VGG16 performance measures indicate that the quality of the generated model is good enough and that the CGAN was able to produce images similar to an extent to the real images. Thus, the generated images were combined with the real images in the training and testing dataset in the following phase to perform COVID-19 detection.
An experiment was conducted: two measures were used to ensure that the generated images were distinct: MSE and SSIM. To this end, 500 images were randomly selected to calculate these two measures. A value of zero for MSE indicates perfect similarity. A value greater than one implies less similarity and will continue to grow as the average difference between pixel intensities increases as well. The SSIM value can vary between -1 and 1, where the latter indicates perfect similarity.   An experiment was conducted: two measures were used to ensure that the generated images were distinct: MSE and SSIM. To this end, 500 images were randomly selected to calculate these two measures. A value of zero for MSE indicates perfect similarity. A value greater than one implies less similarity and will continue to grow as the average difference between pixel intensities increases as well. The SSIM value can vary between -1 and 1, where the latter indicates perfect similarity.
In total, the comparison experiment required 124,750 comparisons between these images. The average MSE and SSIM were 5593.47 and 0.22677, respectively. Since the value of MSE is greater than one, and SSIM obtained a number near zero, this means that the CGAN was able to generate distinct images. To obtain an overview of the results, Figure 10 shows samples of the MSE and SSIM values between one image and the remaining images. In total, the comparison experiment required 124,750 comparisons between these images. The average MSE and SSIM were 5593.47 and 0.22677, respectively. Since the value of MSE is greater than one, and SSIM obtained a number near zero, this means that the CGAN was able to generate distinct images. To obtain an overview of the results, Figure 10 shows samples of the MSE and SSIM values between one image and the remaining images.

COVID-19 Detection Results
For detection, the deep models mentioned in Section 3.3.1 were trained and tested using the expanded datasets, which consist of 500 real images and 2790 generated images for each class. The percentage of generated images in the whole expanded dataset is 84.8%, as shown in Figure 11, which is more than five times the number of real images.

COVID-19 Detection Results
For detection, the deep models mentioned in Section 3.3.1 were trained and tested using the expanded datasets, which consist of 500 real images and 2790 generated images for each class. The percentage of generated images in the whole expanded dataset is 84.8%, as shown in Figure 11, which is more than five times the number of real images.

COVID-19 Detection Results
For detection, the deep models mentioned in Section 3.3.1 were trained and tested using the expanded datasets, which consist of 500 real images and 2790 generated images for each class. The percentage of generated images in the whole expanded dataset is 84.8%, as shown in Figure 11, which is more than five times the number of real images. Figure 11. The percentage of real images and generated images. Figure 12 shows the number of real and generated images in the training and testing process. As shown in the figure, the training part was conducted on 77.81% of the dataset (100 real images and 630 generated images), and the testing part was conducted on 22.19% of the dataset (400 real images and 2160 generated images). To ensure that the expanded dataset was still balanced, the same number of images was generated for each class, as shown in Table 5.  Figure 12 shows the number of real and generated images in the training and testing process. As shown in the figure, the training part was conducted on 77.81% of the dataset (100 real images and 630 generated images), and the testing part was conducted on 22.19% of the dataset (400 real images and 2160 generated images). To ensure that the expanded dataset was still balanced, the same number of images was generated for each class, as shown in Table 5.  The kappa coefficient, which is one of the classifier's performance metrics, was used to measure the degree of agreement between the actual and predicted labels and to control for instances where an image may have been correctly classified by chance [49]. The kappa coefficient can be calculated using both the observed accuracy and the random accuracy as in Equations (11) and (12). According to [51], the kappa coefficient, which is between 0.81 and 1, indicates an almost perfect agreement. Figure 13 shows that all five models obtained an almost perfect agreement between the actual labels and the predicted labels. Table 6 presents the performance of the five detection models based on accuracy, macro-precision, macro-recall, and macro-average F-score. As shown in the table, InceptionResNetV2 was the best-performing model for all  The kappa coefficient, which is one of the classifier's performance metrics, was used to measure the degree of agreement between the actual and predicted labels and to control for instances where an image may have been correctly classified by chance [49]. The kappa coefficient can be calculated using both the observed accuracy and the random accuracy as in Equations (11) and (12). According to [51], the kappa coefficient, which is between 0.81 and 1, indicates an almost perfect agreement. Figure 13 shows that all five models obtained an almost perfect agreement between the actual labels and the predicted labels. Table 6 presents the performance of the five detection models based on accuracy, macro-precision, macro-recall, and macro-average F-score. As shown in the table, InceptionResNetV2 was the best-performing model for all metrics, whereas SqueezeNet was the worst-performing model with no apparent difference among all metrics. To obtain a better overview of the testing accuracy per class, we present the confusion matrices of the five deep transfer learning models in Figure 14.
Choosing a suitable model is a tradeoff among different parameters such as speed, size, and accuracy. Figure 15 shows the five selected models and their corresponding parameters. The area of diamonds in the figure represents the model's size on the disk, meaning that VGG16 is the largest model with a size of 515 MB, whereas SqueezeNet is the smallest at only 4.5 MB. All models achieved high accuracy between 98.86% and 99.72%, but with different training times. The figure also shows that some models were faster or smaller compared to others but could still achieve high accuracy. For instance, SqueezeNet, which is a small network, required only three minutes and achieved comparable accuracy to larger networks such as InceptionResNetV2, which needed more time (143 min) to run.  COVID-19  400  2160  100  630  Normal  400  2160  100  630  Pneumonia  400  2160  100  630 The kappa coefficient, which is one of the classifier's performance metrics, was used to measure the degree of agreement between the actual and predicted labels and to control for instances where an image may have been correctly classified by chance [49]. The kappa coefficient can be calculated using both the observed accuracy and the random accuracy as in Equations (11) and (12). According to [51], the kappa coefficient, which is between 0.81 and 1, indicates an almost perfect agreement. Figure 13 shows that all five models obtained an almost perfect agreement between the actual labels and the predicted labels. Table 6 presents the performance of the five detection models based on accuracy, macro-precision, macro-recall, and macro-average F-score. As shown in the table, InceptionResNetV2 was the best-performing model for all metrics, whereas SqueezeNet was the worst-performing model with no apparent difference among all metrics. To obtain a better overview of the testing accuracy per class, we present the confusion matrices of the five deep transfer learning models in Figure 14.   Choosing a suitable model is a tradeoff among different parameters such as speed, size, and accuracy. Figure 15 shows the five selected models and their corresponding parameters. The area of diamonds in the figure represents the model's size on the disk, meaning that VGG16 is the largest model with a size of 515 MB, whereas SqueezeNet is the smallest at only 4.5 MB. All models achieved high accuracy between 98.86% and 99.72%, but with different training times. The figure also shows that some models were faster or smaller compared to others but could still achieve high accuracy. For instance, SqueezeNet, which is a small network, required only three minutes and achieved comparable accuracy to larger networks such as InceptionResNetV2, which needed more time (143 min) to run.

Conclusions and Future Work
COVID-19 remains a serious disease all around the world. To improve COVID-19 detection, we proposed COVID-CGAN, an approach to extend the existing small CXR COVID-19 datasets using a CGAN and to detect the disease based on the extended datasets. We proposed a customized design for the CGAN with 18 layers for the generator and 24 layers for the discriminator.
Based on the results presented in the previous section, we derived the following findings: 1. A CGAN has a simple and straightforward architecture, yet can produce images similar to real ones. Compared to other GAN architectures that may produce better quality images, such as the least-squares generative adversarial network (LSGAN) [52] and information maximizing GAN (InfoGAN) [53], these architectures have large computational budgets and generating images is time-consuming, whereas CGANs are simpler and do not require long computation times. They can synthesize good-quality images from the original dataset. 2. Some deep learning models are better than others in terms of detecting COVID-19 patients based on their CXR images. The experimental results showed that InceptionResNetV2 outperformed other models in detecting COVID-19 based on CXR images. This model can be investigated by other researchers to detect COVID-19 based datasets other than CXR images. 3. Some deep learning models are small in size and thus provide fast predictions, yet can achieve good results in detecting COVID-19. As shown in Figure 15, SqueezeNet, which is a small network, required only three minutes to achieve accuracy comparable to that of larger networks.
Based on the present study, we suggest the following directions for future research:  Design generative model architectures other than CGANs and compare them in terms of their ability to synthesize high-quality images that are similar to real images.  Include patient information related to COVID-19 other than CXR, such as symptom datasets, in the diagnostic process.

Conclusions and Future Work
COVID-19 remains a serious disease all around the world. To improve COVID-19 detection, we proposed COVID-CGAN, an approach to extend the existing small CXR COVID-19 datasets using a CGAN and to detect the disease based on the extended datasets. We proposed a customized design for the CGAN with 18 layers for the generator and 24 layers for the discriminator.
Based on the results presented in the previous section, we derived the following findings:

1.
A CGAN has a simple and straightforward architecture, yet can produce images similar to real ones. Compared to other GAN architectures that may produce better quality images, such as the least-squares generative adversarial network (LSGAN) [52] and information maximizing GAN (InfoGAN) [53], these architectures have large computational budgets and generating images is time-consuming, whereas CGANs are simpler and do not require long computation times. They can synthesize goodquality images from the original dataset. 2.
Some deep learning models are better than others in terms of detecting COVID-19 patients based on their CXR images. The experimental results showed that Incep-tionResNetV2 outperformed other models in detecting COVID-19 based on CXR images. This model can be investigated by other researchers to detect COVID-19 based datasets other than CXR images. 3.
Some deep learning models are small in size and thus provide fast predictions, yet can achieve good results in detecting COVID-19. As shown in Figure 15, SqueezeNet, which is a small network, required only three minutes to achieve accuracy comparable to that of larger networks.
Based on the present study, we suggest the following directions for future research: • Design generative model architectures other than CGANs and compare them in terms of their ability to synthesize high-quality images that are similar to real images. • Include patient information related to COVID-19 other than CXR, such as symptom datasets, in the diagnostic process.