COVID-19 Chest X-ray Classiﬁcation and Severity Assessment Using Convolutional and Transformer Neural Networks

: The coronavirus pandemic started in Wuhan, China in December 2019, and put millions of people in a difﬁcult situation. This fatal virus spread to over 227 countries and the number of infected patients increased to over 400 million cases, causing over 6 million deaths worldwide. Due to the serious consequence of this virus, it is necessary to develop a detection method that can respond quickly to prevent the spreading of COVID-19. Using chest X-ray images to detect COVID-19 is one of the promising techniques; however, with a large number of COVID-19 infected cases every day, the number of radiologists available to diagnose the chest X-ray images is not sufﬁcient. We must have a computer aid system that helps doctors instantly and automatically determine COVID-19 cases. Recently, with the emergence of deep learning methods applied for medical and biomedical uses, using convolutional neural net and transformer applications for chest X-ray images can be a supplement for COVID-19 testing. In this paper, we attempt to classify three types of chest X-ray, which are normal, pneumonia, and COVID-19 using deep learning methods on a customized dataset. We also carry out an experiment on the COVID-19 severity assessment task using a tailored dataset. Five deep learning models were obtained to conduct our experiments: DenseNet121, ResNet50, InceptionNet, Swin Transformer, and Hybrid EfﬁcientNet-DOLG neural networks. The results indicated that chest X-ray and deep learning could be reliable methods for supporting doctors in COVID-19 identiﬁcation and severity assessment tasks.


Introduction
Novel coronavirus, known as COVID-19, caused by the SARS-CoV-2 virus, initiated a global pandemic that influenced the lives of billions of people worldwide [1][2][3]. The number of infected case numbers and the rapidly increasing death rate indicated that this is a serious illness and challenging disease. The rapid spreading of COVID-19 is one of the most daunting obstacles, which prevents us from controlling the viral pneumonia. The early symptoms of coronavirus include dyspnea, dry cough, fever, myalgia, and headache [4][5][6]; however, in some cases, there are no clear signs which make the disease more dangerous to public health. It is therefore necessary to have a quick, reliable, safe, and simple method for COVID-19 detection and diagnosis.
The main method to detect COVID-19 is the reverse transcriptase-polymerase chain reaction (RT-PCR) [7], which can detect axit ribonucleic SARS-CoV-2 (RNA) from the respiratory specimens (collected from nasopharyngeal swabs), is now the gold standard for identifying COVID-19. However, RT-PCR has many limitations [8], and this screening method is a time-consuming procedure, laborious, complicated, and lacking supply devices. Some patients, including those highly suspected of COVID-19, have false negative and • Chest X-ray diagnosis allows us to have rapid COVID-19 classification and could be carried out in parallel with RT-PCR testing to deal with high volumes of patients. • Chest X-ray images could be obtained in many clinical sites and are readily available in most health care centers.

•
The portable chest X-ray imaging system helps doctors to isolate the image capturing process from other people such that it can reduce the risk of spreading COVID-19.
Despite chest X-ray diagnosis having many advantages, it still faces some obstacles due to some idiosyncratic characteristics of the new pandemic disease. The most cumbersome obstacle is the lack of experienced radiologists and also the error-prone human visual indicators. Computer aid design diagnosis can help radiologists to have faster and more accurate COVID-19 diagnosis as a crucial adjunct to reduce workload and enhance patient safety [18][19][20].
Recently, with the emergence of deep learning, many studies have been conducted to analyze the potential of chest X-ray and CT images for COVID-19 detection, especially to apply deep learning for automatic COVID-19 detection [21][22][23][24][25]. Deep learning for COVID-19 detection could help solve the problem of a lack of radiology specialists and also produce reliable performance [26][27][28]. However, almost all developed artificial intelligence (AI) systems are not open and are not available for the research community to access resources. We do not have much open-source coding or datasets available to conduct thorough research on the subject. Recently, there have been significant efforts pushing for access to the resource and AI source code about detecting COVID-19 using chest X-ray images; some of the notable research can be found in [29][30][31][32][33][34]. In one study, a tailored convolution neural COVID-NET [35] architecture was created to classify normal, pneumonia, and COVID-19. Different from other studies, the author of this study used a large dataset containing 13,800 chest X-ray images on 13,645 patients. The authors gained accuracy results of 92.4% with COVID-19 classification performance.
Severity assessment using chest X-ray is not an easy task, even with experienced radiologists; clinical diagnosis with the aid of a computer could help doctors with this daunting task. There are some works related to COVID-19 severity assessment [36,37], including the deep learning applied works of Liang et al. [38] and COVID-Gram [39] in which the author investigated the X-ray abnormality to detect COVID-19. In the work of Colombi et al. [40], the lung pneumonia extent was diagnosed to assess the severity of the disease. Another notable work was COVID-NET-S [41], one of the early COVID-19 severity assessment studies in which the author designed a deep neural network to predict extent scores from chest X-ray images.
In this paper, we study the application of deep learning for detecting COVID-19 based on chest X-ray images. The experimental results have shown us that artificial intelligence methods based on deep neural networks could aid doctors and radiologists with high accuracy and reliable performance. Furthermore, we also study the assessment of COVID-19 severity through classified chest X-ray images. The patient severity is divided into level1 and level2, which indicates the seriousness of the illness and can aid doctors in deciding a treatment response. We collected training images from various open dataset sources and then cleaned the input data by removing low-quality images and separating the original dataset to balance sets for efficient training. We trained five deep learning models on the customized datasets and evaluated model performance on three metrics: precision, recall, and F1-score.
The rest of the paper was organized as follows. Section 2 presents the materials and methods which are the dataset collecting process and deep learning architecture description. Section 3 describes experimental results and a detailed analysis of each method's performance on the COVID-19 detection and the severity assessment task. We also discuss the limitation of the used methods and review other potential deep learning methods that can be applied to X-ray imagery. Section 4 recaps our study so far and discusses future work.

Materials and Methods
We created two new datasets and ran experiments on 5 deep learning models, which are currently the state of art for medical image classification tasks. The detailed results analysis will be illustrated in Sections 2.1 and 2.2.

Dataset
Two datasets were used to run our experiments: the COVID-19 classification dataset which was collected from COVID CXR [42], and the Chest X-ray Images (Pneumonia) dataset [43]. These datasets were published freely on the internet for research and education purposes. Another dataset was used for severity assessment which was collected from two sources, the RICORD [44] dataset and the RALO dataset [45].

The Customized COVID-19 Classification Dataset
We collected the COVID-19 classification dataset from two open datasets. The first is COVID CXR dataset which contains 30,128 images in total. There are 16,488 images labeled as COVID-19 and 5555 images labeled as pneumonia and 8085 images labeled as normal. The whole dataset was collected from five open-source datasets which are currently freely available. The second dataset is the Chest X-ray Images (Pneumonia) dataset which was published on the Kaggle competition to classify pneumonia and normal chest X-ray images. We mixed the two datasets together and removed portions of COVID-19 labeled images from the COVID CXR dataset to create a more balanced dataset. Details about the dataset contribution are illustrated in Table 1, in which the COVID CXR was collected from 5 published datasets [46][47][48][49][50] which account for 80% of the mixed dataset. We made the combined dataset cleaner by removing all low-quality images and keeping the high-quality images only, then data augmentation was used to feed images into the deep learning models. Table 1. Detailed description of the COVID-19 classification dataset from the Chest X-ray and COVIDX-CXR-3 datasets.
The total number of COVID-19 images is 9446, normal is 9668, and pneumonia is 9828. The Chest X-ray images were selected from Guangzhou Women and Children's Medical Center, Guangzhou. All those images were collected as the clinical checking routine for patients suffering from pneumonia. The chest X-ray images were cleaned to make sure the quality of the input images was acceptable to feed into deep neural models. These images were classified by two expert radiologists and double-checked by a third radiologist. Three classes of chest X-ray images are illustrated in the Figure 1 with respect to normal, pneumonia, and COVID-19 cases. The Chest X-ray Pneumonia dataset contains a total of 5856 images which are grouped into two categories, normal and pneumonia. There are 1583 images labeled as normal and 4273 images labeled as pneumonia. All the normal and pneumonia images were then mixed with normal and pneumonia images from the COVID CXR dataset.
The total number of COVID-19 images is 9446, normal is 9668, and pneumonia is 9828. The Chest X-ray images were selected from Guangzhou Women and Children's Medical Center, Guangzhou. All those images were collected as the clinical checking routine for patients suffering from pneumonia. The chest X-ray images were cleaned to make sure the quality of the input images was acceptable to feed into deep neural models. These images were classified by two expert radiologists and double-checked by a third radiologist. Three classes of chest X-ray images are illustrated in the Figure 1 with respect to normal, pneumonia, and COVID-19 cases. The second dataset used for the severity assessment experiment was collected from two public datasets, the RICORD dataset and the RALO dataset. Details about the contribution of the two datasets are presented in Table 2.

The Customized COVID-19 Severity Assessment Dataset
The second dataset used for the severity assessment experiment was collected from two public datasets, the RICORD dataset and the RALO dataset. Details about the contribution of the two datasets are presented in Table 2. learning community. The segmentation was performed by a thoracic specialist and the labeling process was coordinated with other international medical imaging organizations. There are a total of 909 images, and we separated the dataset into train, validation, and test sets. In the train set, there are 140 and 467 images belonging to level1 and level2. The validation set contains a total of 152 images with 35 images of level1 and 117 of level2. For the test set, there are 52 level1 images and 98 level2 images which contribute to the total number of test images for the severity assessment.
The RALO (Radiographic Assessment of Lung Opacity Score) dataset was captured and scored by Stony Brook Medicine to aid researchers with a standard COVID-19 dataset. The dataset contains 2373 chest X-ray images and was scored by two expert radiologists for further COVID-19 severity analysis. In the RALO dataset, we only separated the dataset into train and validation with 1899 and 474 images, respectively. There are 845 and 1054 images with respect to level1 and level2 in the training set. For the validation set, there are 211 and 263 images as level1 and level2. We present an illustration of level1 and level2 severity chest X-ray images in Figure 2 below. RICORD stands for the RSNA International COVID-19 Open Radiology Database, the dataset was published by the Radiological Society of North America (RSNA) with the purpose of providing free access to research and education resources for the machine learning community. The segmentation was performed by a thoracic specialist and the labeling process was coordinated with other international medical imaging organizations. There are a total of 909 images, and we separated the dataset into train, validation, and test sets. In the train set, there are 140 and 467 images belonging to level1 and level2. The validation set contains a total of 152 images with 35 images of level1 and 117 of level2. For the test set, there are 52 level1 images and 98 level2 images which contribute to the total number of test images for the severity assessment.
The RALO (Radiographic Assessment of Lung Opacity Score) dataset was captured and scored by Stony Brook Medicine to aid researchers with a standard COVID-19 dataset. The dataset contains 2373 chest X-ray images and was scored by two expert radiologists for further COVID-19 severity analysis. In the RALO dataset, we only separated the dataset into train and validation with 1899 and 474 images, respectively. There are 845 and 1054 images with respect to level1 and level2 in the training set. For the validation set, there are 211 and 263 images as level1 and level2. We present an illustration of level1 and level2 severity chest X-ray images in Figure 2 below.

Neural Networks Architecture
We used five neural networks to conduct our experiment, three convolutional, and two transformer-based models. The overview of each model architecture is shown in Table 3 and the details are presented in the following sections. Table 3. Detailed description of the five deep learning models used in this paper.

Neural Networks Architecture
We used five neural networks to conduct our experiment, three convolutional, and two transformer-based models. The overview of each model architecture is shown in Table 3 and the details are presented in the following sections. The DenseNet121 [51] model won the CVPR 2017 Best Paper Award and was developed by researchers from Cornell University, Tsinghua University, and FaceBook Research. The convolution neural net contains shorter connections between input and output layers so that the network can be deeper, more efficient, and more accurate. Based on these observations, Gao Huang et al. [51] introduced DenseNet, which connects each layer following the deep-forward design principle. With each layer, a feature map of all layers is used as the input and its own feature map is then used for the next layers. The DenseNet architecture has many strong points: it reduces the gradient descent, enhances features propagation, promotes features reuse, and reduces a large number of parameters.

ResNet50
ResNet architecture was proposed by Kaiming He et al. [52] to solve the problem when training very deep neural networks. Prior convolutional neural networks often face the issue of gradient vanishing when a large number of layers are stacked into the neural network. Gradient vanishing appears when the network is too deep, and the gradient calculated from the loss function easily decreases to zero through several chain rule operations. This results in the model not learning anything from the training process as there is no weight updating. The main concept of ResNet is the skip connection mechanism which reduces the gradient vanishing in two ways. First, it establishes a shortcut for the gradient passing through many layers, which helps the gradient to still pass over many layers. Second, it allows the model to learn an identity function which makes sure that the higher layers do not perform worse than the lower. In this paper, we use ResNet50, which is a variant of ResNet and has 50 layers, including 48 convolution layers and 1 max-pooling, and 1 average pool layer.

InceptionNet
Before InceptionNet, prior convolutional neural networks mainly focused on increasing the depth of the network to extract features for improving the learning ability of the model. However, the creators of InceptionNet [53] pioneered the scaling of both depth and width of the model while still maintaining constant hardware usage. The principal idea behind the InceptionNet model is that every neuron which extracts the same features should learn together. Furthermore, InceptionNet architecture focuses on parallel processing and extracting different feature maps simultaneously. This is the key innovative aspect that makes InceptionNet unique from other convolutional neural networks before it. However, InceptionNet architecture also has some disadvantages, for example, large models which use InceptionNet are subjected to overfit, especially with limited numbers of labeling input data. The model will bias toward the category which has more labels than another category.

Swin Transformer
Winning the Best Paper Awards and Best Student Paper competition, the Swin Transformer [54] is listed on the priority choices to run our experiments. The model had solved many problems that many vision transformers before it experienced, and it also makes a significant shift in applying transformer for vision tasks. We face a substantial challenge when applying transformer from natural language processing to computer vision, because of the natural difference between these two tasks, for example, a large number of pixels in high-resolution images far exceeds the number of words in text documents. Which makes the transformer for vision tasks cost more expensive computationally than applying transformer for NLP. In order to solve this problem, the creators of Swin Transformer proposed a hierarchical architecture of transformer which has representation computed by shifted windows. This hierarchy provided a flexible ability to model at different scales and has linear complexity with image size. Therefore, it can be used as the backbone for other vision tasks such as classification and dense prediction.

Hybrid EfficientNet and DOLG
The Hybrid EfficientNet and DOLG [55] won the Google Landmark Competition 2021 with the highest recognition performance on the over 200,000 classes. The author implemented the model by enhancing the original DOLG [56] with some adjustments to improve the recognition capability. At first, the author used the EfficientNet [57] which was pre-trained on the ImageNet dataset as an encoder. Then the author added a local branch after the third EfficientNet block and extracted those 1024 dimensions of local features by using three dilated convolutions where parameters were different per each model. The output of the fourth EfficientNet was projected to 1024 dimensions and those fused features accumulate using the average pooling before they were fed into the fully connected layers. The model used the subcenter arc face as the loss function contains dynamic margins for predicting thousands of classes. Overview architecture of Hybrid EfficientNet and DOLG illustrates in Figure 3 with EfficientNet-B5 as a feature extractor and DOLG as a classifier.

Hybrid EfficientNet and DOLG
The Hybrid EfficientNet and DOLG [55] won the Google Landmark Competition 2021 with the highest recognition performance on the over 200,000 classes. The author implemented the model by enhancing the original DOLG [56] with some adjustments to improve the recognition capability. At first, the author used the EfficientNet [57] which was pre-trained on the ImageNet dataset as an encoder. Then the author added a local branch after the third EfficientNet block and extracted those 1024 dimensions of local features by using three dilated convolutions where parameters were different per each model. The output of the fourth EfficientNet was projected to 1024 dimensions and those fused features accumulate using the average pooling before they were fed into the fully connected layers. The model used the subcenter arc face as the loss function contains dynamic margins for predicting thousands of classes. Overview architecture of Hybrid Effi-cientNet and DOLG illustrates in Figure 3 with EfficientNet-B5 as a feature extractor and DOLG as a classifier.

Data Augmentation
Recently, convolutional neural network and transformer performed excellently on many vision tasks such as classification and segmentation. However, these networks need more input data to prevent overfitting, which leads to the failure of the model generalization. These models overfit when learned weights are performed well on the training set, however, badly on the testing set. Unfortunately, many application domains of deep learning do not have access to big data, such as the medical and biomedical domains, in which input data are scarce because of the costly labeling expense and the scarcity of image sources. We need to have experienced radiologists, pathologists, and specialists in medical images analysis to perform labeling of input data which makes the cost of labeling data become too expensive. Furthermore, many real-life medical data cannot be available for the privacy protection of patient information.
One of the common techniques usually used to increase input image data for deep learning models is to apply data augmentation operations. Many works apply deep learning for COVID-19 detection using chest X-ray data augmentation techniques to increase input data. In the work of COVID-NET, Wang et al. [35] applied horizontal flip, intensity shift, translations, zoom, and rotation. The work of Bassi et al. [58] applied flipping, rotations, and translation methods to improve the deep neural network performance. Another

Data Augmentation
Recently, convolutional neural network and transformer performed excellently on many vision tasks such as classification and segmentation. However, these networks need more input data to prevent overfitting, which leads to the failure of the model generalization. These models overfit when learned weights are performed well on the training set, however, badly on the testing set. Unfortunately, many application domains of deep learning do not have access to big data, such as the medical and biomedical domains, in which input data are scarce because of the costly labeling expense and the scarcity of image sources. We need to have experienced radiologists, pathologists, and specialists in medical images analysis to perform labeling of input data which makes the cost of labeling data become too expensive. Furthermore, many real-life medical data cannot be available for the privacy protection of patient information.
One of the common techniques usually used to increase input image data for deep learning models is to apply data augmentation operations. Many works apply deep learning for COVID-19 detection using chest X-ray data augmentation techniques to increase input data. In the work of COVID-NET, Wang et al. [35] applied horizontal flip, intensity shift, translations, zoom, and rotation. The work of Bassi et al. [58] applied flipping, rotations, and translation methods to improve the deep neural network performance. Another work by Nishio et al. [59] used a mixture of data augmentation techniques, such as rotating, flipping, shifting, and mix-up to improve the model's performance. In this paper, we utilized various image transformation techniques using ImageDataGenerator from keras.preprocessing.image to augment our input data. Image augmentation operations include height_shift_range, rotation_range, horizontal_flip, brightness_range, width_shift_range, and rescale. All these data augmentation techniques are illustrated in Figure 4. work by Nishio et al. [59] used a mixture of data augmentation techniques, such as rotating, flipping, shifting, and mix-up to improve the model's performance. In this paper, we utilized various image transformation techniques using ImageDataGenerator from keras.preprocessing.image to augment our input data. Image augmentation operations include height_shift_range, rotation_range, horizontal_flip, brightness_range, width_shift_range, and rescale. All these data augmentation techniques are illustrated in Figure 4.

Hardware and Hyperparameter Settings
We trained deep learning models on the NVIDIA GeForce RTX 2070 GPU 8GB with the computer hardware setting: Intel(R) Core (TM) i7-8700K CPU @ 3.70GHz RAM 16GB. We used CallBack, ModelCheckpoint, LearningRateScheduler, TensorBoard, EarlyStopping, and ReduceLROnPlateau from tensorflow.keras.callbacks. We set the maximum epoch to 120 with a patience of 10, starting with a leaning rate to 0.0001 with a minimum learning rate of 0.0001 and maximum learning rate of 0.0005. We utilized the Adam optimizer [60] from tensorflow_addons, which is the upgrade version of the stochastic gradient descent and has been used frequently for vision and natural language processing.
We used Sparse Categorical Cross Entropy as the loss function for our training pipeline, which is a loss function applied for multi-categorical classification. There are two loss functions usually applied for the multi-categorical classification tasks, which are categorical cross entropy and sparse categorical cross entropy. The two loss functions have the same formula as illustrated in the diagram above; however, the only difference is the truth value in sparse categorical cross entropy, which are integer encoded such as {1}, {2}, {3}, and the categorical cross entropy use of one-hot encoding, such as {1, 0, 0}, {0, 1, 0}, and {1, 0, 1} instead. Illustration of multi-categorical classification scheme presents in Figure 5 with feature maps goes through Softmax layer before passing to Sparse Categorical Cross Entropy.

Hardware and Hyperparameter Settings
We trained deep learning models on the NVIDIA GeForce RTX 2070 GPU 8GB with the computer hardware setting: Intel(R) Core (TM) i7-8700K CPU @ 3.70GHz RAM 16GB. We used CallBack, ModelCheckpoint, LearningRateScheduler, TensorBoard, EarlyStopping, and ReduceLROnPlateau from tensorflow.keras.callbacks. We set the maximum epoch to 120 with a patience of 10, starting with a leaning rate to 0.0001 with a minimum learning rate of 0.0001 and maximum learning rate of 0.0005. We utilized the Adam optimizer [60] from tensorflow_addons, which is the upgrade version of the stochastic gradient descent and has been used frequently for vision and natural language processing.
We used Sparse Categorical Cross Entropy as the loss function for our training pipeline, which is a loss function applied for multi-categorical classification. There are two loss functions usually applied for the multi-categorical classification tasks, which are categorical cross entropy and sparse categorical cross entropy. The two loss functions have the same formula as illustrated in the diagram above; however, the only difference is the truth value in sparse categorical cross entropy, which are integer encoded such as {1}, {2}, {3}, and the categorical cross entropy use of one-hot encoding, such as {1, 0, 0}, {0, 1, 0}, and {1, 0, 1} instead. Illustration of multi-categorical classification scheme presents in Figure 5 with feature maps goes through Softmax layer before passing to Sparse Categorical Cross Entropy.

Evaluation Metrics
In our experiment, we used evaluation metrics from classification_report in sklean.metrics which include precision, recall, F1-score metrics with respect to each category, macro average and micro average. Details about metrics are presented in the following sections.

Evaluation Metrics
In our experiment, we used evaluation metrics from classification_report in sklean.metrics which include precision, recall, F1-score metrics with respect to each category, macro average and micro average. Details about metrics are presented in the following sections.

Precision Metric
This evaluation metric calculated the ratio between the correct case predicted as positive over all predicted as positive cases.
The precision metric provides us with the intuition of the ability to find all cases relevant to the dataset. The precision metric relates to direct costs; if there is a large false positive, then there is a higher true positive cost.

Recall Metric
This evaluation metric calculates the ratio between the correct case predicted as positive over actual positive cases.

Recall = True Positive True Positive + False Negative
The recall metric provides us the intuition of the ability to find all cases that are only relevant in the dataset. This metric relates to opportunity costs; if there are many false negatives, then there are opportunities lost every time.

F1-Score Metric
The F1-score metric involves the average of both precision and recall, when we have two classifiers and the first one has higher precision.
However, the second one has a larger recall, in this case, we should use the F1-score to compare two models. The F1-score is also named as the harmonic mean of precision and recall of the model.

Macro-Average Metric
The macro-average calculates the harmonic mean of precision and recall of each class in the dataset.
Macro average of precision = P1 + P2 2 Macro average of recall = R1+R2 2 Macro average of F1 − score = F1+F2 2 For example, in the formula above, we have two classes with the P 1 as the precision of the first class and P 2 as the precision of the second class. After calculating a precision score for every class, the macro-average can be computed by taking the average of each score and compute the macro-average for recall as the same as the precision.

Results and Discussions
We ran experiments on two tasks. The first task included normal, pneumonia, and COVID-19 classification, which was a multiple-categorical classification task. The second task was the COVID-19 severity assessment, which included two levels of severity and turned out to be a binary classification task. Details of the experiment results for the two tasks are presented in the Sections 3.1 and 3.2.

COVID-19 Classification Results
We trained three convolution-based and two transformer-based models on the customized COVID-19 classification dataset. The output results were evaluated based on three metrics, which were precision, recall, and F1-score with respect to COVID-19, normal, pneumonia, macro-average, and micro-average. Each tables below contained numerical results of three metrics and will be described in detail. The two models which showed the best results were DenseNet121 and Hybrid EfficientNet-DOLG. Table 4 compares the performance of five deep learning models based on the precision metric score. The numerical results illustrate each category and on macro-and micro-average. For the COVID-19 precision score, the best model was found to be Swin Transformer with a score of 0.99, and for normal images, we found Hybrid EfficientNet-DOLG a top score of 0.93, and for pneumonia, the DensNet121 model produced the highest result (0.94). For macro-average and micro-average, the Hybrid EfficientNet-DOLG was found to lead both metrics with scores of 0.95 and 0.96, respectively. The hierarchical architecture of Swin Transformer computed image representation in different scales which works best on COVID-19. On the other hand, the concatenate mechanism of DensNet boosted the pneumonia detection of DenseNet. The hybrid architecture helped Hybrid EfficientNet-DOLG perform best results on normal, macro average, and micro average. As shown in Table 5, we could see the outstanding performance of Hybrid EfficientNet-DOLG as it produced the highest score on three metrics: pneumonia, macro-average, and micro-average, with scores of 0.95, 0.96, and 0.96, respectively. For the COVID-19 category, we found the DenseNet121 to have a recall score of 0.98 and the second highest score was found to be Hybrid EfficientNet Transformer which was 0.97, only smaller than a 0.01 gap. For normal, we found Swin Transformer to have a score of 0.97 and the second score was DenseNet 121 and Hybrid EfficientNet-DOLG, which was 0.94. The other models, including ResNet50 and InceptionNet, also computed comparable results. With respect to recall, the dense architecture of DenseNet worked best on COVID-19 images, and different from precision metric, the hierarchy architecture of Swin Transformer achieved the best results on normal images. The combination of EfficientNet and DOLG took the best place on pneumonia, macro average, and micro average. Table 5. Comparing results between the five deep learning models on the COVID-19 classification task with respect to recall metric.

COVID-19
Normal Pneumonia Macro-Average The third metric was the F1-score; in this metric, we could see the pattern that Hybrid EfficientNet produced the best results on all scores except the normal category, in which the top result belonged to DenseNet with a score of 0.95. For COVID-19, pneumonia, macro-average, and micro-average the Hybrid EfficientNet-DOLG produced the highest results with scores of 0.99, 0.94, 0.95, and 0.96, respectively. For COVID-19, pneumonia, macro-average, and micro-average, the second highest model was DensNet121, which produced scores of 0.98, 0.93, 0.94, and 0.95 sequentially. Different from precision and recall, the classification performance of DenseNet secured best on normal images, this might be because the dense connections performed well for normal cases regarding the F1-score, then the compound of EfficientNet and DOLG did efficiently on other categories.

Micro-Average
From Table 6, we can conclude that Hybrid EfficientNet-DOLG and DenseNet models are the best models for COVID-19 classification tasks on our customized dataset. Figures 6  and 7 demonstrate the confusion matrix of inference results on the test set and training history of the Hybrid EfficientNet-DOLG and DenseNet models. Table 6. Comparing results between the five deep learning models on the COVID-19 classification task with respect to F1-score metric.

Severity Assessment Results
With the COVID-19 severity assessment task, we also trained the customized COVID-19 severity dataset with five models: DenseNet121, ResNet50, InceptionNet, Swin Transformer, and Hybrid EffificientNet-DOLG. The dataset contains images of two categories, level1 and level2, in which level1 indicates the patient severity is normal and the patient can self-quarantine at home without requiring a further treatment response. Level2 indicates that the patients need to have further support and need to go to the hospital for a treatment response because the pneumonia extent of COVID-19 damage is large and severe.
After training models for hours, we obtained the output results as shown in Tables 7-9. We also evaluated severity assessment results on three metrics: precision, recall, and F1score with respect to level1, level2, macro-average, and micro-average, the same as the COVID-19 classification task. A detailed analysis of deep learning models' performance on COVID-19 severity assessment is presented for each metric under every table.
From Table 7, we can see that with level1, DenseNet121 output had the top precision score result of 0.76, and level2 Hybrid EfficientNet-DOLG produced a precision score of 0.87. The Swin Transformer led the macro-average with a 0.81 precision score, and with respect to the micro-average, the Hybrid Efficient-DOLG also produced the highest score of 0.82. The dense connection of DenseNet helped to reduce gradients vanishing which improve the precision metric of DenseNet on level1 chest X-ray images. The Swin Transformer produced the best on macro average because the accuracy of level1 and level2 is very high, and Hybrid EfficientNet-DOLG surpassed other neural nets on micro average because its precision on level2 is the highest and on level1 almost equal DenseNet. Table 7. Comparing results between the five deep learning models on COVID-19 severity assessment task with respect to precision metric.  Table 9. Comparing results between the five deep learning models on the COVID-19 severity assessment task with respect to F1-score metric. When comparing the precision score, we could see that the Hybrid EfficientNet-DOLG outperformed on all three categories: level1, macro-average, and micro-average with the score of 0.75, 0.80, and 0.82. The DenseNet121 produced the highest score of level2 (0.89) on the recall metric. The dense connection mechanism of DenseNet is sensitive to level2 chest X-ray images that made up the highest score for DenseNet on the recall metric. The EfficientNet encoder of the hybrid network produces overall superb performance on both level1, macro-average, and micro-average. Swin Transformer and other two convolutionbased neural nets, ResNet50 and InceptionNet, also achieved comparable results with DenseNet and Hybrid EfficientNet-DOLG.

F1-Score
The last metric we analyzed was F1-score. In this metric, Hybrid EfficientNet-DOLG outperformed all four categories: level1, level2, macro-average, and micro-average, with scores of 0.74, 0.86, 0.80, and 0.82. The top results on the F1-score of Hybrid EfficientNet-DOLG are based on the robustness of EfficientNet as an encoder of the structure and combine with the global and local descriptor of DOLG. DenseNet also produced high output results with scores of 0.71 on level1, 0.86 on level2, 0.79 on macro-average, and 0.81 on micro-average. Output of DenseNet was only smaller than Hybrid EfficientNet-DOLG with 0.01 score gaps on level1, macro-average, and micro-average and produce the equal result on level2. Other neural nets also achieved comparable results on F1-score with DenseNet and Hybrid EfficientNet-DOLG.
Overall, we can conclude that Hybrid EfficientNet-DOLG and DenseNet are two models which produce the highest inference results on the COVID-19 severity assessment task; it is the same pattern as the performance on the COVID-19 classification task. Figures 8 and 9 show us the confusion matrix and training history of DenseNet and Hybrid EfficientNet-DOLG after inferencing and training on the COVID-19 severity assessment dataset.
Hybrid EfficientNet-DOLG 0.74 0.86 0.80 0.82 The last metric we analyzed was F1-score. In this metric, Hybrid EfficientNet-DOLG outperformed all four categories: level1, level2, macro-average, and micro-average, with scores of 0.74, 0.86, 0.80, and 0.82. The top results on the F1-score of Hybrid EfficientNet-DOLG are based on the robustness of EfficientNet as an encoder of the structure and combine with the global and local descriptor of DOLG. DenseNet also produced high output results with scores of 0.71 on level1, 0.86 on level2, 0.79 on macro-average, and 0.81 on micro-average. Output of DenseNet was only smaller than Hybrid EfficientNet-DOLG with 0.01 score gaps on level1, macro-average, and micro-average and produce the equal result on level2. Other neural nets also achieved comparable results on F1-score with DenseNet and Hybrid EfficientNet-DOLG.
Overall, we can conclude that Hybrid EfficientNet-DOLG and DenseNet are two models which produce the highest inference results on the COVID-19 severity assessment task; it is the same pattern as the performance on the COVID-19 classification task. Figure  8 and Figure 9 show us the confusion matrix and training history of DenseNet and Hybrid EfficientNet-DOLG after inferencing and training on the COVID-19 severity assessment dataset.  There are many deep learning architecture and deep transfer learning techniques that could apply to X-ray imagery as a potential method for classifying COVID-19 detection and severity assessment. Some of the architecture techniques include Wide Residual Networks [61] (WRNs) and Visual Geometry Group [62] (VGG). Wide Residual Networks are variants of ResNet, which both increase the width and decrease the depth of residual networks, and also create lightweight models with high performance. With only 16 layers, the network can outperform another convolutional neural networks with over 1000 layers on CIFAR and ImageNet datasets. The second network is VGG, which won the ImageNet Challenge 2014 and was first and second in terms of localization and classification, respectively. We have VGG16 and VGG19, which represent 16 layers and 19 layers of neural networks. In general, the architecture of VGG includes input layers, convolutional layers, hidden layers, and fully connected layers, and depending on the architecture we use, that the number of layers might be different. There is research that studies the application of deep transfer learning, including the work of Naushad et al. [63] in which the author efficiently implemented deep transfer learning techniques for land use and land cover classification based on WSNs and VGG pre-trained models. Another study by Das et al. [64] tried to apply deep transfer learning automatically to detect COVID-19 based on chest X-ray images. Appl There are many deep learning architecture and deep transfer learning techniques that could apply to X-ray imagery as a potential method for classifying COVID-19 detection and severity assessment. Some of the architecture techniques include Wide Residual Networks [61] (WRNs) and Visual Geometry Group [62] (VGG). Wide Residual Networks are variants of ResNet, which both increase the width and decrease the depth of residual networks, and also create lightweight models with high performance. With only 16 layers, the network can outperform another convolutional neural networks with over 1000 layers on CIFAR and ImageNet datasets. The second network is VGG, which won the ImageNet Challenge 2014 and was first and second in terms of localization and classification, respectively. We have VGG16 and VGG19, which represent 16 layers and 19 layers of neural networks. In general, the architecture of VGG includes input layers, convolutional layers, hidden layers, and fully connected layers, and depending on the architecture we use, that the number of layers might be different. There is research that studies the application of deep transfer learning, including the work of Naushad et al. [63] in which the author efficiently implemented deep transfer learning techniques for land use and land cover classification based on WSNs and VGG pre-trained models. Another study by Das et al. [64] tried to apply deep transfer learning automatically to detect COVID-19 based on chest X-ray images.
This study has some limitations. First, we collected data from many open-source datasets; to some extent this might affect the model accuracy. Because X-ray images obtained from different machines have various image qualities, image color channels as well as resolutions, these factors have significantly impact on the model training pipeline. Another shortcoming of this study was the severity assessment levels; to have a more precise treatment response for patients, having many classes of severity is better than having few classes. In this study, we only focused on two classes, level1 and level2; this is not as detailed as it could be with more levels for severity diagnosis, which means a more appropriate treatment response could be designed. The last disadvantage of this paper is that we did not propose a deep learning model to customize for our chest X-ray dataset. We only used a built-in model from available libraries, which was then not as efficient as we trained on X-ray imagery. In future work, we aim to build a model that is lightweight, robust, and has a lower computational complexity for X-ray image classification tasks. We used five deep neural networks with computational complexity as follows: Swin Transformer 1038 (Giga FLOPs), InceptionNet (24.57 Giga FLOPs), Hybrid EfficientNet-DOLG (9.9 Giga FLOPs), DenseNet121 (5.69 Giga FLOPs), and ResNet50 (3.8 Giga FLOPs). We will design a network with computational complexity approximate to 10 Giga FLOPs for an efficient and robust model. This study has some limitations. First, we collected data from many open-source datasets; to some extent this might affect the model accuracy. Because X-ray images obtained from different machines have various image qualities, image color channels as well as resolutions, these factors have significantly impact on the model training pipeline. Another shortcoming of this study was the severity assessment levels; to have a more precise treatment response for patients, having many classes of severity is better than having few classes. In this study, we only focused on two classes, level1 and level2; this is not as detailed as it could be with more levels for severity diagnosis, which means a more appropriate treatment response could be designed. The last disadvantage of this paper is that we did not propose a deep learning model to customize for our chest Xray dataset. We only used a built-in model from available libraries, which was then not as efficient as we trained on X-ray imagery. In future work, we aim to build a model that is lightweight, robust, and has a lower computational complexity for X-ray image classification tasks. We used five deep neural networks with computational complexity as follows: Swin Transformer 1038 (Giga FLOPs), InceptionNet (24.57 Giga FLOPs), Hybrid EfficientNet-DOLG (9.9 Giga FLOPs), DenseNet121 (5.69 Giga FLOPs), and ResNet50 (3.8 Giga FLOPs). We will design a network with computational complexity approximate to 10 Giga FLOPs for an efficient and robust model.

Conclusions
In this work, we have shown the benefits of using the chest X-ray images as an early COVID-19 screening method to have a faster and safer method of COVID-19 detection. We also created new chest X-ray datasets from the available open datasets and then we cleaned the input data by removing low-quality images to make our training dataset more balanced. The first dataset used for COVID-19, pneumonia, and normal classification was collected from the COVID CXR and Chest X-ray Pneumonia datasets, containing a total of 36,384 images, and the second dataset used for COVID-19 severity classification was collected from RICORD and RALO datasets, containing a total of 3282 images. We ran experiments on five deep learning models, which are convolution-based and transformer-based models. The results indicated that using chest X-ray images to detect and assess COVID-19 severity is a promising method since it produces a reliable inference performance. We could also see the pattern that the transformer-based models performed better than convolution-based models on all three metrics: precision, recall, and F1-score.
In future work, we will apply more data augmentation techniques such as GAN [65] to augment input data for more accurate training models. We will also consider customizing deep learning models for the chest X-ray dataset to create more robust and stable models for the COVID-19 detection and the COVID-19 severity assessment task to improve model performance. Our machine learning model currently performs well on a research scale but is not ready as a production solution. We hope that in the future we can gather more real case datasets to apply our machine learning system for practical diagnosis in clinical settings.