Let AI Perform Better Next Time—A Systematic Review of Medical Imaging-Based Automated Diagnosis of COVID-19: 2020–2022

: The pandemic of COVID-19 has caused millions of infections, which has led to a great loss all over the world, socially and economically. Due to the false-negative rate and the time-consuming characteristic of the Reverse Transcription Polymerase Chain Reaction (RT-PCR) tests, diagnosing based on X-ray images and Computed Tomography (CT) images has been widely adopted to conﬁrm positive COVID-19 RT-PCR tests. Since the very beginning of the pandemic, researchers in the artiﬁcial intelligence area have proposed a large number of automatic diagnosing models, hoping to assist radiologists and improve the diagnosing accuracy. However, after two years of development, there are still few models that can actually be applied in real-world scenarios. Numerous problems have emerged in the research of the automated diagnosis of COVID-19. In this paper, we present a systematic review of these diagnosing models. A total of 179 proposed models are involved. First, we compare the medical image modalities (CT or X-ray) for COVID-19 diagnosis from both the clinical perspective and the artiﬁcial intelligence perspective. Then, we classify existing methods into two types—image-level diagnosis (i.e., classiﬁcation-based methods) and pixel-level diagnosis (i.e., segmentation-based models). For both types of methods, we deﬁne universal model pipelines and analyze the techniques that have been applied in each step of the pipeline in detail. In addition, we also review some commonly adopted public COVID-19 datasets. More importantly, we present an in-depth discussion of the existing automated diagnosis models and note a total of three signiﬁcant problems: biased model performance evaluation; inappropriate implementation details; and a low reproducibility, reliability and explainability. For each point, we give corresponding recommendations on how we can avoid making the same mistakes and let AI perform better in the next pandemic.


Introduction
The SARS-CoV-2 (COVID- 19) pandemic began in the spring of 2020. In the battle between humans and the novel coronavirus, medical professionals and scientists contributed a large amount. At the same time, a popular reaction from other research communities, including the AI community, is "how can we help?". In practice, the diagnosis of COVID-19 can be based on two methods: the Reverse Transcription Polymerase Chain Reaction (RT-PCR) test and radiography imaging (X-ray or CT scan). Current AI technologies can hardly help with RT-PCR tests, but they can assist the radiography imaging-based diagnosis of COVID-19 from a computer vision perspective. In the virus-stricken area, radiologists have a heavy burden on analyzing massive scanning images, so automated diagnosis models have great potential value in supporting medical decisions. The AI community responded very quickly to this point. The first automated diagnosis model (from Wang et al. [1]) was posted on MedRXiv on 17 February 2020, when the total number of confirmed cases was only around 64,000 (current global confirmed cases: more than 400,000,000 (data retrieved from https://covid19.who.int, accessed on 11 February 2022)), nearly a month before the World Health Organization (WHO) announced COVID-19 as a global pandemic (12 March 2020). As shown in Figure 1, following the work of Wang et al. [1], a large number of researchers dived into this field, making the automated diagnosis of COVID-19 become a noticeable research hotspot in the AI community. Meanwhile, an unprecedented amount of papers have been presented. As of this writing, there are more than 1320 manuscripts on arXiv (Google Scholar search: COVID-19 ["CT" OR "X-ray"] ["machine learning" OR "deep learning" OR "artificial intelligence"] site:arxiv.org) and 535 manuscripts on medRXiv (Google Scholar search: COVID-19 ["CT" OR "X-ray"] ["machine learning" OR "deep learning" OR "artificial intelligence"] site:medrxiv.org). As shown in Table 1, at least 13 special issues on this topic have been held by various journals. Moreover, there are more than 20 review or survey papers [2] that have been presented to the best of our knowledge.  The research on the automated diagnosis of COVID-19 is highly interdisciplinary. A good study should be qualified with both clinical standards and AI standards. From the clinical perspective, the model should avoid making mistakes such as improper data collection, data prepossessing and data augmentations. From the AI perspective, the models should have rigorous experiment design, enough robustness and a good generalization ability. However, unfortunately, while there are thousands of studies that appeared in these two years, very few of them are qualified from both sides. On 27 March 2020, a group from Europe presented an early review of the diagnosis models [3,4]. Among hundreds of covered studies, they found that "all models were rated at high or unclear risk of bias" according to the result obtained from PROBAST [5] (a prediction model risk of bias assessment tool). On 13 November 2020, Summers [6] described AI for COVID-19 imaging as "A Hammer in Search of a Nail". He called for moving beyond studies that repeatedly show that AI can detect COVID-19. Reliable diagnosis models that can meet real-world clinical requirements are more urgently needed. However, this need was not met by subsequent studies. On 15 March 2021, a group from the UK [7] published an article pointing out that "our review finds that none of the models identified are of potential clinical use due to methodological flaws and/or underlying biases". Later, a review paper [8] noted that "The vast majority of manuscripts were found to be deficient regarding potential use in clinical practice".
The problems of the research on the automated diagnosis of COVID-19 are not limited to an insufficient application value. A group from the University of Cambridge reviewed studies of AI models for COVID-19 diagnosis and found an "apparent deterioration in standards of research" [9]. Hryniewska et al. [10] listed numerous mistakes that were made during the research of the automated diagnosis of COVID-19. Cruz et al. [11] analyzed public COVID-19 datasets and found that most of them had significant problems that would lead to a high risk of model bias.
Review and survey papers are also important. A good review can help researchers to quickly understand the current status and background of the research field, reducing the risk of conducting repetitive work. In June 2020, we contributed a conference paper [12] that reviews the automated diagnosis model of COVID-19. In that paper, we pointed out that the literature coverage of existing reviews is insufficient, as demonstrated in Table 2. Meanwhile, they lack proper organization, a comparison of performance and an in-depth analysis of shortcomings. Specifically, in several reviews, diagnosing models are divided into CT-based models and X-ray-based models, but CT-based models and X-ray-based models share many similarities in preprocessing, feature extraction, classification and evaluation. Many early reviews contain repetitive restatements summarizing previous papers, failing to provide their own insights and new perspectives and failing to point out clear directions in how AI will be used in practical medicine in the future. These points are partially confirmed by a recent paper [2] that systematically analyzes the methodological quality of COVID-19 review papers. Therefore, based on our previous conference paper, we aim to present a systematic review for the 2-year (2020-2022) development of the medical imaging-based automated diagnosis of COVID-19 in this paper. We classify existing methods into two types-classification and segmentation-then define universal model pipelines and analyze the techniques that have been used in each step. We also review the existing public datasets. More importantly, we summarize the problems that emerged in the research on this field, and provide some comments and suggestions on how AI can perform better in the next pandemic. The rest of this paper is organized as follows. We first discuss the input modalities of automated diagnosis models in Section 2. In Sections 3 and 4, we, respectively, present a systematic review of existing methods and datasets. Section 5 discusses the limitation of existing methods and present corresponding recommendations, and Section 6 concludes this paper.

Input Modalities: CT or X-ray 2.1. Clinical Perspective
An early study (March 2020) pointed out that fever, cough, myalgia, fatigue, expectoration, and dyspnea are the main clinical symptoms of COVID-19 patients [24]. These symptoms can be used to diagnose COVID-19. However, later research shows that many infected patients (estimated as at around 20%) can be asymptomatic, and that those asymptomatic patients play an important role in the transmission of COVID-19 [25][26][27]. Real-time RT-PCR has proven to be a much more effective way to diagnose COVID-19, but it still has the risk of false-negative and false-positive results [28].
To solve this issue, medical imaging (CT or X-ray)-based diagnoses can be used as a complementary method to correct false-negative RT-PCR tests. In February 2020, a group from China [29] found that some patients with positive chest CT findings may present with negative results of RT-PCR tests for COVID-19. CT patterns of COVID-19 includes Ground-Glass Opacities (GGO), vascular enlargement, bilateral abnormalities, lower lobe involvement and posterior predilection [30]. Fang et al. [31] reported that the sensitivity of CT-based diagnosis was greater than that of RT-PCR (98% vs. 71%). Meanwhile, chest X-ray imaging has also proven to be an effective tool for diagnosing COVID-19 [32]. Cozzi et al. [33] showed that there are several commonly observed features in chest X-ray images, including lung consolidations, GGO, nodules and reticular-nodular opacities. Rousan et al. [32] showed that nearly half of COVID-19 patients have abnormal chest X-ray findings, and that GGO in lower lobes is the most common finding. Some typical examples of COVID-19 CT and X-ray features can be found in Figure 2.
CT images: accurate but expensive X-ray images: less accurate but cheaper  [30] and chest X-ray (right) [32] images of COVID-19-infected patients. The right subfigure a-f represents different cases, please refer to [32] for more details.
At present, a (repeated) RT-PCR test is the most commonly adopted COVID-19 diagnosing approach due to its scalability and relatively high sensitivity. Medical imaging-based diagnosis is used as a complementary method to the RT-PCR test. Comparing CT and X-ray-based diagnoses, a CT-based diagnoses are much more accurate (and even better than RT-PCR) but much more expensive, whereas X-ray-based diagnoses are less accurate but less expensive [34][35][36].

Artificial Intelligence Perspective
From the perspective of using artificial intelligence algorithms, the most significant difference between CT and X-ray images is the input shape. As shown in Figure 3, a CT image contains a series of slices; therefore, the input has multiple channels. However, there are also some methods that ignore the multiple channels and treat each slice independently, leading to the loss of spatial contextual information. Comparatively, X-ray images only have a single channel.  Despite the difference in data dimension, there are not many other differences between CT and X-ray images from the artificial intelligence perspective, since most existing methods do not utilize any explicit domain knowledge. Therefore, the number of CT-based methods and X-ray based methods are approximately equal (49% vs. 53 (some methods are compatible with both CT and X-ray images)) among the covered papers in this review. However, we found that, among CT-based and X-ray based methods, the proportion of classification methods and segmentation methods is different. CT-based segmentation methods account for 20%, whereas X-ray-based segmentation methods account for only 3%. The reason might be the difficulty of the X-ray-based segmentation task. Meanwhile, it can also be partially explained by the imbalance in the dataset type: the proportion of the CT-based segmentation dataset is more than that of the X-ray-based segmentation dataset (40% vs. 29%).

Automated Diagnosis of COVID-19
Current automated diagnosis methods of COVID-19 can be divided into two categories: image-level diagnosis and pixel-level diagnosis. Image-level diagnosis refers to the methods that predict the label (e.g., COVID-19/normal) from a given medical image. Pixel-level diagnosis can further provide the location of the lesion via predicting a segmentation mask. Since most image-level and pixel-level methods are based on different neural network architecture (i.e., image classification network, such as ResNet [37], and image segmentation network, such as U-Net [38]), we will review each type of method separately in this section. Image-level diagnosis methods will be firstly introduced in Section 3.1, whereas pixel level methods will be presented in Section 3.2.
with N samples, where X i and Y i are the ith input image and the class label of the corresponding image. Our goal is to learn a function f , which is usually a Convolutional Neural Network (CNN) [39], to predict the label from a given image accurately. In other words, the results of classification can be written asŶ i = f (X i ) and we expect a low prediction error This formulation is adopted by most researchers. Here, we first define a commonly adopted pipeline that consists of several steps, then introduce techniques that have been used in each step in the following sections (Sections 3.1.2-3.1.5). As shown in Figure 4, this pipeline is a combination of the following steps: first, the lung scanning images (CT or X-ray) are preprocessed by data augmentation or lung segmentation; then, the feature vectors are extracted by Convolution Neural Networks (CNN) or other feature extractors. The classifier predicts the label that corresponds to the input image.

Preprocessing Feature Extraction
ResNet-50 Backbone In Figure 4, we use a ResNet-50 as the feature extraction backbone, which is adopted by most researchers according to our statistics in Section 3.1.3. The ResNet-50 backbone can be divided into four stages, and each of them has three, four, six and three blocks. The detailed structure of a block is shown on the right side of Figure 4. It is an important method to ensure that deep learning does not degenerate with the increasing number of network layers.

Preprocessing
In the existing literature, researchers mainly used three types of preprocessing methods: data augmentation, image equalization and lung segmentation. Data augmentation can enlarge the dataset and prevent overfitting, equalization improves the image quality and lung segmentation can preserve the region of interest (ROI) only and avoid the undesired interference from areas out of the lung.
To avoid overfitting and solve the problem of data imbalance, data augmentations are the most adopted method in the preprocessing stage. Rotating, flipping, scaling, cropping and brightness and contrast adjusting [40][41][42][43][44][45][46][47][48][49][50][51][52][53][54] are the simplest and most common data augmentation methods. For simplicity, in Table 3, we summarize basic transformation-based data augmentation methods used by COVID-19 diagnosing models. We also summarize the total number of papers that adopt each type of data augmentation in the last row of Table 3. It can be seen that rotating and filling and scaling or cropping are the most widely adopted techniques. However, their augmentation strength is limited. For example, in the comparative experiment of Mizuho et al. [55], the conventional data augmentation method improved the diagnosing performance by only 4%. Therefore, researchers proposed applying other advanced data augmentation methods. To balance the imbalanced data, Rahul Kumar et al. [56] proposed the synthetic minority oversampling technique (SMOTE). Mehmet et al. [52] performed Zero-phase Component Analysis (ZCA whitening) to remove redundant information in input scanning images. Nour et al. [57] and Arvan et al. [58] used the Generative Adversarial Network (GAN [59] and Conditional Generative Adversarial Network (CGAN [60]), respectively, to generate virtual samples for data augmentation. Generative models can significantly increase the dataset size, but the quality of the generated sample is difficult to guarantee. The purpose of data augmentation is to prevent overfitting by increasing the variation, but, in these virtual sample methods, the discriminant lesion patterns might be lost or distorted if the model increases the variation too much. Table 3. Basic transformation-based data augmentation methods. The last row summarizes the total number of papers that adopt the corresponding approach.

Paper
Rotating or Flipping Scaling or Cropping Brightness Adjusting Contrast Adjusting [47,49,52,[61][62][63][64][65][66][67][68][69][70][71] √ --- [40,[42][43][44]50,53,[72][73][74][75][76][77][78][79][80] √ √ -- [45,51,81] √ In addition to the issue of insufficient data or imbalanced data, there are also large image variations that are caused by different types of scanners. As shown in Figure 5, we can observe significant image variation across different CT scanners. To solve this issue, Md et al. [95] and Oh et al. [96] performed histogram equalization on the images. However, histogram equalization has the potential harm of affecting image details or bringing unexpected noise. Md et al. [95] eliminated the noise by introducing a Perona-Malik Filter (PMF) [97], whereas some other researchers [93,94,98,99] solved the problem by proposing the Contrast Limited Adaptive Histogram Equalization (CLAHE). Lung segmentation aims to preserve only the lung area. This is motivated by prior domain knowledge: COVID-19 is a type of viral pneumonia, and the evidence of infection cannot lie outside of the lung. Lung segmentation can be conducted by using pretrained lung segmentation models. Importantly, we note that, though segmentation models are used and segmentation masks are predicted in this step, it has a fundamental difference to the pixel-level diagnosis model. Lung segmentation is independent of identifying COVID-19. It only requires identifying the lung area. Differently, pixel-level diagnosis needs to locate the exact area of the COVID-19 lesion.

Feature Extraction
As previously demonstrated in Section 2 and shown in Figure 2, scanning images of COVID-19 has certain characteristic manifestations, such as Ground-Glass Opacity (GGO) and a crazy-paving pattern distributed in a certain zone of lungs [16]. Feature extraction detects those discriminative lesion patterns. Most COVID-19 diagnosing models adopted the Convolutional Neural Network (CNN) for feature extraction, and most of them used existing network structures, such as ResNet [37], GoogLeNet [100], DenseNet [101], VGG [102], MobileNet [103], SqueezeNet [104], AlexNet [105], Capsule [106], etc. We summarized some popular CNN structures that have been used by COVID-19 diagnosing models in Table 4. Some researchers also proposed automatic network structure designing methods to identify the best network structure for lung feature extraction. Wang et al. [155] used a generative synthesis approach to identify the optimal network architecture. Dalia et al. [46] applied a Gravitational Search Algorithm (GSA) to determine the best network architecture hyperparameters. Sivaramakrishnan et al. [137] developed an iteratively pruning strategy to identify the optimal network structure. The model ensemble can also promote the overall performance. Lawrence et al. [118] and Umut et al. [45] performed a model ensemble by voting and feature fusion. Md et al. [95] applied Softmax Class Posterior Averaging (SCPA) and Prediction Maximization (PM) for the model ensemble, and Rodolfo et al. [156] combined seven traditional feature extraction models with Inception-v3 to obtain better results. Mahesh et al. [145] assembled different CNNs using a stacked generalization approach [115] to further improve the model performance. These models assumed that different sub-models learn nonlinear discriminative features and semantic image representation from images of different levels. Therefore, the combined model will be more robust and accurate.
In the beginning, trying existing CNNs is fast and convenient. However, these networks are designed for general image classification tasks, such as the ImageNet challenge. Radiologists diagnose COVID-19 by finding distinguishing local patterns. Some researchers design local methods to extract more discriminative features. For example, Umut et al. [45] and Oh et al. [96] used the local patches to train the CNN feature extractor. To analyze local textural features, Chirag Goel et al. [157] proposed the Grey Level Co-occurrence Matrix (GLCM). However, lung infectious areas may vary significantly in size, and the local methods with a fixed patch size are unable to extract features of the target with the larger size. Hu et al. [158] proposed multi-scale learning to overcome such deficiencies. The network aggregated features from different layers to make the final decision. Similarly, Ying et al. [108] and Tan et al. [108] integrated ResNet or DenseNet with the Feature Pyramid Network (FPN) [159], which is a pyramidal hierarchy network structure for multi-scale feature extraction. In addition, the lesion of COVID-19 in the lung is a 3D object, and slice-wise contextual information in CT images would be lost by a conventional 2D feature extractor. Therefore, Zheng et al. [41], Xi et al. [151], Wang et al. [61] and Chih-Chung Hsu et al. [78] proposed CNN structures with 3D convolution units to detect COVID-19 to solve this defect. Han et al. [160] proposed an Attention-based Deep 3D Multiple Instance Learning (AD3D-MIL) algorithm that can predict the infection according to multiple CT slices. Compared to conventional 2D methods that predict the infection according to a single CT slice, 3D methods can make the diagnosis more accurate. In addition, some researchers also proposed some methods for the post-processing of the extracted features. For example, [151] used PCA to find the most influential features, and Jin et al [154] used the ReliefF algorithm to rank the extracted features.
In practice, radiologists also need to consider information such as epidemiology and clinical manifestations for diagnosis. Therefore, some methods also combine auxiliary external information with visual features to improve the model. Wang et al. [161] combine clinical features, including age, sex and co-morbidity, with CNN features. Similarly, since the infected area usually lies near the edge of the lung, Xu et al. additionally provided the distance-from-edge information [107] of the local patch to the network. Shi et al. [162] and Sun et al. [163] calculated human-designed features, including using the volume, infection lesion number, histogram distribution, surface area and radionics information.

Classification
Classification is used to present a diagnosing prediction (such as COVID-19/non-COVID-19) according to the extracted feature. Most existing COVID-19 diagnosing models use CNN as the feature extractor, and most of them use softmax as the classifier. Some researchers proposed improvements based on the CNN-softmax scheme. For example, Wang et al. [1] combined softmax, decision tree and Adaboost algorithms, and Zhang et al. [113] simultaneously performed softmax loss-based classification and contrastive loss-based anomaly detection to make the final decision. However, these deep models are black-box and usually need large-scale training sets. In the literature of [45,109,164], researchers developed non-end-to-end models and took the Support Vector Machine (SVM) as the classifier. Comparative experiments of classification algorithms, including SVM, logistic regression, k-Nearest Neighbors (k-NN), Multi-Layer Perception (MLP), decision tree, AdaBoost, random forest, LightGBM [165], and bagging classifier have been carried out in [119,156,162,163]. Among them, classifiers in [119,156] are for visual feature classification, whereas classifiers in [162,163] are for hand-crafted clinical feature classification, in which, Least Absolute Shrinkage and Selection Operator (LASSO) [166] and deep forest [167] algorithms were used for feature selection.

Evaluation
Researchers evaluated their proposed models with several metrics in experiments. The most used metrics are accuracy and the Area Under Curve (AUC). Accuracy is the ratio of correctly classified samples. AUC is the area under the ROC curve, which is the graph of the function between the true positive rate and false positive rate. The average accuracy and AUC of diagnosing models based on X-ray scanning are 94.76% and 96.94, and the average accuracy and AUC of CT scanning-based models are 90.13% and 94.76. Theoretically, 3D CT scanning contains more information than 2D X-ray scanning, and CT scanning can also avoid the occlusion of ribs compared with X-ray scanning. However, X-ray scanning-based models achieve better performance. We consider the reason for this to be that the large size of X-ray training sets helps these models, whereas CT scanning is relatively more difficult to collect. The average training set size of X-ray-based models is 4185, and the average size of CT-based models is only 1417.
Although the performance of existing models is relatively high (average accuracy of 93.59% and average AUC of 95.75), the size of the test set is worth noting; in some models, the test set only has a few COVID-19 samples. The average COVID-19/total ratio of test sets is 0.274:1, which is highly imbalanced (the ratio of training sets is 0.3:1, which is also imbalanced). In Table 4, we also color the table cells according to the number of samples in the training and testing dataset (green and red are, respectively, corresponding to higher and lower than the average, and the saturation is correlated with the difference from the average). Some researchers reproduced the experiment with different datasets, but achieved a significantly lower performance compared to the originally reported performance [192]. The reason for this might be due to the model overfitting and the lack of appropriate control of patients and ground truth labels. Moreover, models in Table 4 are evaluated in different datasets, and most of them are private datasets of combined datasets. We think a proper benchmark test set is vital for further research of this area. Experiments on the same benchmark test set can also benefit the hospitals' selection of diagnosing models. In addition, how to combine the accuracy, AUC and other evaluation criteria, such as the precision, recall and time complexity, in order to choose the best model for practical applications is still an open question [193].

Overview
The pixel-level diagnosis of COVID-19 can be formalized as follows: suppose we have with N samples, where X i and Y i are the ith input scanning image (CT/X-ray) and the corresponding binary lung lesion annotation. For each pixel of Y i , zero represents the background, whereas one represents a lung lesion instance. The goal here is to learn a function f that predicts Y i from X i accurately. Similar to classificationbased models, the lung lesion detection can be written asŶ i = f (X i ), and we expect a low prediction error d i =Ŷ i − Y i for each i.
We present a typical segmentation-based COVID-19 diagnosis model in Figure 7. This model uses a U-Net for segmentation, which mainly consists of two parts: an encoder network and a decoder network. The encoder network has four downsample stages with convolution and pooling layers that analyze the contextual pixel information in the image to obtain the semantic feature. In each stage, the input tensor first goes through two convolutional layers with ReLU activation. The output of convolutional layers is maxpooled with a kernel size of 2 × 2, therefore reducing the spatial resolution by a half. Let conv(·) denote the two convolutional layers, and let pool(·) represent the max-pooling layer. The L is the total number of stages. The output feature maps of each stage of the encoder E can be formulated as follows: The decoder network consists of four upsample stages. It recovers the same resolution of the given input image. In each stage, a skip connection is built with the corresponding stage in the encoder. The output tensor of the former stage of the decoder is upsampled and concatenated along the channel axis with the output tensor of the same stage of the encoder. The upsampling is performed by conducting nearest interpolation. Then, two 3 × 3 convolutions with ReLU activation and the same padding is applied. Let upsample(·) denote the upsampling operation and let the ⊕ represent the concatenate operation; the output tensor of each stage of the decoder D can be formulated as follows:

Preprocessing
Similar to classification-based models, the following preprocessing methods are used during preprocessing: data augmentation and lung segmentation.
Data augmentation, such as random clipping, left-right and up-down flipping, mirroring operation, rotation, scaling, etc. [71,107,123,194], is of vital importance to train the neural network to achieve a high generalizability. In addition to these simple and common methods, Bo et al. [42] used a cubic interpolation approach for image normalization. Chen et al. [194] minimized the influence of various random noises (e.g., words) on the segmentation. To deal with the imbalanced distribution of the sizes of the infection regions between COVID-19 and CAP, Xi et al. [123] developed a dual-sampling strategy to mitigate the imbalanced learning.
As classification-based models, lung segmentation can reduce the interference from the area out of the lung and therefore boost the model performance. Chen et al. [168] trained UNet++ to extract valid areas in CT images. Md et al. [47] employed an inception residual recurrent convolutional neural network with a Transfer Learning (TL) approach for COVID-19 detection and NABLA-N network model for segmenting the regions infected by COVID-19. Shuo et al. [161] used a fully automatic DL model (DenseNet121-FPN) to segment lung areas in the chest CT image. However, they found that some inflammatory tissues attaching to the lung wall may be excluded falsely by the model. In addition, there are also many other lung segmentation methods, such as: VB-Net [123], U-Net [195], ANN [71], FCN-8s, V-Net and 3D U-Net++ [42]. Among them, VB-Net replaces the conventional convolutional layers in the up and down blocks with bottlenecks, and achieves good and efficient segmentation results. U-Net is a fully convolutional network that uses skip connection to fuse the information of multi-resolution layers. V-net uses a volumetric and fully convolutional neural network to achieve three-dimensional image segmentation.

Segmentation
There are two different segmentation tasks in COVID-19 diagnosis models: lung region segmentation and lung lesion segmentation. Lung region segmentation is used to separate the whole lung region from the background, and lung lesion segmentation is used to distinguish the lesion region from other lung regions. The first lung region segmentation is usually performed as a preprocessing step. Here, we only focus on the methods that have been used in lung lesion segmentation.
The V-Net-based segmentation model VNET-IR-RPN17 [107] was trained for pulmonary tuberculosis purposes; it was verified to be good enough to separate candidate patches from viral pneumonia. Md et al. [47] employed an inception residual recurrent convolutional neural network with a transfer learning approach for COVID-19 detection and NABLA-N network model for segmenting the regions infected by COVID-19. Chen et al. [194] used the aggregated residual transformations to learn a robust and expressive feature representation and applied the soft attention mechanism to achieve the automated segmentation of multiple COVID-19 infection regions. Wu et al. [196] trained a JCS system, which includes a segmentation branch that is trained with accurately annotated CT images, performing fine-grained lesion segmentation. Fan et al. [197] proposed a novel COVID-19 Lung Infection Segmentation Deep Network (Inf-Net) that can automatically identify infected regions from chest CT slices. Xi et al. [123] proposed a novel online attention module with a 3D Convolutional Network (CNN) to focus on the infection regions in the lungs when making decisions of diagnoses. Rohit Lokwani et al. [195] built a 2D seg-mentation model using the U-Net architecture, which gives the output by marking out the region of infection. Gao et al. [198] developed a Dual-branch Combination Network (DCN) for COVID-19 diagnoses that can simultaneously achieve individual-level classification and lesion segmentation. Yang et al. [199] proposed federated semi-supervised learning for COVID region segmentation in 3D chest CT. The framework is designed to leverage unlabeled data for federated learning.

Evaluation
The most commonly used evaluation metrics of the models related to segmentation are the accuracy and Area Under Curve (AUC). The average accuracy and AUC of diagnosing models based on X-ray scanning are 98.74% and 99, and the average accuracy and AUC of CT scanning-based models are 91.68% and 95.37. In fact, compared with X-ray, CT can more deeply examine the lesions in a certain position of an organ. However, here, the models based on X-ray scanning achieve a better performance. We think that there are a couple of reasons for this phenomenon, which are as followed. First, COVID-19 diagnosing models based on segmentation are far fewer than COVID-19 diagnosing models based on classification, and there are fewer models based on X-ray scanning, which makes the result relatively singular. Second, X-ray images are easier to collect than CT images. The average training set size of X-ray-based models is much larger than that of CT-based models, where the former is 3843 and the latter is only 2723.
The performance of existing models related to segmentation is relatively high (average accuracy of 92.97% and average AUC of 95.89). In addition, in the models we have collected, the average COVID-19 images of test sets are 507, and the average total images of test sets are 829, where the average COVID-19/total ratio of test sets is 0.611:1. This ratio is still relatively balanced. Only when the ratio is more balanced can we train a more optimized model. Among them, by comparing MiniSeg with state-of-the-art image segmentation methods, Qiu et al. [200] proved that MiniSeg not only achieves the best performance but also has a high efficiency, where MiniSeg has an accuracy of 99.15%. Wang et al. [42] proposed a "3D Unet++-ResNet-50" combined model, which achieved the best Area Under the Curve (AUC) of 0.991 among four combined models mentioned in the text. In addition to the accuracy and AUC, sensitivity and specificity are also good evaluation metrics. Sensitivity/specificity measured the fraction of positives/negatives that were correctly identified as positive/negative, so sensitivity/specificity was also known as the true positive/negative rate.

Datasets
Having sufficient and high-quality annotated training data is important for the design, implementation and evaluation of COVID-19 diagnosis models. In this section, we discuss existing datasets of COVID-19 scanning images. Tables 6 and 7 provide overviews of ten classification datasets and five segmentation datasets, respectively, including the size, scanning type, number of COVID-19 samples and total samples, data annotations and other categories besides COVID-19.

Classification Datasets
• SARS-CoV-2 CT-scan Dataset [201]. These data have been collected from real patients in hospitals from Sao Paulo, Brazil. The aim of this dataset is to encourage the research and development of artificial intelligent methods that are able to identify if a person is infected by SARS-CoV-2 through the analysis of his/her CT scans. There are 2482 images in total, including gender information; • COVID-CT-Dataset [202]. The COVID-CT-Dataset is a radiologist-confirmed CT image dataset. The images are collected from 760 COVID-19-related preprint PDFs in medRxiv and bioRxiv. The labels are decided according to the associated figure captions, while other information, such as age and gender, are also extracted; • COVID-CT Dataset [63]. This dataset contains the full original CT scans of 377 persons, including other information, such as age and sex. It was gathered from Negin radiology located in Sari, Iran, between 5 March and 23 April 2020. There are 15,589 and 48,260 CT scan images belonging to 95 COVID-19 and 282 normal persons, respectively. The format of the exported radiology images was a 16-bit grayscale DICOM format with a 512 × 512 pixels resolution; • CT Scans for COVID-19 Classification [203]. Data were collected from two hospitals: Union Hospital (HUST-UH) and Liyuan hospital (HUST-LH). There are a total of 39,370 CT images, and they are in a JPG format with a resolution of 512 × 512; • Large COVID-19 CT Scan Slice Dataset [204]. The dataset also contains information such as gender, age and country; • COVIDx Dataset [155]. The COVIDx Dataset is a combined dataset. The X-ray images in the COVIDx Dataset are collected from more than five different data repositories, which include COVID- Augmented COVID-19 X-ray Images Dataset [206]. The Augmented COVID-19 X-ray Images Dataset is modified from two datasets, including Covid-Chestxray-Dataset and Chest-Xray-Pneumonia. There are a total of 3532 X-ray images in PNG format. The images are augmented by basic augmentation methods, such as rotating, flipping, scaling and cropping; • Covid-Chestxray-Dataset [207]. Data were largely compiled from public databases on websites such as Radiopaedia.org, the Italian Society of Medical and Interventional Radiology2 and the Hannover Medical School. Both X-ray and CT images are involved in the COVID-19 X-ray Images Dataset, where 930 images in total are in JPG format. However, 43 of 45 CT images are COVID-19-positive. The imbalance makes it unsuitable to be used alone. This dataset not only consists of the lung bounding box, but is also annotated with other information, such as sex, age, location, survival, etc.

Segmentation Datasets
• [208]. This dataset also includes other information, such as age and gender; • COVIDGR Dataset [211]. COVIDGR Dataset is a balanced X-ray dataset that covered all levels of severity of illness, from normal with positive RT-PCR, mild and moderate to severe. Data were collected from an expert radiologist team of the Hospital Universitario San Cecilio, and there are 852 X-ray images in total; • BIMCV COVID-19+ [212]. BIMCV COVID-19+ is a 389.27 GB annotated dataset that consists of both X-ray and CT images. Data were collected from public sources, including COVID-CT-Dataset, COVID-19 dataset and COVID-19 RADIOGRAPHY DATABASE. Data were also collected from some private datasets. There are 23,527 images in total, 23 of which were annotated by a team of expert radiologists. Two types of objects, including ground-glass and consolidation, are annotated. Ground-glass opacities are in green, and consolidation is in purple. Images are stored at a high resolution and entities are localized with anatomical labels in a Medical Imaging Data Structure (MIDS) format.

COVID-19 CT Lung and Infection Segmentation Dataset
The dataset also contains other information, such as sex, age, diagnostics, survival, etc.

Discussion
Existing automated COVID-19 diagnosis methods have reported extraordinary performances. To demonstrate this point, we show histograms of these reported model performances in Figure 8. For the two types of the most commonly used evaluation metrics, the accuracy and AUC, we calculate the proportion of them falling within each interval. The histograms are obtained independently for classification-based models and segmentation models. As we can see, the performance reported by most models lies in the range of (95%, 100%], regardless of the type of model (classification or segmentation) and the used metrics (accuracy or AUC). Such a performance surpasses that of human radiologist professionals by a large range. However, after the two-year development in this field, there are still very few models that can be actually applied in the real-world clinical scenario. From this perspective, the academic research of COVID-19 automated diagnosis is not satisfying enough in this COVID-19 pandemic. So, what went wrong? In this section, we summarize the problems that have appeared during the research of applying AI to the automated diagnosis of COVID-19. We hope that if there is a need for emergent AI research and application in the future again, by avoiding these problems, the AI community can perform better next time. We calculate the proportion of the reported model accuracy and AUC metrics that falls within each interval. The model with very high claimed performance (e.g., >95%) accounts for a high proportion.

Biased Model Performance Evaluation
We think that the major problem of COVID-19 diagnosis model-related research is the lack of benchmarks. There is not a well-recognized benchmark testing set (e.g., something like ImageNet [213] for image classification, MS-COCO [214] for object detection and Cityscapes [215] for semantic segmentation) that can be used to compare COVID-19 diagnosis models. As a result, different papers used different datasets, making their results not fully comparable. In addition, this problem leads to the appearance of many papers that benchmark existing deep models on different datasets repeatedly. Most importantly, this creates difficulties for researchers in assessing the effectiveness of numerous newly proposed methods.

•
Recommendation: Well-recognized institutions should establish benchmarks (using baseline models and a high-quality dataset and releasing reproducible codes) as soon as possible.
In addition to the problem of lacking a benchmark, each individual performance evaluation report is not fully reliable. Many papers used no validation set (which means that the authors directly tune the hyperparameters on the test set), leading to the problem of data leakage. Moreover, in many studies, the testing set is too small. As in Figure 9, we show the distributions of sample numbers (COVID-19 samples vs. total samples) in the testing set of COVID-19 diagnosis models. For papers that did not explicitly declare the size of the training and test set, we calculate them with the corresponding train-test split ratio. Unfortunately, most studies used a very small testing set: the number of COVID-19 samples of more than 50% studies is smaller than 50. There are even several methods that used fewer than 10 COVID-19 samples to evaluate their methods [111,116,153]. We think this is one of the core reasons for the production of over-optimistic reports. • Recommendation: The testing set should not be used for validation. In addition, the testing set should be sufficiently large; otherwise, it cannot give an accurate estimation of the model performance.
In spite of the scale of COVID-19 diagnosis datasets, their quality is also problematic to some extend. For data acquisition, several datasets are a combination of different datasets, which would lead to potential data leakage and the production of an over-optimistic estimation of the model performance. Some datasets are collected from the figures in the PDF file of published papers, whereas some datasets convert the original DICOM file format into PNG or JPG format. Both of them would lead to a decrease in image quality [9]. Moreover, in many datasets, different classes of samples are collected from different sources, which would lead to a high risk of model bias [11].

•
Recommendation: When making a new dataset public, researchers should guarantee its quality and provide as much detailed information as possible.

Inappropriate Implementation Details
The major issue in the research of the automated diagnosis of COVID-19 is the lack of data, especially in the early stage (e.g., first several month). In addition, many COVID-19 datasets are also highly imbalanced. The most popular solution to such an issue is performing data augmentation. However, as shown in Figure 10, some data augmentation techniques that work well for images of general objects are too aggressive for medical images. In addition, several studies tried to apply GAN-based data augmentation. Modern GANs have achieved impressive performances in generating realistic general images (e.g., objects, faces, etc.), but this does not mean that they naturally become perfect choices for generating COVID-19 images. Moreover, there are few theoretical guarantees about the effectiveness of GAN-based data augmentation in medical images. Many open questions need in-depth investigation. • Recommendation: When solving the problem of lacking training data with data augmentation, the 'safety' of selected data augmentation techniques should be considered. Figure 10. Some data augmentation techniques that work well for images of general objects are too aggressive for medical images.
Another typical solution to the lack of a massive amount of training data is transfer learning. Among existing works, 34 models adopted the transfer learning scheme. They pretrained the CNN on a larger image dataset (mostly on ImageNet), then fine-tuned the model with X-ray or CT scanning images. However, ImageNet contains images of general objects, which make the convolution filters learn some patterns that will not appear in scanning images. Instead, transferring the model that was pretrained on the lung cancer dataset [161] or conventional pneumonia dataset [140] can lead to a better performance. • Recommendation: When solving the problem of lacking training data with transfer learning, it is better to select the pretraining dataset carefully, or consider using domain adaptation algorithms [169,216].

Low Reproducibility, Reliability and Explainability
Artificial Intelligence (AI) has made significant progress in the last decade. Open-source deep learning frameworks and public implementations of state-of-the-art methods make the deep learning model able to be easily accessed by the community. If newly designed deep neural network architectures, loss functions and pretrained model weights for COVID-19 diagnosis can be conveniently utilized, then the technical barrier can be lowered and the research cycle can be accelerated. Unfortunately, among our covered 179 studies, only 48 of them provide an official implementation (i.e., 26%). This poses difficulties when researchers came to this field and tried to establish some baselines for comparison. • Recommendation: If possible, upload clean codes accompanying the posted papers. Prepare easy-to-follow documents that describe how to re-implement the proposed method.
If no official code is released, researchers can also reimplement the proposed methods according to the paper. Unfortunately, a proportion of papers have the problem of missing important implementation details, such as a detailed data preprocessing procedure, neural network architectures, hyperparameter settings, the learning rate, batchsize, number of total epochs, etc. Without these details, other researchers would face a great difficulty in reimplementing and performing a fair comparison. Moreover, readers and reviewers cannot identify the potential mistakes without these details.

•
Recommendation: authors should provide sufficient technical details of their proposed methods in order to guarantee the reproducibility.
In existing classification-based methods, Class Activation Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) [217] are adopted by many researchers [40,50,87,95,96,98,110,[113][114][115]117,120,137,139,140,155,158,160,169,218] to output heatmaps for explaining the final result and present an intuitive understanding of which area the model is focusing on. Ideally, these heatmaps can provide radiologists with more useful information and further help them. However, we found that most papers only present the output heatmaps, but provide no analysis about them. We think that one possible reason for this is that they lack radiologist experts to analyze the correctness of the heatmaps. As already illustrated in [9], working as a multidisciplinary team and respect of the opinions of clinicians are important in automated COVID-19 diagnosis research. If this is not conducted, some AI researchers could fail to realize the mistakes that were made by the model according to the heatmaps [10].

•
Recommendation: Work as a multidisciplinary team. Opinions from domain experts are valuable for evaluating the correctness of AI models.

Conclusions
In this paper, we review 179 medical imaging-based automatic diagnosis models of COVID-19. We first discuss the two types of input modalities and compare their differences from both a clinical perspective and artificial intelligence perspective. Them, we divide existing methods into classification-based models and segmentation-based models. Universal pipelines are defined for both of them. For each step in the pipeline, we review and analyze the adopted techniques in detail. Furthermore, a total of 10 COVID-19 datasets for classification-based models and a total of 5 datasets for segmentation models are reviewed. Finally, we summarize three significant problems that emerged in the research of the automated diagnosis of COVID-19: a biased model performance evaluation; inappropriate implementation details; and a low reproducibility, reliability and explainability. We hope that, based on our provided corresponding recommendations, AI can play a better role next time.