A Literature Review on the Use of Artificial Intelligence for the Diagnosis of COVID-19 on CT and Chest X-ray

A COVID-19 diagnosis is primarily determined by RT-PCR or rapid lateral-flow testing, although chest imaging has been shown to detect manifestations of the virus. This article reviews the role of imaging (CT and X-ray), in the diagnosis of COVID-19, focusing on the published studies that have applied artificial intelligence with the purpose of detecting COVID-19 or reaching a differential diagnosis between various respiratory infections. In this study, ArXiv, MedRxiv, PubMed, and Google Scholar were searched for studies using the criteria terms ‘deep learning’, ‘artificial intelligence’, ‘medical imaging’, ‘COVID-19’ and ‘SARS-CoV-2’. The identified studies were assessed using a modified version of the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD). Twenty studies fulfilled the inclusion criteria for this review. Out of those selected, 11 papers evaluated the use of artificial intelligence (AI) for chest X-ray and 12 for CT. The size of datasets ranged from 239 to 19,250 images, with sensitivities, specificities and AUCs ranging from 0.789–1.00, 0.843–1.00 and 0.850–1.00. While AI demonstrates excellent diagnostic potential, broader application of this method is hindered by the lack of relevant comparators in studies, sufficiently sized datasets, and independent testing.


Introduction
Coronaviruses are a group of RNA viruses that give rise to respiratory-tract and intestinal infections [1]. Gaining high pathogenic status during the severe acute respiratory syndrome (SARS-CoV) outbreak in 2002-2003, a new coronavirus emerged in Wuhan, Hubei province, China in December 2019 [2]. The virus was named 'COVID-19' or 'SARS-CoV-2" and as a result of its rapid spread, was declared a pandemic by the World Health Organization (WHO) in March 2020 [3]. As of 3 January 2022, there have been a total of 291,721,552 cases worldwide, which is increasing at a steady rate each day [4]. The most commonly used diagnostic test is the nasopharyngeal swab for reverse-transcriptase polymerase chain reaction (RT-PCR). However, RT-PCR has lower than optimal sensitivity rates. At day 1, RT-PCR has a false-negative rate of 100%; by day 4 it lowers to a rate of 67% and reaches 38% by the time of symptom onset [5]. More recently, as mass testing has emerged, rapid lateral-flow tests have been used to detect COVID-19. The sensitivity of these tests is dependent on the skill of the individual performing the test: laboratory scientists perform with a sensitivity of 79%; in self-trained members of the public, this level is 58% [6].
It is pivotal that a diagnostic test demonstrates a high sensitivity rate, particularly for COVID-19, so that the infected individual is directed to self-isolate, thereby reducing transmission [7]. At present, the RT-PCR testing method is the only approved method to detect the COVID-19 disease [8]. It has been reported that medical imaging can be used to As an official framework with which to assess AI studies is yet to be published, each study in this review was assessed using a modified version of the Transparent Reporting of a Multivariable Prediction Model (TRIPOD) [14]. This reporting statement allows for the reporting of studies that develop, validate, or update a predictive model for diagnostic or prognostic purposes. The TRIPOD assesses the quality of the study in 6 areas (title and abstract, introduction, methods, results, discussion, other information). This includes adequate reporting of the study context, purpose (e.g., validation or development), source of data, information about participants, sample size, handling of missing data, and statistical analysis. Further, the adequate reporting of model development, performance and validation as well as limitations and study fundings are assessed.
The modified TRIPOD statement used in this review assesses 12 of 22 items that are most relevant to AI studies [15]. This modified statement applied the following: title, background and objectives, source of data, participants, outcome, sample size, participants, model performance, interpretation, implications, supplementary material, and funding. The outcome is summarized in Tables 1 and 2.
In addition to the TRIPOD assessment, an additional clinical-relevance score was applied to all the included studies [16]:
Potentially clinically relevant but not evaluated against a relevant comparator and lacks independent testing. 3.
Potentially clinically relevant and has demonstrated value against a relevant comparator but lacks independent testing. 4.
Potentially clinically relevant and has demonstrated value against a relevant comparator and has been independently tested. 5.
1-4 fulfilled and ready for implementation.
Only papers with a score of 2 or higher were included in this study.

Study Selection
The electronic search resulted in 312 studies; when duplicates were removed, this number became 309. A total of 192 studies were excluded as irrelevant based on the title and abstract evaluation, and the remaining 117 papers were assessed in full for inclusion, from which a further 94 were excluded (See Figure 1: PRISMA flow chart detailing exclusion criteria). Once the evaluation was complete, 23 studies remained.

Study Selection
The electronic search resulted in 312 studies; when duplicates were removed, this number became 309. A total of 192 studies were excluded as irrelevant based on the title and abstract evaluation, and the remaining 117 papers were assessed in full for inclusion, from which a further 94 were excluded (See Figure 1: PRISMA flow chart detailing exclusion criteria). Once the evaluation was complete, 23 studies remained.

General Characteristics
Tables 1 and 2 detail the general characteristics of all 23 studies included in this review, and Tables 3 and 4 summarize the findings of the studies.     (1/23). The majority of the studies received no external funding (9/23), or it was not disclosed (9/23); for those studies that did receive external funding, four were funded by national health bodies, and one was commercially funded.

Aim and Methodology
All of the studies applied deep learning with image input for diagnosing COVID-19. The studies could be further divided according to the following objectives: 1.
The detection of and screening for COVID-19 (binary classification) 2.
Forming a differential diagnosis between COVID-19 pneumonia and other viral or bacterial pneumonia (multiclass classification).

Reference Standard and Comparator
The reference standard for the studies was varied; 15/23 studies used a ground truth label based on a radiologist's annotation, 2/23 used RT-PCR test results to assign labels, and 6/23 used a mix of both radiologist review and RT-PCR results.
Out of the 23 studies, four compared the performance of AI with a relevant comparator, i.e., a radiologist with varying years of experience.

Validation and Testing
Validation methods are used to assess the robustness of a proposed model. Internal validation utilizes data from the original training source and external validation tests the performance of the model on a dataset from a new independent source. All studies applied internal validation, where the dataset was split into two for training and testing. The majority of the studies operated with a train-and-test format, dividing the dataset in two. In addition, some studies performed k-fold cross validation. Out of the 23 studies, seven studies performed external validation using an independent dataset.

Clinical Relevance and Main Findings
All 23 studies were scored using the previously mentioned clinical-relevance score. The relevance score of 2 was the most common as most of the studies lacked a relevant clinical comparator and only compared the AI performance against other algorithms.
Out of the 23, 11 studies used chest X-rays to assess for COVID-19; all the chest-X-ray studies scored 2 in clinical relevance due to the lack of a relevant comparator, and 1/11 [17] performed external validation in an independent dataset.
The remaining 12/23 studies assessed the use of chest CT, of which 4/12 ([9]) included a relevant comparator. Therefore, four of the papers managed to score higher than 2 in terms of the clinical-relevance score. Out of these four papers, three papers ( [30]) were allocated a score of 4 as they also included independent testing.
Studies using X-ray found overall good detection rates of GGO and lobular consolidation to deduce a diagnosis of COVID-19. The studies demonstrated diagnostic accuracy, with a sensitivity range of 79-100% and a specificity range of 92-99%. However, none of the X-ray studies included a relevant comparator, so it cannot be assessed whether the algorithm was on par with the diagnostic ability of a radiologist. Among the studies using CT, four included a relevant comparator [9]. The results of these studies are described below and in Table 7.
In the four studies that included a relevant comparator, the AI algorithm outperformed the radiologist. Only [34] reported one incidence where the radiologist outperformed the algorithm, i.e., the radiologist performed better at binary classification of pneumonia and non-pneumonia. The average experience of the radiologists in these four studies ranged from 6-11 years; the mean experience was 8.6 years. The use of an algorithm generally demonstrated an increase in both the sensitivity and specificity. Two studies [9] supplied information about the time taken for the algorithm to assess a dataset/image versus that of the relevant comparator. Both studies reported much faster evaluation times with the algorithm performing up to 142 times quicker than a human.
When assessing model performance between the two imaging modalities, calculating the median and respective interquartile ranges, the performance of models using X-ray was more consistent (see Figure 2, a boxplot demonstrating the smaller interquartile range). This may be due to the ease of 2D image processing, and the lack of a requirement for segmentation layers in the network or slice selection. The reporting of significant differences between the performances of the model and the radiologist was only provided by [34], where there was no reported significant difference for the model used in the study.

Discussion
The aim of this study was to provide an overview and evaluation of the literature published thus far on the utilization of AI on CT/chest-X-ray for the diagnosis of COVID-19.
The selected papers utilized deep-learning techniques with transfer learning, which

Discussion
The aim of this study was to provide an overview and evaluation of the literature published thus far on the utilization of AI on CT/chest-X-ray for the diagnosis of COVID-19.
The selected papers utilized deep-learning techniques with transfer learning, which assisted in making up for small-dataset limitations. Those studies that utilized transferlearning methods were able to achieve high accuracies in diagnosis by employing previous image-analysis algorithms. All studies were assessed for potential bias, scored for clinical relevance, and evaluated using a modified TRIPOD assessment [15].
Compared to most AI studies in chest X-ray/CT, the datasets of those included in this review had a significantly smaller population; the average pathological dataset (in this case COVID-19 positive) for chest X-ray was: 340 images (range: 120-500, median: 360). Similarly, for chest CT the average dataset size was: 985 images (range 181-3084, median: 820). With fewer data available, the algorithms may not be trained or tested as thoroughly as in other AI studies. The average proportion of COVID-19 images in the datasets was 34%. This means that most of the dataset, 66%, is comprised of non/alternative-pathogenic images. This can potentially overestimate the sensitivity and positive predictive value of the algorithms as the proportion of positive cases in a typical clinical setting is likely to be much smaller.
All the included studies were retrospective studies. There is a lack of prospective testing, and no clinical trials were identified in the search. This is likely due to the recency of the COVID-19 pandemic rather than the newness of DL, as clinical studies are more prevalent in other applications of DL. One challenge that was common to all the studies was the small datasets available. As COVID-19 has only been around since the later part of 2019, there are relatively few images of COVID-19 patients available at individual institutions and public databases. A few of the studies even used the same databases (Table 6). This is a weakness, as the algorithms trained on a certain dataset may not be able to perform equally well when applied to different data [36]. For example, when data comes from only one demographic region, it may not perform as well on different demographics, which emphasizes the need for independent testing. This risk of bias is further enhanced by the lack of external validation among the papers reviewed.
Smaller datasets make it difficult to assess the reproducibility of the algorithm performance. While some studies pooled data from several public datasets, this does decrease transparency with regard to the origin and character of the image data. Another review of the application of AI to diagnose COVID-19 reported that the high risk of bias in most of their papers was a result of the small sample size of COVID-19 images [22]. However, small datasets are not exclusive to AI studies in COVID-19; an AI study on pulmonary nodules assessed an average sample size of images from only 186 patients [16].
Images included within the studies have been sourced from numerous public repositories and taken from publications [32]. It is likely that these images show extreme and interesting cases of COVID-19 that may be easier for the algorithm to detect. Further, in several studies datasets were expanded via image augmentation and the formation of iterations. Out of the 23 studies, only 6 performed independent testing with an external dataset (26%). Performance in externally validated studies was calculated at an average sensitivity and specificity of 93%. The average sensitivity and specificity in studies that did not externally validate their model were 92% and 94%. Thus, the externally validated models seem to work equally well on hitherto unseen data, which is reassuring. This lack of external validation has also been reported in another recent review on the topic [37]; only 13/62 (20%) assessed their algorithms on independent datasets. High performance in external testing proves that the model is generalizable to other patient populations, and in its absence it is difficult to tell how the model will perform if transitioned into clinical practice.
Yousefzadeh et al. performed external validation using data that the model had previously classified into multiple classes, but rediverted the images into binary groups [9]. There was one example where the model's generalizability was tested, whereby a dataset of exclusively asymptotic patients was fed through the model. This selection of images was far more likely to be representative of those found in the community [25], permitting sound assessment of the model when reviewing images of less-extreme cases of COVID-19. The validation of a model on a set of low-quality images from recently published papers was also performed in order to assess the stability of the model [32], which could potentially be used for machines of poorer quality or for images captured with poor resolution/contrast.
Most studies lacked a relevant comparator. In this instance, the relevant comparator is not a PCR test, but a human comparator (i.e., radiologist) who is assessing the same image. Studies omitting a human comparator can cause the performance of the AI model to appear better than it is. Thus, it is important to contrast the performance of the new AI system to current practice prior to implementation in order to assess how the model may best serve in clinical practice. Only 5 of the 23 studies included a human comparator. With each study there was a varying degree of experience of the radiologist, which consequently influenced the degree of success the algorithm was perceived to have. Those with more experience rivaled the performance of the AI more closely, whereas junior radiologists may inflate the capability of the model. It is important for these studies to establish whether the machine is made to aid trainees, non-experts, or specialists. The average reader experience in the study by Mei et al. was just over 5 years senior to those in the paper by Liu et al., yet both yielded similar diagnostic accuracies. This suggests that the experience of the radiologist may not influence the ability to diagnose COVID-19, as it is a new pathology with new disease manifestations.
Studies will often aim for their algorithm to outperform that of a radiologist; however, it is important to note that an algorithm can still be of use even if it does not outperform an experienced radiologist. AI can still be used to lower the clinical burden, performing tasks with a similar accuracy at faster speeds. Studies by Yousefzadeh [9] and Jin [35] both assessed the speed at which their algorithms could perform, and both were much quicker than human analysis. In general, the included studies tended to pitch AI vs. human interpretation when perhaps a synergistic approach would have yielded greater benefit. Incorporating AI into a COVID-19 diagnosis could mean faster, more accurate diagnoses that incorporate various pieces of clinical information.
There are several limitations to this review. Following a comprehensive search for papers assessing the use of AI in reaching a COVID-19 diagnosis, it is possible that not all papers were included. As all publications on COVID-19 are new and further studies are published at a high speed, this review cannot claim to be up to date. In addition, a number of the papers at the time of writing are still pre-prints.
This review highlights a number of biases present in the literature, e.g., small sample size, potential for image duplication, differing image quality, extreme cases included in dataset, as also discussed by Roberts [37], and these biases limit the ability to translate AI into clinical practice.
As COVID-19 continues to pose a significant threat to health, more people are requiring both screening and testing. RT-PCR remains the current 'gold standard' for diagnosis; however, there are limitations in turnaround time, with test results taking anywhere between 3 h and 72 h, depending on price paid or priority assigned to turnaround. Rapid testing in the form of lateral-flow tests can bridge limitations in turnaround time and PCR supply, but they are unreliable for a diagnosis in primary care. A diagnosis in health care must be accurate in order to direct the isolation protocol and triage. AI programs have the potential to serve as an accurate and rapid aid in diagnosis.
AI can be developed to analyze the same findings that experienced radiologists can also extract. In addition, AI can also detect manifestations of disease that may not be obvious to the human eye, in turn increasing the sensitivity of image review. Once the limitations of small datasets, lack of relevant comparators and a clear standard reporting have been overcome [16], the use of AI can be extended beyond just formulating a diagnosis, it can also make predictions about the course and severity of the disease. Some of the AI models can match similar presentations with those it has previously assessed and share information about the experience of the disease course and outcome. It is essential that sensitivity rates of these AI models are high in all incidences in order to ensure there are no false negatives, and that everyone needing to self-isolate is informed to do so. AI is also able to monitor the long-term manifestations of lung diseases. If AI can be implemented alongside 'traditional' methods of diagnosis, then perhaps faster, definitive, and accurate instructions can be determined for self-isolation protocol and identifying patients at high risk.

Conclusions
This review summarizes the published research on AI systems for COVID-19 detection on CT and chest X-ray. The presented studies report promising results for the automated diagnosis of COVID-19 by both modalities using deep-learning methods. However, while AI shows a promising diagnostic potential, this area of research does suffer from small datasets as well as the lack of a relevant clinical comparator and external validation, giving rise to a high risk of bias that limits its transferability into clinical practice. Thus, future research should include relevant clinical comparison and external validation in order to increase the likelihood of new AI systems being deployed in fields that are of the greatest benefit to patients.