Comprehensive Survey of Using Machine Learning in the COVID-19 Pandemic

Since December 2019, the global health population has faced the rapid spreading of coronavirus disease (COVID-19). With the incremental acceleration of the number of infected cases, the World Health Organization (WHO) has reported COVID-19 as an epidemic that puts a heavy burden on healthcare sectors in almost every country. The potential of artificial intelligence (AI) in this context is difficult to ignore. AI companies have been racing to develop innovative tools that contribute to arm the world against this pandemic and minimize the disruption that it may cause. The main objective of this study is to survey the decisive role of AI as a technology used to fight against the COVID-19 pandemic. Five significant applications of AI for COVID-19 were found, including (1) COVID-19 diagnosis using various data types (e.g., images, sound, and text); (2) estimation of the possible future spread of the disease based on the current confirmed cases; (3) association between COVID-19 infection and patient characteristics; (4) vaccine development and drug interaction; and (5) development of supporting applications. This study also introduces a comparison between current COVID-19 datasets. Based on the limitations of the current literature, this review highlights the open research challenges that could inspire the future application of AI in COVID-19.


Introduction
The first coronavirus was detected among humans in 1960 and was known as a human coronavirus (HCoV) [1]. It caused mild diseases to the lower and upper respiratory that led to acute respiratory failure in some cases [2]. The situation became more serious in 2003 with the appearance of a severe acute respiratory syndrome (SARS-CoV) in China [3]. At that time, nearly 1 million people were affected by SARS-COV, with a mortality rate of 9.5%. The spread of this virus stopped by isolating the infected people and detecting the causes of infections. Subsequent experiments in wild animals have shown that SARS-COV exists in cats and bats [4]. Therefore, it was believed that the virus spread to humans from bats and cats, then spreading from human to human [5]. The situation has remained stable

•
We discuss the detailed characteristics of COVID-19 symptoms, behaviors, and patterns.

•
We investigate the role of automated analysis and diagnosis of COVID-19 based on the WHO statistics worldwide.

•
We propose a taxonomy for using AI, big data, and statistics in COVID-19 diagnosis, prediction, and treatment. Based on this taxonomy, a comprehensive survey of current AI literature is provided.

•
We collect the details about all available COVID-19 datasets (i.e., textual data, medical images, and speech data). • We explore the limitations of the current literature of AI applications in the COVID- 19 domain and draw the directions for future improvements that could handle these challenges.
The rest of the article is organized as follows. Section 2 introduces the taxonomy of using AI for classifying COVID-19. Section 3 presents a survey of the literature for using AI in a COVID-19 context. Section 3 shows a comparison among COVID-19 datasets. Section 4 is a discussion of the results discovered from studying the literature. Limitations of the current solutions and future directions are introduced in Section 5, and the paper is concluded in Section 6. Table 1 include all terms and its abbreviation.  Figure 1. Taxonomy of using AI in COVID-19.

Diagnosis Using Medical Images
Although medical images, such as those from CT scans and X-rays, could provide valuable pathological information, only the qualitative assessment is written in the radiological report. This is due to the lack of computerized tools that measure the infected areas and their changes. Therefore, the changes across the medical images are often ignored. On the other hand, contouring the infected areas in the CT scan is recommended for quantitative evaluation. Unfortunately, manual contouring is time-consuming, tedious, and may lead to discrepancies in the assessment. With this in mind, fast and automated contouring tools for COVID-19 medical images are an urgent need to face the fast-growing COVID-19 pandemic. The following subsections survey the ML and DL models used to make auto controlling, segmentation, and classification of COVID-19 medical images for disease diagnosis.

Diagnosis Using CT Chest Scans
Several studies have developed DL models for COVID-19 identification and diagnosis, with promising results, which are mainly based on CT chest images [19]. For example, [20] proposed a DL model to extract the visual features from CT chest scan images. The study used the extracted features to differentiate between COVID-19 and other pneumonia diseases. However, the proposed system was not able to define the progression of COVID-19 disease. Ahuja et al. [21] developed a CNN model to analyze and detect COVID-19. The developed model depended on extracting and specifying opacities in the lung images, and it achieved 92.21% and 98.50% for sensitivity and specificity, respectively. The developed system is considered robust in terms of pixel spacing. Jaiswal et al. [22] provided a DL model for CT segmentation and detection of COVID-19 infection. Xue et al. [23] did a similar task by developing a classification model to discriminate COVID-19 and other non-pneumonia, with an accuracy of 86.30%. In [24], Ozturk et al. proposed a 3D CNN model to classify COVID-19 patients from normal ones using chest CT images and other images of viral pneumonia. First, the infected regions were segmented from a CT chest scan using the 3D CNN model. Then, these separated

Diagnosis Using Medical Images
Although medical images, such as those from CT scans and X-rays, could provide valuable pathological information, only the qualitative assessment is written in the radiological report. This is due to the lack of computerized tools that measure the infected areas and their changes. Therefore, the changes across the medical images are often ignored. On the other hand, contouring the infected areas in the CT scan is recommended for quantitative evaluation. Unfortunately, manual contouring is time-consuming, tedious, and may lead to discrepancies in the assessment. With this in mind, fast and automated contouring tools for COVID-19 medical images are an urgent need to face the fast-growing COVID-19 pandemic. The following subsections survey the ML and DL models used to make auto controlling, segmentation, and classification of COVID-19 medical images for disease diagnosis.

Diagnosis Using CT Chest Scans
Several studies have developed DL models for COVID-19 identification and diagnosis, with promising results, which are mainly based on CT chest images [19]. For example, [20] proposed a DL model to extract the visual features from CT chest scan images. The study used the extracted features to differentiate between COVID-19 and other pneumonia diseases. However, the proposed system was not able to define the progression of COVID-19 disease. Ahuja et al. [21] developed a CNN model to analyze and detect COVID-19. The developed model depended on extracting and specifying opacities in the lung images, and it achieved 92.21% and 98.50% for sensitivity and specificity, respectively. The developed system is considered robust in terms of pixel spacing. Jaiswal et al. [22] provided a DL model for CT segmentation and detection of COVID-19 infection. Xue et al. [23] did a similar task by developing a classification model to discriminate COVID-19 and other non-pneumonia, with an accuracy of 86.30%. In [24], Ozturk et al. proposed a 3D CNN model to classify COVID-19 patients from normal ones using chest CT images and other images of viral pneumonia. First, the infected regions were segmented from a CT chest scan using the 3D CNN model. Then, these separated images were categorized using Diagnostics 2021, 11,1155 5 of 44 the location attention model. Finally, the noisy-OR Bayesian function has been used to calculate the confidence score.
Due to the limited access to COVID-19 datasets, several studies reported that the pre-trained model and transfer learning became the most effective techniques to build diagnosis and prediction models for COVID-19 [25][26][27]. For example, Jaiswal et al. [22] utilized deep transfer learning to build a classification model for chest CT scans using the DenseNet201 pre-trained model. A total of 1260 CT images for COVID-19 patients and 1232 CT chest images for health patients were used to train and test the DenseNet201 model. The proposed model achieved promising results in terms of various metrics, including precision, recall, F-measure, and accuracy, at 96.20%, 96.20%, 96.20%, and 96.21%, respectively. In [28], Pathak et al. also used transfer learning and the ResNet50 pre-trained model to build a 2D classification model for COVID-19 to classify infected CT chest images from the normal images. The proposed model achieved a training accuracy of 96.32% and a testing accuracy of 93.11%. However, the model takes a long training time. Wang et al. [29] proposed a segmentation and classification model for the CT scans, the pipeline was divided into two main steps. First, the segmentation step was based on DL models (i.e., U-Net, 3D U-Net++, and V-Net). Second, the classification was by using a pre-trained model (i.e., ResNet-50, and DPN-92). The model was evaluated using CT chest scans of 732 cases and resulted in a classification model with an AUC = 99.01%. In [30], Weng et al. developed a model that analyzed the changes in the CT chest images of the infected patients.
They developed a CNN model that utilized an inception pre-trained model and transfers the learning technique to build an effective model for diagnosis. This model achieved a performance of 89.66 % for accuracy and saved time.
Other studies tried to overcome the shortage in the CT datasets by training the model in various types of pneumonia. For example, in [31], Cheng et al. proposed a multiclass deep CNN model. The system evaluated more than ten thousand CT chest images from four categories, including influenza, non-viral pneumonia, COVID-19, and non-pneumonia subjects. The proposed system was evaluated based on 1940 samples, with an AUC, sensitivity, and specificity of 95.76%, 90.10%, and 97.16 %, respectively. The same procedure has been followed by [26,28,[32][33][34][35][36].
In [37], Farid et al. proposed a prediction model that predicted the recurrences in both COVID-19 and SARS cases. They composited a hyper feature extraction technique of the main four filters, namely, a Gabor filter, MPEG-7 histogram filter, fuzzy-64, and local binary histogram. Then, they built a hybrid classification technique of CNN and ML models to achieve a high accuracy in prediction. The proposed model enhanced the performance and reduced the false-positive rate after applying feature optimization techniques. The model was evaluated by using only 51 images extracted from the Kaggle benchmark dataset. As it is clearly noticed, the evaluation of the model using such a small dataset does not guarantee the generalization ability.
Some studies tried to examine the relationship between CT scans and symptoms. For example, brahmin et al. [38] analyzed 121 CT chest images of positive COVID-19 cases. They found that the prevalence of symptoms and the signs of diseases increased with time from the onset time. In [39], Xueyan et al. proposed a COVID-19 prediction system that integrated CT chest scan images, patient demographics (e.g., age, weight, and sex), clinical symptoms (e.g., fever, cough, and sputum), and laboratory test (e.g., WBC, lymphocytes, neutrophils, etc.). The authors reported that the presence of patient symptoms and laboratory tests gave the classification model a better performance-84.34% for sensitivity (con f idence interval [CI] = 77.1%, 90.0% ; p = 0.662)-compared to a CNN model that used CT chest images only, which achieved a sensitivity of 82.6% (CI = 76.2%, 89.4% ; p = 1). The p-value clarifies the significant difference with respect to the integrated model. Table 2 lists more COVID-19 classification models based on CT-images. Chest CT scan-based detection of COVID-19 is considered difficult, as patients need to be moved to the CT room with a danger of radiation, and machines need a high level of cleaning after each use. Therefore, a CT chest scan is not recommended as the main identification tool for COVID-19.
From the previous table (Table 2), we could notice the following: (1) 60% of the studies build binary classification models, 34% built multiclass classification models, and 6% used object detection techniques to detect COVID-19; (2) 48% used transfer learning to overcome the shortage in data; (3) studies that build binary classification models achieve better results than binary-class classification models; and (4) 66% of the studies used DL models for COVID-19 classification, whereas 34% used conventional ML models (i.e., SVM, RF, DT, etc.). The best results are achieved when using pretrained models with the GAN network model and ResNet pretrained model [47]. This is due to using a pretrained model to fine-tune the network parameter and use GAN to provide a robust model and overcome the overfitting problem.
Diagnosis Using X-ray COVID-19 radiological analysis is a common and cost-effective technique for COVID-19 detection, especially in the intermediate stage of the disease. Medical experts in [48] reported that X-ray of COVID-19 patients presented no change in the early stages of the disease. However, with disease progression, two main observations are commonly observed in X-ray images, including patchy infiltrates in the lower and upper zones of the lungs. Moreover, the transfer of digital X-ray images does not need any transportation from the point of acquisition to the point of analysis, making the diagnostic operation extremely fast. Moreover, the portable X-ray machines allow testing within an isolation ward. These machines minimize the main need for additional personal protective equipment. It also minimizes the risk of hospital-acquired infections for patients. Therefore, several recent studies utilized X-ray in COVID-19 diagnosis. The main goal of this subsection is to discuss the state of the art of COVID-19 diagnosis and detection based on X-ray images.
Several recent studies applied different ML and DL techniques in diagnosing COVID-19 based on radiographic imagery. For example, Elisha et al. [49] provided an ML model for COVID-19 diagnosis. The developed model was used to examine patients' similarities according to the X-ray images. It was trained using 1384 COVID-19 patients with ages ranging from 18 and 63 and tested using 350 images. The results in accuracy reached 89.7%, and the AUC reached 94.0%. Other researchers utilized a pre-trained model to improve the model performance. For example, in [50], the authors provided a diagnostic model for COVID-19 using transfer learning. Thirteen pre-trained models, such as VGG, AlexNet, ResNet, etc., were used to extract features from 380 X-ray images; then, SVM was used for the classification. Authors reported that ResNet with SVM gave the highest accuracy of 95.33% in 22 independent executions. In [51], Shi et al. provided a diagnosis model called infection size aware random forest (ISARF). This model was built based on 1685 X-ray images from COVID-19 patients and 1027 from patients with pneumonia. They used VB-net to identify the lesion size and categorized them into four main groups. Finally, RF was used to provide the final classification decisions. The model provided accuracy, specificity, and sensitivity of 87.9%, 83.3%, and 90.6%, respectively. Kiran et al. [52] presented a multi-image augmented model using the CNN model. This model enhanced the COVID-19 detection process based on chest X-ray and chest CT scan images. The main objective of this study was to provide medical experts with a more accurate diagnosis system as the integration of X-rays and CT scans will ease the detection process of finding changes in human lungs with zero false-positive and false-negative rates. The model was trained on 19 COVID-19 cases and 50 cases of non-COVID-19. The classification accuracy reached 99.44% for X-ray and 95.38% for classifying CT scan images. Kevser and Ferhat [53] presented a DL transfer learning technique for detecting COVID-19 based on chest X-ray images. The authors utilized various pre-trained models, such as VGG19, VGG16, ResNet, DenseNet, and InceptionV3. They reported that using the VGG16 technique gave the highest classification accuracy of 80% among the other four proposed models. The same procedure has been followed in [35,54]. The authors in [54] used five pre-trained models, including ResNet50, InceptionResNetV2, and Xception. These models were trained on 5857 chest X-rays and 767 chest CT images. Results in classification accuracy were 84% for X-ray and 75% for CT scan. Table 3 lists more classification models based on the X-ray images.  From the previous table, we could notice the following: (1) 33% of the studies built binary classification models, 46% built Multiclassification model, and 8% used anomaly detection for COVID-19 classification (2) 68% used transfer learning to fine-tune the network parameter for a limited size dataset. (3) ML models were used in 46% of the studies, where 44% used DL models. Using feature optimization techniques with DL models enhances the detection and the classification process [64]. (4) Using data augmentation increases the size of the available dataset and therefore enhances the classification accuracy [52,65]. Using both X-rays and CT scans increase the performance of the classification model. The best performance is obtained when using object detection with a pretrained darknet model [24].

Diagnosis Using Ultrasound
Ultrasound (US) identification is an indoor positioning system (IPS) that is utilized to automatically detect and define the location of objects in real time with high accuracy. It is done by attaching nodes to the surface of persons, issues, and things; then, it transmits an ultrasound signal to connect their locations to microphone sensors [66]. Ultrasound is already used for various lung diseases, such as pneumonia and lung cancer [67]. The authors in [68] presented a survey study about the ultrasound findings from many types of research studies. It has been suggested as an effective method for diagnosis, especially in low-income countries with limited resources. Therefore, US has started to be the first-line examination instead of X-ray for COVID-19. However, the literature on the applicability of US in COVID-19 diagnosis is still limited. For example, the proposed approach in [69] utilized lung US to define suspected COVID-19 patients. The essential goal relied on the investigation of identifying COVID-19 during the initial outbreak. The outcome resulted in 41% of patients being COVID-19 positive, and includes 67% of them that were diagnosed with CP. They achieved 95%, 61%, and 90% in terms of accuracy, specificity, and sensitivity, respectively. In [70], the authors used 2,392,963 frames extracted from 64 videos. These videos were aggregated with three different categories (COVID-19, healthy, and pneumonia). The VGG-16 pre-trained model was used, followed by hidden layers (dense, dropout, batch normalization, and an output layer with SoftMax activation function) to identify COVID-19. The study resulted in a classification model with an accuracy of 89% and sensitivity of 96%. Lung US is also used to specify the duration of symptoms. In [71], authors used data from 28 patients (14 male and 14 female) that had a positive COVID-19 infection to investigate the utilization of US in specifying symptoms duration and disease severity. They reported that a thickness in the pleural line was observed in most patients with a long duration of the disease than those with a lesser disease duration. Pulmonary consolidation is also commonly observed in critical-case patients compared to moderate-case patients. One of the main challenges in using US in COVID-19 diagnosis is the quality of the US frames. This is due to the low penetration of the sound waves, which may result in noisy and low-resolution frames. This limitation motivated researchers to develop techniques that help in improving the quality of US images, such as noise filtering wavelet deconvolution [72] and contrast-limited histogram equalization (CLAHE) [73]. More classification models based on US images are listed in Table 4. We could notice the following: (1) only a few studies utilized ultrasound for COVID-19 detection from the previous table; (2) 80% of the studies used DL and pretrained models to classify the images; (3) studies extracted image frames from ultrasound videos; and (4) the best performance was obtained when using the pretrained model VGG followed by hidden layers trained on a large number of frames [70].

Diagnosis Using Respiratory Data
Respiratory data in conjunction with ML and DL could help in detecting and diagnosing COVID-19 through three main approaches [75,76]: (1) using cough sounds to classify positive and negative COVID-19 cases; (2) screening COVID-19 patients using breathing sounds and breathing rates; and (3) using patient sound to detect COVID-19 symptoms, including stress, anxiety, fatigue, etc. These speech datasets could also be used in remote diagnosis, monitoring, and screening for COVID-19 patients through telemedicine applications [27,77]. Kranthi et al. [78] provided a comprehensive survey using respiratory data for COVID-19 diagnosis.
Using the cough sound in COVID-19 diagnosis was motivated by several key findings, including the following: (1) several studies have shown that cough sounds from several diseases has distinct features, which could be used to train sophisticated AI models for diagnosis and detection [36][37][38][39][40]. This finding was confirmed by the meta-analysis in this study [41]. They reported that COVID-19 sound data include unique features that could be used in COVID-19 diagnosis, which do not overlap with other respiratory infections. In [76], the authors confirmed that the chest data they aggregated through stethoscope examination were used for COVID-19 diagnosis. (2) The WHO [79] reported that coughing is a common symptom among 67.7% of COVID-19 patients and considered to be the main source of infection.
Based on these findings, recent studies explored how the cough sound is collected from patients via various devices and used these data for COVID-19 diagnosis. For example, the authors in [80] provided an early effort in creating a breathing sound dataset for COVID-19. These data include the sound of the cough, breathing, and voice. These sounds were collected using website applications to enable sound-based diagnosis for COVID-19. In [81], Dunne et al. utilized three different datasets for diagnosis, including (1) Google's Audioset (http://archive.is/MZMRJ) (Last access date: 17 February 2021) aggregated from YouTube videos (non-COVID-19); (2) the Corswara dataset (COVID-19); and (3) data collected at Stanford University (https://github.com/virufy/covid) (Last access date: 17 February 2021). In [82], the authors developed a mobile application that analyzed the patient's cough sound and provided COVID-19 identification within 2 min. They built a DL model based on 328 cough sounds aggregated from 150 patients using four categories (bronchitis, asthma, COVID-19, and healthy). The developed model was able to differentiate between COVID-19 cough sounds and the other sounds with an accuracy of 98%. In [83], the authors depended on cough samples aggregated over the mobile phone from 3620 COVID-19-positive cases and built an application for COVID-19 diagnosis (known as . The study explored transfer learning techniques to overcome the COVID-19 cough training data shortage. It utilized the pre-trained model of ResNet18 to build a classification model and achieved promising results (AUC = 97.0%, specificity = 94.6%, and sensitivity = 98.5%). In [76], Brown et al. reported that respiratory sounds can be used to distinguish COVID-19 respiratory sounds from normal sounds. They used a simple binary classifier and achieved an AUC of 80%. Speech recordings from COVID-19 patients have been analyzed to categorize a patient's health status [84]. Faezipour et al. [77] depended on sound data aggregated through web and android interfaces in building breathing tests for COVID-19 diagnosis. They reported that this would be effective, especially with the rapid increase of the required disease diagnostic tests. In [76], the authors used both cough and breathing sounds to distinguish between COVID-19 and healthy sounds. They built three binary classifiers, one for classifying COVID-19-positive cases from healthy individuals, one for distinguishing COVID-19-positive cases from asthma cases, and one for classifying COVID-19-positive and healthy cases who have a cough. They achieved an AUC of 82%, 81%, and 80% for these classification tasks, respectively.

Estimation of Disease Spread
Since the first confirmed case in 2019, the confirmed COVID-19 cases in all world were rapidly increased, which reached 86.7 million cases, including 1.87 million deaths by January 2021. Determining the future severity of the outbreak is considered one of the main keys to plan against this pandemic [85,86]. In this subsection, we survey the studies that are concerned with analyzing the epidemic status, measure the reproduction number and exponential growth using statistical and DL models. Such studies help prepare for the potential spread and reveal the significance of strict health measures to manage the COVID-19 pandemic.
The compartmental models are the most common models that are usually used for studying the spread of pandemics [85]. In these models, the population is assigned to specific labels, such as susceptible-infected-recovered (SIR) [87], susceptible-exposedinfected-suspected again (SEIS), etc. [88]. Such models used stochastic frameworks to forecast specific measures, such as the total number of infected people, infection rate, and estimated epidemiologic parameters (i.e., reproduction number), and show how public health strategies impact the epidemic outcome. For example, in the SIR model [89], the susceptible population is assumed to be the whole population of the region minus people that were previously infected by the disease. The infection rate is a function that utilized both the number of infections and the rate of transmission to estimate the infected population in each period. The SIR model has been used in several studies to estimate the expected growth of COVID-19. For example, the authors in [90] used the SIR model to measure the effect of social distancing in reducing the spread of infection. They tested the model with different social distances to estimate the expected spread after the reopening. Another study was conducted, at the beginning of the pandemic [91], using susceptibleexposed-infected-confirmed-removed (SEIQR), which has been built upon the SIR model to estimate the growth of COVID-19 in Wuhan, China. This study reported that the lockdown in China would help limit the spread in the rest of the world. Similarly, in [91], authors reported that the travel restrictions help in reducing the infection spread from Wuhan to the rest of the world. Hazhir et al. [92] used the susceptible-exposed-infectedrecovered (SEIR) model to estimate the transmission of COVID-19 in 84 different countries. This model tracked the infection transmission rate due to the travel network for each country. SEIR was also used to forecast the pandemic peak in Japan [93]. The SIR and SEIR models were used to compute the transmission rate from people to people, from animal to people, and vice versa in [94]. Another study [95] was conducted in Egypt to predict the time of the peak and study the changes in the Egyptian behavior during Ramadan based on the SIR and SEIR models. The study measured the spread of the infection. In [96], the authors used the DL model to estimate the risk of COVID-19 spreading outside China. In [97], the authors utilized the logistic growth model to estimate the time and size of the COVID-19 peak in South Korea and China.
Other studies tried to estimate the future spread based on basic and effective reproduction numbers (R0, Re) only. In epidemiology, the basic reproduction number R0 is the expected number of infected cases that are directly infected on average by one confirmed case [98], where all populations are suspected to be infected. On the other hand, the effective reproduction number (Re) is the number of infected cases in a specific time and specific environment; therefore, it is known as Rt (Rtime) [99]. In [100], Salihu et al. estimated the expected growth and reproduction rate (R0) in Africa. Africa is considered one of the most affected regions with coronavirus in the Middle East. The trade relations with China have played a major role in aggravating the risk of African countries' exposure to infections and spread of COVID-19 in a way that is difficult to counteract, especially with their reputation for having fragile state health systems. Salihu et al. [100] analyzed the epidemic between 1 March and 12 April 2020 using the growth estimation function [101]. This estimated the exponential growth per day at 0.22 (95% CI: 0.20-0.24) and the reproduction number at 2.37 (95% CI: 2.22-2.51). In [102], authors depended on SEIR data of suspected, exposed, infected, and recovered stocks that summarized the population groups and the changes in screening, diagnosis, and contact rate to measure the expected growth. The study resulted in a reproduction number of 2.6. In [103], the authors used Markov Chain Monte Carlo (MCMC) to estimate the reproduction number and rate based on the number of confirmed cases and deaths. The estimation results are a Re of 3.36 (94% CI: 3.20-3.64). In [104], the authors studied the correlation between weather and COVID-19 spread in Indonesia. Abdallah et al. [105] tried to estimate the epidemic spread in Kuwait using stochastic modeling, and the same procedure has been done in Iraq [106,107] and Egypt [108,109].
DL models were also used to track the spread of COVID-19 virus infection in terms of time and space. First, some studies utilized the respiratory patterns to predict tachypnea as it is the first diagnostic feature that could be common among large-scale COVID-19 patients. In [110], Yunlu et al. used a bidirectional gated recurrent unit (GRU) to predict tachypnea based on smartphone data. Second, researchers used DL models to predict the risk level. In [111], Yanfang et al. introduced an AI system (known as α-satellite) to specify hierarchic geographic risk assessment at different community levels. DNN was applied to a large scale of real-time data aggregated via smartphone sensors to estimate the risk level [112]. The aggregated data were then used in the development of an effective strategy to combat the rapid increase of the pandemic. LSTM model was used to predict the pandemic trend in Canada [113]. Shawni et al. [114] used a combined technique of LSTM and GRU to measure the negative and positive of the release and death cases of COVID-19.
Despite the importance of such studies in facing the COVID-19 pandemic, the risk of underestimation is still high due to several reasons [85], including (1) the nature of the disease is insertable with other diseases, which results in a large number of populations with mild symptoms (symptoms that similar to flu or cold) not being identified, and thus some that have died due to COVID-19 infection will not be recognized; (2) the variation in the number of tests across the countries resulted in imprecise estimations; and (3) population density, interaction, and lifestyle resulted in variations in reproduction numbers. Therefore, estimation should depend not only on statistical approximation, such as R0 and Re, but also on other factors such as socioeconomic status, population behavior and awareness, and the quality of the healthcare system in each country.

Blood Type
The susceptibility of viral infections among specific blood types has been previously studied for various diseases. For example, Hepatitis and Norwalk were confirmed to have relations with specific blood groups [122,123]. On that basis, researchers studied the relationship between blood type and COVID-19 risk of infection. In [116], the authors analyzed the relationship between ABO blood type and the risk of COVID-19 infections. ABO blood type donates the existence of antigens in erythrocytes in A and B blood types. The results showed that the group A was correlated with a higher risk of infections in contrast to other blood types. This study surveyed the blood test among 23,386 patients in Wuhan, China. Applying statistical analysis tests (i.e., Chi-squared test) ended up in a 95% confidence interval. The same results were reached in [124]. A few studies have analyzed the association between Rh (positive and negative) and COVID-19 disease [115,125,126]. In [126], the authors reported that a positive Rh is more protected against latent toxoplasmosis.

Age
In this current pandemic, the association between patient age, risk of COVID-19 infection, and death have received much speculation. Most articles reported that older age is considered one of the main factors for infection and mortality [127]. In [128], authors analyzed the data from 20 European countries and reported that the R2 value ranged from 0.766 to 0.803 for patients above 75. Another study measured the infection rate and case fatality rate among the population [129] and observed that Italy had a higher CFR of 9.3, followed by the Netherlands with a CFR of 7.4 for patients more than 70 years old. The study concluded that there is a strong relationship between age and fatality rate among COVID-19 patients. The same conclusion was reached by [130,131]. Table 5 shows the COVID-19 statistics according to patient age [130].

Gender
The differences in men's and women's bodies due to their biology (sex) influence the risk of COVID-19 infection and death rate. To attribute and address these differences, several studies analyzed the infection distribution according to gender. In [132], the authors reported that there is a gender inequality among COVID-19 infections. These differences may be due to biological differences (i.e., comorbidities and immunity) or sociocultural factors (i.e., number of tests for both males and females, timelines for medical support, etc.). In [133], the authors reported that the proportion of death in males due to COVID-19 is significantly higher than in females. In [134], the authors reported that a patient's gender might influence the risk of infection, and an immune response led to worse results in terms of infection recovery. Figure 2 shows the statistics between males and females in terms of infections, hospitalizations, admissions, and deaths. These statistics were built based on the dataset available at https://globalhealth5050.org (https://globalhealth5050.org/thesex-gender-and-covid-19-project/dataset/, access date: 10 February 2021). In this current pandemic, the association between patient age, risk of COVID-19 infection, and death have received much speculation. Most articles reported that older age is considered one of the main factors for infection and mortality [127]. In [128], authors analyzed the data from 20 European countries and reported that the R2 value ranged from 0.766 to 0.803 for patients above 75. Another study measured the infection rate and case fatality rate among the population [129] and observed that Italy had a higher CFR of 9.3, followed by the Netherlands with a CFR of 7.4 for patients more than 70 years old. The study concluded that there is a strong relationship between age and fatality rate among COVID-19 patients. The same conclusion was reached by [130,131]. Table 5 shows the COVID-19 statistics according to patient age [130]. The differences in men's and women's bodies due to their biology (sex) influence the risk of COVID-19 infection and death rate. To attribute and address these differences, several studies analyzed the infection distribution according to gender. In [132], the authors reported that there is a gender inequality among COVID-19 infections. These differences may be due to biological differences (i.e., comorbidities and immunity) or sociocultural factors (i.e., number of tests for both males and females, timelines for medical support, etc.). In [133], the authors reported that the proportion of death in males due to COVID-19 is significantly higher than in females. In [134], the authors reported that a patient's gender might influence the risk of infection, and an immune response led to worse results in terms of infection recovery. Figure 2 shows the statistics between males and females in terms of infections, hospitalizations, admissions, and deaths. These statistics were built based on the dataset available at https://globalhealth5050.org (https://glob-alhealth5050.org/the-sex-gender-and-covid-19-project/dataset/, access date: 10 February 2021).

Obesity
Obesity is an indicator of high risk among various diseases (i.e., diabetes and heart diseases) [135]. It has been associated with COVID-19 severity, admissions, and fatality rates [136]. An analytical study [137] on 16,000 COVID-19 patients conducted in the UK reported that obesity is associated with COVID-19 death with a hazard rate (HR) of 1.33. In [138], the authors analyzed data of 6000 COVID-19 patients and found that there was a j-sharped curve between obesity and mortality. Another study was conducted in Latin America [139] and reported a higher risk of infection for people with a body mass index (BMI) > 30 kg/m 2 . This rate increased in lower-income people who already have a higher risk of complications due to healthcare shortages.

Smoking
Smoking destroys the lungs and weakens the immune system [140], so fighting off respiratory diseases such as COVID-19 is hard [141] in smoking people. According to a WHO scientific report [142], around 9.7% of COVID-19 patients are active smokers or have a smoking history. By giving up smoking, you are giving your lungs the chance to become clean and be repaired, improving the ability of a faster recovery. In [143], the authors surveyed the association between smoking, history of smoking, and COVID-19 severity. The study analyzed 16 articles that serve that relation. They concluded that there is a higher association between people who have a history of smoking and COVID-19 infection (odds ratio (OR) = 1:51; 94% CI: 1.11-2.04; p < 0.008), between active smoking and COVID-19 infection (OR= 2:18; 94% CI: 1.27-3.45; p < 0.001). In another study [112], the authors compared different smoking histories (active smokers, not smokers, and smoker quitter). They reported that 19.07% of COVID-19 patients are smokers.

Medical Comorbidities
Many reports found a high association between COVID-19 and other severe diseases, such as diabetes, hypertension, acute kidney injury, etc. In [133], Wang et al. conducted a meta-analysis study including 1570 patients with COVID-19 infection. The study indicated that patients with serve illness were more likely to have respiratory diseases (OR = 3.42 (1.89 to 6.11)), hypertension (OR = 2.66 (1.46 to 3.82)), and cardiovascular disease (OR = 3.44 (1.44 to 3.82)). Another study [144] analyzed the risk factors of death among COVID-19 patients. The study reported negative markers between COVID-19 infections and other chronic diseases, such as diabetes (33.31%), hypertension (35.16%), chronic kidney disease (17.87%), and diseases of the circulatory system (22.53%). They also compared the death rate among COVID-19 patients and other chronic disease patients. They reported a mortality rate of 22 times higher for kidney disease patients, 10 times higher for patients with hypertension, and 14 higher times for patients with diabetes. Table 6 shows the correlation between medical comorbidities and risk of COVID-19 infection according to the WHO reports [145]. Other researchers focused on analyzing organ complications due to COVID-19 infections. For example, in [146], the authors surveyed organ complications study and showed that about 3.75% of COVID-19 patients reported abnormalities in liver enzymes, 10% developed acute kidney injury, and 23% were afflicted with heart problems. Researchers in [147] developed a DL model to analyze the relationship between mortality and other medical comorbidities. They concluded that medical comorbidities are highly associated with mortality, with percentages of 2.56%, 10.3%, 41.0%, and 6% for heart rate problems, respiratory disease, hypertension, and diabetes; the same trend was found in [148][149][150][151][152].
More details about the correlation between comorbidities and severe diseases are available in [153,154].

Environmental Factors
Several studies addressed the relationship between environmental factors and COVID-19 spread of infection. For example, Aabed et al. [155] investigated the impact of weather, population density, and intra-provincial traffic. They found a positive correlation between infection rate and population density and a negative correlation with social isolation and temperature. The same results were found in [156]. Others focused on investigating the effect of building operation factors, and they found that most infections occurred in an indoor environment [157]. Another critical factor that influences the spread and course of the disease is the possibility of having rapid access to diagnosis. These difficulties may be found in developing countries and in urban areas with high population densities, where the use of public transport and the prolonged frequentation of indoor environments lead to the spread of contagion. These scenarios of inadequate health coverage have been mapped, comparing the quality of access to care with the general conditions of development of the territory [158].

Using DL in Developing Vaccines
Since the outbreak of COVID-19, clinicians and virologists worldwide urged to fight this pandemic ubiquitously, searching for drugs or vaccines with precise and accurate operations. It got even worse with the significant increase in infections [159]. Unfortunately, drug discovery using traditional technologies is a complex process known to take many years. AI techniques can reinforce and improve traditional technologies by accelerating drug discovery, screening, and validation. AI also can speed up the pace by extracting useful data for drug repurposing [160]. The following subsection details the role of AI in drug repurposing, discovery, and vaccine discovery.

Drug Repurposing
Drug repurposing is an effective solution in mitigating pandemics, which are based on previously approved drugs. This contributed to rapidly increasing the response against that pandemic and accelerated the clinical trials [161]. Therefore, it is considered the best solution to yield an effective and faster drug against COVID-19 [162]. Several studies [163][164][165] utilized ML and DL techniques, including LSTM, CNN, etc., to search for acting antivirals among the previously known drugs. Four main approaches, namely, docking simulation, ligand prediction, gene expression, and biomedical knowledge graphs (BKGs), have been developed to achieve this goal. The following subsections discuss these four approaches in detail. Figure 3 shows the general method of using AI in drug repurposing. BKG is a basic technique that is used to aggregate data from heterogenous resources [166][167][168]. It also is used to capture the relation between entities such as viral proteins and drugs, a pair of genes, etc. For example, Richard et al. [169] utilized BKG to identify Bari-

Biomedical Knowledge Graph
BKG is a basic technique that is used to aggregate data from heterogenous resources [166][167][168]. It also is used to capture the relation between entities such as viral proteins and drugs, a pair of genes, etc. For example, Richard et al. [169] utilized BKG to identify Baricitinib. Baricitinib is a drug used in arthritis therapy and is considered a promising treatment for COVID-19. This is because Baricitinib inhibits the protein kinase enzyme, which makes it difficult for the virus to infect the hosted cells. Recent studies showed two main techniques for graph construction. First, in [170], the authors utilized a pipeline of three-part neural network and tree search approach to understand the interaction between all molecules. Second, in [171], the authors utilized BKG to describe the relations between the genedisease pairs. Others, in [172], utilized ML and statistical analysis techniques to integrate and mine many BKG, showing a relation between the viral protein, human protein, and previously known drugs. These graphs have been used to predict the effective drug candidates against COVID-19.
In [173], the authors extracted 2045 human proteins, which are known drug targets extracted from DrugBank. Then, a multitask ML model was then used to determine the relationship between the known drug targets (KDTs) and the COVID-19 circuits that conform to the diseases. The results showed that 380 KDTs have a direct relation with circuits of the COVID-19. In [174], the authors used a deep graph neural network to extract the candidate drug representation according to biological interactions. They demonstrated that the interactions between DNN and extensive interaction could facilitate the identification of candidate drugs. In [175], the authors utilized an integrative DL model to discover candidate drugs named CoV-KGE. First, the authors built a list that includes 15 million edges from 39 types of relationships, which were extracted from 24 million PubMed publications. They concluded that CoV-KGE had a high performance in identifying repurposable drugs, with an AUROC = 0.85.

Protein-Ligand Prediction
Ligands are molecules that bind with protein signals. In [176], the authors used multitask neural networks to predict affinities based on a database of 4600 various drugsthe developed model results in identifying 10 promising drugs with their affinity scores. In similar research [177], authors used a CNN model to identify the inhibitors of the 3Clike protease (the main protease in coronavirus)-based binding DB (BDB) [178] to find an effective treatment for this protein. In [179], the authors also developed a template model of the 3-C like protease, and then applied a mathematical DL model to identify its inhibitors. This model relied on two different datasets (84 SARS inhibitors from chEMBL DB and 15,843 protein affinities from bind DB) [178]. The study resulted in identifying a list of promising COVID-19 drugs from the DrugBank DB.

Molecular Docking (Docking Simulation)
Docking is another approach that has been used for drug repurposing, in which each ligand interacts with all proteins in different conformations and orientations. This results in the generation of several poses (known as binding modes). These poses are then utilized to predict the ligand's affinity [178]. Since these docking simulation techniques are computationally expensive, some studies tried to narrow the pool of candidates that need to be docked using ML and DL techniques. For example, in [180], the authors trained a neural network on 3 million candidates (3-C like protease inhibitors) extracted from 1 billion compounds in ZINIC DB using a deep docking platform. Then, the authors docked the result and presented only the first 1000 results. In another research, Btra et al. [181] trained a random forest model on the SMILES dataset (https://2019-ncovgroup.github. io/data/, access date: 10 February 2021) and applied the docking simulation, which resulted in identifying 187 molecules in the coronavirus S-protein. In [182], the authors proposed an ML framework that is used to predict viral protein activity. This was done by developing an ensemble model that ranks the drugs according to their ability to inhibit the SARS-COV-2 virus proteases. The developed model helped in identifying 19 drugs (7 antiviral, 3 antibodies, 6 anticancer, 1 antifungal, and 2 antimalarial). Then they use molecular docking to evaluate the binding ability. They concluded that antiviral and antimalarial drugs have more binding energy with 3CL pro protease than anticancer and antibiotic drugs.

Gene Expression Signature
Studies discovered therapies that have a similar impact to other previously known treatments depending on gene expression signatures. Avaachuv et al. [165] utilized this approach to find a gene expression signature similar to COBP2, limiting COVID-19 replication. The study resulted in 20 promising drugs, many of which have been previously used as antivirals [183]. Since all these drugs already got clinical approval, they may facilitate the discovery of an effective treatment.

Drug Discovery
Another role of AI in COVID-19 treatment is to discover new chemical compounds, using ML and DL models to identify baricitinib to tackle COVID-19 [161]. For example, Zahavorkov et al. [180] tried to find inhibitors for the 3-C like protease. They used three main inputs, include co-crystal ligands, a crystal protein structure, and the protein homology model. In total, 28 different models were trained for each input (i.e., generative adversarial networks and generative autoencoders [180]). The authors then used reinforcement learning with reward functions to evaluate the drugs according to different factors (i.e., novelty, diversity, etc.), to confirm choosing the most suitable molecules and thus guaranteeing to find a novel drug. Reinforcement learning has also been used in another study for drug discovery [184], where the authors used a list of 183 molecules known as inhibitors for SARS, breaking these proteins into 315 fragments. Deep Q learning was used then to combine fragments based on fragment drug design (ADQN-FBDD). This design scored the discovered molecules based on three points (drug-likeness, the existence of known pharmacophores, and the presence of pre-pet-determined fragments). The 4900 molecular were filtered using a heuristic filter to choose the promising compounds [180]. Similarly, in [180], the authors used 1.6 million molecules extracted from the chEMBL dataset [185] and generated 33 candidate inhibitors. Other researchers took a different path to discover a new drug for COVID-19, which depended on the immune response. In the human body, B-cells produce antibodies (known as antigens) that attack the virus. As such, researchers tried to discover new drugs by searching for antigen-neutralizing antibodies. For example, in [180], the authors created a dataset of 1933 antigen sequences from similar diseases (SARS, HIV, and EBOLA); then, they trained the XGBoost model (classification model) to predict the antibody that will affect the antigen. Other researchers [186] tried to predict effective anti-bodies from the future generation of COVID-19. They mutated the SARS antibody sequence and generated 2900 antibody sequences. Then, these mutations were filtered to choose the stable variants and propose the effective antibodies.

Vaccine Discovery
From the medical side, the human body attack viruses in two ways: (1) via B-cells that produce antibodies (as described above); and (2) via T-cells. T-cells include small cells called memory cells, which could recognize the antigen quickly, and then activate more T-cells to attack the virus directly [187]. A part of the immune system is the complex proteins (MHC I and MCH II), which shows the binding areas with the antigens (known as epitopes); these proteins are encoded by Human Leukocyte Antigen (HLA) genes, and vary from human to human [187,188]. On these bases, the vaccine should identify the suitable epitopes and ensure that these epitopes could be presented by MCH I and II genes generated from different HLA [189]. Altman et al. [190] identify 405 T-cell epitopes that could be presented by MHC I and II proteins. They utilized a previously trained neural network to predict the T-cell epitopes that could present with MHC genes. To assure choosing the potential epitopes, the authors examine 68 genetic variants of the SARS-COV virus to analyze the mutation of the virus, to identify the areas of the virus that are more or less likely to mutate [191,192]. They concluded that S-protein is the most suitable part for the vaccine, as it does not include too many such mutations. In another research [193], the authors used an XGBoost model to predict the best protein that could serve as an effective vaccine. They reported that the six proteins (i.e., nsp3, nsp4, nsp5, nsp6, nsp7, and nsp8) are also promising for vaccine development, in addition to the S protein. As far as we know, three different vaccines (clinically approved vaccines) reported that they used ML in their development process [189]. However, it is discouraging that the developed companies published minimal information about their methodologies pipeline and how they integrate ML into the vaccine development pipeline.

Applications of AI to Support COVID-19 Patients
ML and DL have been extensively used in various and critical health care applications, such as predicting brain age [194], diagnosis of liver diseases [195], and many other diseases [196,197]. In the current pandemic, governments and healthcare organizations are in critical need of support and decision-aid tools, which may help get timely and efficient support to avoid virus spread. AI tries to provide professional solutions that mimicked human intelligence and results in various significant applications that could be used in screening, diagnosing, and tracking the disease. This section concentrates on AI applications that gained much interest and raised the world's hope to fight against COVID-19. AI is used to tracking patients through smart devices, such as mobile phones, cameras, and other wearable sensors [198,199]. These devices could be used for diagnosing, screening, and continuous monitoring [200]. Based on data aggregated from these devices, AI could provide useful information for the decision-making process, such as prioritizing the need for respiratory support as well as intensive care unit (ICU) admission [58,201].
Several AI applications have been developed to lighten the burden on medical experts as well as healthcare workers. This is done by automating procedures in a way that minimizes their direct contact with patients as follows. (1) AI is used to analyze patient's data (i.e., symptoms, clinical reports, etc.), and to classify them into different categories, such as mild, moderate, and serve. Accordingly, different therapy plans can be adopted for patients efficiently. (2) AI telemedicine applications could help in reducing the frequent visits to hospitals by providing continuous monitoring for patients with mild symptoms [202]. (3) Another application that supports both patients and health care staff is the AI-based medical chatbots (i.e., Clara chat boot 44). Chatbot is an AI service that is incorporated with ML and DL models (i.e., feature extraction, NLP, etc.) to assist patients with instant answers, providing continuous guidance on how to deal with potential problems. From the health care organizations' side, chatbots could assist in triaging patients to flow smoothly, automate primary care, and allow medical experts to focus on critical and dire cases [203][204][205]. (4) AI is used as the core of service robotics that could assist in several tasks, such as cleaning, disinfecting, delivering food, and treatment [206][207][208]. Moreover, depending on AI to understand population awareness towards COVID-19 through social media could help in specifying the correct strategy for mitigating this pandemic. ML and DL were utilized to make a sentiment analysis towards the followed strategies, recognize trends, and determine the origin of such misinformation and rumors [35,209,210]. AI could also help analyze the updated information, such as the recovery rate and therapeutic results, which may help medical experts resolve panic and fear towards this pandemic [131]. More applications that utilized AI techniques to support or monitor COVID-19 patients are expressed in Table 7. Table 7. Applications of using AI techniques in supporting COVID-19 patients.

Ref.
Application Type of Data AI Technique Challenge [203][204][205] Chatbots to support COVID-19 patients and their relatives Guidelines and information from a medical expert NLP (i.e., information extraction, text summarization, and classification), speech recognition, and automated question answerers tools.
-Require a large amount of data to handle questions related to an unsaved query.
-The challenge related to using various language expression (i.e., language slang) [35,209,210] Mining text to understand the community's response towards governmental and health strategies (i.e., social distance, lockdown)

COVID-19 Datasets
The lack of accurate and sufficient data is one of the key problems in COVID-19 research, as the number of carried-out tests is small, and thus numerous death and infected cases are left unreported. No country worldwide has succeeded in offering reliable and accurate datasets to the virus's existence among their population. However, the research on this context cannot stop. Therefore, information fusion has a significant role in combining information from multiple sources. Information fusion is used to integrate data from various resources to provide valuable information for the characterization, identification, and detection of a specific entity [233]. Given the fact that in ML and DL models the existence of a large size dataset plays a key role in developing models with high prediction accuracy, the datasets of COVID-19 were categorized into three main groups: (1) textual data; (2) medical images; and (3) speech. Most COVID-19 image datasets were taken from screening tools that belong to three main classes, namely, X-ray, ultrasound, and CT chest scans. As the kits used in the PCR test are timely, limited, and costly, medical images are considered an adequate alternative that lower the burden on PCR tests.

Medical Images Datasets
Medical images, such as X-ray and CT chest scans, were used to develop an automated model for disease diagnosis. Datasets often need preprocessing steps, such as segmentation and augmentation [25]. Image segmentation leads to portions of the image (region of interest). Image augmentations include transformation and filtering to increase the size of the dataset [42]. Consequently, ML and DL provide accurate models and avoid overfitting. The following subsections discuss the available medical image datasets for COVID-19.

CT Chest-Scan Dataset
Owing to the rapid progression of the COVID-19 disease, a subsequent CT scan every 2-4 days is required to evaluate the progression and therapeutic effect. Figure 2 shows the changes in CT chest images of the COVID-19 patient, which took place gradually [28,35,234]. Initially, there is a slight change in the chest CT images; but, as infection rises day by day, bilateral differences are seen to take place. Chest CT images clearly show the growth of pneumonia with linear opacity in the subpleural area [235]. Figure 4 shows the progression in the patient's status.

COVID-19 Datasets
The lack of accurate and sufficient data is one of the key problems in COVID-19 research, as the number of carried-out tests is small, and thus numerous death and infected cases are left unreported. No country worldwide has succeeded in offering reliable and accurate datasets to the virus's existence among their population. However, the research on this context cannot stop. Therefore, information fusion has a significant role in combining information from multiple sources. Information fusion is used to integrate data from various resources to provide valuable information for the characterization, identification, and detection of a specific entity [233]. Given the fact that in ML and DL models the existence of a large size dataset plays a key role in developing models with high prediction accuracy, the datasets of COVID-19 were categorized into three main groups: (1) textual data; (2) medical images; and (3) speech. Most COVID-19 image datasets were taken from screening tools that belong to three main classes, namely, X-ray, ultrasound, and CT chest scans. As the kits used in the PCR test are timely, limited, and costly, medical images are considered an adequate alternative that lower the burden on PCR tests.

Medical Images Datasets
Medical images, such as X-ray and CT chest scans, were used to develop an automated model for disease diagnosis. Datasets often need preprocessing steps, such as segmentation and augmentation [25]. Image segmentation leads to portions of the image (region of interest). Image augmentations include transformation and filtering to increase the size of the dataset [42]. Consequently, ML and DL provide accurate models and avoid overfitting. The following subsections discuss the available medical image datasets for COVID-19.

CT Chest-Scan Dataset
Owing to the rapid progression of the COVID-19 disease, a subsequent CT scan every 2-4 days is required to evaluate the progression and therapeutic effect. Figure 2 shows the changes in CT chest images of the COVID-19 patient, which took place gradually [28,35,234]. Initially, there is a slight change in the chest CT images; but, as infection rises day by day, bilateral differences are seen to take place. Chest CT images clearly show the growth of pneumonia with linear opacity in the subpleural area [235]. Figure 4 shows the progression in the patient's status. A pioneering effort in collecting public CT scans datasets was in [236]. The dataset consists of 125 chest CT scans. It includes images of several classes (COVID-19, SRAS, MERS, and ARDS). The dataset was collected from several websites and publications, which may affect the image quality and even the performance of the ML model [237]. A pioneering effort in collecting public CT scans datasets was in [236]. The dataset consists of 125 chest CT scans. It includes images of several classes (COVID-19, SRAS, MERS, and ARDS). The dataset was collected from several websites and publications, which may affect the image quality and even the performance of the ML model [237]. Another published CT chest dataset is in [25]. It includes 275 images of positive COVID-19 CT scans extracted from 760 COVID-19 preprints. The dataset is used in various studies and updates continuously in the online repository. To overcome the shortage in COVID-19 datasets, several studies use augmentation and segmentation techniques to increase the size of the dataset. The segmentation is considered a preprocessing step used to crop the region of interest (infected region). For example, in [34], the authors use a 3D CNN model to segment the infected regions from the CT chest scan dataset [236]. The system made auto-contouring to estimate the shape and percentage of the infected region, resulting in an accuracy of 90% in recognition. Other segmented datasets are listed in [238], consisting of 20 labeled COVID-19 datasets categorized into left and right infected lungs. Another COVID-19 online dataset is available at http://medicalsegmentation.com/covid19/, access date: 10 February 2021, the segmented images obtained from a society of medical and interventional radiology (SIRM) (https://www.sirm.org/en/category/articles/covid-19-database/, access date: 7 February 2021; https://coronacases.org, access date: 10 February 2021) and categorized into three classes (consolidation, pleural effusion, and ground glass). Another effort for collecting a COVID-19 dataset is in https://coronacases.org/, access date: 10 February 2021. The UK imaging and British society of thoracic imaging developed an online portal for COVID-19-positive CT-scan images (https://www.bsti.org.uk/training-and-education/ covid-19-bsti-imaging-database/), access date: 10 February 2021. Each case is stored with its characteristics, such as gender, age, and PCR result test. The same procedure was done to collect the dataset in https://www.sirm.org/en/category/articles/covid-19-database/, access date: 10 February 2021/Several studies utilized these datasets in their research [18,239]. To make a binary classification for COVID-19 identification and diagnosis, several studies use non-COVID-19 CT chest-scan images as a negative training example, such as the following: (1) the MedPix (https://medpix.nlm.nih.gov/home, access date: 10 February 2021) medical images dataset that includes 5900 images for 1200 patients; (2) the LUNA (https://luna16.grand-challenge.org/) dataset for lung cancer patients that includes 888 CT chest scans for 888 subjects; and (3) the Radiopaedia online repository (https://radiopaedia.org/articles/covid-19-4?lang=us, access date: 10 February 2021) that includes 366,558 CT scan images.

X-ray Images Dataset
A chest radiograph (X-ray) is the common way to diagnose patients with respiratory diseases. A chest X-ray image can be viewed as normal at the early stages, but it gradually changed in a way that may correlate with other respiratory diseases such as pneumonia or acute respiratory distress syndrome (ARDS). Two common changes that arise in the COVID-19-infected lung include (1) accumulation of tissue or fluid in a way that prevents gas exchange; and (2) the appearance of nodular shadowing. Figure 5 shows the progression of X-ray images for a 45-year-old patient.
An earlier effort to develop an X-ray dataset for COVID-19 patients was in [240]. It includes 13,800 images for 13,000 patients collected from several online repositories. Wang et al. [240] collected this dataset to develop a CONVNET model for COVID-19 diagnosis, resulting in a classification model with an accuracy of 93.11%. Another dataset collected from online repositories by Cohen et al. [236] continuously updated through the following link (https://github.com/ieee8023/covid-chestxray-dataset, access date 12 February 2021). Several researchers utilized the Cohen X-ray images dataset in their studies. For example, Hemdan et al. [58] utilized Cohen et al.'s dataset [236] to develop a CNN model for COVID-19 diagnosis. They developed five different DL models based on transfer learning to overcome the shortage of the dataset. Other researchers merged Cohen's [236] dataset with other datasets to increase the size of the resulting dataset to enhance the performance and avoid overfitting. For example, in [241], the authors merged the Kaggle dataset (https://www.kaggle.com/andrewmvd/convid19-X-rays, access date: 14 February 2021), for pneumonia with the Cohen dataset [236] to train a CNN model using pre-trained models, including VGG19, Inception, Xception, MobileNet2, and ResNet V2. Results show that MobileNet V2 outperformed other models in terms of accuracy, specificity, and sensitivity. The authors extended their study in [215] by merging Cohen's dataset [236] with SIRM and RSNA [241] data, where a total of 455 images were obtained for all classes. This research demonstrated that building the CNN model from scratch based on a sufficient dataset outperformed transfer learning. In another research [24], Cohen's dataset [236] was merged with the Kaggle dataset (https://www.kaggle.com/ paultimothymooney/chest-xray-pneumonia, access date: 14 February 2021) and resulted in a 100 CT image dataset that was divided into two balanced classes (50 normal and 50 positive). Apostolopoulos et al. [53] used the same dataset and merged it with the Kaggle dataset (https://www.Kagglee.com/andrewmvd/convid19-X-rays, access date: 10 February 2021). This resulted in 127 images from pneumonia and COVID-19 cases. In [213], the authors utilized the augmentation techniques on Cohen's dataset [236] in resolving the COVID-19 data scarcity. The same has been done in [242], where authors applied data augmentation techniques on COVID-19 and non-COVID-19 X-ray images. They obtained around 17,000 X-ray images from 4044 positive images and 5500 negative images. The same was done in [243], where the authors utilized both the Cohen dataset and Kaggle dataset at (https://www.kaggle.com/paultimothymooney/chest-xray-pneumoni, access date: 16 February 2021). The authors used data augmentation techniques and obtained 2500 images (1340 viral pneumonia and 190 COVID-19 images). Data after augmentation is available at (https://www.kaggle.com/tawsifurrahman/covid19-radiography-database, access date: 14 February 2021). In [244], Signoroni et al. collected 4707 X-ray images for COVID-19-positive subjects collected from an Italian hospital. To maintain a robust dataset, the authors collected it from two different modalities, including (direct X-ray (DX) and computed radiology (CR)) for patients with various statuses (i.e., supine, standing, and with or without life support systems).
region of interest (infected region). For example, in [34], the authors use a 3D CNN model to segment the infected regions from the CT chest scan dataset [236]. The system made auto-contouring to estimate the shape and percentage of the infected region, resulting in an accuracy of 90% in recognition. Other segmented datasets are listed in [238], consisting of 20 labeled COVID-19 datasets categorized into left and right infected lungs. Another COVID-19 online dataset is available at http://medicalsegmentation.com/covid19/, access date: 10 February 2021, the segmented images obtained from a society of medical and interventional radiology (SIRM) (https://www.sirm.org/en/category/articles/covid-19-database/, access date: 7 February 2021; https://coronacases.org, access date: 10 February 2021) and categorized into three classes (consolidation, pleural effusion, and ground glass). Another effort for collecting a COVID-19 dataset is in https://coronacases.org/, access date: 10 February 2021. The UK imaging and British society of thoracic imaging developed an online portal for COVID-19-positive CT-scan images (https://www.bsti.org.uk/training-and-education/covid-19-bsti-imaging-database/), access date: 10 February 2021. Each case is stored with its characteristics, such as gender, age, and PCR result test. The same procedure was done to collect the dataset in https://www.sirm.org/en/category/articles/covid-19-database/, access date: 10 February 2021/Several studies utilized these datasets in their research [18,239]. To make a binary classification for COVID-19 identification and diagnosis, several studies use non-COVID-19 CT chest-scan images as a negative training example, such as the following: (1) the MedPix (https://medpix.nlm.nih.gov/home, access date: 10 February 2021) medical images dataset that includes 5900 images for 1200 patients; (2) the LUNA (https://luna16.grand-challenge.org/) dataset for lung cancer patients that includes 888 CT chest scans for 888 subjects; and (3) the Radiopaedia online repository (https://radiopaedia.org/articles/covid-19-4?lang=us, access date: 10 February 2021) that includes 366,558 CT scan images.

X-ray Images Dataset
A chest radiograph (X-ray) is the common way to diagnose patients with respiratory diseases. A chest X-ray image can be viewed as normal at the early stages, but it gradually changed in a way that may correlate with other respiratory diseases such as pneumonia or acute respiratory distress syndrome (ARDS). Two common changes that arise in the COVID-19-infected lung include (1) accumulation of tissue or fluid in a way that prevents gas exchange; and (2) the appearance of nodular shadowing. Figure 5 shows the progression of X-ray images for a 45-year-old patient. An earlier effort to develop an X-ray dataset for COVID-19 patients was in [240]. It includes 13,800 images for 13,000 patients collected from several online repositories. Wang et al. [240] collected this dataset to develop a CONVNET model for COVID-19 diagnosis, resulting in a classification model with an accuracy of 93.11%. Another dataset collected from online repositories by Cohen et al. [236] continuously updated through the following Notwithstanding the importance of X-ray in the diagnosis of COVID-19, X-ray chest images are unreliable at the early stages of COVID-19 disease [245]. In other words, the reliability of X-ray findings mainly depends on the difference in time between the first symptoms and the imaging procedure. An Italian study, conducted in April 2020 on 72 COVID-19 patients [246], reported that the disease is visible on an X-ray image within the first 4 days after the onset of the initial symptoms, such as a dry cough, fever, etc.

Ultrasound Dataset
Lung ultrasound correctly diagnosed COVID-19 in 96% of people with COVID-19. However, few US datasets are available. For example, in [70], the authors aggregated a dataset of 64 videos that were divided into 39 videos of COVID-19 and 15 videos of pneumonia, and 12 videos for healthy patients. Another dataset available at (https:// tinyurl.com/yckfqrcg, access date: 17 February 2021; https://pocovidscreen.org/, access date: 16 February 2021) includes 1101 ultrasound images and is categorized as 650 images for COVID-19, 276 for bacterial pneumonia, and 171 for healthy cases. These images were extracted from different videos published in research works. Figure 6 shows the progression of a US image for a COVID-19 patient. Notwithstanding the importance of X-ray in the diagnosis of COVID-19, X-ray chest images are unreliable at the early stages of COVID-19 disease [245]. In other words, the reliability of X-ray findings mainly depends on the difference in time between the first symptoms and the imaging procedure. An Italian study, conducted in April 2020 on 72 COVID-19 patients [246], reported that the disease is visible on an X-ray image within the first 4 days after the onset of the initial symptoms, such as a dry cough, fever, etc.

Ultrasound Dataset
Lung ultrasound correctly diagnosed COVID-19 in 96% of people with COVID-19. However, few US datasets are available. For example, in [70], the authors aggregated a dataset of 64 videos that were divided into 39 videos of COVID-19 and 15 videos of pneumonia, and 12 videos for healthy patients. Another dataset available at (https://tinyurl.com/yckfqrcg, access date: 17 February 2021; https://pocovidscreen.org/, access date: 16 February 2021) includes 1101 ultrasound images and is categorized as 650 images for COVID-19, 276 for bacterial pneumonia, and 171 for healthy cases. These images were extracted from different videos published in research works. Figure 6 shows the progression of a US image for a COVID-19 patient.

Sound Dataset
The main challenge in developing such modes is the shortage of available datasets. The earliest and noteworthy have been developed in [80,247], known as the Coswara dataset (https://coswara.iisc.ac.in/, last access date (16 March 2021). Coswara is a public dataset collected via public media interviews. Since writing this paper, Coswara included 102 records for breathing and deep cough sounds aggregated from COVID-19-positive patients. The collected data include shallow and deep cough sounds and slow and fast breathing sounds. Gender, age, health status (i.e., infected, cured, or exposed), and geographical information are also stored for each patient. Another cough dataset [248] was collected in South Africa, known as SACRO (https://datahub.io/core/covid-19, last access 22 March 2021) (SARS COVID-19 South Africa). SACRO is a small dataset collected from

Sound Dataset
The main challenge in developing such modes is the shortage of available datasets. The earliest and noteworthy have been developed in [80,247], known as the Coswara dataset (https://coswara.iisc.ac.in/, last access date (16 March 2021). Coswara is a public dataset collected via public media interviews. Since writing this paper, Coswara included 102 records for breathing and deep cough sounds aggregated from COVID-19-positive patients. The collected data include shallow and deep cough sounds and slow and fast breathing sounds. Gender, age, health status (i.e., infected, cured, or exposed), and geographical information are also stored for each patient. Another cough dataset [248] was collected in South Africa, known as SACRO (https://datahub.io/core/covid-19, last access 22 March 2021) (SARS COVID-19 South Africa). SACRO is a small dataset collected from 21 cases (8 COVID-19 cases and 12 healthy cases) through a smartphone. Cough sounds were collected, and then sampled at a 44.1 sampling rate. Age, gender, county, COVID-19 lab test result (positive or negative), and symptoms were also recorded in addition to the cough sound. Due to the imbalance in the SARCO dataset, the authors used the synthetic minority oversampling technique (SMOTE) [249] to make the data balanced before utilizing it in detection and classification processes. In [250], the authors collected 260 sound samples from 52 COVID-19-positive cases via the WeChat app. They recorded five sentences one after the other via the mobile app for each patient. These sentences were analyzed to specify the degree of anxiety, fatigue, sleep quality, breath rate, etc. In another dataset [76], the authors collected 7000 sound samples that included 200 confirmed COVID-19 subjects.

Text Dataset
Since the COVID-19 pandemic, various textual datasets have been developed with different targets. It could be categorized as following: (1) reporting and visualizing COVID-19 cases in time-series formats; (2) measuring the community transmission; (3) correlating the effect of mobility on virus transmissions; (4) evaluating the impact of (non-pharmaceutical interventions) NPI on COVID-19 cases; and (5) analyzing COVID-19 scholarly publications for semantics. The categorization of the textual dataset is shown in Figure 7. , and symptoms were also recorded in addition to the cough sound. Due to the imbalance in the SARCO dataset, the authors used the synthetic minority oversampling technique (SMOTE) [249] to make the data balanced before utilizing it in detection and classification processes. In [250], the authors collected 260 sound samples from 52 COVID-19-positive cases via the WeChat app. They recorded five sentences one after the other via the mobile app for each patient. These sentences were analyzed to specify the degree of anxiety, fatigue, sleep quality, breath rate, etc. In another dataset [76], the authors collected 7000 sound samples that included 200 confirmed COVID-19 subjects.

Text Dataset
Since the COVID-19 pandemic, various textual datasets have been developed with different targets. It could be categorized as following: (1) reporting and visualizing COVID-19 cases in time-series formats; (2) measuring the community transmission; (3) correlating the effect of mobility on virus transmissions; (4) evaluating the impact of (nonpharmaceutical interventions) NPI on COVID-19 cases; and (5) analyzing COVID-19 scholarly publications for semantics. The categorization of the textual dataset is shown in Figure 7. The earliest dataset that was developed to aggregate COVID-19 statistics summarization (number of infected recovered and death grouped by county) can be found in [251]. It was developed by Johns Hopkins University, where a real-time dashboard (https://www.arcgis.com/apps/opsdashboard/index.html, last access date (16 March 2021). was developed to aggregate data. These data are publicly available at (https://datahub.io/core/covid-19, last access date (16 March 2021). The main objective of this dataset is to provide the health authorities as well as researchers with statistical data that could be used to analyze, track, and predict the spread of the COVID-19 pandemic. The Chinese Center for Disease Prevention and Johns Hopkins University developed another time-series dataset, which includes the number of recovered and infected cases, the time of infection, and the origin county. Other researchers [252,253] provided an epidemiological dataset about COVID-19 cases in China. This dataset includes personal and laboratory information, such as demographic data, disease onset date, admission date, last travel date, etc. It is updated continuously to guide public health in the decision-making process. In [254], the authors provided a textual dataset that includes four time series datasets: (1) the daily infected cases in Wuhan; (2) the daily internationally exported cases; (3) the daily infected cases in China; and (4) the percentage of the infected cases on vacation flights. This study aimed to estimate the transmission of infection, the virus outbreak, and the effect of travel bans on infection transmission. In the same manner, in [255], authors utilized the daily case reports to evaluate the impact of travel restrictions on COVID-19 spread, where in [256], the authors used case reports that were collected from location-based systems (i.e., WeChat). In [257], the authors analyzed the effect of mobility and travel restrictions on spreading COVID-19 in China. The authors developed a dataset that includes real-time The earliest dataset that was developed to aggregate COVID-19 statistics summarization (number of infected recovered and death grouped by county) can be found in [251]. It was developed by Johns Hopkins University, where a real-time dashboard (https: //www.arcgis.com/apps/opsdashboard/index.html, last access date (16 March 2021). was developed to aggregate data. These data are publicly available at (https://datahub.io/ core/covid-19, last access date (16 March 2021). The main objective of this dataset is to provide the health authorities as well as researchers with statistical data that could be used to analyze, track, and predict the spread of the COVID-19 pandemic. The Chinese Center for Disease Prevention and Johns Hopkins University developed another time-series dataset, which includes the number of recovered and infected cases, the time of infection, and the origin county. Other researchers [252,253] provided an epidemiological dataset about COVID-19 cases in China. This dataset includes personal and laboratory information, such as demographic data, disease onset date, admission date, last travel date, etc. It is updated continuously to guide public health in the decision-making process. In [254], the authors provided a textual dataset that includes four time series datasets: (1) the daily infected cases in Wuhan; (2) the daily internationally exported cases; (3) the daily infected cases in China; and (4) the percentage of the infected cases on vacation flights. This study aimed to estimate the transmission of infection, the virus outbreak, and the effect of travel bans on infection transmission. In the same manner, in [255], authors utilized the daily case reports to evaluate the impact of travel restrictions on COVID-19 spread, where in [256], the authors used case reports that were collected from location-based systems (i.e., WeChat). In [257], the authors analyzed the effect of mobility and travel restrictions on spreading COVID-19 in China. The authors developed a dataset that includes real-time and historical data aggregated in Wuhan, China, in addition to the list of cases inside and outside Hubei, available at https://github.com/Emergent-Epidemics/covid19_npi_china, last access date (16 March 2021). This study found a high correlation between the spatial distribution of COVID-19 and mobility. Another study utilized an epidemiological dataset extracted from government websites and official sources [258] to evaluate the effect of travel restriction on limiting the spread of infection.
Another research interest is concerned with studying the effect of NPI restrictions. NPI is a wide range of rules and restrictions applied by the government to fight against the COVID-19 pandemic (i.e., social distancing, travel limits and bans, contact reduction, etc.). Such datasets are essential to show the effect of applying NPI on infection transmission. At Oxford University, a team of academic researchers started the Oxford COVID-19 government response tracker (OxCGRT) project, which includes data from various countries in the Stringency Index [259]. The Stringency Index consists of 17 indicators, such as local and international travel bans, contact tracing, cancelling all public events, etc. These indicators are utilized to compare the government response, the public awareness, and the effect on the transmission rate. The aggregated data are available at a GitHub repository (https://github.com/OxCGRT/covid-policy-tracker, last access date: 16 March 2021). Another dataset aggregated by a group of volunteers can be found at https://www.kaggle.com/davidoj/covid19-national-responses-dataset, last access date (16 March 2021). The main objective of this is to analyze the effect of NPI regulations in 117 countries, regardless of economic factors. Unfortunately, the authors reported that the data might be biased to some countries, as some countries are not concerned with the document, and their actual implementation may differ from the basic reports.
It is essential to understand the emotional, public response, and worries towards the COVID-19 pandemic in this global crisis. The earliest effort in this regard was in [260], wherein authors requested various participants to report their emotions and developed a dataset of tweets (short and long tweets) aggregated from 2500 participants. The authors also asked the participants to rank their feelings using nine points, to gauge the anxiety, anger, relaxation, happiness, and sadness they felt. In another large-scale tweet dataset, the authors used Twitter API stream to aggregate tweets that include specific keywords (i.e., COVID-19, pandemic, SARSCOV, etc.) [261]. They aggregated 434 million tweets. Twitter streaming API was also used to collect a dataset of Arabic tweets [262]. These data aimed at analyzing the Arabian countries' behavior towards the pandemic, and authors collected 2,433,660 tweets in addition to the geolocation of the tweet.

Genome Sequence Dataset
Genome sequencing is critical to specify the order of chemicals inside DNA molecules and identify virus gene expression [1]. Virology scientists utilized these sequence data in the processes of vaccine development or mutation recognition. During the early breakout of the pandemic, there were a very limited number of genome datasets in Wuhan, China. The lack of genome transfer data made the virus analysis more challenging and raised doubts on virus recombination and phylogenetic network results. With the rapid increase of COVID-19 in different countries, several studies reported that the virus had accumulated several alterations of genome sequences, which have been seen in the spread of viral strains [163]. Until now, more than 66,000 viral genome sequences have been shared through the global initiative on sharing avian influenza data (GISAID) (https://www.gisaid.org/, last access date (16 March 2021). [263]. The availability of the mutated genome sequence raises the chance to discover new drugs and vaccines. Several datasets have been developed for this purpose. In this study [1], the authors developed a stream of virus sequence datasets that included two types of data (raw data and processing data). The raw data had 1557 instances of the SARS-COV-2 virus genome that was collected from NCBI and 11,540 collected from another virus-host, in addition to three other virus sequences (bat-SL-COVZC45, bat-SL-COVZC22, and RAT13). These viruses had a large similarity with the SARS-COV virus. The processing part consists of various types of data stream representations (DSRs), including direct mapping and k-mers mapping with Chaos Game Representation (GCR). Another centralized repository of virus sequence included both the original coronavirus sequence available at (https://registry.opendata.aws/ncbi-covid-19/, last access date (16 March 2021).
Other projects were developed to aggregate virus mutations. For example [264], the VIPR project was a pathogen platform that provided the ability to search and download information about virus mutation. However, it lacked the connecting information between virus mutation, country, and time of occurrence, which is essential to analyze the transmission path. The main objective of such projects was to give users the chance to analyze virus mutations from different perspectives. Table 8 includes a summarization of all the COVID-19 datasets from different angles.

Discussion
The dramatic spread of the COVID-19 and the consequent increase in the number of medical examinations throws a heavy burden on healthcare organizations. This is due to the shortage of medical expertise and test kits. That is why AI is considered a forefront tool to face the COVID-19 outbreak. Recently, several papers focused on surveys of the COVID-19 state of the art from different perspectives. For example, in [267], the authors surveyed the usefulness of the prediction models for COVID-19 diagnosis. In [268] and in [242], the authors briefly summarized the deep learning applications that were developed to combat COVID-19. Same in [269], where the authors summarized the state of the art in medical image processing and its significant role in the COVID-19 domain. Another survey focused on the role of transfer learning. The main differences between our study and other surveys in COVID-19 are the following: (1) investigated the role of AI in the COVID-19 pandemic; (2) covering all applications from diagnosis using various medical datasets; (3) understand the current spread of the pandemic state and predict future spread; (4) specifying the correlation between COVID-19 infection and other healthcare factors; and (5) surveying the role of AI in developing drugs and vaccines. Table A1 show the distribution of gender, ages, and death rate among various countries. Figure A1 show this distribution graphically We tried to analyze how the progress of deep learning contributes to combat coronavirus by developing effective solutions.
First, we compare studies that are concerned with using AI in COVID-19 diagnosis through medical images. Based on this comparison, we observed that (i) a large number of studies have utilized CT scans and X-rays in their works [243,270,271], where few studies utilized lung US [55,66,272]; (ii) although X-ray chest scans are considered less sensitive than PCR tests in detection of COVID-19 at the early stages, it is recommended for monitoring and evaluating the progression of a patient's status, especially with critical cases [215]; (iii) segmentation techniques that used to detect the infected region are primarily used in CT scans [273]; (iv) augmentation techniques that used to increase the size of the dataset are commonly used with X-ray datasets [274]; (v) the majority of COVID-19 studies utilized CNN in their classification process [52,275], where some of them integrate CNN and transfer learning to overcome the shortage of the available dataset and increase the accuracy of the model [32,201,276]; (vi) a small number of studies augmented CNN with random forest and support vector machines to make feature extraction and classification [277,278]; (vii) higher accuracy reported from studies that augmented CNN, transfer learning, and SVM, where using CNN and DL are reported to overfit in some studies due to the shortage of available datasets [37,162]; (viii) accuracy of diagnosis using X-rays in diagnosis is approximately equal to the accuracy when using CT chest scans; (ix) the sensitivity of X-ray in diagnosis is highly correlated with the difference between the time of the initial symptoms and the procedural images;-it was not more than 55% after 2 days from the initial symptoms and increased to 79% after 11 days from the symptom onset [147]; (x) VGG, MobileNet, and ResNet are the most commonly pre-trained models employed for the classification tasks [21,52]; (xi) explainability of CNN model have been rarely used in clarifying the results of CNN [57]; and (xii) most of the studies reported accuracies of more than 90% for the binary classification tasks (i.e., COVID-19, non-COVID-19) [218,279], and reported accuracies higher than 80% for three classification tasks (i.e., normal, viral pneumonia, and COVID-19) [216,280]. Tables 2-4 present summarizations of the many studies that used medical images in COVID-19 diagnosis.
Second, we concentrated on using AI techniques in COVID-19 diagnosis based on respiratory sounds. Accordingly, we make the following observations: (i) a cough sound has unique characteristics, and therefore could be used to differentiate respiratory diseases in the early stages of the diseases. AI models could effectively learn these features and discriminate between COVID-19 and non-COVID-19 cough sound; (ii) quantity and quality of the respiratory sound datasets are the main challenges that face AI in providing robust prediction; (iii) the majority of COVID-19 sound datasets have been aggregated by volunteering the general population through mobile apps and websites. Therefore, prescreening tools are essential to build effective models.
Third, we focused on textual datasets and their role in fighting against COVID-19. We observed that (i) a textual dataset is used for several purposes, including reporting several infections in time series format, correlating the NPI and lockdown effect with virus spread, estimating the reproduction and mortality rate, and analyzing social media data for semantics) [136,189,281,282]; (ii) extracting human emotions towards the pandemic and the NPI from articles and social media data are not deeply investigated; (iii) most research that worked on social media data did not consider the timeliness of the study, as such data got outdated quickly [242]; (iv) contact tracking application is very limited due to the difference in privacy and security regulations across different countries [246,254]; and (v) several papers were written in the Chinese language, especially papers published during the first stage of the COVID-19 pandemic. Thus, it may not be useful for many researchers.
Finally, we compared all COVID-19 available datasets, make several observations. First, regarding the medical images dataset, (i) several studies did not publicly include their data and code. Therefore, we cannot reproduce the results of the research conducted with these data [264,265]; (ii) other studies aggregate data from several resources, but they did not host it in a new repository; and (iii) augmenting data may help in solving the data scarcity issue, increase the performance of the model and avoid overfitting-however, the accuracy of using augmented data needs to be evaluated. Second, we observe that real news is much longer than fake news regarding textual datasets in terms of several words per post or article. Table 7 summarizes all the COVID-19 datasets.

Limitations and Future Directions
This section highlights the most critical challenges in the literature and the possible research directions for future work.

•
Symptoms of COVID-19, pneumonia, and other respiratory diseases are very similar, therefore developing a suitable DL model that could detect COVID-19 with optimum accuracy remains a challenge [74].

•
The scarcity of a high-quality dataset for COVID-19 is a major challenge. This returns to different reasons, including (1) closed source and non-published datasets; (2) the distributed nature of COVID-19 datasets; and (3) privacy issues that limit data sharing [32]. Therefore, the collaboration between all medical organizations across the globe is essential to expand the existing dataset and accelerate AI research for COVID-19.

•
The variability in the testing process across different countries and hospitals is a critical concern that may lead to non-uniformity in the labeling process. • COVID-19 virus is rapidly mutated over different geographic areas. Therefore, data collected from one region may not be suitable to draw interferences on another region [226].

•
Medical staff are considered the first line of defense against this pandemic. Therefore, work on more contact-less screening and diagnosis tools is an urgent need to protect them from infections.

•
Most state-of-the-art DL models were trained in 2D images. However, most MRI and CT scan images are 3D, and hence adding an additional dimension is essential to optimize the impact of these images [40,44].

•
The non-standardized process when aggregating medical image datasets result in increasing data variety; thus, this raises the need to ensure the robustness of DLgenerated models.

•
Most of the available COVID-19 datasets are limited in size. Therefore, transfer learning is a future research direction that could help detect abnormalities in small datasets and yield robust predictions and remarkable results [241]. • Based on the literature, it is noticed that there is a correlation between COVID-19 infection and other medical comorbidities. Therefore, to provide a precise and accurate prediction model, a patient's history of other ailments (diabetes, liver, kidney, heart disease, etc.) must be taken into consideration in both the COVID-19 prediction and detection process [144][145][146]. • High computational resources are required to build complex DL models, processing, and interpreting big data, compared to working with IoT devices. Therefore, edge computing and fog computing could be effective in handling this challenge [199]. • Various preprocessing steps are required to enhance the interpreting data extracted from various sensors (i.e., data cleaning, outlier detection, quality improvement, etc.) [51,260,283,284]. • Current NLP applications have limited the benefit from such a diagnosis system. Therefore, working in algorithms that measure semantic textual similarity (STS) [285] is essential to translate performance to a specific domain environment (i.e., COVID-19). • Data fusion is a challenge because it integrates heterogeneous data [232]. However, it improves the performance of the resulting models. There are many fusion techniques in the literature. Therefore, adaptive multi-models are highly needed to handle data from multiple sensors [286]. • More sophisticated techniques are needed to optimize the performance of processing X-ray and sound data.

•
The explainability and interpretability of ML/DL techniques is a key challenge. ML model should not be a black box. Medical experts must know which features are chosen to distinguish COVID-19 from non-COVID-19 [232]. Moreover, ML/DL should investigate how to predict infections before the symptoms appear.
• Several ML and DL models have shown promising results in COVID-19 screening, diagnosis, and prediction. However, most of these models are not deployed in a real environment (i.e., emerging services, hospitals, etc.) to show their capabilities in tackling the COVID-19 pandemic. Therefore, lots of challenges need to be addressed to deploy such diagnosis models, including (1) addressing the consistency of the network security to provide more reliable communication and trusted data on the network; (2) adaption of cloud, fog, and edge computing; and (3) security and privacy issues regarding the patient's data that also need to be handled.

Conclusions
COVID-19 is an ongoing pandemic that outperforms most communicable diseases in terms of death and infection rate. Therefore, medical experts as well as AI scientists are trying to fight against this pandemic and are searching for alternative techniques that could provide rapid tracking, screening, and development of drugs and vaccines. This paper aims to survey recent studies that investigated AI solutions to combat the COVID-19 pandemic. It includes AI solutions for diagnosis, estimation, treatment, and association. This paper also surveyed open-source datasets (medical images, speech dataset, test dataset, and genome structure dataset) and studied the challenges and limitation issues of the current AI literature. Finally, the paper discussed the future direction in terms of data aggregation, data preprocessing, and ML and DL deployment in real environments. The study concludes that ML and AI have dramatically enhanced disease screening, diagnosis, monitoring, and drug/vaccine discovery for the COVID-19 pandemic and minimize human intervention in a way that minimizes burdens on the healthcare sector.