Applications of Big Data Analytics to Control COVID-19 Pandemic

The COVID-19 epidemic has caused a large number of human losses and havoc in the economic, social, societal, and health systems around the world. Controlling such epidemic requires understanding its characteristics and behavior, which can be identified by collecting and analyzing the related big data. Big data analytics tools play a vital role in building knowledge required in making decisions and precautionary measures. However, due to the vast amount of data available on COVID-19 from various sources, there is a need to review the roles of big data analysis in controlling the spread of COVID-19, presenting the main challenges and directions of COVID-19 data analysis, as well as providing a framework on the related existing applications and studies to facilitate future research on COVID-19 analysis. Therefore, in this paper, we conduct a literature review to highlight the contributions of several studies in the domain of COVID-19-based big data analysis. The study presents as a taxonomy several applications used to manage and control the pandemic. Moreover, this study discusses several challenges encountered when analyzing COVID-19 data. The findings of this paper suggest valuable future directions to be considered for further research and applications.


Introduction
On 30 January 2020, the World Health Organization (WHO) declared the spread of the COVID-19 pandemic as a cause of concern and called for raising the level of health emergencies. Afterward, the government of the Kingdom of Saudi Arabia urgently took several strict measures to limit the spread of the pandemic within the regions of Saudi Arabia [1,2]. The Saudi Ministry of Health (MoH) and many other countries have implemented WHO recommendations related to the identification and isolation of suspected COVID-19 cases.
Nevertheless, the pandemic has spread dramatically, with the number of infected people over 82 million, and the number of deaths exceeding one million [3]. The rapid spread of the pandemic, with its continuous evolving patterns and the difference in its symptoms, makes it more difficult to control. Moreover, the pandemic has affected health systems and the availability of medical resources in several countries around the world, contributing to the high death rate [4].
A regular monitoring and remote detection system for individuals will assist in the fasttracking of suspected COVID-19 cases. Moreover, using such systems will generate a huge amount of data, which will provide many opportunities for applying big data analytics tools [5] that are likely to improve the level of healthcare services. There are a large number of open-source software such as the big data components for the Apache project [6], which are designed to operate in a cloud computing and distributed environment to assist in the development of big data-based solutions. Furthermore, there are several key characteristics of big data called the Six V's [7], namely, Value, Volume, Velocity, Variety, Veracity, and Variability. However, the original definition of the big data key characteristics considers only three Vs, namely Volume, Velocity, and Variety [8].
The big data characteristics apply to data acquired from the healthcare sector, which increases the tendency to use big data analysis tools to improve sector services and performance. There are wide applications of big data analytics in the healthcare sector, including genomics [9], drug discovery and clinical research [10], personalized healthcare [11], gynecology [12], nephrology [13], oncology [9,12], and several other applications found in the literature. However, in this paper, we present the contributions of the most important review papers found in the literature that cover the field of big data in healthcare. We also investigate the opportunities and challenges for applying big data analytics tools to COVID-19 data and provide findings and future directions at the end of the paper.
Promising wearable technology is expected to be one of the primary sources of health information, given its widespread availability and acceptance by people. Based on a survey conducted in January 2020, 88% of 4600 subjects included in the study indicated a willingness to use wearable technology to measure and track their vital signs. While 47% of chronically ill patients and 37% of non-chronically ill patients reported a willingness to blindly share their health information with healthcare research organizations. Of the same group, 59% said they would likely use artificial intelligence (AI)-based services to diagnose their health symptoms [14]. People sharing such data routinely will greatly increase the volume of data, which calls for planning to design and implement data analysis tools and models in this sector.
Several studies used big data for sentiment analysis, such as Reference [15], which linked between social media behavior and political views, opinions, and expressions. The study consisted of a representative survey conducted on 62.5% of adults from Chile and it showed the huge effect of social media on changing people's opinions regarding political views and elections. Similarly, the authors of Reference [16] had studied how the management responding to customer satisfaction online review affects the choice of the customers for some facilities or hotels. It showed a positive correlation between the response and customer satisfaction. The authors of Reference [17] had reviewed the classification techniques, including deep and convolutional, to identify the writer from their handwriting. They discussed several challenges in identification related to language characteristics, scripts, and the lack of datasets. Also, the authors of Reference [18] had reviewed and analyzed the latest papers about big data analytics latest developments, capabilities, and profits. Their study showed that big data can support business industries in many functionalities including prediction, planning, managing, decision-making, and traceability. The limitation of their study is the data sources, which were hard to find due to privacy and conservation of the information. Moreover, the authors of Reference [19] had surveyed numerous papers about mathematical models to improve the efficiency in detecting and predicting COVID-19. Their survey suggested using artificial intelligence to detect COVID-19 cases, big data to trace cases, and nature-inspired computing (NIC) to select suitable features to increase the accuracy of detection. Some surveys studied heart-related diseases and suggested some recommendations and guidelines, such as Reference [20], to help people in understanding heart failure causes, symptoms, and the most affected group. They declared that heart failure can escalate the patient's injuries, especially the ones with serious illnesses. Analyzing health data in real-time with the utilization of AI techniques will have a vital role in predictive and preventive healthcare. For example, it will help predict the sites of infection and the flow of the virus. It will also help in estimating the needs of beds, healthcare specialists, and medical resources during such pandemic crises as well as in the diagnosis and characterization of the virus [21].
Several reviews in the literature have examined big data analytics in healthcare from various aspects. Table 1 summarizes a number of such studies. In this paper, we focus on identifying the applications of big data analytics for COVID-19 and the challenges that may hinder its utilization. Table 1. Summary of surveys on big data analytics in the healthcare field.

Source
Publication Year Domain Key Contribution [22] 2017 Healthcare security and privacy Discussed healthcare data security and privacy issues, and the mechanisms and strategies available for healthcare data privacy, security, and user access [23] 2017 Heart attack prediction and prevention Identified the uses and technologies of big data analytics in this area, as well as challenges and concerns regarding patient privacy [24] 2018 General healthcare Defined the scope of big data analytics and its applications in healthcare, and provided strategies to overcome its challenges [25]  The rest of this paper is organized as follows. Section 2 presents the current big data analytics applications for COVID-19. Section 3 shows several tools used for big data analytics. Section 4 discusses big data analytics in the healthcare sector from different aspects and analyzes the challenges that may hinder its application, then provides our future predictions in terms of using big data in the healthcare field, in addition to several recommendations. Finally, Section 5 concludes the paper.

Applications of Data Analytics in COVID-19
The spread of the global pandemic, COVID-19, has generated a huge and varied amount of data, which is increasing rapidly. This data can be used by applying big data analytics techniques in multiple areas, including diagnosis, estimate or predict risk score, healthcare decision-making, and pharmaceutical industry [38]. Figure 1 shows examples of potential application areas.  In the following subsections, we present several examples of COVID-19 data utilization from the literature with a primary focus on reviewing studies that have provided solutions to control the COVID-19 pandemic and fall within one of the three areas, namely (1) diagnosis (Section 2.1), (2) estimate or predict risk score (Section 2.2), and (3) healthcare decision-making (Section 2.3). We also summarize the data analysis techniques and the data type used for each study in Table 2.  In the following subsections, we present several examples of COVID-19 data utilization from the literature with a primary focus on reviewing studies that have provided solutions to control the COVID-19 pandemic and fall within one of the three areas, namely (1) diagnosis (Section 2.1), (2) estimate or predict risk score (Section 2.2), and (3) healthcare decision-making (Section 2.3). We also summarize the data analysis techniques and the data type used for each study in Table 2.

Diagnosis
Suspected COVID-19 cases are diagnosed using the Reverse Transcription-Polymerase Chain Reaction (RT-PCR) test. This test takes around 24 h to several days, depending on the multiple conditions. Many countries experienced increased demand for diagnosing suspected COVID-19 cases, which exceeded the available local testing capacity. Therefore, several researchers have proposed alternative solutions for the COVID-19 RT-PCR diagnosis test, including the following.
The authors in Reference [39] have proposed a model to differentiate between COVID-19 and four other viral chest diseases. The model utilizes several body sensors to collect information and monitor the patient's health condition, including temperature, blood pressure, heart rate, respiratory monitoring, glucose detection, and others. The collected data is stored on a cloud database containing AI-enabled expert systems that help diagnose symptoms of patients infected or suspected of having COVID-19 to determine the appropriate procedure to deal with them. However, it is not clear how the patient's health information will be presented to the hospital staff. Moreover, the authors in Reference [19] had surveyed numerous papers about mathematical models to improve the efficiency in detecting and predicting COVID-19. Their survey suggested using artificial intelligence to detect COVID-19 cases, big data to trace cases, and nature-inspired computing (NIC) to select suitable features to increase the accuracy of detection.
In Reference [40], the authors provided a flexible and low-cost design of a medical device that can be used to detect and track symptoms of COVID-19. It utilizes headphones and a mobile phone to detect breathing problems. The signals are collected and saved in an audio file format through the mobile app, after which the signals are analyzed using the MATLAB program to identify the respiratory symptoms associated with COVID-19.
Researchers [41] also developed a program to remotely monitor discharged COVID-19 patients. Each patient registered to the app is provided with a pulse oximeter and thermometer to self-report daily symptoms, O2 saturation, and temperature. The abnormal vital signs and symptoms are flagged to be assessed by a group of nurses. Depending on the evaluation outcome, the patient might be readmitted to the Emergency Department (ED). The program helps reduce ED utilization and provides scalable remote monitoring capabilities when a patient is discharged from the hospital.
The authors in Reference [42] found that smartwatches could be utilized in COVID-19 pre-symptomatic detection. They analyzed the physiological and activity data collected from smartwatches of the infected COVID-19 cases. They concluded that 63% of COVID-19 cases could be detected before symptoms appear by applying a two-level warning system based on severe elevations in resting heart rate relative to individual baseline. Moreover, they found that activity tracking and health monitoring using wearable devices can help in early detection of respiratory infections.
Since the COVID-19 symptoms have not been fully identified and due to the changing nature of COVID-19, some studies have focused on identifying the medical characteristics and symptoms associated with positive COVID-19 cases. The study in Reference [43] focused on identifying the symptoms associated with the positive results of the COVID-19 examination, and it was conducted on a group of healthcare workers (HCWs). Initial screening was performed by phone, and a COVID-19 PCR test was also performed for each HCW to identify symptoms associated with each case. The study found that the most common symptoms of positive COVID-19 cases were fever, myalgia, and anosmia/ageusia, while the negative cases mostly have no symptoms, or the symptoms are limited to nasal congestion and sore throat.
The study in Reference [44] aimed to determine the clinical characteristics and outcomes of 5700 hospitalized patients with COVID-19 in the NY area. However, the study included non-critically ill patients and the follow-up time was limited.
Another study [45] proposed a website and Android app to separate a COVID-19 cough sound from other respiratory sounds with the aid of crowdsourcing data from about 7000 unique users (more than 200 of whom reported a recent positive test for COVID-19). Their proposed method employed Logistic Regression (LR), Gradient Boosting Trees, and Support Vector Machines (SVMs) classifiers to distinguish the cough sound data based on gender, age, and symptoms. Also, their classifiers distinguish the user based on other features, such as whether they are asthmatic patients, smokers, or healthy. Their app asks the user to cough from three to five times then repeat the process every two days to update the user's health status. Their method proved that a COVID-19 cough can be distinguished from other lung diseases coughs from the sound of the cough combined with breathing sound to screen the disorder. It achieved 82% Area Under the Curve (AUC) in identifying the cases that tested positive for COVID-19. They recommended more studies in the field to specify more characteristics of a COVID-19 cough sound to make it more distinguishable from other respiratory sounds.
The authors in Reference [46] declared the importance of using complementary technologies such as on-body sensors for diagnosing and monitoring COVID-19 infections. They stated that clinical devices are more reliable and provide more functions than smartwatches since these devices are distributed in different areas of the human body to detect different body signals. A thin, soft sensor with a high-bandwidth accelerometer and a precision temperature sensor placed on the neck is very important to record respiratory activity from cough frequency, intensity, and duration to respiratory rate and effort, to high-frequency respiratory features associated with wheezing and sneezing. Also, they recommended machine learning and predictive algorithms to help to diagnose and monitor COVID-19.
In Reference [47], researchers emphasized on the importance of identifying the characteristics of COVID-19 among patients of Saudi Arabia in managing the pandemic. The study included 1519 cases where data related to their ages, genders, vital signs, public data, and clinical examinations were collected. Their test was conducted based on the quantitative RT-PCR approach, which is the protocol established by the World Health Organization. After the data was gathered, it was entered into electronic sheets with distinct data collectors, and data was analyzed with Statistical Package for Social Sciences Sensors 2021, 21, 2282 9 of 24 program, version 24 (SPSS-24). The statistics manifested that the most common symptoms of COVID-19 are cough and fever, with 89.4% and 85% presence in reported positive cases, respectively. Also, it confirmed that the most infected patients' demographics include elder males, severe cardiac condition patients, and diabetic patients.
The authors in Reference [48] had utilized machine learning techniques along with spark-based linear models, Multilayer Perceptron (MLP), and Long Short-Term Memory (LSTM) with a two-stage cascading platform to enhance the prediction accuracy in different datasets. They applied their method on two datasets for cardiac arrhythmia and resource locator, so their model performed with higher accuracy and lower computation time. Thus, the authors in Reference [49] had proposed a computer program method to aid the classification model to analyze the retinal image of diabetic retinopathy to investigate its effect among adults in causing blindness. It proved that the focused connection among layers of the convolutional network assists the accuracy of the classification result.
The retrospective, observational study in Reference [50] conducted a statistical analysis to show the cardiovascular implications of COVID-19 on the patients. The study was performed on 116 patients who tested positive for COVID-19. The data was clinically collected and tested to extract clinical symptoms and signs, chest computed tomography, treatment measures, and medical records. The statistical analysis was performed on the data to reveal similar results as those reported by Reference [47], where the common symptoms were fever and dry cough, and the elder or middle-aged males, heart injury patients, hypertension patients, and diabetics were the most infected populations.

Estimate or Predict Risk Score
Estimating the risk score helps in determining the care level and priority for each patient with an insight to the necessary proactive measures. In the following section, we present the studies that cover this area.
In Reference [51], the authors aimed to validate a hypothesis that COVID-19 infection could lead to serious cardiovascular diseases or maybe worse. They utilized statistical analysis by employing a multi-factorial logistic regression model to analyze COVID-19related causes. The study was conducted on 54 patients with different ages, genders, and vital signs, where 39 were diagnosed as severe COVID-19 cases and 15 as critical COVID-19 cases. The data was collected clinically from the patients with attached vital sign measurement devices updated every four hours. Results showed that elder males, diabetic patients, and hypotension patients are more likely to develop a serious heartrelated condition and need more care. Their proposed study is limited due to the small sample size, and they suggested a higher sample size to conduct a more appropriate study and verify the results.
The authors in Reference [52] are interested in developing and validating the risk score to predict adverse events among patients suspected of having COVID-19. They conducted a retrospective cohort study of adult visits to the emergency department. The study concluded that the primary outcome was death or no respiratory decompensation within 7 days. To derive the risk score, they used the Least Absolute Shrinkage and Selection (LASSO) and Logistic Regression models. They concluded that the COVID-19 Acuity Score (COVAS) can assist in decision-making to discharge patients during the COVID-19 pandemic. They also reported the derivation and validation metrics of cohorts and subgroups with pneumonia or COVID-19 diagnosis.
The authors in Reference [53] proposed an Internet of Things (IoT) based system to discover unregistered COVID-19 patients, as well as infectious places. This would help the responsible authorities to disinfect contaminated public places and quarantine the infected persons and their contacts even if they did not have any symptoms. The newly confirmed and recovered cases would be recorded in the system by the healthcare staff, while the geolocation data will be collected automatically by Global Positioning System (GPS) technology in the IoT devices. The authors discussed how their proposed system could be utilized to apply three different prediction mathematical models, namely the θ-SEIHRD model, Susceptible-Infected-Recovered (SIR) model, and Susceptible-Exposed-Infectious-Removed (SEIR) model.
Another study [54] demonstrated the possibility of transmitting the COVID-19 virus through indirect contact, like touching surfaces contaminated with the droplets of an infected person. Therefore, it was recommended that paying attention to personal hygiene and disinfection of public places could possibly reduce the incidence.
Furthermore, researchers also [55] conducted a cross-sectional study to show the impact of the COVID-19 outbreak on the psychological side. They found that fear of a COVID-19 outbreak can have significant psychological repercussions on people, which requires more attention by the relevant authorities to cope with this impact. Also, the authors in Reference [56] had proposed a model that identified the risk of getting infected by tuberculosis based on several factors related to tuberculin skin, age, and weak immune system. They stated that those factors can increase the infection from 10% to 20%.
The authors in Reference [57] provided a model that predicts the course of the outbreak to help plan an efficient method of prevention. Model stages are SIDARTHE (susceptible, infected, diagnosed, ailing, recognized, threatened, healed, and extinct). It discriminates between infected people based on whether they have been diagnosed and on the severity of their symptoms. The simulation results obtained by combining the model with the available data on the COVID-19 pandemic in Italy indicate that it is an urgent necessity.

Healthcare Decision-Making
During the COVID-19 pandemic, the demand for emergency departments and medical equipment such as ventilators increased. Therefore, many studies have aimed to provide monitoring tools and models that help in making several medical decisions to mitigate potential risks, and these solutions include the following.
The authors in Reference [58] designed a prediction model called Conscious-based Susceptible-Exposed-Infective-Recovered (C-SEIR) model to ensure the usefulness of the lockdown and protective countermeasures in decreasing the influence of the pandemic in Wuhan city. The proposed model consisted of two classification groups, namely the quarantined suspected infection group (P), and the quarantined diagnosed infection group (Q), along with a blue/green curve with a solid line for daily patients and dashed line for cumulative patients. It showed that the result of the prediction is a double drop-down or increase based on the city lockdown precautions in Wuhan. The authors also gave guidance for protection against COVID-19, such as being educated about the virus, social distancing, and lockdown.
In Reference [59], the authors have developed a patient monitoring program that allows daily electronic checking of symptoms, providing advice and reminders via text messages, and providing care by phone. Patients registered in the system complete a daily questionnaire to evaluate 10 symptoms using a scale from 0 to 4. In addition to determining how much they feel the infection is affecting them, the number of analgesic/antipyretic tablets they take, and the temperature measured, questionnaire responses are used to classify patients and specify the care needed. The study focused on three measures, namely the number of patients monitored over time, the daily symptoms score, and daily ED referrals.
Likewise, the authors in Reference [60] developed a mobile app to track the spread of COVID-19 symptoms in the UK by analyzing a set of data reported by patients registered in the app, including location, age, health risk factors, symptoms, healthcare visits, and COVID-19 test results. Survey data helped in determining patients' type and intensity, availability of personal protective equipment, and work-related stress and anxiety.
The study presented in Reference [61] was concerned with evaluating one of the COVID-19 applications in terms of user satisfaction and the possibility of using the data collected to support decision-makers and healthcare providers. The app collects information daily from patients, including symptoms, vital signs, and an assessment of their satisfaction with the services provided by the app. The data collected is distributed on an interactive map according to the postal code for each user, which helps in knowing the regional distribution of the spread of infection in addition to the percentage of healthcare consumption in each region.
Another study [62] provided an analytical model for predicting patient census and estimating ventilator needs for a given hospital during the COVID-19 pandemic. Through this study, it was noticed that the estimation of the bed and ventilator needs is influenced by the length of hospital stay, and the number of days of inpatient ventilator use. Also, there was no relationship between the age of hospitalized patients and the likelihood of needing a ventilator, or between the inpatient gender and the length of stay. They recommended that each hospital relies on its internal data for accurate resource planning.
Furthermore, the Institute for Health Metrics and Evaluation (IHME) COVID-19 health service utilization forecasting team conducted a study to predict the expected daily use of health services and the number of deaths due to COVID-19 for the next four months from the date of the study for each state in the US [63].
The authors in Reference [64] tried to describe the clinical characteristics and identified factors that predict intensive care unit (ICU) admission for COVID-19 patients. They found that the need for a COVID-19 patient to enter the ICU can be predicted by checking a set of medical parameters that can be easily obtained: age, fever, and tachypnea with/without respiratory crackles. They used the EHRead [65] technique that was developed by Savana to extract information from the medical records. Also, deep learning convolutional neural network classification methods are used to classify the extracted data.
The authors in Reference [66] provided a data-driven framework to pre-assess the risks of the COVID-19 pandemic and to identify high-risk areas in Italy. The framework assesses the risk index using a function consisting of three criteria, namely disease risk, area exposure, and the vulnerability of its population. The twenty Italian regions are classified based on available historical data, which include population density, age, human mobility, air pollution, and winter temperature. The study showed a correlation between the risk index and the number of deaths, infected, and patients in ICU. They also provided a policy model to assist authorities in making several decisions. Moreover, regional healthcare models have been developed to estimate the pandemic, like the simulation approach developed at the University of Pennsylvania called Monte-Carlo [67]. Such models can be used to manage facilities and plan for an anticipated increase in patient numbers, but not for an estimate of daily operational needs. Applying the Pennsylvania model in an individual hospital requires unknown parameters like the proportion of the region's patients expected to visit that hospital, and the percentage of the regional population isolated sufficiently to avoid infection.

Big Data Analytics Tools
Enterprise systems that have functions and functionality for big data applications are known as big data analytics platforms. It helps companies to reveal previously overlooked correlations, market trends, and valuable information from a large amount of big data. Tables 3 and 4 show the most popular big data analytics platforms and data storage management, respectively. It is a big data platform that is cloud-based and used for developing, analyzing, installing, and managing applications.
It provides the following services: Software as a service (SAAS). Platform as a service, Infrastructure as a service.
Azure free account and get popular services free for 12 months.

Findings, Challenges, and Future Directions
This section is organized as follows. First, Section 4.1 provides our findings from the literature review conducted in Section 2. Section 4.2 discusses the key challenges that were faced when designing big data analytics solutions to address the COVID-19 pandemic. Section 4.3 presents several future directions to be considered by researchers and authorities.

Findings
This section is organized as follows. First, Section 4.1.1 introduces the type and source of data that can be used in healthcare solutions. Then, Section 4.1.2 introduces the type and source of COVID-19 data found in the literature.

Data Type and Source
Numerous data can be utilized in the medical health sector. As shown in Figure 2, medical data can be classified into six categories based on their type and source. Analyzing this data will assist in predicting future events, understanding the current situation, and making several decisions. The medical data can be obtained from many sources, as it can be collected using sensors of wearable/mobile devices or medical devices [39,42,46,53], online questionnaires [55,59], websites or mobile apps [40,41,43,45,60,61], hospital records [50][51][52]62,64], local and international health systems [44,47,57,63,67], interviews and case study samples [54], and data on open databases or social media websites [58]. Numerous data can be utilized in the medical health sector. As shown in Figure 2, medical data can be classified into six categories based on their type and source. Analyzing this data will assist in predicting future events, understanding the current situation, and making several decisions. The medical data can be obtained from many sources, as it can be collected using sensors of wearable/mobile devices or medical devices [39,42,46,53], online questionnaires [55,59], websites or mobile apps [40,41,43,45,60,61], hospital records [50][51][52]62,64], local and international health systems [44,47,57,63,67], interviews and case study samples [54], and data on open databases or social media websites [58].

Data Used in COVID-19 Solutions
Many solutions have been designed to control the COVID-19 pandemic, including diagnosis, forecasting, and decision-making solutions. These solutions use many types of data, shown in Figure 3, which we will introduce in this section based on the survey conducted in Section 2.

Data Used in COVID-19 Solutions
Many solutions have been designed to control the COVID-19 pandemic, including diagnosis, forecasting, and decision-making solutions. These solutions use many types of data, shown in Figure 3, which we will introduce in this section based on the survey conducted in Section 2.
Demographic data is useful in understanding the main characteristics of the population and can be used to classify study samples into several categories, such as males and females, to simplify the study of the sample. Social data is also used by solutions that study the impact of the repercussions of the COVID-19 pandemic on the human psychological state. Moreover, there are researchers who have been interested in investigating the possibility of benefiting from activity data and other indicators collected via smartwatches and wearables. Travel data is used to identify suspected COVID-19 cases that have come from countries where the pandemic has spread. Table 5 shows examples of each type of data discussed in this paragraph.

of 24
Medical data is widely used in studies directed to control COVID-19, through which it is possible to determine the features of the disease that help in its diagnosis as well as prediction of its occurrence. Additional data on COVID-19 is also used, which helps to know the number, status of cases, and the results of the PCR COVID-19 test. Another type of data relies on sampling to detect virus incubators and contaminated places. Also, statistical data is used for resource management and risk prediction purposes, such as full utilization of ICU capacity, to devise proactive solutions. Finally, the environmental data, which some studies have been interested in, assesses the risks of the spread of the pandemic and determines the areas in which the population will be more vulnerable to infection. Table 6 shows examples of each type of data discussed in this paragraph. Medical history [45,60] Routinely taken medications [42,59] Laboratory findings [47,50,51] CT scans [39,48,50,54] Required ICU [41] ICU length of stay [41] Readmission status [41] COVID-19 data Number of cases and status [42,44,47,50,53,58,63] Test date [42,44] Results (laboratory, outcome) [42,45,47,50,60] Symptom onset date [42] Incubation periods [47] Treatment measures [50] Infection feels [59] Samples Throat swabs [54] Blood samples [54] Aerosol and surface samples [54] Statistical data Healthcare visits [60] Hospital capability and utilization [63,67] Known regional injuries [67] Percentages related to ICU [67] Future daily admissions [62] Percentage of inpatients requiring MV [62] ICU lengths of stay [62] Duration of MV [62] App satisfaction assessment [61] Hospital market share [67] Population age and size [66,67] Environmental data Epidemiological data [57] Air pollution [66] Winter temperature [66] Healthcare density [66] Human mobility [66] Housing concentration [66] Note: ICU: intensive care unit, MV: mechanical ventilation.
Moreover, Table 7 summarizes the vital signs and outwardly measurable symptoms considered by the reviewed studies, where the distributions of vital signs and symptoms in the reviewed studies are presented in Figures 4 and 5, respectively. Several techniques shown in Table are used to analyze the data presented in this section. However, many other techniques have been used in healthcare, summarized in References [24,71], whereas numerous other applications can be found in the literature. Based on the survey conducted in this paper, the main challenges of applying data analysis techniques when developing solutions to assist in coping with the COVID-19 pandemic are the volume and variety of the data. For example, prediction models developed based on data from a particular hospital may not provide the same accuracy when applied to data from a different source. Therefore, sharing data on the local and international level will serve in improving the accuracy of data analysis solutions.

Key Challenges
Several challenges may hinder the beneficial outcome from the application of big data analysis tools in the health sector that have been encountered when designing solutions to address the COVID-19 epidemic, which will be discussed in the following subsections.

Security and Privacy
Healthcare data security and patient privacy issues [22,[72][73][74] are a concern of authorities and even patients, and medical data is only shared under certain conditions and for specific specialists/researchers and purposes. Therefore, it is necessary to define the mechanisms, strategies, and regulations that govern and facilitate access to medical data without compromising patients' privacy or exploiting the data for unacceptable purposes, especially when critical conditions occur and with the spread of dangerous epidemics that need quick solutions, such as COVID-19.

Sharing Data
Variety and volume of data play a vital role in extracting useful information as well as in understanding various events when applying data analysis tools [51]. For example, the spread of COVID-19 in the city of Wuhan in China raised concerns in other countries about the characteristics of the virus, its impact, as well as determining the countries affected by the epidemic and whether it has been visited by travelers to take preventive measures that limit the spread of infection. This challenge can be overcome by making use of Blockchain technology [75], which helps in large-scale sharing of information securely by anonymizing patients as well as the verified data.

Information Correctness
Although the Internet and social media have a great role in transmitting information and facilitating communication, they are one of the main sources for transmitting false medical information and rumors, for example, about disease, the effects of the virus, and the impact of the vaccine, all of which will hinder the efforts of government and health agencies to contain the spread of virus and the preservation of human health. It may also have negative psychological effects on society. Moreover, the absence or incorrectness of some study data may lead to biased study findings [44]. However, artificial intelligence and big data analytics tools can be used to check and filter information on the Internet and alert people on misinformation and remove it from the network [76].

Patient Cooperation
The patient is the main source for understanding the nature and characteristics of new diseases. Therefore, there is an urgent need to share part of his health information, for example, his medical history record, with the research organizations. Moreover, sharing activity and physiological information gathered from wearables can also contribute to building predictive systems. However, many people are not willing to share their health information with others, as well as other personal information like gender and location [45]. For example, during a survey conducted in January 2020 [14], only 37% of 4600 individuals, without the severe disease, indicated a willingness to blindly share their health information with healthcare research organizations. Therefore, people must be educated about the importance of blind data sharing. Also, to increase people's confidence in terms of their data privacy, the parties authorized to collect data must be identified as well as the regulations that they adhere to.

Future Directions
Most countries have made many efforts to contain the spread of COVID-19 and mitigate its repercussions, as they have faced various challenges, including the cost and limited capacity for the COVID-19 test. For instance, the Kingdom of Saudi Arabia signed a contract worth 995 million SR with China for 9 million Coronavirus test kits, to perform diagnostics with a capacity of 10,000 tests per day [77]. Another challenge is the lack of a mechanism to monitor the health status of individuals, especially for those who are isolated in their homes. In the United Kingdom, this challenge has resulted in the death of a number of people alone in their homes due to the coronavirus, as their death was not discovered for up to two weeks [78]. Moreover, there is a lack of immediate data to proactively manage resources, such as the distribution of medical staff between regions, as well as the estimated ventilators required for each hospital, which depends on the expected numbers of patients and their different needs. Therefore, we recommend using big data analytics tools to assist stakeholders to make decisions and predict the future. The following are several areas of big data analytics tools' use that are provided based on the stakeholder level.

Government Level
Social media big data analysis can help spot misinformation about diseases, alert people, and prevent it from spreading. Also, analysis of international air travel data will help track the spread of the pandemic between countries to take proactive preventive measures. Moreover, big data science, including advanced machine learning techniques such as deep learning, mathematical and statistical models such as autoregressive integrated moving average (ARIMA), optimization techniques such as particle swarm optimization (PSO), and simulation models such as SEIR (Susceptible, Exposed, Infected, and Recovered states), can be used to accurately predict the development of the outbreaks like COVID-19. Such models help in forecasting, controlling epidemics, and measuring the impact of interventions and control measures taken by authorities or even planned to be taken. With the available data on COVID-19, these models can be utilized to describe the dynamic aspects of the outbreak to predict early and thus prepare the healthcare infrastructure to manage the impact of such pandemics.
The use of social media has increased during this pandemic. Social media platforms serve as an easy tool for the individual for sharing their views and perceptions. Furthermore, it can also be utilized to get up-to-date information about the pandemic. These colossal amounts of data can be utilized by the government to track the people's views about the policies and awareness about COVID-19. Several Natural Language Processing (NLP) and AI techniques can be used to track the individual perceptions about the precautionary measures taken by the government. Similarly, some precautionary measures like lockdown, social distancing, remote work, and online education, have isolated the people and, in some cases, may result in some psychological health issues. Several sentiment analysis and opinion mining techniques can enable to pre-emptively detect and diagnose depression levels in the individual. Similarly, these techniques can also be utilized to track the fake news and rumors related to the COVID-19 pandemic.

MoH Level
Analyzing big patient data helps in making proactive resource management decisions, such as the medical staff distribution mechanism and estimating the need for ventilators, as this depends on the expected requirements of patients and their numbers in each city. Big data models such as machine learning help to identify new disease patterns, symptoms, and disease course, as well as allow risk factors associated with the disease. This helps in developing strategies and proactive measures as well as making decisions related to the allocation of medical resources.
Moreover, most smartwatches and wearable devices can measure most of the vital signs shown in Table 7, and collecting and analyzing such data has many benefits, including the following:

1.
Large-scale data analysis of the general population and hospital patients would assist the MoH in identifying current health trends among the population and aid in the early prediction of emergencies and epidemics.

2.
The monitoring of the vital signs of the general population can say a lot about their health and help in the gauging of stress levels and the overall health of different age groups, particularly the older population. This would help in the establishment of health drives and clinics raising awareness of the appropriate conditions among the population.

3.
Analyzing respiratory rate and oxygen saturation data would help in the identification of respiratory problems among the population, including pollution-related respiratory problems among different cities, age groups, and genders. 4.
The monitoring of various symptoms on a large scale will also help the MoH gauge in advance the health of the population in general and enable them to take proactive decisions.

5.
Centralized real-time data visualization for the number of active and infected cases can help the MoH to identify the areas that contain huge numbers of COVID-19 patients. Furthermore, it can aid the health professionals and decision makers to provide more health facilities in the areas with huge numbers of COVID-19 patients. Similarly, the policy makers can impose strict precautionary measures, and this will reduce the risk of contamination. Big data analysis tools can provide very powerful data analysis and visualization techniques.

Hospital Level
The analysis of remote patient monitoring data can assist in estimating the number of patients in a specific area to optimally plan for containing any expected increase in the number of patients beyond the hospital capacity. Moreover, health data is growing exponentially, making it difficult to use traditional representation methods such as tables. The employment of artificial intelligence alongside data analytics tools has a role in addressing this challenge, and it can help in the extraction and representation of data in real-time-the Savana system [65] is an example.
Implication of AI and ML techniques in automated early diagnosis and prognosis of several diseases in general and COVID-19 specifically has shown the significant outcomes. Similarly, a remote COVID-19 patients triaging system allows to remotely monitor the patients. The emergence of non-invasive medical devices and the integration of sensors in smart devices and watches facilitate the process of remote monitoring. The data generated by the sensors will be utilized by AI and ML algorithms for diagnosis and prognosis. Due to the huge number of COVID-19 patients and the risk of contamination, these applications will allow the patients with the mild condition to be monitored remotely by the doctors.

Individuals/Patients Level
Real-time analysis of hospital data that are related to admitted patients, waiting lists, and hospital capacity helps individuals locate less crowded hospitals with earlier appointments and less waiting time. Also, linking patient data to maps can help identify areas of infection and provide warnings to people when they are in these areas to reduce the chance of infection. Moreover, the employment of advanced machine learning models such as deep learning can help classify many respiratory diseases by applying them to large samples of coughing and breathing sounds. Integrating such models into mobile apps helps provide a rapid mechanism for individuals to pre-diagnose respiratory symptoms and determine the need for diagnosis by clinicians.

Responsible Authorities Level
Analyzing mobile data helps in identifying polluted public places for disinfection and quarantining infected people and their contacts, even if they do not show any symptoms. Moreover, the integration of mathematical models of the spread of infectious diseases with interactive maps and GPS technology can help in determining the locations and paths of infected people, which allows the imposition of a quarantine only on infected people and not others. In turn, this will reduce the economic damages caused by suspending all activities during the quarantine period for all people.

Conclusions
The volume of data increases dramatically over time, especially data generated on the global pandemic caused by COVID-19. Such volume of data requires utilizing big data analytics tools along with AI techniques to make sense of the pandemic and control its spread in a timely manner. In this study, we presented a review of several data analysis applications for COVID-19, providing a taxonomy structure which classified the potential applications of COVID-19 into four categories, namely diagnosis, estimate or predict risk score, healthcare decision-making, and pharmaceutical. The paper introduced several data analysis tools and explained the main features of each tool. We also provided important insights on a number of challenges that might hinder the use of data analytics tools for COVID-19. These challenges include healthcare data security and patient privacy issues, the difficulty of sharing data with researchers, absence of data validation for some studies that may lead to biased results, and the patients' cooperation in sharing part of their medical information. Finally, we highlighted and discussed a number of future directions that should be considered in further research and applications to assist stakeholders, such as governments, MoHs, hospitals, patients, and responsible authorities, to make decisions and predict the future.