An Infodemiology and Infoveillance Study on COVID-19: Analysis of Twitter and Google Trends

: Infodemiology uses web-based data to inform public health policymakers. This study aimed to examine the diffusion of Arabic language discussions and analyze the nature of Internet search behaviors related to the global COVID-19 pandemic through two platforms (Twitter and Google Trends) in Saudi Arabia. A set of Twitter Arabic data related to COVID-19 was collected and analyzed. Using Google Trends, internet search behaviors related to the pandemic were explored. Health and risk perceptions and information related to the adoption of COVID-19 infodemic markers were investigated. Moreover, Google mobility data was used to assess the relationship between different community activities and the pandemic transmission rate. The same data was used to investigate how changes in mobility could predict new COVID-19 cases. The results show that the top COVID-19–related terms for misinformation on Twitter were folk remedies from low quality sources. The number of COVID-19 cases in different Saudi provinces has a strong negative correlation with COVID-19 search queries on Google Trends (Pearson r = − 0.63) and a statistical signiﬁcance ( p < 0.05). The reduction of mobility is highly correlated with a decreased number of total cases in Saudi Arabia. Finally, the total cases are the most signiﬁcant predictor of the new COVID-19 cases.


Introduction
In late December 2019, news of a novel coronavirus began to emerge from Wuhan, China. The new virus, named by World Health Organization (WHO) as Coronavirus Disease 2019 (COVID-19) causes severe acute respiratory infection leading to death in many cases [1]. This virus has continued to spread aggressively around the world, causing a global panic and posing serious challenges to public health and policy makers. Many countries have enforced rigorous protective measures to combat the pandemic, such as mandatory quarantine and massive closures. However, efforts to slow down the transmission rate have been undermined by the COVID-19 'infodemic' [2][3][4][5].
An important public health response to the COVID-19 infodemic is proactive communication, with the objectives of alleviating confusion, avoiding misunderstandings, minimizing adverse consequences, and saving lives [6]. In March 2020, WHO published a recommendation to provide guidance for countries on how to implement effective Risk Communication and Community Engagement (RCCE) strategies to improve public awareness and perception [7]. RCCE can help prevent infodemics by building trust in public health institutions and spokespersons, thereby increasing the chance that public health guidelines will be followed. Determining what people know, how they feel, and what they do to keep the outbreak under control is essential for public health workers to address the public's perception of risk and uncertainties. Thus, designing effective RCCE strategies to stop further spreads of misinformation and rumors must be based on public health surveillance. Social media and web searches have become essential sources of information about the public and are widely utilized for health-related information. The COVID-19 Sustainability 2021, 13, 8528 2 of 16 pandemic provides a starting point to discuss how social media and web searches can be leveraged to design beneficial RCCE to address this unprecedented crisis.
In the past decade, the internet has become an integral part of people's lives. Online sources that provide real time or near-real time data are becoming increasingly available, consequently changing the pattern of spread of health-related information [8]. This paradigm shift can be useful for understanding population health concerns and needs, the diffusion of health-related information/misinformation, and the public's reaction to health events [9]. There is a large body of studies utilizing social media and internet traffic to understand the prevalence and diffusion of information and misinformation about COVID-19, providing insights for public health surveillance [6] and health policy makers [10,11].
This use of the internet has formed two new concepts: infodemiology, defined as the science of distribution and determinants of information on the Internet, and infoveillance, defined as the longitudinal tracking of infodemiology metrics for surveillance and trend analysis [9]. Infodemiology and infoveillance have contributed to public health knowledge by analyzing a range of topics such as chronic diseases and influenza. More specifically, infodemiology provides the necessary tools for understanding health-related infodemics, which can make it hard for people to find reliable sources and trustworthy guidance during public health crises [12,13].
Twitter Analytics and Google Trends are two of the most effective infoveillance tools for assessing the dissemination of information on health issues and topics [14,15]. Twitter functions as a convenient source of information about COVID-19 [11]. Google Trends is a tool that provides both real-time and archived information on Google queries worldwide. Thus, information from internet searches that are performed anonymously, enabling the analyzing and forecasting of health-related topics can be obtained. The social media search index has been identified as a promising predictor of COVID-19 transmission rates [16,17]. Although Twitter has some limitations as a tool for disease prediction and containment, its potential for communicating peoples' stories and news sharing may profoundly impact public health outcomes. Both Google Trends and Twitter can serve as viable resources to understand people's perception and to monitor their reaction to the pandemic over time [14,18].
The use and roles of social media and web searches amid the COVID-19 pandemic have been widely studied [14,16,17]. However, to the best of our knowledge, no systematic research has yet been conducted on Arabic infodemiology and infoveilliance. Given the fact that 41% of the online population in Saudi Arabia uses social media, a higher percentage than any other country in the world, it is vital to conduct a study concerning the Arabic language [19]. Considering the impact that the spread of information and misinformation of COVID-19 may have upon the transmission rate, determining the public reactions to tweets and investigating the nature and diffusion of COVID-19-related information on the internet can provide important insights into the beliefs and concerns of Arabic users. Furthermore, analyzing the spread of information found in tweets may help decision makers to intervene in limiting the wide spread of misinformation or fake news, such as setting laws to prevent the spread of unreliable information.
In this paper, the infodemiology of COVID-19 in terms of socially disseminated information are addressed. First, the magnitude of misinformation and the quality of information sources regarding the COVID-19 epidemic that is being spread on Twitter in the Arabic language was analyzed. Second, information prevalence indicators (search volume) about COVID-19 was analyzed by using data from Google Trends to examine the correlation between information prevalence and daily new cases in different provinces in Saudi Arabia. Third, the relationship between mobility activity in Saudi Arabia and COVID-19 prevalence was examined [20]. Fourth, an infodemiology study with a multiple regression analysis was conducted to examine the relationship between new COVID-19 cases and various potential predictors, namely overall mobility, total confirmed cases, and information prevalence. In this paper, Saudi Arabia was chosen as a case study since it is one of the top Arabic countries that is highly affected by the pandemic [21]. Moreover, to the best of our knowledge, no study has examined infodemiology and infovalliance in Saudi Arabia. Figure 1 shows the status of the pandemic in Saudi Arabia according to the regions. The cumulative number of cases in Saudi Arabia reached 275,000 as of 3 June 2020, with confirmed cases in each Saudi province [22]. This study serves as a starting point for designing strategic messages for health campaigns and establishing an effective risk communication channel. cases and various potential predictors, namely overall mobility, total confirmed cases, and information prevalence. In this paper, Saudi Arabia was chosen as a case study since it is one of the top Arabic countries that is highly affected by the pandemic [21] . Moreover, to the best of our knowledge, no study has examined infodemiology and infovalliance in Saudi Arabia. Figure 1 shows the status of the pandemic in Saudi Arabia according to the regions. The cumulative number of cases in Saudi Arabia reached 275,000 as of 3 June 2020, with confirmed cases in each Saudi province [22]. This study serves as a starting point for designing strategic messages for health campaigns and establishing an effective risk communication channel.

Materials and Methods
To answer the research questions of this study, data from Twitter [23], and Google Trends [24] and Google COVID-19 community mobility data [20] were collected. In this section, the data collection process and analysis are explained.

Twitter
Searches about the novel coronavirus related tweets written in the Arabic language and geolocated in Saudi Arabia were performed between 13 June 2020 and 12 July 2020 using Python software 3.7.0, Twitter standard search Application Programming Interface (API), and Tweepy Python libraries. In the search, a set consisting of predefined search terms that were most widely used as media terms for the novel coronavirus remedy from Twitter trends were used. Table 1 shows the English name for each of the terms used in the search with a brief description.

Materials and Methods
To answer the research questions of this study, data from Twitter [23], and Google Trends [24] and Google COVID-19 community mobility data [20] were collected. In this section, the data collection process and analysis are explained.

Twitter
Searches about the novel coronavirus related tweets written in the Arabic language and geolocated in Saudi Arabia were performed between 13 June 2020 and 12 July 2020 using Python software 3.7.0, Twitter standard search Application Programming Interface (API), and Tweepy Python libraries. In the search, a set consisting of predefined search terms that were most widely used as media terms for the novel coronavirus remedy from Twitter trends were used. Table 1 shows the English name for each of the terms used in the search with a brief description. A set of 6541 Arabic tweets was collected. The data contains text and metadata of the tweets, including tweet id, username, hashtags, and number of retweets. The data preprocessing involves three phases which are the data cleaning phase, normalization phase, and lemmatization phase. In the data cleaning phase, all non-Arabic terms, stop words, numbers, punctuations, emojis, hashtags, and URLs were automatically removed by coding. In the normalization phase, multiple forms of a letter were converted into one uniform letter, numbers, spaces, repeated letters, and elongation were removed using the Tashaphyne Python library [25]. In the lemmatization phase, words were converted to their roots using the Farasa toolkit [26]. An example of the preprocessing phases is shown in Figure 2. A set of 6541 Arabic tweets was collected. The data contains text and metadata of the tweets, including tweet id, username, hashtags, and number of retweets. The data preprocessing involves three phases which are the data cleaning phase, normalization phase, and lemmatization phase. In the data cleaning phase, all non-Arabic terms, stop words, numbers, punctuations, emojis, hashtags, and URLs were automatically removed by coding. In the normalization phase, multiple forms of a letter were converted into one uniform letter, numbers, spaces, repeated letters, and elongation were removed using the Tashaphyne Python library [25]. In the lemmatization phase, words were converted to their roots using the Farasa toolkit [26]. An example of the preprocessing phases is shown in Figure 2. A set of metrics were selected to analyze the collected tweets. Information prevalence was computed by counting the number of conversations mentioning ' ‫ﻛﻮﺭﻭ‬ ‫ﻧ‬ ‫ﺎ‬ ' ('coronavirus') in combination with one or more of the other terms listed in Table 1 within the collected tweets. The information occurrence ratio was determined by calculating the information prevalence for each of the terms divided by the total number of the collected tweets. The information prevalent quality was found by examining the source of each tweet in the collected tweets to determine its reliability. The accounts of Saudi Arabia's Ministry of Health, official TV channels, TV news channels, TV programs, official newspapers, and online newspapers were considered as high-quality sources (HQS); usernames of personal accounts, personal groups, and unofficial accounts were coded as low-quality sources (LQS). Then, the information prevalent quality was calculated as the number of the tweets that were retweeted from LQSs divided by the total number of the tweets that were retweeted from both HQSs and LQSs. Information incidence was calculated by determining the number of conversations about each of the defined four terms combined with ‫'ﻛﻮﺭﻭﻧﺎ'‬ ('Coronavirus') in the collected tweets by units of time, where the unit of time in this study is four weeks. To visualize the most frequent words in the conversations, word clouds was used via RStudio Version 1.3.1056 [9,15]. A set of metrics were selected to analyze the collected tweets. Information prevalence was computed by counting the number of conversations mentioning ' ' ('coronavirus') in combination with one or more of the other terms listed in Table 1 within the collected tweets. The information occurrence ratio was determined by calculating the information prevalence for each of the terms divided by the total number of the collected tweets. The information prevalent quality was found by examining the source of each tweet in the collected tweets to determine its reliability. The accounts of Saudi Arabia's Ministry of Health, official TV channels, TV news channels, TV programs, official newspapers, and online newspapers were considered as high-quality sources (HQS); usernames of personal accounts, personal groups, and unofficial accounts were coded as low-quality sources (LQS). Then, the information prevalent quality was calculated as the number of the tweets that were retweeted from LQSs divided by the total number of the tweets that were retweeted from both HQSs and LQSs. Information incidence was calculated by determining the number of conversations about each of the defined four terms combined with ' ' ('Coronavirus') in the collected tweets by units of time, where the unit of time in this study is four weeks. To visualize the most frequent words in the conversations, word clouds was used via RStudio Version 1.3.1056 [9,15].

Google Trends Data
Google Trends is an open online tracking tool that provides real-time and archived information for internet hit search volumes [19]. It normalizes and scales data in the form of search volume numbers to reflect search popularity on a scale relative to the total number of queries carried out on Google overtime. The scale ranges from 0 (low) to 100 (highly popular) for a specific search term. The information prevalence indicator [19] in this study is represented by the popularity scale provided by Google Trends. In order to find the information prevalence in different provinces, the framework in [14] was followed. The framework provides a systematic way to extract information from Google Trends. In this study, the country was set to "Saudi Arabia", and a default of "All categories" and "Web search" were selected. Since "Coronavirus or " is the official name used by the Saudi Ministry of Health, the topic " " ('Coronavirus') was selected as the most widely used term for the novel coronavirus. Besides this, other popular search terms/phrases were added. The terms/phrases and the English name for each of them are provided in Table 2. Note that only Arabic words were examined; no English words were included in this study. Google Trends data were collected as time series queries during the period from February to July 2020 and were retrieved in the csv format. Since there is a lag between the action of Google search and new cases confirmation, a delayed effect of 14 days was considered. The collected data were aligned with official data on daily COVID-19 new cases and deaths per one million people for each province in Saudi Arabia during the period from May to July with a lagged difference of 14 days, all of which was retrieved from the Saudi Ministry of Health [27]. The data was then arranged as (date, province, demand prevalence indicator, number of new cases). The aim is to examine the correlation between Google search queries for different Arabic cities and the daily increase in COVID-19 cases. Pearson correlation coefficients were used to examine the association between daily new COVID-19 cases and the hit search volume in each province using the same data. Autocorrelation function (ACF) (the correlation of a parameter with itself over the time) was also performed using Google Trends data to test whether the number of cases of a specific day would impact the search volume the next day [28]. Both crude and partial autocorrelations were computed. All of the calculations were performed using a Python 3.7.0 environment. Note that for all analyses, an alpha level of 0.05 was used to determine statistical significance.

Google Mobility Data
Daily community mobility data provided by Google covers 130 countries starting from 15 February 2020 [20]. Saudi Arabia was picked as a case study to examine whether or not less mobility is associated with fewer COVID-19 total cases per one million. The data are grouped by Google into five categories based on the most popular activities among people. these categories are work, grocery and pharmacy, parks, residential, and retail. To describe mobility, Google uses a percentage change in relation to previous values or baseline [20], as seen in Figure 3. The baseline is calculated by Google as the average value, for the same day of the week, during the five-week period between 3 January and 6 February 2020. For example, on 21 February 2020 in Saudi Arabia compared to the baseline, grocery mobility decreased by 43%, park mobility decreased by 60%, retail and transit mobilities decreased Sustainability 2021, 13, 8528 6 of 16 by 100%, and residential mobility decreased by 100%. The government imposed a curfew to control the spread of the pandemic. During the curfew, most retail stores were closed and residential activities were prohibited. However, there were specific hours during the day when people were allowed to go to grocery and pharmacy or parks. Noteworthy, a positive score in mobility indicates increased mobility and a negative score indicates a reduction in mobility. Google collects only the data from the devices whose users allow their location to be used anonymously.
Sustainability 2021, 13, x FOR PEER REVIEW 6 of 17 6 February 2020. For example, on 21 February 2020 in Saudi Arabia compared to the baseline, grocery mobility decreased by 43%, park mobility decreased by 60%, retail and transit mobilities decreased by 100%, and residential mobility decreased by 100%. The government imposed a curfew to control the spread of the pandemic. During the curfew, most retail stores were closed and residential activities were prohibited. However, there were specific hours during the day when people were allowed to go to grocery and pharmacy or parks. Noteworthy, a positive score in mobility indicates increased mobility and a negative score indicates a reduction in mobility. Google collects only the data from the devices whose users allow their location to be used anonymously. To capture daily trends in movement patterns, the data used in this analysis covers the period from 16 February to 25 June 2020. Since the Saudi Arabia government ordered all workers to work from home throughout this period [16], the "work" category was excluded and mobility was assessed based on the four remaining categories to find any association between the new COVID-19 cases and people's mobility. These associations were investigated using Pearson correlations (if both variables entered into the correlation were normally distributed) or non-parametric Spearman rank correlations (if one/both of the variables entered into the correlation were not normally distributed). The Pearson Correlation Coefficient was applied to specify how two variables vary together (the new COVID-19 cases and people's mobility).
Overall mobility (i.e., the average mobility across all categories) from the same data set were used to conduct a multiple regression analysis to investigate how mobility in general could predict new coronavirus cases. This analysis also included total cases and information prevalence as predictors of new cases as well. The data was examined for the entire country and was not disaggregated by province. To conduct this analysis, Rstudio Version 1.3.1056 was used. Using the same data, a multiple regression analysis was carried out to investigate whether patterns of travel to grocery and pharmacy, parks, residential, and retail areas could significantly predict new coronavirus cases (i.e., the four mobility categories were used as predictors in this regression, as well as total cases and information prevalence). To capture daily trends in movement patterns, the data used in this analysis covers the period from 16 February to 25 June 2020. Since the Saudi Arabia government ordered all workers to work from home throughout this period [16], the "work" category was excluded and mobility was assessed based on the four remaining categories to find any association between the new COVID-19 cases and people's mobility. These associations were investigated using Pearson correlations (if both variables entered into the correlation were normally distributed) or non-parametric Spearman rank correlations (if one/both of the variables entered into the correlation were not normally distributed). The Pearson Correlation Coefficient was applied to specify how two variables vary together (the new COVID-19 cases and people's mobility).

Twitter
Overall mobility (i.e., the average mobility across all categories) from the same data set were used to conduct a multiple regression analysis to investigate how mobility in general could predict new coronavirus cases. This analysis also included total cases and information prevalence as predictors of new cases as well. The data was examined for the entire country and was not disaggregated by province. To conduct this analysis, Rstudio Version 1.3.1056 was used. Using the same data, a multiple regression analysis was carried out to investigate whether patterns of travel to grocery and pharmacy, parks, residential, and retail areas could significantly predict new coronavirus cases (i.e., the four mobility categories were used as predictors in this regression, as well as total cases and information prevalence).

Twitter
During the study period, the total number of tweets about the most popular media terms for the novel coronavirus remedy was 6541. Figure 4 shows the differences among the information prevalence, the information occurrence ratio, and the quality of the information prevalence for each of the identified terms. As seen, the majority of tweets with an mation prevalence for each of the identified terms. As seen, the majority of tweets with an information prevalence of 3807 and an information occurrence ratio of 58.2% was about ‫'ﺩﻳﻜﺴﺎﻣﻴﺜﺎﺯﻭﻥ'‬ ('Dexamethasone'), 89% of which were retweeted. This was followed by 1694 (26%) tweets about the folk remedy ‫ﺍﻟﻬﻨﺪﻱ'‬ ‫'ﺍﻟﻘﺴﻂ‬ ('Saussurea costus'), 64% of which were retweeted. The use of ‫'ﺭﻣﺪﻳﺴﻴﻔﻴﺮ'‬ ('Remdesivir') comes third with 1021 (15.6%) tweets, 90% of which were retweeted. The use of the folk remedy ‫'ﺳﻤﺎﻕ'‬ ('Sumac') is considered the least prevalent with 19 tweets (0.3%) and 100% retweets. Figure 4 also shows that 100% of the ‫'ﺳﻤﺎﻕ'‬ ('Sumac') and 98% of the ‫ﺍﻟﻬﻨﺪﻱ'‬ ‫'ﺍﻟﻘﺴﻂ‬ ('Saussurea costus') tweets were retweeted from LQS. These are considered high percentages and the tweets may include added misinformation. This was followed by 39% of '‫'('ﺩﻳﻜﺴﺎﻣﻴﺜﺎﺯﻭﻥ‬Dexamethasone'), and 13% of '‫'('ﺭﻣﺪﻳﺴﻴﻔﻴﺮ‬Remdesivir') tweets retweeted from LQS, respectively. Spreading information from unofficial accounts means that they are not posted from HQS (e.g., WHO, Ministry of Health). Noteworthy here is that information prevalence from LQS is mostly for non-medical or folk remedy. This is evident by the high percentage of retweets from LQS for 'sumac' and 'Saussurea costus'. Next, a sense of the conversation incidence of the defined terms over the study period was calculated. The results, as summarized in Figure 5, show how conversation volume for each term changes over time. It can be seen that a high conversation incidence about ‫'ﺩﻳﻜﺴﺎﻣﻴﺜﺎﺯﻭﻥ'‬ ('Dexamethasone') and ‫ﺍﻟﻬﻨﺪﻱ'‬ ‫'ﺍﻟﻘﺴﻂ‬ ('Saussurea costus') during the first week of the study period that then decreases through the rest of the study period. In contrast, the conversation incidence about ‫'ﺭﻣﺪﻳﺴﻴﻔﻴﺮ'‬ ('Remdesivir') is low during the first week, followed by a high incidence during the second week, then followed by a low incidence through the last two weeks. As for ‫'ﺳﻤﺎﻕ'‬ ('Sumac'), the incidence of its conversations is the least from the beginning to the end of the study period when compared with the other three terms.  ('Saussurea costus') tweets were retweeted from LQS. These are considered high percentages and the tweets may include added misinformation. This was followed by 39% of ' '('Dexamethasone'), and 13% of ' '('Remdesivir') tweets retweeted from LQS, respectively. Spreading information from unofficial accounts means that they are not posted from HQS (e.g., WHO, Ministry of Health). Noteworthy here is that information prevalence from LQS is mostly for non-medical or folk remedy. This is evident by the high percentage of retweets from LQS for 'sumac' and 'Saussurea costus'. Next, a sense of the conversation incidence of the defined terms over the study period was calculated. The results, as summarized in Figure 5, show how conversation volume for each term changes over time. It can be seen that a high conversation incidence about ' ' ('Dexamethasone') and ' ' ('Saussurea costus') during the first week of the study period that then decreases through the rest of the study period. In contrast, the conversation incidence about ' ' ('Remdesivir') is low during the first week, followed by a high incidence during the second week, then followed by a low incidence through the last two weeks. As for ' ' ('Sumac'), the incidence of its conversations is the least from the beginning to the end of the study period when compared with the other three terms.

Google Trends
Since words alone give us limited insight into people's perception and topics of conversation, the pattern of information prevalence about COVID-19 by using an autocorrelation analysis was identified to show how the prevalence of these terms change over time. Search volume and time series data from Google Trends [19] and their relation to the number of total cases are observed. The pattern of internet search volume shown in Figure 7 did not reveal a cyclic trend, as can be seen from the autocorrelation diagram in Figure 8. No seasonal trends were found (ACF = −0.2). The research volume remained constant from March to June 2020, apart from a peak in late March and early April.

Google Trends
Since words alone give us limited insight into people's perception and topics of conversation, the pattern of information prevalence about COVID-19 by using an autocorrelation analysis was identified to show how the prevalence of these terms change over time. Search volume and time series data from Google Trends [19] and their relation to the number of total cases are observed. The pattern of internet search volume shown in Figure 7 did not reveal a cyclic trend, as can be seen from the autocorrelation diagram in Figure 8. No seasonal trends were found (ACF = −0.2). The research volume remained constant from March to June 2020, apart from a peak in late March and early April.

Google Trends
Since words alone give us limited insight into people's perception and topics of conversation, the pattern of information prevalence about COVID-19 by using an autocorrelation analysis was identified to show how the prevalence of these terms change over time. Search volume and time series data from Google Trends [19] and their relation to the number of total cases are observed. The pattern of internet search volume shown in Figure 7 did not reveal a cyclic trend, as can be seen from the autocorrelation diagram in Figure 8. No seasonal trends were found (ACF = −0.2). The research volume remained constant from March to June 2020, apart from a peak in late March and early April.  COVID-19 cases and symptoms are the most searched coronavirus terms by the Arabic users in Saudi's provinces. Our analysis revealed that provinces with a higher number of COVID-19 cases per 1 million people had lower Google search interest related to COVID-19 (e.g., Makkah, Riyadh, and the Eastern provinces). Figure 9 shows the information prevalence indicator represented by the popularity scale provided by Google Trends. The COVID-19 related search queries have a significant strong negative correlation with the incidence of the average of new COVID-19 cases per 1 million in these provinces (Pearson r = −0.63, p < 0.05).  COVID-19 cases and symptoms are the most searched coronavirus terms by the Arabic users in Saudi's provinces. Our analysis revealed that provinces with a higher number of COVID-19 cases per 1 million people had lower Google search interest related to COVID-19 (e.g., Makkah, Riyadh, and the Eastern provinces). Figure 9 shows the information prevalence indicator represented by the popularity scale provided by Google Trends. The COVID-19 related search queries have a significant strong negative correlation with the incidence of the average of new COVID-19 cases per 1 million in these provinces (Pearson r = −0.63, p < 0.05). COVID-19 cases and symptoms are the most searched coronavirus terms by the Arabic users in Saudi's provinces. Our analysis revealed that provinces with a higher number of COVID-19 cases per 1 million people had lower Google search interest related to COVID-19 (e.g., Makkah, Riyadh, and the Eastern provinces). Figure 9 shows the information prevalence indicator represented by the popularity scale provided by Google Trends. The COVID-19 related search queries have a significant strong negative correlation with the incidence of the average of new COVID-19 cases per 1 million in these provinces (Pearson r = −0.63, p < 0.05).

Google Mobility Data
Shapiro-Wilk tests found that all mobility data categories were normally distributed, with the exception of Parks. Since COVID-19 virus has an incubation period of 5 to 14 days, mobility and potential exposure will have a delayed effect on the new confirmed cases. Thus, mobility data with 14 days lag difference was correlated to current new cases. Moreover, the Saudi government imposed a curfew on 23 March 2020. Thus, two separate analysis were conducted. The first one is for mobility before curfew from 16 February to 23 March, and its impact on new cases from 1 March to 5 April. The second analysis is from 24 March to 14 June, which represents the time period after curfew, and its impact on new cases from 6 April to 25 June. Sustainability 2021, 13, x FOR PEER REVIEW 10 of 17

Google Mobility Data
Shapiro-Wilk tests found that all mobility data categories were normally distributed, with the exception of Parks. Since COVID-19 virus has an incubation period of 5 to 14 days, mobility and potential exposure will have a delayed effect on the new confirmed cases. Thus, mobility data with 14 days lag difference was correlated to current new cases. Moreover, the Saudi government imposed a curfew on 23 March 2020. Thus, two separate analysis were conducted. The first one is for mobility before curfew from 16 February to 23 March, and its impact on new cases from 1 March to 5 April. The second analysis is from 24 March to 14 June, which represents the time period after curfew, and its impact on new cases from 6 April to 25 June.
The analysis of Google mobility data before imposing curfew shows that the increased reduction of mobility in all categories (e.g., grocery & pharmacy, parks, retail and recreation, and residential) is highly and negatively correlated with decreased number of total cases in Saudi Arabia (see Table 3). In other word, the less the movement, the lower The analysis of Google mobility data before imposing curfew shows that the increased reduction of mobility in all categories (e.g., grocery & pharmacy, parks, retail and recreation, and residential) is highly and negatively correlated with decreased number of total cases in Saudi Arabia (see Table 3). In other word, the less the movement, the lower the number of new cases. However, after the curfew, the analysis shows a positive weak relationship between new cases and mobility in grocery & pharmacy and retail and recreation, which indicates that reduction in mobility has a weak relationship with new cases. On the other hand, there is a negligible correlation between parks and residential mobility and new COVID-19 cases (Pearson r = 0.2). Noteworthy, all these correlations are statistically significant with p < 0.05. It is noteworthy that these correlations are all very large [29]. Figure 10 shows the scatterplots of these four correlations. relationship between new cases and mobility in grocery & pharmacy and retail and recreation, which indicates that reduction in mobility has a weak relationship with new cases. On the other hand, there is a negligible correlation between parks and residential mobility and new COVID-19 cases (Pearson r = 0.2). Noteworthy, all these correlations are statistically significant with p < 0.05. It is noteworthy that these correlations are all very large [29]. Figure 10 shows the scatterplots of these four correlations.  A multiple regression analysis was conducted to examine the relationship between new cases of COVID-19 and various potential predictors, namely overall mobility, total confirmed cases, and information prevalence. The results of the multiple regression indicated that the model explained 88.31% of the variance and that the model was a significant predictor of the number of new cases, F (3, 105) = 272.9, p < 0.001. While all three independent variables were significantly predictive of new cases (total cases B = 0.02, p < 0.001; mobility B = −12.62, p < 0.001; information prevalence B = −16.82, p < 0.001), the proportion A multiple regression analysis was conducted to examine the relationship between new cases of COVID-19 and various potential predictors, namely overall mobility, total confirmed cases, and information prevalence. The results of the multiple regression indicated that the model explained 88.31% of the variance and that the model was a significant predictor of the number of new cases, F (3, 105) = 272.9, p < 0.001. While all three independent variables were significantly predictive of new cases (total cases B = 0.02, p < 0.001; mobility B = −12.62, p < 0.001; information prevalence B = −16.82, p < 0.001), the proportion of variance in new cases is uniquely explained by each predictor varied substantially: total cases explained 44.32% of the variance, mobility explained 14.11% of the variance, and information prevalence explained 7.28% of the variance (22.60% of the explained variance was therefore not unique to one predictor, but shared between at least two). When this analysis was repeated with the four separate mobility dimensions as opposed to one average mobility score, the results were similar. The model explained 89.16% of the variance, and it was a significant predictor of the number of new cases, F(6, 102) = 149.0, p < 0.001 (see Table 4). Total cases, information prevalence, grocery and pharmacy mobility, and parks mobility were all statistically significant predictors of the number of new cases, but retail and recreation mobility and residential mobility were not. This is might be due to the impact of the curfew, since recreation and residential mobility have the highest percentage of reduction (80% and 100%, respectively) [20]. Therefore, they were not significant predictors. In contrast, the reduction in mobility in grocery and pharmacy and parks is low (25% and 40%, respectively), which explains why these two are significant predictors. It should be noted, however, that 58.22% of the variance in the number of new cases was not unique to one predictor, likely due to the four mobility variables all correlating very highly with each other (all rs ≥ 0.80). Therefore, the low percentages of unique variance explained are likely due to multicollinearity.

Discussion
Social media and online platforms have become key distribution channels for information surrounding COVID-19. One of the main advantages of these data sources is that they can be obtained both anonymously and easily, and early in the epidemic at a low cost. Although governments and health organizations have used these platforms as communication channels to reach the public, populations can become overwhelmed with the propagation of misinformation and disinformation, as it is increasingly prevalent. Infoveillance can monitor how people react to the evolution of the pandemic over time, as well as identify common beliefs, concerns, or hopes regarding prevention, treatment, and vaccines.
COVID-19 is the first global pandemic of the digital era. Our results provide a means for understanding people's perceptions of risk and recognizing the prevalence of misinformation. It also provides insights into the public's information seeking behaviors and the role of mobility in contributing to the number of cases. Designing an effective risk communication strategy based on digital health solutions can play a major part in the fight against COVID-19 [13]. Understanding the population's perception and compliance with health guidelines can be utilized to develop more effective RCCE and digital health solutions and health campaigns to improve public awareness and perception.
The retweet was the most common form of interaction. It is important to note that most of the popular folk remedies are shared from unreliable sources or low-quality sources. Nevertheless, it must be noted that most retweets are for information from reliable sources or high-quality sources. Twitter users are passionate in notifying their followers with any information about possible coronavirus treatments, whether they are mainstream medical treatments or folk remedies, and whether they are from reliable or unreliable sources.
The conversation volume for each remedy changes over time based on the global news and trends. For example, retweets related to "Dexamethasone" spiked when it was announced as a cure to treat COVID-19 [30]. This is an important finding, suggesting that people are aware and up-to-date with any news related to COVID-19. Thus, the use of Twitter as a source of information can cause a wide spread of misinformation. These results build on existing evidence of the internet search behavior and the extent of the results in [31,32].
The results of this study suggest a mean through which public health officials can identify the most common forms of misinformation and indicate that the best remedy for the COVID-19 infodemic is to broadcast timely and correct information to Twitter audiences. On the other hand, people confronting the COVID-19 pandemic face unfamiliar circumstances and are anxious for any source of information. Therefore, it is important to deliver accurate and appropriate information through risk communication and digital health campaigns [33]. Identifying the top examples of misinformation in Arabic in Twitter helps government authorities and experts debunk comments on misinformation and rumors. Specifically, Twitter analytics can enable them to choose what keywords and hashtags are more prevalent among the audience. For example, governments can effectively communicate accurate information on how to deal with the symptoms of coronavirus. These implications can be utilized to enhance government fact-checking services, risk communication, and health campaigns.
While this finding is encouraging, there are many more types of misinformation that need to be analyzed to understand the total impact of myths on the number of COVID-19 cases. It is crucial to monitor infodemics and build a system to limit the spread of misleading information and potentially harmful Arabic content [34]. Twitter and Google have updated their approaches to misleading information about COVID-19 by classifying its related content for English-speaking audiences [35]. However, it is important to consider filtering the Arabic language content as well. Therefore, there is a need for a strong system to detect rumors or misinformation posted on social media, possibly under the supervision of the government and health monitors such as the work in [21]. Systems such as the one described in [21] could help governments to set laws to prevent the spread of misinformation such as imposing penalties, fines, blocking accounts, or omitting the blogs that have misinformation related to health or health treatment [36].
The study provides a new insight into the relationship between the increased perceptions and the number of new cases. The results demonstrate how the internet search volume can be used to measure peoples' perception of the risk of the pandemic. In line with the hypothesis, the Saudi provinces where people do less research about COVID-19 tend to have more cases. These results build on an existing evidence in [37]. This result emphasizes the importance of encouraging people to stay up to date about the pandemic. Besides, future digital health initiatives should focus on reaching people using different communication channels to make a greater impact on the epidemic, specifically by identifying digital communication channels and influencers with the potential to reach larger audiences.
A further finding in our analysis of public mobility and the spread of COVID-19 reveals that increased mobility leads to an increased number of cases. A significant strong negative correlation between the reduction in mobility and the number of daily new cases was found, especially within the grocery and residential categories. Specifically, a low mobility score means more reduction in mobility. Therefore, as people move about more, more cases arise. The results demonstrated match the state-of-the-art methods used in [38] and [39]. This is an important finding in the understanding of the effectiveness of quarantining and social distancing during the pandemic. This result aligns with the findings in [40] that emphasizes a high correlation between social distancing and daily new cases. In general, when enforcing new health guidelines to combat the pandemic, it is crucial to convey the reasoning behind, and aims of, such guidelines. This will help manage people's fear and increase the likelihood that they will adhere to the measures imposed on them during the crisis.
Among all COVID-19 predictors (e.g., total cases, mobility, and information prevalence), the total cases are the strongest predictor, while the information prevalence is the weakest. This result highlights the strongest predictor of the number of daily new cases. Our findings suggest that the transmission rate increases as the total number of cases increase, therefore emphasizing the importance of imposing more restrictive measures in places where the total number of cases is higher. Moreover, this result can be used to design customized risk mitigation measures for different provinces or cities based on mobility data and total cases [41]. This could contribute to the controlling of the pandemic and supporting economic activities without applying extreme restrictions in unnecessary areas. Applying unnecessary precautions can delay the economy re-opening, which result in a lower economic activity levels and a slower economic growth. Although this does not mean that information prevalence should be ignored since it is statistically significant, total cases and mobility increase the transmission rate of the virus, better explaining the results. Besides, among the four mobility categories, grocery and pharmacy and parks were significant predictors since the mobility reduction in these two categories was lower than the others. This result emphasizes the importance of curfews during the pandemic.

Study Limitations
Some limitations can be attributed to this study in the spread of misinformation in tweets regarding COVID-19. Although multiple search terms can be used to collect COVID-19 related tweets such as "COVID-19", "COVID", "SARS-CoV2", "Wuhan virus", and "Chinese virus", the official Arabic name used by the Saudi Ministry of Health and the most widely used term for COVID-19 in Saudi Arabia is " " ('coronavirus'). Therefore, the latter term with a set of predefined search terms for COVID-19 remedies from Twitter trends were used to collect the tweets. In addition, the study only analyzed tweets written in the Arabic language and geolocated in the country of Saudi Arabia. Moreover, the Twitter standard search API does not allow for collecting tweets older than one week. Thus, tweets posted before 13 June 2020, could not be collected. Furthermore, the study could not collect tweets from private accounts. Thus, the results only represent organizations and people who use Twitter for trading news and information. Therefore, the results may not reflect the real number of tweets discussed by all kinds of users which may include misinformation related to COVID-19. All these points may restrict the generalizability of the results of this study. Locally distributed information and socioeconomic status for different cities are important factors to be considered in the analysis. However, the Saudi Ministry of Health is the main source of information nationwide. Thus, there is no information dissemination from local medical resources in different cities or hospitals other than what the Ministry of Health distributes. This is a protection measure from the Saudi government to guarantee the correctness of the disseminated information. Moreover, there is no local data available about socioeconomic status for different cities. Therefore, the data used for the regression analysis are at a national level and does not consider socioeconomic status as a predictor. It should also be noted that many of the results presented in this study are correlational, and as such does not imply any directional associations.

Conclusions and Future Work
In this study, we used three different platforms to conduct an infodemiology and infoveillance survey of COVID-19 in Saudi Arabia. Frist, using Twitter, the study showed the prevalence of misinformation and folk remedies that might hinder people from following medical information from trusted sources. Second, using Google Trends, we investigated the relationship between information prevalence in each province in Saudi Arabia and the number of daily new cases. The results showed that there is a strong negative relationship, which indicates that literacy among people is an important factor in controlling the pandemic. Third, we used Google mobility data to investigate the impact of mobility on the number of daily new cases. The results showed that reduction of mobility can decrease the number of daily new cases. Finally, the study examined the relationship between new cases of COVID-19 and various potential predictors, namely overall mobility, total confirmed cases, and information prevalence. The analysis showed that the total confirmed cases is the most significant predictor.
Governments, policymakers, and healthcare providers can use these results to design effective programs, awareness messages, and community campaigns to increase the perception and knowledge about COVID-19. Moreover, governments can apply customized restrictive measures for different places based on the total cases and mobility.
It is important to emphasize that this study focuses on the spreading of misinformation regarding COVID-19. One direction of future work will focus on the opposite side of this study, which discusses the information prevalence on Twitter to analyze public awareness and perception towards COVID-19 in LQS versus HQS. Moreover, establishing directional associations links between COVID-19 information distribution, mobility, and deaths allows for future research to investigate causal associations.