Forecasting and Surveillance of COVID-19 Spread Using Google Trends: Literature Review

The probability of future Coronavirus Disease (COVID)-19 waves remains high, thus COVID-19 surveillance and forecasting remains important. Online search engines harvest vast amounts of data from the general population in real time and make these data publicly accessible via such tools as Google Trends (GT). Therefore, the aim of this study was to review the literature about possible use of GT for COVID-19 surveillance and prediction of its outbreaks. We collected and reviewed articles about the possible use of GT for COVID-19 surveillance published in the first 2 years of the pandemic. We resulted in 54 publications that were used in this review. The majority of the studies (83.3%) included in this review showed positive results of the possible use of GT for forecasting COVID-19 outbreaks. Most of the studies were performed in English-speaking countries (61.1%). The most frequently used keyword was “coronavirus” (53.7%), followed by “COVID-19” (31.5%) and “COVID” (20.4%). Many authors have made analyses in multiple countries (46.3%) and obtained the same results for the majority of them, thus showing the robustness of the chosen methods. Various methods including long short-term memory (3.7%), random forest regression (3.7%), Adaboost algorithm (1.9%), autoregressive integrated moving average, neural network autoregression (1.9%), and vector error correction modeling (1.9%) were used for the analysis. It was seen that most of the publications with positive results (72.2%) were using data from the first wave of the COVID-19 pandemic. Later, the search volumes reduced even though the incidence peaked. In most countries, the use of GT data showed to be beneficial for forecasting and surveillance of COVID-19 spread.


Introduction
Coronavirus Disease , caused by the novel acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is an infectious disease with high virulence and a high proportion of asymptomatic cases, which, together with other factors such as a long period from infection to the onset of the symptoms, symptoms' similarity to a regular cold, and continuous social interactions, led to a worldwide virus outbreak [1][2][3].
Early detection of COVID-19 outbreaks is crucial for multiple reasons: (i) to prepare hospitals and staff, including efficiently allocating protective gear and medical equipment [4], as well as testing tents and setting up IT infrastructure (setting up electronic health information systems for patient registration and databases); (ii) to prepare governments for actions, such as imposing curfew, ordering equipment, and drawing up guidelines for businesses and events; (iii) to improve public messaging and warn people about the risks and their prevention; (iv) to prevent further spread of infection [4] by imposing local quarantine or taking other preventive measures. The probability of future COVID-19 waves remains high [5]; thus, COVID-19 surveillance and forecasting remain important.
Online search engines harvest vast amounts of data from the general population in real time. Importantly, many of them, including the most popular Google search engine, make these data publicly accessible. This raises the interest of using such data for surveillance and forecasting of disease outbreaks [6]. Among internet-based tools for analysis of search queries used to search for specific information, the most acclaimed one is Google Trends (GT) [7,8]. As stated by other researchers, GT can be employed to solve public health issues as it provides valuable information about current concerns and health-related problems in general society, especially in the field of infectious diseases [7] and, therefore, could be used for prediction of upcoming disease waves.
GT as a prediction tool that has been used for many different diseases in the past two decades, including Influenza [9], Zika virus disease [10], Middle East Respiratory Syndrome (MERS) [11], and Malaria [12]. These studies provided diverging results, which makes it difficult to make generalized conclusions about a possibility to use GT for prediction and surveillance of infectious diseases. When it comes to COVID-19, it is important to assess GT's ability to detect changes in numbers of people who possibly do not perform COVID-19 tests, but nonetheless feel symptoms or who suspect that they had contact with an infected person and can infect others. This could be used for prediction of COVID-19 outbreaks. Therefore, the aim of this study was to review the literature about the possible use of GT for COVID-19 surveillance and prediction of its outbreaks.

Materials and Methods
This literature review included articles published within 2 years from the beginning of the pandemic until February 2022. The PubMed search engine was used to search for scientific publications.

Inclusion and Exclusion Criteria
The search phrases used for the search query were "Google Trends" AND "COVID-19". For the initial search of publications, we did not use any time, language, publication type, or other criteria filters. The initial search yielded 301 results. All publications were reviewed for the following inclusion criteria: (i) primary original articles addressing the usage of GT for COVID-19 prediction and/or surveillance; (ii) articles available in full text for our institutional network; exclusion criteria: (i) publications that had only part of the search phrase in the title or the snippet of the abstract which made us suppose that the publication was not about usage of GT tool; (ii) publications with type review, letter, comment, correspondence, or presentation; (iii) publications written in any other language than English; (iv) publications where data obtained from other sources than GT (e.g., WikiTrends, Twitter, etc.) were analyzed.
Firstly, with respect to inclusion and exclusion criteria, the publication titles were screened to determine if the publication could possibly fit the scope of this review, which ruled out 202 articles found during the initial search. As a second step, 99 article abstracts of the selected publications were screened to verify the relevance of the publication, ruling out 33 publications. The full text was downloaded only if an abstract showed that the publication might be relevant to this review. Full texts of 66 articles were then analyzed to include only those articles which provided the results of the assessment of GT forecasting possibilities for COVID-19 disease. In addition, the reference lists of included publications were reviewed according to the same criteria for those not uncovered with the initial search. After completing all these steps and removing duplicates, we concluded with 44 articles meeting all the criteria (Figure 1).
From each included publication, we extracted such data as year of publication, short description of the main findings, country where GT data were collected, keywords used by people in that country, period of data collection for GT analysis, and the statistical analysis method(s) used to analyze the data. From each included publication, we extracted such data as year of publication, short description of the main findings, country where GT data were collected, keywords used by people in that country, period of data collection for GT analysis, and the statistical analysis method(s) used to analyze the data.

ES
02 2020-05 2020 cansancio, which translates as fatigue; coronavirus, COVID 19, covid 19, and COVID19; diarrea, which translates as diarrhea; dolor de garganta, which translates as sore throat; fiebre, which translates as fever; neumonia, which translates as pneumonia and was searched without an accent due to being more relevant; perdida de olfato, which translates as lost sense of smell and was also searched without an accent; tos, which translates as cough Lippi, Mattiuzzi, Cervellin (2020) [40] Significant correlations found between GT search data and newly diagnosed COVID-19 cases with a 3-week lag.

US
10 2019-05 2020 diarrhea, nausea, vomiting, and abdominal pain. The terms fever and cough were included as positive controls. The term constipation was included as a negative control.
Xie, Tan, Li (2020) [58] Monitoring internet search activity could prevent and control the epidemic and rumors around it.
CN 01 2020-02 2020 Coronavirus  [47] The dynamics of the correlations found between GT data COVID-19 cases and deaths suggest that it would be possible to make predictions of COVID-19 cases and mortality rates up to 3 weeks in advance.

Time Periods
GT seemed to have a higher prediction capability during the first wave of the COVID-19 pandemic (most of the studies (72.2%) took GT data from 01 2020 to 05 2020). The majority of studies reviewed in this article used GT data obtained in 2020 (some starting December 2019) with only four extending their GT data collection to previous years [23,30,52,64] for comparison.

More Complex Analysis Methods of GT Data
There were some publications with more complex methods used for statistical analysis of GT data (Table 2). Long short-term memory [20,53] (3.7%), random forest regression [22,25] (3.7%), Adaboost algorithm [21] (1.9%), autoregressive integrated moving average (ARIMA), error, trend and seasonality (ERS), neural network autoregression (NNA) [23] (1.9%), and vector error correction modeling [50] (1.9%) were described as methods of analysis. The findings of those studies showed that GT significantly improved the predictive capability of the methods used in the analysis and could be used in the future with even higher predictability as more data become available [25,53].

Negative Results of GT Use for COVID-19 Prediction and Surveillance
Nine publications (Table 3) showed negative results of GT use in COVID-19 surveillance and/or prediction. Most of them [26,28,31,57,64] stated that the correlations between GT search queries and COVID-19 cases in those countries were present because of media coverage [31,57,64] or announcements by governments and/or WHO [26,28]. A high variation in correlations between COVID-19 incidence and internet searches was identified as well [27,29], showing that GT data are not a reliable source for COVID-19 prediction and surveillance. Table 3. Publications with negative results of GT use for COVID-19 prediction and surveillance.

Author and Year
The Main Findings about Google Trends Country Period Keywords Szmuda, Ali, Hetzger, Rosvall, Słoniewski (2020) [26] GT data did not correlate with COVID-19 incidence and mortality; however, they had a strong correlation with international WHO announcements.
40 European countries. Appendix F. 12 2019-04 2020 Coronavirus Asseo, Fierro, Slavutsky, Frasnelli, Niv (2020) [27] The correlation between internet searches for symptoms and new COVID-19 cases varied significantly over time. High fluctuations show that relying only on GT data to monitor the spread of COVID-19 is not a viable strategy.
IT, US 03 2020-04 2020 taste loss, smell loss, sight loss (control), hearing loss (control), COVID symptoms (and the same in Italian) Muselli, Cofini, Desideri, Necozione (2021) [28] The volume of Google searches did not reflect the actual epidemiological situation. It has been seen that official communications and government activity has more impact on public interest in the disease. Rovetta (2021) [29] Big number of anomalies seen in multiple cities' relative search volumes (RSVs) made these data unusable for statistical inference. Furthermore, correlations varied greatly depending on the day RSVs were collected.

Differences between Countries
One possible reason why there were more studies performed in high-income countries compared to low-income ones could be the lack of IT infrastructure-only 50% of individuals in low-and middle-income countries are using internet [65] as opposed to almost 90% in high-income countries [66], thus allowing people to search for information easily. For example, even though India is the second country in the world in internet user numbers, only 36% percent of its population use internet monthly [67] as opposed to over 90% in the USA [68,69] or 92% of households in Europe [70].

Time Periods
It was seen that most of the publications with positive results were using data from the first wave of the COVID-19 pandemic. Later, the search volumes reduced [33] even though the incidence peaked. This could be explained by people's initial fear and lack of knowledge about the disease-symptoms, as well as protection measures, were more searched during the first wave. Later, such information became more widely known-not only people learned while searching themselves, but there were plenty of announcements from the governments as well as WHO. Naturally, people lost interest in following such news [71] in addition to getting "tired" of lockdowns.
The strong public interest decline in COVID-19-related issues might cause a big public health challenge to distribute relevant information regarding the newest developments in disease treatment and prevention measures throughout the whole pandemic [33].

Risk Communication
Four publications [33,38,72,73] identified during the PubMed database search were not about prediction or surveillance of COVID-19 using GT data, rather about public interest in the pandemic and risk communication during the outbreaks. Those studies have shown increased amount of search queries after first case announcement [33,73] and such events such as local COVID-19 transmission, approval and implementation of testing, social-distancing campaign, face mask shortage, and announcements by WHO [72].
As people's interest peaked, it would be sensible to spread scientific information and promote preventative measures, as well as prevent misinformation in this exact time period. It would be beneficial to target social media, where misinformation spreads the fastest and people feel properly informed while reading non-expert opinions and statements. In addition, the decline in interest should be met with informational campaigns to ensure proper information spreads [38] as well as showing people where to search for information and how to distinguish facts discovered by scientists from non-expert opinions.

Language
Our study reviews publications made in many different countries, which results in different search terms. Several studies [38] indicated the importance of 'related query analysis' prior to further analysis since it can point out the most relevant search terms.
Furthermore, there were many multi-country studies where the search terms were translated, thus potentially resulting in lost nuances of the meaning as well as some overlay [13].

Complex Analysis Methods of GT Data
Several studies ( Table 2) incorporated GT data in their machine learning algorithms. Results of these studies show that such method was able to successfully predict an increase in COVID-19 cases in a large number of countries 7 days in advance [22,25]. Furthermore, data of previous incidence of COVID-19 and GT were combined, which showed improved performance of the prediction models compared to previous ones which used incidence data alone [23].
When conventional metrics (numbers of cases and deaths) were combined with interest-over-time values, the prediction ability of the models increased further [23]. Rabiolo et al. have identified two principal components, which allowed to reduce data dimensionality and summarize the information into two components, thus providing a flexible approach which allows the variables of interest to change and use the same models to investigate different research questions in the future [23]. Moreover, an additional advantage is that the performance of these models can be further improved as more data become available over time and can reflect the current situation [25,53]. In addition, the models could have other uses than predicting COVID-19 numbers, e.g., assessing people's awareness and engagement, thus allowing health authorities to use these data for measuring the effectiveness of the information spread [53], which is crucial especially when information fatigue is present [69,72].

Negative Findings
Few studies showed that GT data could not be used for COVID-19 prediction and/or surveillance. According to the authors of those studies, WHO and/or local government announcements had a major influence on search trends [26,57] and that GT is more efficient in tracking a new disease outbreak when media coverage of that disease is absent [26]. However, this was not possible to test during this pandemic since WHO, as well as governments and officials, started communication regarding the novel coronavirus even before WHO announced it as a pandemic. Furthermore, the authors suggest that online searches simply overlap with the increase in COVID-19 cases and related deaths since big media announcements are made at the same time as increase in incidence happens [26] or were a result of information-seeking curiosity [57].

Strengths
Many studies have made analyses in multiple countries and obtained the same results for the majority of them [8,13,37,38,61,62], thus showing the robustness of the chosen methods. Furthermore, Google search data are easy to obtain, more dynamic, and available compared to traditional data sources, such as data from governmental institutions, health authorities, etc., as well as represent current moods of the population and can be obtained during multiple periods [53]. As the relevant search terms can change over time, it is possible to investigate GT data repeatedly and incorporate the new terms and newly available data into the prediction models, thus improving the outcome. Even more improvement in prediction can be reached when search terms with higher correlation values are used for the analysis [20].

Limitations of the Possible Use of GT
One of the main limitations noted by the authors of the studies analyzed was the short timeframe taken for the analysis [25,38]. The positive results obtained from the first COVID-19 wave could have been due to the virus being new and interesting to the society, including mass media. Possibly, these factors resulted in an increase in searches using Google and other search engines. Furthermore, such methods must account for misspellings and possible other search terms [38] as well the fact that Google might not be the main search engine for different groups of people [23,25,26,38]. One more disadvantage lies in the data (incidence and death rate) which are used to compare it with the ones obtained from Google. Different countries have different testing policies, as well as death reports, thus making it impossible to have a standardized number [26,38]. Moreover, COVID-19 reports in other countries and media coverage everywhere around the world, as well as people's curiosity, might have influenced the increase in searches [13,25]. It was not possible to take into account many of the social and demographic factors (gender, age, education level, literacy) of the searchers [13,26]. One could speculate that older people are not represented in the search volumes, even though they are one of the mostly vulnerable groups for COVID-19. They, together with children, as well as people living in areas with poor internet connection, cannot be studied with this strategy, i.e., using GT data to make predictions, thus making it implausible for countries with large rural areas [23]. Similarly, the symptom similarity and prevention methods between COVID-19 and influenza might not allow to differentiate between the two [25,57], potentially showing higher search volumes and influencing the predictions.

Limitations of the Review
The limitations of this review include potentially missing results published in relevant publications written in any language other than English. Furthermore, focusing only on Google Trends can possibly exclude other internet-based tools useful for COVID-19 prediction and surveillance. Similarly, we included only those articles that were accessible to our institutional network which could exclude some relevant studies from this review. In addition, despite the fact that we used the name of the tool analyzed in this review (Google Trends) and the name of the disease (COVID-19) as keywords for the search of the publications, we could have missed some publications. Possibly, adding more keywords to the search query could help find more publications and this should be addressed in future reviews.

Conclusions
The majority of the studies analyzed in this paper have reported positive findings regarding prediction and surveillance of COVID-19 cases using data obtained from Google Trends. Incorporating GT data into various COVID-19 forecasting algorithms could increase their prediction capabilities. Further analyses using data obtained during later time periods are needed to further evaluate the forecasting capabilities of GT when the mass media calms down.

Conflicts of Interest:
The authors declare no conflict of interest.