Infodemiologists Beware: Recent Changes to the Google Health Trends API Result in Incomparable Data as of 1 January 2022

In an ever-increasingly online world, many Internet users seek information from online search engines such as Google. Accessing such search activity allows infodemiologists a glimpse into the collective online mind. Tools such as Google Trends and Google Health Trends (GHT) can be used to gauge search activity in key geographical regions and for specific periods of time. Recently, Google implemented changes to the GHT platform. Evidence is provided here for an initial exploration of how this change impacted the data obtained from GHT. Comparing 177 weekly probabilities for short search sessions of 421 Freebase IDs in thirty geographies extracted from GHT both before and after the implemented change, a low correlation (median of all Spearman ρ = 0.262 [IQR 0.04; 0.53]) between these data was observed for the year 2022. In general, the extracted values are higher after the implemented changes, compared to the values extracted before the change. Future research using the GHT API should not attribute increases in GHT data from 1 January 2022 onward as being reflective of increased search activity for a specific keyword, but rather attribute it to the implemented change to the GHT sampling strategy.


Introduction
Global access to the worldwide web has increased remarkably, with just over 63% of the global population accessing the Internet in 2022 [1]. Although access to the search giant Google is limited in some regions (such as China and North Korea), it dominates search activity in most other territories (Table 1). Knowing what the world is searching for online gives researchers the opportunity to identify and respond to these trends in a timely manner. As such, gaining access to search activity has long been regarded a holy grail for researchers, with different tools used in the assessment of such patterns.
Specific to Google's search engine, the unrestricted Google Trends platform (https: //trends.google.com/, accessed on 22 September 2022) is open for all to explore how specific demographics searched for certain keywords. Google Trends retrieves a relative search volume (RSV), a metric ranging from 0 to 100 and based on the proportional popularity of the keyword in a specific geographic region for the selected period. Although this platform gives the user an indication of the dates or times that a specific phrase was searched the most frequently, it lacks the ability for users to compare results from different periods [2]. For example, searching the same keyword for different time periods on the same geographical boundary yields different results ( Figure 1). Since this is a scaled metric, based on the number of searches in the geographical limitation selected for the searched keyword, comparisons between regions are not possible with Google Trends data [3]. The extraction of Google Trends data can be automated, to some extent, using unofficial application programming interfaces (APIs), such as pytrends (v4.8.0) for Python [4]. For those interested in comparing search activity in different regions or time periods, Google offers limited access to the Google Health Trends (GHT) API. Access can be requested from https://bit.ly/3xpYFJo. GHT was used to explore the search behavior of African Internet users related to the COVID-19 pandemic, as a prediction tool for dengue fever in Brazil, and to gauge interest in pre-exposure prophylaxis in the United States of America [5][6][7], with mixed reports on the effectiveness of this tool in infodemiology and epidemiology.
Recently, Google announced via email that the GHT API "will be improved by providing higher precision responses by using a more comprehensive sample of search requests" (Supplementary Email S1). The said changes were implemented on 18 July 2022, with all data from 1 January 2022 being altered to include this new comprehensive search request sample. Google also indicated that any changes in search interest dating 1 January 2022 might be attributable to this change.
Such changes impact ongoing research, especially when future research efforts seek to compare periods before and after the implementation of such a change. The GHT documentation has also not been updated yet to indicate that such a change was made, risking the potential that erroneous conclusions can be made in the future. Here, I present an investigation into whether this change implemented by Google indeed had an impact on the GHT data retrieved and provide the first evidence that future research using the GHT platform should refrain from comparing data obtained from 1 January 2022 onwards to dates before 2022.

Data Extraction from the Google Trends API
The use of Freebase IDs, or in the absence of a Freebase ID, the corresponding Google Knowledge Graph Identifiers (GKGIs), allows searching for specific terms regardless of the searcher's input language, since Google aggregates search values based on these identifiers. For example, searches conducted for 'watre' (sic), ' 水', 'l'eau', ' For those interested in comp Google offers limited access to requested from https://bit.ly/3xp African Internet users related to fever in Brazil, and to gauge int America [5][6][7], with mixed repo epidemiology.
Recently, Google announc providing higher precision resp requests" (Supplementary Email with all data from 1 January 2022 request sample. Google also indi 2022 might be attributable to thi Such changes impact ongoi to compare periods before and documentation has also not bee risking the potential that erroneo an investigation into whether th on the GHT data retrieved and GHT platform should refrain fro to dates before 2022.

Data Extraction from the Goog
The use of Freebase IDs, or i Knowledge Graph Identifiers (G the searcher's input language, identifiers. For example, searche जलम् ', 'metsi' or 'amanzi' would to the English word 'water'. Fr Knowledge Graph Search API an recommendation by Google. Usi data across linguistically differen The presented study was b API. First, the probabilities of sh Table S1) were searched in 30 co Trends sampling strategy. These for a different research project, change in the GHT random sam ', 'metsi' or 'amanzi' would be categorized as a search for '/m/0838f' corresponding to the English word 'water'. Freebase IDs or GKGIs were identified using the Google Knowledge Graph Search API and used as search terms on the GHT API, according to the recommendation by Google. Using Freebase IDs, therefore, allows for comparable search data across linguistically different searches.
The presented study was based on two different datasets extracted from the GHT API. First, the probabilities of short search sessions of 421 Freebase IDs (Supplementary Table S1) were searched in 30 countries (Table 1) before the recent update to the Google Trends sampling strategy. These extractions were carried out between 9 and 12 June 2022 for a different research project, the author having no prior knowledge of the pending change in the GHT random sampling strategy. A second extraction was performed after the changes were made to the GHT API on 22 July 2022. Weekly probabilities for short search sessions were extracted for the period from 6 January 2019 to 22 May 2022, resulting in 177 weeks' worth of data extracted for each of the searched terms in all countries. The extractions were carried out using a Python script as per Google's guidelines [8]. The only modification was that the process was automated by including for loops to conduct the extractions for different countries.

Statistical Analyses
Statistical analyses were performed in R (R Core Team, v4.2.0, 2022), using RStudio Integrated Development for R. The raw data extracted were plotted as two separate time series, applying locally estimated scatterplot smoothing (LOESS) to visually identify potential trends. Spearman correlation was used to determine the correlations between data obtained from the two data extractions and summarized. Thereafter, a new time series was constructed by calculating the difference between the values retrieved via the Google Trends API before and after the updates occurred on 18 July 2022. These time series of differences were also plotted. Anomalies (datapoints that are outside the normal fluctuation range of a time series) in the different time series were detected using the AnomalyDetection package for R [8] and the anomaly time series were plotted using the internal plotting functions of R, as well as ggplot2 [9].

Results
In total, 12,630 time series were extracted both before and after the implemented change to the Google Trends API, plotted with the application of LOESS and visually inspected for potential trends. The difference was then calculated and plotted for the extracted data. These figures are made available publicly at: https://doi.org/10.25415/ujhb.20424693.v1.
Since 177 data points (corresponding to weekly search activity) were extracted from the Google Trends API for each time series, a total of 2,235,510 data points were included in this study, of which~7.42% (165,953) were identified as anomalies using the Anomaly-Detection package for R. Plots of data points identified as anomalies in the difference plots are made publicly available here: https://doi.org/10.25415/ujhb.20430924.v1. Anomalies in a constructed difference time series occur due to Google's daily updates of the uniformly distributed random sample of searches from which the data are extracted. As such, some variance is expected, as was the case in anomalies detected for 2019-2021 (Table 2). However, most (79.40%) of the anomalies detected in the collected data originated in 2022. The median values of these anomalies between the two extractions were similar for 2019-2021, while the median for 2022 was double that of previous years.    The difference was then calculated and plotted for the extracted data. These figures are made available publicly at: https://doi.org/10.25415/ujhb.20424693.v1.
Since 177 data points (corresponding to weekly search activity) were extracted from the Google Trends API for each time series, a total of 2,235,510 data points were included in this study, of which ~7.42% (165,953) were identified as anomalies using the AnomalyDetection package for R. Plots of data points identified as anomalies in the difference plots are made publicly available here: https://doi.org/10.25415/ujhb.20430924.v1. Anomalies in a constructed difference time Within the 30 countries included in this investigation, all returned an increased number of anomalies in the 2022 data, ranging between 46.09% (China) and 96.18% (India) of anomalies in these time series (Table 3). Table 3. Anomalies observed in the time series comparing the difference in Google Trends API data collected before and after the implemented changes to the data set.

Discussion
The Google Trends API gives researchers the ability to access search trends from most countries around the world. Little is known regarding the sampling strategy that Google implements to construct the GHT database, apart from the statement in the GHT API Getting Started Guide: "Numbers are calculated on a uniformly distributed random sample of Google web searches done since 2004, updated once a day, thus there may be some variance between similar requests" [10].
As such, fluctuations in data retrieved on different extraction days are expected. Although such variance can affect data for a specific search term on a specific day, general trends in time series have a high correlation between data extracted on different days. From the two data sets extracted before and after the changes were made to the Google sampling strategy, a high degree of correlation was observed for the data extracted for 2019-2021 ( Figure 3). This is in line with the notification received on the changes made to the sampling strategy. In its email, Google indicated that the changes to the sampling strategy will only affect data from 1 January 2022 onward (Supplementary Email S1).
These changes in the sampling strategy resulted in a greater range of correlation values between older and newer data sets for the year 2022 to date ( Figure 3A,E), as well as a lower median correlation value. The low similarity between the data extracted before and after the change in sampling strategy is indicative of the implemented change to the data used to retrieve the Google Trends data. By detecting anomalies in the difference between these two time-series, we were able to show that changes implemented to the GHT sampling strategy mostly increased the returned values (Table 1), with the median value of these unexpected differences in 2022 being double the value of previous years. In the 30 countries investigated, the majority of unexpected data points from the differenced time series occurred in 2022 (Table 2). Through a visual inspection of the plotted time series, most search terms showed an increasing trend during the first months of 2022.
Since this newly implemented change to the sampling strategy results in predominantly higher search volume being returned, data extracted prior to 18 July 2022 can no longer be compared to data extracted after this date for the year 2022. However, the high level of correlation for previous years is indicative that, in most cases, comparative studies focused on dates prior to 1 January 2022 could still be accurate considering the minor variance introduced by Google's daily updates to the sample data set. As mentioned elsewhere [11], caution should be exercised in the interpretation of single extractions of GHT API data, which may be falsely interpreted as changes in search trends. Therefore, it is advised that the extractions of the GHT API data be repeated on different dates and analyzed accordingly.
The presented study was not without limitations. Owing to the short timeframe between the announcement that the GHT sampling strategy will be changing and the date of implementation of these changes, only data from a singular extraction prior to the implemented change could be analyzed. It is also uncertain as to which increases were due to the changes made to GHT, or which were attributable to the chance of the GHT sample dataset on the day of extraction. Although this limits the quantification of the changes made to the sampling algorithm, the results are indicative that the changes impacted the data obtained from the service, that there is mostly an increase in search probability for most search terms after 1 January 2022, and that the interpretation of comparative studies with data extracted after the implemented changes should be handled with caution.

Conclusions
Evidenced here is the first report that the recent changes to the sampling strategy implemented by Google impacted the comparability of the GHT API data, particularly on comparisons of search trends from before and after January 1, 2022. Although the improved sampling strategy may result in a more accurate representation of search trends, caution should be exercised on any increased search trends observed following the 1 January 2022 date and extracted after the 18 July 2022. Furthermore, it would be impossible to determine whether such changes indeed gave a more representative view of the use of the Google Search Engine by individuals. Although such changes may impact current research activities involving the GHT API, the improved sensitivity that may arise from this change and the benefits of having an improved GHT API may, in the future, result in better predictions-which could be especially useful when using the Google Trends API for public health monitoring.
Funding: This research received no external funding.

Supplementary Materials:
The following supporting information can be downloaded at: https:// www.mdpi.com/xxx/s1, Table S1: List of terms searched included in this investigation.
Institutional Review Board Statement: No ethical review was required, as the data used in this research were extracted from publicly available resources.