A Large-Scale Dataset of Search Interests Related to Disease X Originating from Different Geographic Regions

The World Health Organization added Disease X to their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic. During different virus outbreaks of the past, such as COVID-19, Influenza, Lyme Disease, and Zika virus, researchers from various disciplines utilized Google Trends to mine multimodal components of web behavior to study, investigate, and analyze the global awareness, preparedness, and response associated with these respective virus outbreaks. As the world prepares for Disease X, a dataset on web behavior related to Disease X would be crucial to contribute towards the timely advancement of research in this field. Furthermore, none of the prior works in this field have focused on the development of a dataset to compile relevant web behavior data, which would help to prepare for Disease X. To address these research challenges, this work presents a dataset of web behavior related to Disease X, which emerged from different geographic regions of the world, between February 2018 and August 2023. Specifically, this dataset presents the search interests related to Disease X from 94 geographic regions. The dataset was developed by collecting data using Google Trends. The relevant search interests for all these regions for each month in this time range are available in this dataset. This paper also discusses the compliance of this dataset with the FAIR principles of scientific data management. Finally, an analysis of this dataset is presented to uphold the applicability, relevance, and usefulness of this dataset for the investigation of different research questions in the interrelated fields of Big Data, Data Mining, Healthcare, Epidemiology, and Data Analysis with a specific focus on Disease X.


Introduction
In the recent past, several viruses, such as COVID-19 [1], the plague [2], the Spanish flu [3], HIV [4], Ebola [5], and MPox [6], have rampaged unopposed across different countries, infecting and leading to the demise of people, resulting in the destruction of political regimes, affecting various sectors of the global economy, and causing financial and psychosocial burdens, the likes of which the world has not witnessed in centuries [7].As a response to this, various organizations and policy-making bodies on a global scale have begun investigating approaches to learn from such virus outbreaks, with an aim to not repeat the mistakes of the past during future virus outbreaks."Disease X" is a placeholder name that was adopted by the World Health Organization (WHO) in February 2018 in their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic [8,9].The WHO used the placeholder term "Disease X" to make sure that its planning (such as relevant tests, expanded vaccinations, and production capabilities for vaccines) was robust, versatile, and equipped to deal with an unidentified virus [10].The idea of Disease X, according to Anthony Fauci (the director of the US National Institute of Allergy and Infectious Diseases at that time), was to motivate the WHO's investigations on entire classes of viruses rather than just specific strains of certain viruses, with an aim to strengthen the WHO's preparedness for dealing with such outbreaks [11].
Thus, it is crucial to plan and adopt a holistic approach to prevent and predict a new pandemic in the future.Prior works [12][13][14][15] in this field have discussed various means by which Disease X might start.For instance, the potential of deadly pathogens being released from melting glaciers could start a new pandemic.Alternatively, with the continual increase of global warming and climate changes, viruses dormant at present may become active and mutate and lead to the next pandemic.Furthermore, human and animal contact has become increasingly common, and the lack of proper protocols in this regard has led to the outbreak of zoonotic viruses in the past.A well-known example of this would be H1N1, which contained genetic material from human, avian, and swine origin, involving wildlife, pig farming, animal movement, and farm workers [16].Therefore, in the last couple of years or so, works in this field have also focused on predicting what type of pathogen might be responsible for Disease X, with an aim to create, implement, and evaluate countermeasures that would help control the potential pandemic at a faster rate than previous pandemics, such as COVID-19 [17][18][19].Simpson et al. [20] stated that Disease X is likely to occur due to one or more of these risk factors-human interactions with wildlife, the production of goods derived from animals with minimal oversight of workers and an unclear supply chain, bug and tick vectors, extremely high population densities, and limited surveillance and laboratory capacities.This work by Simpson et al. [20] also states that Disease X will probably be caused by the zoonotic spread of a highly infectious RNA virus from a region where the confluence of risk factors and population dynamics will lead to prolonged person-to-person transmission.
A highly agreed upon aspect related to Disease X within the research community in this field is that the world is currently not prepared with the applicable countermeasures, policies, and procedures that would be necessary to control and contain this virus.There are multiple factors that we need to take into consideration when creating new response, control, and preparation measures, including vaccine development and distribution, country and state responses, political stances, and cultural and environmental factors.It is crucial that there is global preparation, coordination, and communication such that each of these factors is considered and managed in coordination with other factors to allow for controlling and containing a new pandemic [21].
One of the overarching issues that was observed on a global scale when attempting to handle the COVID-19 pandemic was the lack of efficiency, coordination, agreement, and organization related to the production and distribution of vaccines and COVID-19 tests in a timely manner [22].While various organizations and labs were working to create a vaccine, it seemed that different countries were scrambling to even put-up testing centers and mass produce enough COVID-19 tests.It took far longer than ideal to ensure easy access to COVID-19 tests, which allowed the COVID-19 virus to continue to spread at an alarming rate because symptoms were not guaranteed to be noticeable in all population groups [23,24].Testing is one of the first lines of defense against viruses because the threshold of the spread can be determined, and suitable actions can be taken depending on the positive cases that are reported [25].This was an issue with the supply chain and communication across agencies during the outbreak and rapid spread of COVID-19.Those same supply chain issues were reported when trying to roll out the COVID-19 vaccines at an even slower rate than the tests.Research labs in different geographic regions seemed less prepared to mass produce and distribute the vaccines, which slowed down response rates and did not contain the spread of COVID-19 in a timely manner [26,27].Another issue associated with the COVID-19 pandemic was the lack of coordination and cooperation between countries in their responses [28].During the outbreak of COVID-19, some countries implemented measures (such as partial or complete lockdowns) immediately, while others did not implement such measures at the same pace [29,30].Finally, a major issue specifically seen during the COVID-19 pandemic was political stances standing in the way of scientific progress.A lot of misinformation circulated over the course of the pandemic, which included the effectiveness of vaccines, the safety of the vaccines, the accuracy of the test results, approaches for treatment, and the severity of the virus.[31,32].
During the outbreak of COVID-19 and similar viruses in the past, Google Trends attracted a significant amount of attention from researchers across different disciplines, such as Big Data [33,34], Data Mining [35,36], Healthcare [37,38], Epidemiology [39,40], Information Retrieval [41,42], and Data Analysis [43,44], as Google Trends helps to mine, analyze, and obtain real-time insights related to web behavior, and the features of Google Trends surpass traditional surveys [45].However, none of the prior works in this field have focused on mining and analyzing web behavior data from Google Trends related to Disease X. Addressing this research gap serves as the main motivation for this work.
The rest of this paper is organized as follows.In Section 2, a review of recent works in this field is presented.Section 3 discusses the methodology that was followed for mining relevant web behavior data from Google Trends to develop the proposed dataset.The description of the dataset is presented in Section 4. Section 4 also highlights how the dataset complies with the FAIR principles of scientific data management.A brief analysis of the dataset and potential applications of this dataset are presented in Section 5.The conclusion is presented in Section 6, which is followed by references.

Literature Review
In today's "Internet of Everything" style of living, the internet is integrated into our everyday routines more than ever before [46].The web affords people across the world opportunities for internet-mediated engagement, and many of our everyday activities and lifestyles are rapidly transitioning to activities being conducted on the internet.Therefore, mining, analysis, and modeling of web behavior hold substantial importance across various disciplines, particularly for enhancing recommender systems [47], collaborative filtering mechanisms [48], user behavior clustering [49], customization of technology [50], modeling user trust and acceptance towards emerging technologies [51], enhancing webpage transitions [52], user personality detection [53], user interest analysis [54], monitoring virus outbreaks [55], and forecasting epidemics [56], just to name a few.
There have been multiple methodologies developed, implemented, and applied for web behavior mining, modeling, and analysis.However, in the last few years, Google Trends has become increasingly popular amongst researchers from different disciplines for such research works related to studying web behavior [57].In the area of web behavior monitoring and analysis associated with different virus outbreaks, Google Trends has had a wide range of applications and use cases [58][59][60][61].Ginsberg et al. [62] discussed the significance of seasonal influenza and the potential threat of a pandemic caused by a new strain of the influenza virus using Google Trends.The work proposed a method to enhance early disease detection by monitoring Google Search queries, which reflected health-seeking behavior.By analyzing Google Search queries, the researchers accurately estimated weekly cases of influenza in different regions of the United States, allowing for rapid detection and response to influenza with only a one-day reporting lag.The work by Kapiány-Fövény et al. [63] focused on analyzing Google Search volumes using Google Trends to forecast Lyme disease incidences.By integrating Google Trends data into a seasonal autoregressive moving average (SARIMA) model, the researchers compared their predictions with the actual reported values for Lyme disease incidence in Germany.The objective of the work done by Verma et al. [64] was to assess the potential of using Google Trends data for predicting disease outbreaks.Focusing on diseases like malaria, dengue fever, chikungunya, and enteric fever in two regions in India-Chandigarh and Haryana-the research compared Google Search trends with Integrated Disease Surveillance Programme (IDSP) data.The analysis revealed a temporal correlation between the two datasets, particularly with a lag of 2 to 3 weeks for chikungunya and dengue fever, indicating the feasibility of utilizing Google Trends for predicting disease outbreaks at both local and regional levels.Young et al. [65] explored the potential of using relevant Google Search queries from Google Trends to monitor and predict syphilis cases at a state level.The study investigated the relationship between weekly reported syphilis cases and online search activity related to risk factors.By employing linear mixed models, the study established associations between search query data and syphilis cases, achieving accurate predictions for a significant number of weeks.The results indicated a strong correlation between web behavior and reported syphilis cases, suggesting the feasibility of integrating such data into public health monitoring systems for disease surveillance and prediction.Another work by Young et al. [66] focused on utilizing Google Search data to monitor and predict new HIV diagnosis cases in the United States.They collected HIV-related search volume data and state-level new HIV diagnoses data using Google Trends.Thereafter, they developed a predictive model using significant predictor keywords identified through LASSO and combined this data with actual HIV case reports from the CDC.The model demonstrated strong predictive capabilities, achieving an average R 2 value of 0.99 and an average root-mean-square error (RMSE) of 108.75 when comparing predicted and actual HIV cases.Morsy et al. [67] focused on predicting Zika virus cases using Google Search queries from Google Trends.The researchers developed a prediction model based on time-series regression (TSR) that utilized Zika search volume from Google Trends to anticipate confirmed Zika cases in Brazil and Colombia.The model, with a 1-week lag of Zika query and a 1-week lag of Zika cases as a control for autocorrelation, was found to be the most effective in predicting Zika cases.The results demonstrated the potential to forecast Zika cases a week ahead of outbreaks, offering healthcare authorities an early indicator for outbreak evaluation and precautionary measures.Using Google Trends, Ortiz-Martínez et al. [68] showed that there was a high correlation between the COVID-19 incidence in Colombia and Google searches on COVID-19 in Colombia (R 2 = 0.8728 and p < 0.0001).In addition to the above, in the last few years, Google Trends has also had a wide range of interdisciplinary applications related to the understanding and analysis of public health concerns [69][70][71], societal problems [72][73][74], emerging technologies [75][76][77], human behavior analysis [78][79][80][81], assistive technologies [82][83][84][85], humanitarian issues [86][87][88][89], and smart technologies [90][91][92][93].Therefore, it may be concluded that prior works in this field have focused on using Google Trends related to mining, analysis, and investigation of multimodal components of web behavior for a wide range of applications and use cases, with a specific focus on studying and analyzing web behavior during various virus outbreaks.However, these works have multiple limitations.The following is a summary of the same and an overview of the public health needs in the context of Disease X:

•
In the last few years, a significant amount of research related to the mining and analysis of web behavior, associated with different virus outbreaks such as COVID-19 [40,68,84,92], influenza [62], Lyme disease [63], malaria [64], dengue [64], chikungunya [64], syphilis [65], HIV [66], and Zika virus [67], has been conducted.Even though Disease X features in the shortlist of blueprint priority diseases of the WHO, no prior work in this field has focused on Disease X.Therefore, it is crucial to perform the mining and analysis of web behavior related to Disease X.

•
The works that analyzed relevant data from Google Trends during virus outbreaks of the past have focused on web behavior originating from a very limited number of geographic regions.For example, the work by Verma et al. [64] focused on the analysis of the web behavior from two regions in India, the work by Young et al. [66] focused on web behavior analysis from the United States, and the work of Morsy et al. [67] focused on the web behavior analysis from Brazil and Columbia.Similar to the virus outbreaks of the past, which were not localized in one or two geographic regions, the outbreak of Disease X is expected to have a global impact.Therefore, the need of the hour is to mine and analyze the web behavior data related to Disease X emerging from different geographic regions.
To address these limitations, public health needs, and to contribute to the timely advancement of research and development in this field, this work presents a dataset that comprises web behavior data related to Disease X that emerged from 94 regions between February 2018 and August 2018.These 94 regions were selected for the development of this dataset as all these regions recorded a significant level of interest towards Disease X during this timeframe.This dataset was developed by collecting data from Google Trends.This paper also presents an analysis of this dataset and discusses potential applications of the same in the context of public health needs related to Disease X.The methodology that was followed for the development of this dataset is presented in Section 3.

Methodology
Google Trends [94], a tool developed by Google, allows the mining and analysis of real-time and historical information associated with Google Search queries, enabling researchers to uncover valuable insights into the interests of individuals across different domains and topics [95].Google Trends analyzes search behavior by considering searches on Google and can thus provide unique insights associated with web-behavior.This feature is particularly valuable in health informatics, where understanding public engagement and interests in health-related topics and predicting disease outbreaks is of paramount importance [96].
The real-time data availability of Google Trends makes it superior to traditional survey methods, and it is also far less time-consuming.Additionally, as the web behavior data available via Google Trends is anonymous, it allows researchers to explore different forms of data analysis that might have been otherwise difficult due to privacy concerns of the general public [96].Google Trends presents several significant advantages over traditional survey methods, positioning it as a potent tool for research and analysis of the multimodal characteristics of web-behavior.The foremost advantage lies in the cost-effectiveness of utilizing Google Trends.Unlike traditional surveys, which frequently entail significant expenses for participant recruitment, data collection, and analysis, Google Trends operates as a cost-free resource.This financial flexibility allows researchers to channel resources into more focused areas of investigation or allocate them toward enhancing the research process itself, promoting greater flexibility in research endeavors.Another key advantage centers around the breadth and diversity of the data captured by Google Trends.Conducting regular surveys on a global scale is a logistical challenge, often constrained by geographic and demographic limitations.However, Google Trends seamlessly aggregates web behavior data on a global scale, which can be used for in-depth study and analysis.This global perspective of Google Trends enhances the generalizability of findings and facilitates cross-cultural comparisons, making it a valuable resource for understanding the intricacies of web behavior across different geographic regions.Moreover, the near real-time nature of data availability on Google Trends is a gamechanger.Google Trends offers almost immediate access to search trends as they unfold, providing researchers with timely access to evolving interests and trends.This swift access to information enables timely analysis, decision-making, and trend detection, making it particularly advantageous in fields that require quick response, such as public health and policy formulation.In contrast, traditional surveys often grapple with time delays, influenced by the labor-intensive nature of participant recruitment and adherence to inclusion criteria.The delays inherent in survey-based research can hinder the ability to capture real-time insights, potentially impacting the accuracy and relevancy of the findings.The instant accessibility of Google Trends data addresses this limitation, empowering researchers with the agility to adapt and react promptly to emerging trends or shifts in user interests related to a topic as evidenced by relevant web-behavior.
Google Trends presents the frequency at which a specific search term is input into Google's search engine relative to the overall search volume on the site during a specific timeframe.Mathematically, if n(q,l,t) represents the number of searches for the query q in the location l during the period t, the relative popularity (RP) of the query is computed as shown in Equation (1).In Equation (1), Q(l,t) is a set of all the queries made from location l at time t, Π(n(q,l,t) > τ) is a dummy term with a value of 1 when n(q,l,t) > τ (query is popular) and 0 otherwise.The resulting numbers are then scaled within the range of 0 to 100 based on the proportion of the topic relative to the total number of search topics.This defines the Google Trends Index (GTI), as shown in Equation ( 2) [96].
These index values can be generated by Google Trends starting from 1 January 2004, up to 36 hours prior to the present search.Google Trends excludes search data from very limited users and highlights popular search topics while assigning 0 to terms with low search volumes [97].The following is an overview of the features of Google Trends:

•
Search Term Trends: This feature allows users to see how the popularity of a specific search term or keyword has changed over time.Google Trends provides a graphical representation to highlight these trends.

•
Related Queries: Google Trends displays related queries that are frequently searched alongside the user's primary search term.This can help identify related topics or terms relevant for data analysis.For developing this dataset, the web behavior data in terms of search interests related to Disease X (as a topic) was collected using Google Trends from February 2018 to August 2023.To perform this task, the global search trends related to Disease X (as a topic) during this timeframe (February 2018 to August 2023) were mined using Google Trends.The following represents the specific steps that were followed in this regard: • Navigate to the "Explore" tab on Google Trends.

•
Set the search query as Disease X (Topic).

•
Set the geolocation to "Worldwide".

•
Set "All Categories" in the categories option.

•
Select "Web Search" for the type of search.
February 2018 was selected as the start time, as the WHO added Disease X to their shortlist of blueprint priority diseases in February 2018.8 August 2023 was the most recent date at the time of data collection.The search query was set as Disease X (Topic) to mine the search interests related to Disease X as a topic on a global scale.This selection ensured that different search queries on Google related to Disease X focused on different topics, such as how to prepare for Disease X, policies to reduce the spread of Disease X, treatments for Disease X, medications for Disease X, available vaccines for Disease X, effect of Disease X on the education industry, effect of Disease X on the global economy, impact of Disease X on the healthcare industry, effect of Disease X on stock markets, and Disease X and stay-at-home guidelines, just to name a few; these were included in the search interest values being computed.In the categories option on Google Trends, different web search categories are available.However, for the data collection, the "All Categories" option was selected to take into account all the different search categories on Google in the context of web searches about Disease X.These web search categories include Arts and Entertainment, Autos and Vehicles, Beauty and Fitness, Books and Literature, Business and Industrial, Computers and Electronics, Finance, Food and Drink, Games, Health, Hobbies and Leisure, Home and Garden, Internet and Telecom, Jobs and Education, Law and Government, News, Online Communities, People and Society, Pets and Animals, Real Estate, Reference, Science, Shopping, Sports, and Travel.The result provided by Google Trends is shown in Figure 1.Thereafter, by using the "Regional Interest" feature of Google Trends, the list of regions that recorded significant search interests related to Disease X during this timeframe was compiled and exported.This list of regions is shown in Table 1.

List of Regions
Singapore, Haiti, Honduras, El Salvador, Madagascar, Panama, Bolivia, Reunion, Guatemala, Cuba, United Arab Emirates, Paraguay, Nicaragua, Hong Kong, Macao, Qatar, United Kingdom, Brunei, Ecuador, Uruguay, Oman, Bahrain, Ireland, Kuwait, Costa Rica, Argentina, India, Puerto Rico, Venezuela, France, St. Helena, Brazil, Mexico, Côte d'Ivoire, Peru, Canada, Australia, Zimbabwe, Colombia, United States, Luxembourg, Lebanon, Ghana, Algeria, New Zealand, Portugal, Malaysia, Myanmar (Burma), Ethiopia, Dominican Republic, China, Chile, Nepal, Belgium, Iraq, Taiwan, South Africa, Tunisia, Sri Lanka, Thailand, Switzerland, Spain, Bangladesh, Saudi Arabia, Kenya, South Korea, Germany, Norway, Pakistan, Indonesia, Hungary, Morocco, Austria, Israel, Nigeria, Bulgaria, Philippines, Netherlands, Denmark, Greece, Italy, Jordan, Egypt, Sweden, Finland, Czechia, Romania, Poland, Iran, Türkiye, Russia, Vietnam, Ukraine, Japan Thereafter, by utilizing Google Trends as the data source, search interests related to Disease X (as a topic) for all these 94 regions between February 2018 and August 2023, were collected and exported as .CSV files.As far as other geographic regions are concerned, for instance, Yemen, Zimbabwe, Tajikistan, Namibia, Fiji, etc., there was not a significant number of Google searches related to Disease X on a monthly basis between February 2018 and August 2023.As a result, Google Trends did not provide any value for search interests related to Disease X from all such regions.Therefore, such regions were not included in the dataset development.To consolidate the 94 .CSV files into one workbook on Microsoft Excel, the Power Query interface on Excel was employed.The Power Query tool used each individual file as a data source and imported each file's data into the Excel Workbook.Each region's search interest for "Disease X" is present as a different sheet in this file, which was uploaded to IEEE Dataport [98] as a dataset.The flowchart in Figure 2 shows the step-by-step process that was followed for the development of this dataset.This dataset is described in Section 3.

Data Description
This section describes the dataset, which is available at https://dx.doi.org/10.21227/ht7f-rx42.This dataset contains one Microsoft Excel workbook that comprises 94 different sheets, where each sheet presents the search interests related to Disease X (as a topic) for a different region between February 2018 and August 2023.The search interest data for all the regions stated in Table 1 is available in this dataset.For each region, this dataset presents the search interests related to Disease X (as a topic) for each month in this timeframe, i.e., from February 2018 to August 2023.Table 2 presents the description of the different attributes present in this dataset.It is worth mentioning here that these attributes are present across different sheets in the dataset, where each sheet contains the name of the region.For instance, in the sheet named-"Singapore", the two attributes present are "Month" and "Disease X: (Singapore)".Similarly, in the sheet named "Honduras", the two attributes present are "Month" and "Disease X: (Honduras)".There are 94 such sheets in this dataset, which are named as per these regions.To avoid presenting a table with 95 rows, all the attributes from all the sheets are not listed in Table 2.In the remainder of this section, the compliance of this dataset with the FAIR principles of Scientific Data Management [99] is explained.The FAIR principles include four key aspects of scientific data management, namely Findability, Accessibility, Interoperability, and Reusability.The components of the FAIR Principles exhibit interrelationships while maintaining autonomy and distinctiveness.The aforementioned principles delineate specific factors to be taken into account in modern data publication settings, particularly in relation to facilitating both human and computerized methods of depositing, exploring, accessing, collaborating, and reusing data.In the last few years, numerous works have emerged, primarily focused on specific domains, advocating for enhancements in the handling of data and data archival practices.However, FAIR stands apart by presenting succinct, domain-agnostic, overarching principles that can be universally applied to diverse research results.The principles delineate the essential attributes that modern data resources, tools, vocabularies, and infrastructures should possess in order to facilitate the process of identification and enable the reuse of such resources by others.The principles may be followed in various configurations and progressively, as data providers' publication settings progress towards higher levels of 'FAIRness'.Furthermore, the flexibility of the principles, together with their clear differentiation between data and metadata, provides explicit support for a diverse array of scenarios.The FAIR Guiding Principles, which are of a higher level, come before the selection of implementation options and do not endorse any particular technology, standard, or implementation approach.It is important to note that these principles do not constitute a standard or a specification in and of themselves.These guidelines serve as a reference for data publishers and stewards, aiding them in assessing the effectiveness of the resulting decisions in ensuring that their research artifacts are defined by the principles of Findability, Accessibility, Interoperability, and Reusability.These overarching principles facilitate a diverse array of integrated and investigative behaviors, grounded in a vast selection of technological options and applications.
The FAIR principles comprise the potential to have beneficial effects on a wide range of stakeholders.These include researchers who seek to distribute, receive recognition for, and utilize one another's data and solutions.Essentially, the FAIR principles endeavor to cultivate a more cooperative and transparent research landscape, facilitating the exchange of knowledge and bolstering the lasting influence of scientific investigations related to database development and database management [99].Several prior works in the field of dataset development have discussed how developed datasets, such as the human metabolome database for 2022 [100], WikiPathways dataset [101], datasets of Tweets about COVID-19 [102,103], a dataset of Tweets about MPox [104], computational 2D materials database (C2DB) [105], the open reaction database [106], RCSB Protein Data Bank [107], and PHI-base (pathogen-host interactions database) [108], just to name a few, comply with the FAIR principles of scientific data management.This dataset, available at https://dx.doi.org/10.21227/ht7f-rx42, is findable, as it has a unique and permanent DOI assigned by IEEE Dataport.This DOI can be used by researchers from any discipline to find this dataset online.This dataset satisfies the accessibility property, as it can be accessed by any user on the internet using any device via the DOI, as long as the user's device is connected to the internet and is operating in a desired manner.The dataset is interoperable, as the data in this dataset is available in a standard format (.xlsx file) that can be downloaded, read, and analyzed across different computer systems, frameworks, and applications.Finally, this dataset satisfies the reusability property as the data can be re-used any number of times for the study and investigation of different research questions that focus on the analysis of search interests related to Disease X.

Data Analysis and Potential Applications
This section presents the results obtained from a brief analysis of this dataset.It concludes by highlighting a few potential applications of this dataset.The data present in this dataset can be analyzed to obtain the trends in search interests during this timeframe for each of these 94 regions.For instance, the analysis of this data for the United States is presented in Figure 3.In this Figure, the X-axis represents the months, and the Y-axis represents the search interest related to Disease X on a scale of 0 to 100.This analysis of the data for the United States shows that the search interest related to Disease X was the highest in August 2023.Similar trends and insights associated with search interests for Disease X emerging from different geographic regions can be obtained from analysis of the search interest data for that region as available in this dataset.The data present in this dataset also unravel the evolving paradigms of global search patterns about Disease X.For instance, Figure 4 shows a world map-based representation of different regions that recorded a significant number of Google searches about Disease X on 8 August 2023.As can be seen from this Figure, these regions were Fiji, Oman, Ethiopia, the United Kingdom, Uganda, Uruguay, El Salvador, Puerto Rico, Nepal, Canada, Ecuador, Venezuela, Bangladesh, Morocco, Bolivia, Singapore, Bulgaria, South Africa, Ireland, the United States, the United Arab Emirates, Australia, the Netherlands, Hong Kong, Israel, Nigeria, Austria, Belgium, India, Pakistan, Portugal, Colombia, Egypt, Argentina, France, Poland, Thailand, Malaysia, Spain, Germany, and Russia.Figure 5 shows a world map-based representation of different regions that recorded a significant number of Google searches about Disease X on 8 July 2023.As can be seen from Figure 5, the regions that recorded a significant number of Google searches about Disease X on 8 July 2023 were Maldives, Cuba, Kuwait, Morocco, the Dominican Republic, Bulgaria, Costa Rica, Uruguay, New Zealand, Israel, Pakistan, Peru, the United Arab Emirates, Greece, Philippines, Argentina, Thailand, Spain, Türkiye, Brazil, and the United States.These two figures illustrate the fact that the global landscape, in terms of interest in Disease X, significantly changed over a time period of just one month, between 8 July 2023 and 8 August 2023.Figures 4 and 5 also indicate that within a short period of time, people from several regions of the world have started proactively searching for Disease X on Google.In a similar manner, the evolution of global interest related to Disease X over different time periods can be mapped, analyzed, and interpreted using this dataset.
During the development of this dataset, it was observed that online searches on Google related to Disease X during this timeframe (February 2018 to August 2023) had several related queries.The 'rising' keywords associated with these related queries were collected using the "Related Queries" feature of Google Trends, as described in Section 2. Figure 6  Thereafter, a comprehensive analysis of the search interests associated with Disease X from all 94 regions between February 2018 and August 2018 was performed to explore and investigate the trends of the same.These are presented in Figures 7-15.Multiple graphical representations were prepared, primarily to ensure the readability of the compared trends for investigation of the underlying search interests.Each of these graphs presents the trends in search interests about Disease X for about 10 distinct countries, enabling the exploration and investigation of the trends in search interests.For instance, from Figure 7 it can be inferred that the number of Google Searches about Disease X in Brunei in August 2023 was much higher than the number of Google Searches about        Thereafter, a country-specific analysis of the highest search interests related to Disease X was performed.For readability, the results are shown as two different graphs in Figures 16 and 17, respectively.Multiple novel insights related to the volume of Google Searchers originating from different parts of the world can be inferred from Figures 16  and 17.For instance, a high number of Google Searches about Disease X (indicated by the search interest value being 100) was observed in several countries for August 2023.These specific regions were Bolivia, the United Kingdom, Oman, Ireland, Canada, Kuwait, Australia, the United States, the Dominican Republic, Algeria, Malaysia, Portugal, New Zealand, South Africa, Bangladesh, Pakistan, Norway, Netherlands, and Nigeria.This is a significant increase as compared to recent historical data as far as a high number of Google Searches (indicated by the search interest value being 100) related to Disease X are concerned.For instance, in June 2023, only one country, Luxembourg recorded a high number of Google Searches about Disease X (indicated by the search interest value being 100).Similarly, in April 2023, only one country, Réunion, recorded a high number of Google Searches about Disease X (indicated by the search interest value being 100).These findings indicate that there has been a considerable increase in Google Searches about Disease X from different regions of the world.
This dataset is expected to contribute towards the investigation of a wide range of research questions in different disciplines, such as Healthcare, Epidemiology, Big Data, Data Science, and Data Analysis, with a specific focus on Disease X.Given the potential correlation between the volume of internet searches and the information needs of users [109,110], this dataset could be utilized by public health organizations at both local and global levels to obtain insights into the specific needs of the general public pertaining to Disease X in various geographic regions.The use of Google Trends as a supplementary tool to conventional surveillance systems for monitoring epidemics has proven to be effective, as demonstrated by multiple prior works in this field [111,112].So, if an epidemic due to Disease X were to start, this dataset is expected to serve as a framework for public health organizations for the development of a surveillance system for Disease X.Previous studies in this field have utilized data acquired from Google Trends to identify instances of "panic-induced searching", where media coverage of a specific outbreak amplified web-search activity [113].For example, during the avian influenza outbreak that occurred between 2005 and 2006 [114], starting in China and spreading to Turkey, there were notable spikes in the US search volume index for the term "bird flu", despite no confirmed cases of avian flu in the US.So, the data available in the dataset could be analyzed to infer whether the spikes in search interests related to Disease X emerging from different regions were caused as a result of media coverage about Disease X or whether people in different regions of the world (that recorded spikes in search interests about Disease X) were genuinely concerned about Disease X.The internet serves as a platform for public health organizations to efficiently and affordably distribute healthcare information.However, it is crucial that reliable news is disseminated to the general public [115][116][117].In the last few weeks, many public health organizations have disseminated information pertaining to Disease X, with the aim of reaching a wide audience [118][119][120][121].The dataset presented in this work could be helpful for public health organizations to build a framework to understand the translation gap of such information about Disease X, i.e., the gap between what people need to know and what most people believe they know.Furthermore, this dataset, along with relevant social media data about Disease X, may be utilized by both global and local health organizations to identify the information requirements related to Disease X, barriers to preventing infection stemming from social and behavioral factors, and instances of misinformation about Disease X in different regions of the world.Finally, the investigation of the temporal pattern of query rates about Disease X from this dataset, in tandem with their geographical dispersion and primary search themes, may provide a measurable and meaningful indicator of the level of public interest and information requirements regarding Disease X.
The work presented in this paper has a few limitations.First, the data collected by Google Trends is limited to the search patterns emerging from only a subset of the global population, specifically those who have the ability to access the internet and who choose to use Google as their preferred search engine, as opposed to other search engines.Second, a significant constraint of Google Trends is the lack of thorough information about the methodology used by Google for generating search interest data and the algorithms utilized for its analysis.Third, the data from Google Trends available in this dataset represents relative search volumes and not absolute values of the number of Google Searches, as Google Trends only provides the relative search volume data.Finally, there is a lack of documentation on the past developments related to the design and functionalities of Google Trends.This absence of documentation may result in fluctuations in search results and therefore impact the outcomes of research studies, depending on when they were conducted.

Conclusions
The World Health Organization (WHO) added "Disease X" to their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic.Since then, several works in this field have analyzed virus outbreaks of the past to propose approaches, methodologies, principles, and guidelines for better awareness, preparedness, and response towards Disease X.Many of these works that focused on analyzing virus outbreaks of the past, such as COVID-19, Influenza, Lyme disease, and Zika virus, utilized Google Trends to mine and analyze multimodal components of web behavior.However, two primary limitations exist in these works.First, these works did not specifically focus on Disease X.Second, many of these works focused on the analysis of Google Trends data originating from a very limited number of geographic regions.To address these limitations and to contribute towards the timely advancement of research in this field, this work presents a dataset of search interests related to Disease X (as a topic) originating from 94 regions of the world between February 2018 and August 2023.These 94 regions were selected for the development of this dataset as all these regions recorded a significant level of search interest towards Disease X during this timeframe.The dataset is available at https://dx.doi.org/10.21227/ht7f-rx42.In this dataset, for every region, the search interest related to Disease X is available for each month during this timeframe.The dataset complies with the FAIR principles of scientific data management.This paper also presents a brief analysis of this dataset to uphold its relevance and usefulness for the investigation of different research questions in the interrelated fields of Big Data, Data Mining, Healthcare, Epidemiology, Information Retrieval, and Data Analysis with a specific focus on Disease X.As per the best knowledge of the authors, no similar work in this field has been done so far.Future work in this area would involve analyzing the specific trends of search interests related to Disease X across different geographic regions to determine and investigate specific similarities or dissimilarities of those trends.

Figure 1 .
Figure 1.Trends in Search Interests related to Disease X (as a topic) on a Global Scale between February 2018 and August 2023.

Table 1 .
List of 94 regions that recorded significant search interests related to Disease X (as a topic) between February 2018 and August 2023.

Figure 2 .
Figure 2. A flowchart to represent the step-by-step process of the development of this dataset.

Figure 3 .
Figure 3. Trends in search interests related to Disease X (as a topic) for the United States between February 2018 and August 2023.

Figure 4 .
Figure 4.A world map-based analysis of the significant number of Google searches related to Disease X from different regions of the world on 8 August 2023.

Figure 5 .
Figure 5.A world-map-based analysis of the significant number of Google searches related to Disease X from different regions of the world on 8 July 2023.
Figures 4 and 5 also indicate that within a short period of time, people from several regions of the world have started proactively searching for Disease X on Google.In a similar manner, the evolution of global interest related to Disease X over different time periods can be mapped, analyzed, and interpreted using this dataset.During the development of this dataset, it was observed that online searches on Google related to Disease X during this timeframe (February 2018 to August 2023) had several related queries.The 'rising' keywords associated with these related queries were collected using the "Related Queries" feature of Google Trends, as described in Section 2. Figure 6 shows a word-cloud-based representation of these queries related to Disease X during this timeframe.In this context, it is worth mentioning that the mining of the data from Google Trends for the development of this dataset was performed on 8 August 2023.Google Trends provided the search interest for August 2023 for each of the 94 regions by taking into account the relevant Google Searches recorded from 1 August 2023 to 8 August 2023.So, if the data collection is performed once again at the end of August 2023 or at a later date using Google Trends, it is possible that the search interest for August 2023 for some of these regions might change, as Google Trends would then report the search interest value for August 2023 by taking into account all relevant Google Searches recorded from 1 August 2023 to 31 August 2023.Thereafter, a comprehensive analysis of the search interests associated with Disease X from all 94 regions between February 2018 and August 2018 was performed to explore and investigate the trends of the same.These are presented in Figures7-15.Multiple graphical representations were prepared, primarily to ensure the readability of the compared trends for investigation of the underlying search interests.Each of these graphs presents the trends in search interests about Disease X for about 10 distinct countries, enabling the exploration and investigation of the trends in search interests.For instance, from Figure7it can be inferred that the number of Google Searches about Disease X in Brunei in August 2023 was much higher than the number of Google Searches about

Figure 8 .
Figure 8.A graphical analysis of search interests (monthly) related to Disease X in Bolivia, Hong Kong, Bahrain, Paraguay, Ecuador, the United Kingdom, Macao, Costa Rica, Reunion, and Oman between February 2018 and August 2023.

Figure 9 .
Figure 9.A graphical analysis of search interests (monthly) related to Disease X in Ireland, Myanmar (Burma), Uruguay, Venezuela, Argentina, India, Puerto Rico, France, Qatar, and St. Helena between February 2018 and August 2023.

Figure 10 .
Figure 10.A graphical analysis of search interests (monthly) related to Disease X in Madagascar, Brazil, Peru, Mexico, Cambodia, Canada, Ethiopia, Luxembourg, Colombia, and Kuwait between February 2018 and August 2023.

Figure 11 .
Figure 11.A graphical analysis of search interests (monthly) related to Disease X in Australia, the United States, Ghana, Nepal, Lebanon, the Dominican Republic, Algeria, Malaysia, Portugal, and New Zealand, between February 2018 and August 2023.

Figure 12 .
Figure 12.A graphical analysis of search interests (monthly) related to Disease X in Iraq, China, Taiwan, Belgium, South Africa, Switzerland, Tunisia, Côte d'Ivoire, Bangladesh, Chile, and Thailand between February 2018 and August 2023.

Figure 13 .
Figure 13.A graphical analysis of search interests (monthly) related to Disease X in Spain, Saudi Arabia, Morocco, South Korea, Germany, Pakistan, Norway, Indonesia, Greece, Bulgaria, and Jordan between February 2018 and August 2023.

Figure 14 .
Figure 14.A graphical analysis of search interests (monthly) related to Disease X in Hungary, Kenya, Philippines, Israel, Austria, Netherlands, Denmark, Egypt, Italy, Sri Lanka, and Sweden between February 2018 and August 2023.

Figure 15 .
Figure 15.A graphical analysis of search interests (monthly) related to Disease X in Nigeria, Finland, Romania, Czechia, Ukraine, Poland, Türkiye, Vietnam, Iran, Russia, and Japan between February 2018 and August 2023.

Figure 16 .
Figure 16.A graphical representation of the specific months between February 2018 and August 2023 when the highest search interests related to Disease X were recorded in Singapore, Honduras, Haiti, Nicaragua, Guatemala, El Salvador, Brunei, Panama, Cuba, the United Arab Emirates, Bolivia, Hong Kong, Bahrain, Paraguay, Ecuador, the United Kingdom, Macao, Costa Rica, Reunion, Oman, Ireland, Myanmar (Burma), Uruguay, Venezuela, Argentina, India, Puerto Rico, France, Qatar, St. Helena, Madagascar, Brazil, Peru, Mexico, Cambodia, Canada, Ethiopia, Luxembourg, Colombia, Kuwait, Australia, the United States, Ghana, Nepal, Lebanon, the Dominican Republic, and Algeria.
Users can compare the search interests of different categories or topics on Google using Google Trends.This can be useful for understanding the relative popularity of various topics.•TimePeriod Selection: Google Trends allows users to specify the time period for which they wish to query and analyze the data.This can range from a few hours to multiple years.
• Regional Interest: Users can view the geographical regions where a specific search term is most popular using Google Trends.Google Trends provides insights into regional differences in search interests for search terms.• Trending Searches: This feature of Google Trends highlights the current and popular search queries or topics, providing real-time insights into what people are searching for on Google.• Year in Search: Google Trends often releases a "Year in Search" report summarizing the top search queries from the past year.This report offers an overview of significant events and trends.• Category Comparison:

Table 2 .
Description of different attributes present in the dataset.