Global Research on Syndromic Surveillance from 1993 to 2017: Bibliometric Analysis and Visualization

: Syndromic Surveillance aims at analyzing medical data to detect clusters of illness or forecast disease outbreaks. Although the research in this ﬁeld is ﬂourishing in terms of publications, an insight of the global research output has been overlooked. This paper aims at analyzing the global scientiﬁc output of the research from 1993 to 2017. To this end, the paper uses bibliometric analysis and visualization to achieve its goal. Particularly, a data processing framework was proposed based on citation datasets collected from Scopus and Clarivate Analytics’ Web of Science Core Collection (WoSCC). The bibliometric method and Citespace were used to analyze the institutions, countries, and research areas as well as the current hotspots and trends. The preprocessed dataset includes 14,680 citation records. The analysis uncovered USA, England, Canada, France and Australia as the top ﬁve most productive countries publishing about Syndromic Surveillance. On the other hand, at the Pinnacle of academic institutions are the US Centers for Disease Control and Prevention (CDC). The reference co-citation analysis uncovered the common research venues and further analysis of the keyword cooccurrence revealed the most trending topics. The ﬁndings of this research will help in enriching the ﬁeld with a comprehensive view of the status and future trends of the research on Syndromic Surveillance.


Introduction
Syndromic Surveillance, part of public health surveillance [1], is defined as the analysis of medical data to detect clusters of illness or forecast disease outbreaks [2]. Depending on the data collection, surveillance can be broadly classified into passive surveillance where medical data are collected in a routine manner, and an active surveillance where health data are actively gathered during outbreaks [3]. Syndromic surveillance is, therefore, crucial to the safety of the population. As such, a number of emergency department-based syndromic surveillance systems and early warning systems for early detection of adverse disease events have been implemented since 2000. Thereafter, numerous countries started developing their own systems. Korea Centers for Disease Control and Prevention (KCDC) have also implemented an emergency department-based syndromic surveillance system [4,5]. Up to now, numerous journals have published articles on syndromic surveillance; however, to the best of the authors' knowledge, a comprehensive qualitative, and quantitate evaluation of the research output has not been done. Bibliometric analysis was performed based on a novel framework to map the research in terms of collaborations, publications, and trends. When performing the search, the results can be divided into two: the main result, which includes records about articles related to the keyword used in the search process, and extended records or citing articles, which resemble articles that cite the ones reported in the main results.
In Figure 1, the search results of Q1 in WoSCC were 3805 citation records and 14,397 extra records as extended datasets. Equally, the results of Q2 in Scopus were 118 records and 1177 citing articles. Since some of the journals might be indexed in WoSCC and Scopus, a preprocessing step was needed. The purpose of the preprocessing step, which is the last stage of the framework shown in Figure 1, is to remove duplicate citation records, merge different datasets and handle the inconsistencies. For example, the datasets from Scopus were in RIS format while the data from When performing the search, the results can be divided into two: the main result, which includes records about articles related to the keyword used in the search process, and extended records or citing articles, which resemble articles that cite the ones reported in the main results.
In Figure 1, the search results of Q1 in WoSCC were 3805 citation records and 14,397 extra records as extended datasets. Equally, the results of Q2 in Scopus were 118 records and 1177 citing articles. Since some of the journals might be indexed in WoSCC and Scopus, a preprocessing step was needed. The purpose of the preprocessing step, which is the last stage of the framework shown in Figure 1, is to remove duplicate citation records, merge different datasets and handle the inconsistencies. For example, WoSCC were in plain text format. Therefore, Scopus data were converted into plain text format for consistency. Thereafter, the two datasets were merged into one final dataset, which was considered for the analysis step. The size of this dataset is 14,680 records.

Bibliometric Analysis and Tools
Bibliometric analysis is used to map research areas according to researchers, publications, institutions and trends. It has been applied to analyze research output in different fields such as Middle East Respiratory Syndrome [6], Bacterial Meningitis [7], T-cell [8] and Medical Big Data [9].
Various tools are available for Bibliometric analysis. In this research, CiteSpace [10] was used for the analysis, especially for creating networks of document co-citation, terms, country, and institutional collaborations. In addition, aside from the geographic distribution map, which was created by StatPlanet [11] software, CiteSpace generated all other visualizations. The impact factors of the journals were retrieved according to the 2016 list.

Results
The major types of bibliometric analysis were performed, namely, collaboration networks between countries and institutions, reference co-citation and keyword co-occurrence analysis. Additionally, a basic exploratory analysis was performed to understand the distribution of the citation dataset. The citation data are shown in Figure 2. The graph shows a steady increase in articles published in the field of syndromic surveillance since 2005. This increase of high-quality publications shows not only the importance of the topic for the safety and security of nations, but also the necessity to analyze the data and get quantitative and qualitative insights of the research in this field.

Countries and Institutions
The 14,680 articles considered for analysis in this paper were produced by 73 countries. Country collaboration network was created using CiteSpace. The network data were used to further map the global distribution of these countries. As a result, StatPlanet tool was used to generate the global view of the countries publishing about syndromic surveillance along with the proportion of the publications from 1993 to 2017. The color-coded map is visualized in Figure 3. The network data were also used to rank the countries according to their publications. Table 1 shows the top ten countries. USA, England, Canada, France, and Australia were the top five of the list, in that order. Table 1 also shows a list of ten institutions ranked according to their publication output. Institutions located in the United States of America dominated the top of the list. Namely, The US Centers for Disease Control and Prevention (CDC), Harvard University and Johns Hopkins University were the top three research institutions followed by the Canadian University of Toronto, and the British London School

Countries and Institutions
The 14,680 articles considered for analysis in this paper were produced by 73 countries. Country collaboration network was created using CiteSpace. The network data were used to further map the global distribution of these countries. As a result, StatPlanet tool was used to generate the global view of the countries publishing about syndromic surveillance along with the proportion of the publications from 1993 to 2017. The color-coded map is visualized in Figure 3. The network data were also used to rank the countries according to their publications. Table 1 shows the top ten countries. USA, England, Canada, France, and Australia were the top five of the list, in that order. Table 1 also shows a list of ten institutions ranked according to their publication output. Institutions located in the United States of America dominated the top of the list. Namely, The US Centers for Disease Control and Prevention (CDC), Harvard University and Johns Hopkins University were the top three research  Table 2, on the other hand, represents the top ten most cited publications produced by these institutions.  Table 2, on the other hand, represents the top ten most cited publications produced by these institutions.   The colors in Figure 3 represent absolute counts of papers. For instance, >25 refers to countries that have published 25 or more articles in the specified period of our dataset. Consequently, the black    165 Eysenbach [14] Journal of Medical Internet Research 2 163 Carneiro [15] Clinical Infectious Diseases 2 158 Frumkin [16] American Journal of Public Health 18 112 Heffernan [17] Emerging Infectious Diseases 0 110 East [18] Gastroenterology Clinics 5 108 Mandl [19] American Medical Informatics Association 0 102 Tarpey [20] Nature genetics 23 Note: Freq, Frequency; ID, Cluster Identification Number. The colors in Figure 3 represent absolute counts of papers. For instance, >25 refers to countries that have published 25 or more articles in the specified period of our dataset. Consequently, the black circles indicate the exact number of publications. The size of the circle represents the number of articles.

Reference Co-Citation Analysis
To get an overview of the publication landscape, a CiteSpace based document co-citation network was generated. This analysis can reveal the most influential papers, an intellectual base from which most publications were generated. CiteSpace summarizes the co-citation relationship between documents in terms of nodes and edges. In networking terminology, nodes are defined as dots and lines connecting these dots are called edges. In line with the previous definition, CiteSpace generates a co-citation network which is composed of dots that represent documents (i.e., papers) referred to as nodes, and two nodes are connected by a line (i.e., edge) if the two documents have been cited together in another paper. The decomposition of the resulting network into groups of strongly connected components form a cluster. CiteSpace decomposes the network into clusters using a Silhoute measure that quantifies the extent to which nodes represented in a strongly connected component are actually homogeneous. Afterwards, such clusters were summarized with a selected term that frequently occurs within these documents. This task is accomplished by the measures of Information retrieval and text mining, such as Term-Frequency-Inverse Document Frequency (TF*IDF), Mutual Information (MI) and Log-Likelihood (LLR) [21]. According to Chen [10], labelling using LLR is preferred over the other two measures. The resulting network consisted of 708 nodes and 2252 edges. A timeline visualization of the network is depicted in Figure 4. Cluster analysis was applied to the network where the network was grouped into various clusters. The clusters are labeled using the terms appearing in the titles of the citing articles according to three measures, i.e., TF*IDF, MI and LLR. Figure 4 also shows the labels of the clusters.

Reference Co-Citation Analysis
To get an overview of the publication landscape, a CiteSpace based document co-citation network was generated. This analysis can reveal the most influential papers, an intellectual base from which most publications were generated. CiteSpace summarizes the co-citation relationship between documents in terms of nodes and edges. In networking terminology, nodes are defined as dots and lines connecting these dots are called edges. In line with the previous definition, CiteSpace generates a co-citation network which is composed of dots that represent documents (i.e., papers) referred to as nodes, and two nodes are connected by a line (i.e., edge) if the two documents have been cited together in another paper. The decomposition of the resulting network into groups of strongly connected components form a cluster. CiteSpace decomposes the network into clusters using a Silhoute measure that quantifies the extent to which nodes represented in a strongly connected component are actually homogeneous. Afterwards, such clusters were summarized with a selected term that frequently occurs within these documents. This task is accomplished by the measures of Information retrieval and text mining, such as Term-Frequency-Inverse Document Frequency (TF*IDF), Mutual Information (MI) and Log-Likelihood (LLR) [21]. According to Chen [10], labelling using LLR is preferred over the other two measures. The resulting network consisted of 708 nodes and 2252 edges. A timeline visualization of the network is depicted in Figure 4. Cluster analysis was applied to the network where the network was grouped into various clusters. The clusters are labeled using the terms appearing in the titles of the citing articles according to three measures, i.e., TF*IDF, MI and LLR. Figure 4 also shows the labels of the clusters. The figure is composed of three parts: an ID, which shows the identifier of the cluster; a label, which represents the label given to the clusters; and the graphical representation of the clusters showing their evolution over time. For instance, the first cluster with ID 0 is labeled as: "Olympic Winter Games" [22], indicating the phrase was common in the papers included in this cluster. The second cluster was labeled "ESSENCE II", which stands for Electronic Surveillance System for the The figure is composed of three parts: an ID, which shows the identifier of the cluster; a label, which represents the label given to the clusters; and the graphical representation of the clusters showing their evolution over time. For instance, the first cluster with ID 0 is labeled as: "Olympic Winter Games" [22], indicating the phrase was common in the papers included in this cluster. The second cluster was labeled "ESSENCE II", which stands for Electronic Surveillance System for the Early Notification of Community-Based Epidemics (ESSENCE II), a regional system that supports advanced surveillance within the National Capital Region (NCR) developed by Johns Hopkins University Applied Physics Laboratory (JHU/APL) and the Division of Preventive Medicine at the Walter Reed Army Institute of Research [23]. The third cluster with ID 2 was labeled "Google Flu Trend" which is basically concerned with articles that use google trends as a source for conducting disease surveillance [15]. Following that, "Pandemic Influenza" [24] is the major topic discussed by the articles of cluster with ID 4. The next cluster is concerned with research regarding "Returning Traveler" [12]. In essence, the major concern is this research is returning ill travelers who might be seen as seeds to spread an epidemic. The last three clusters, with IDs 12, 18, 19, and 30, discuss issues of "Notifiable Disease" [25], "Public Health Response" [16,26], "Syndromic Surveillance" [2,4,19,27] and "Local Perspective" [28], respectively.
A summary of the three largest clusters is shown in Table 3, where the largest cluster (ID 0) has 98 members and a silhouette value of 0.827. It is labeled as Olympic Winter Game by LLR, Syndromic Surveillance by TFIDF, and Disease Surveillance by MI. The most active citer to the cluster is Gesteland's paper [27]. The timeline view shows this clusters as the oldest. The second largest cluster (ID 1) has 48 members and a silhouette value of 0.902. It is labeled as ESSENCE II by LLR, Syndromic Surveillance by TFIDF, and EWMA Control Chart by MI. The most active citer to the cluster is Abrams' article [29]. The third largest cluster (ID 2) has 42 members and a silhouette value of 0.965. It is labeled as Google Flu Trend by LLR, Social Media by TFIDF, and Crowd-Sourced by MI. Milinovich's paper [25] was the most active citer to the references in this cluster. In the paper, the authors reviewed studies that have exploited Internet use and search trends to monitor two particular diseases: Influenza and Dengue. This cluster is also the newest cluster, indicating it is an active research area. The document co-citation network was used to find the most cited references. The top ten most cited articles are listed in Table 2. The most cited article is a paper authored by Freedman DO [12] which is published by the New England Journal of Medicine. In the paper, the authors discussed the ten most cited articles. The second paper by Ginsberg J [13] is published in Nature. The third was a paper by Kulldorff M [30], published in PLoS Medicine. The fourth by Eysenbach G [14] is published in the Journal of Medical Internet Research. The fifth article by Carneiro HA [15] is published in the Journal of Clinical Infectious Diseases. All of the articles listed in Table 2 have made significant contributions in advancing the global research in syndromic surveillance. For further insights into the field, articles with citation bursts have also been identified, i.e., articles that have excessive citations within a particular period. Such articles indicate the focus of the research within that period. The CiteSpace generated top 25 articles with citation bursts are listed in Table 4. The table also shows the strength of the citation (Column 4) along with the start and ending years of the citations burst.

Keyword Cooccurrence Analysis
Although a network of keyword cooccurrence can be used to track the evolution of a particular research, in this paper, it is used to highlight hot and trending research issues.
A network of 469 nodes and 1905 edges was created from the noun phrases occuring in the citation datasets. A list of the burst phrases was generated from the network (Table 5). At the top of the list with the highest citation burst is the term Zika Virus. Since its appearance in 2015, the Zika virus (ZIKV) epidemic has motivated the Americas to enhance their surveillance systems [54]. Table 5 shows that this term has had a citation burst from 2016 until now, suggesting it is a trending research topic. Other phrases such as Big Data [55], Social Media [56], and Google Trends [15] were also among the top keywords with recent citation bursts. Therefore, the proliferation of social media along with the massive amount of real-time data associated with it, and the increasing advancement in big data technologies suggest an emerging direction towards new surveillance systems that can use Google trend in detecting and forecasting outbreaks.

Discussion
In this paper, a bibliometric analysis of the global research on Syndromic Surveillance between the period of 1993 and 2017 has been presented. The citation datasetswere collected from two major sources: Scopus, and Web of Science. The analysis was performed to evaluate the volumes of publications, the major venues, the active institutions, countries, and authors, as well as to assess the intellectual bases, and highlight the research fronts. Some findings are as follows: (1) The publications follow almost linear increase from 2005 onwards. While this pattern has been observed in numerous bibliometric analyses of well-established research fields [8,9,57], it indicates the global efforts to combat epidemics [58][59][60] and pandemics [61] as they occur. (2) The top three productive countries were USA, England, and Canada, respectively. Table 1 and Figure 3 show the complete list. (3) The study has uncovered three major clusters of research, namely Disease Surveillance, EWMA Control Charts, and Crowd-Sourced. The papers within these clusters resemble the intellectual base of the subfield which can be labeled as the cluster label. For example, a careful investigation of Table 6 shows that the second largest cluster (i.e., research field) has 48 papers as its intellectual base. This cluster is concerned with the applications of the Statistical Process Control (SPC) methods for the purpose of disease surveillance. The papers which cite elements of this cluster can be viewed as research fronts. For example, the work in [62] can be considered as a current research front which builds on the intellectual base of Crowd-Sourced methods for disease surveillance. 2 Crowd-Sourced 42 [13][14][15]25,55,56,59,60,. 34 [29,30,62,183,192,197,200,216,. This study has numerous strong points: (1) It can serve as a guide to new researchers in the field as well as help policy makers direct the research. (2) It is not only the first bibliometric review applied to research output in the field of Syndromic Surveillance, but also considered two major databases: Scopus, and Web of Science's Core Collections. (3) Although the framework was applied to the research output in Syndromic and Disease surveillance here, it can be generalized to other fields. (4) To the best of the authors' knowledge, it is the first study to put forward not only the major clusters of publications, but also to reveal their associated intellectual bases, and research fronts, as shown in Table 6.
Despite its major contributions, the study has some limitations: (1) although this study has combined datasets from two of the major sources, it might have missed papers not indexed in Scopus, and Web of Science. It has also ignored research papers written in other languages as well as technical reports. (2) Similar to the majority of bibliometric analysis tools, CiteSpace relies on the citation count as a measure of importance and impact. Thus, it might have overlooked recent articles of greater impact beyond the citation count. (3) Similar to some other works in bibliometric analysis, the clusters and their associated labels have not been validated by subject matter experts.

Conclusions
This paper has provided a first of its kind qualitative and quantitate view of the global research output on syndromic surveillance. In essence, the paper made the following contributions. It provides a general overview of the research landscape in the last 24 years (1993-2017) where the most productive countries and institutions with most publications have been identified. The paper also identified a list of the most significant articles. Furthermore, the current research themes and the emerging trends of research have been identified. In general, this article provides not only a bird' s-eye view to the researchers and policy makers working on the field of syndromic surveillance, but also serves as a technical guidance for systematic reviews.