1. Introduction
A novel coronavirus, COVID-19 (formally known as 2019-nCoV), has emerged over the last few weeks since its outbreak in Wuhan City, China [
1,
2,
3,
4,
5]. This severe acute respiratory syndrome (SARS)-like virus has infected over 75,000 people and killed over 2000 in China [
1,
2,
3,
4,
5]. Case diagnoses have been confirmed in 26 countries, and 14 deaths have been reported outside of mainland China [
1,
2,
3,
4,
5]. Currently, COVID-19 is spreading rapidly in South Korean communities, with almost 200 confirmed cases [
1]. Little is known regarding this virus, aside from a possible incubation period of 2 to 14 days and a mortality rate of approximately 2.2% [
5]. Increasing numbers of cases have also been reported in other countries across all continents except Antarctica, and the rate of new cases outside of China has outpaced the rate in China. These cases initially occurred mainly among travelers from China and those who have had contact with travelers from China [
6,
7]. However, ongoing local transmission has driven smaller outbreaks in some locations outside of China, including South Korea, Italy, Iran, and Japan, and infections elsewhere have been identified in travelers from those countries [
8]. In the United States, clusters of COVID-19 with local transmission have been identified throughout most of the country [
6,
7].
COVID-19 is of critical concern for public health [
9,
10]. Health care providers should be updated regarding public health and COVID-19 outbreaks affecting their communities to promptly make correct decisions [
10,
11]. This would enable them to offer improved services in an efficient manner, which is crucial in the current situation [
10]. Most health care providers depend on the Center of Disease Control and Prevention (CDC) to be informed on disease outbreaks or to be notified of new infectious COVID-19 [
10]. However, we still do not have infectious diseases under control, especially novel COVID-19 [
12]. Numerous researchers are attempting to gain an improved understanding of the evolution of COVID-19 and the causes of the disease [
13,
14,
15]. This knowledge may help predict COVID-19 infections, which would allow a more targeted prediction of at-risk populations. Recently, social media search indices (SMSIs) have successfully indicated a correlation with the prediction of the transmission of infectious disease [
16,
17,
18]. Studies have demonstrated that specific word searches in social networks may be a predictor of the transmission of influenza [
18], SARS [
17], dengue fever [
19], and Middle East respiratory syndrome [
16]. Nevertheless, SMSI was difficult to choose keywords, although they have a considerable effect on the performance of a prediction model. Since people continuously learn new terminology and change the search keywords they use, keywords should be updated regularly to maintain prediction performance [
20]. As in the case of Google Flu, this system can fail to predict disease outbreaks correctly [
21]. Therefore, the proposed digital surveillance system should be used with caution, or as a complementary method.
This study investigated the correlation between the number of new cases of COVID-19 and the search index for a popular social network in China, Baidu search index (BSI), as the reference SMSI. The aim of this study was to create an effective and affordable model to predict new cases, which would enable prompt and correct decision-making regarding public policies to limit the spread of COVID-19.
3. Results
We display the positive correlation between the series of new suspected COVID-19 cases and the lagged series of five keywords in BSI (
Table 1). In addition, we identified a significant positive correlation between the lag days of BSI and new suspected COVID-19 cases, which revealed that changes in SMSI behaviors occurred earlier (6–9 days) than the confirmation of COVID-19 infection cases (
Figure 1 and
Figure 2). The correlation between new suspected COVID-19 case number and lag value in SMSI was statistically significant (
Table 1). In our study, the SMSI was a predictor of new suspected COVID-19 infection confirmed cases and could be detected earlier by 6–9 days before the confirmation of new COVID-19 infection cases.
Moreover, we summarize the accuracy metrics for five methods (
Table 2). Among these methods, subset selection had the lowest RMSE, MAE, and MAPE and the highest correlation and correlation of increment, which indicated that it was the optimal method for explaining the data. The subset selection method only selected 10 of the 50 predictors.
Figure 3 illustrates the prediction of the number of new COVID-19 cases and the error term. The prediction was close to the true series, and the error term was random and very small along the time axis, which confirmed that the subset selection method captured most of the relationship between search behaviors and the number of new COVID-19 cases.
Furthermore, we verified the optimal method of subset selection between the correlation and new confirmed COVID cases (
Table 3).
Table 1 reports significant correlations with SMSI on lag day 10 and new confirmed COVID-19 cases. The correlation between SMSI and new confirmed COVID-19 cases (nearly 50%) was lower than the correlation with new suspected COVID-19 cases (>80%;
Table 1). The specific five keywords on lag 10 days were significantly correlated with new confirmed COVID-19 cases. The highest significant correlations, in order, were chest distress, fever, pneumonia, coronavirus, and dry cough on lag day 10. We also try to change some features (angina pectoris, difficulty urinating, impotence, urinary incontinence, dizziness) and compare the results with the original model results to illustrate the sensitivity of the model (
Table A1). Because early symptoms do not include angina pectoris, difficulty urinating, impotence, urinary incontinence, or dizziness, we see no correlations between the lag time series of Baidu Indexes of these keywords. Based on these non-specific keywords, the overall estimation performance is worse with non-specific keywords. As a result, our prediction result is stable.
Figure 1 and
Figure 2 also demonstrate that the SMSI could be a predictor and detect COVID-19 cases, 10–12 days before they were confirmed.
We also identified similar patterns in SMSI and the series of new suspected and confirmed COVID-19 cases. Furthermore, the patterns appeared earlier in SMSI than in the series of new suspected and confirmed COVID-19 cases.
4. Discussion
Web and social media platforms have seen a rapid rise in user numbers, across both the developed and developing world [
26]. Every day, millions of people self-report their symptoms online through social media, by using terms such as “fever,” “cough,” or “sore throat” [
27]. Increasingly, people are using the Internet to search for information regarding their health [
28]. An estimated 80% of all Internet users search for health information [
29]. For instance, the number of tweets and searches related to an influenza-like illness increases during flu season. These anonymized data can help to track outbreaks across populations, almost instantaneously, and with geographically linked information [
30]. Yahoo and Google have demonstrated that searches can detect outbreaks up to two weeks earlier than traditional disease surveillance [
31]. The present study is the first to use BSI as the source of SMSI data in relation to COVID-19 epidemiology and investigate potential predictors of new suspected or confirmed COVID infection (
Table 1 and
Table 3). Tracking web data could allow a larger proportion of the population to be assessed, compared with traditional health surveillance methods [
32].
Symptoms are not a diagnosis, and diseases can share common symptoms [
33]. Therefore, accurate diagnosis or prediction of the underlying infectious agent remains the cornerstone of early warning systems, because it informs correct interventions [
34]. SMSI-based models could serve as earlier, rapid, and affordable advanced sensing systems [
35], which detect new suspected or confirmed COVID-19 infectious with specificity (
Table 1,
Table 3,
Figure 1,
Figure 2,
Figure 4,
Figure 5) and in real-time, enabling rapid and effective public health interventions.
Predicting new suspected or confirmed COVID-19 cases is crucial for developing targeted antiviral drugs, vaccines, or effective public health interventions, to prevent a future outbreak of COVID-19 [
36]. In
Table 1, the correlation between new suspected COVID-19 case numbers and lag value in SMSI was statistically significant. Changes in the SMSI could predict new suspected COVID-19 cases 6–9 days earlier. Moreover, our predictive method in SMSI was also significantly correlated with new confirmed COVID-19 10–12 days earlier (
Table 3 and
Figure 4 and
Figure 5). The correlation was more than 80% between Lag value in SMSI and new suspected COVID-19 and nearly 50% with new confirmed COVID-19 cases. In
Table 1, the correlations of coronavirus and pneumonia searches in social media were 0.8325 and 0.8130 (
p value < 0.0001 and < 0.0001), respectively, nine days prior to the reporting of new suspected COVID-19 cases. Furthermore, dry cough, fever, coronavirus, and pneumonia searches were positively correlated with new suspected COVID-19 infections eight days earlier (Lag day 8;
Table 1). The five keywords were all significantly correlated with new suspected COVID-19 cases, with correlation coefficients of 0.8288, 0.8896, 0.8396, 0.8301, and 0.8886 for dry cough, fever, chest distress, coronavirus, and pneumonia, respectively. The SMSI keyword search patterns occurred seven days before new suspected COVID-19 infection. The keyword search for fever and pneumonia was six days earlier than the new suspected COVID-19 cases, with over 90% correlation (
Table 1). This SMSI could potentially be used to predict the areas and populations at risk of an outbreak of COVID-19. The SMSI in our study could be a predictor of COVID-19 infection, which would allow government health departments to formulate public health policies earlier and limit the spread of COVID-19 infection.
SMSI could be an effective and affordable tool for predicting emerging infectious diseases, and our findings in COVID-19 are compatible with studies on other emerging infectious diseases [
35,
37]. In
Figure 1 and
Figure 2, and
Figure 4 and
Figure 5, SMSI appeared to predict COVID-19 diagnosis a week early. Early prediction of COVID-19 infection benefits public health policies, by revealing specific infectious outbreak areas and at-risk populations, allowing governments to implement health policies to prevent the epidemic from expanding, as was the case with SARS [
38]. Health authorities can educate highly susceptible populations in suspected infectious outbreak areas [
38]. Public health policies may include the following: ensuring triage, early recognition, and source control (isolating patients with suspected COVID-19 infection); applying standard precautions for all patients; implementing empiric additional precautions (droplet, contact, and, airborne precautions, when necessary) for suspected cases of COVID-19 infection; implementing administrative controls; using environmental and engineering controls; and instructing the population not to eat raw eggs and to wash their hands with soap. The government should apply standard precautions for people who mention the five keywords (discomfort within 14 days) in SMSI. Standard precautions include hand and respiratory hygiene, the use of appropriate personal protective equipment, risk assessments, injection safety practices, safe waste management, proper linens, environmental cleaning, and sterilization of patient-care equipment [
39]. Respiratory hygiene measures include ensuring that all patients cover their nose and mouth with a tissue or elbow when coughing or sneezing, offering medical masks to patients with suspected COVID-19 infection while they are in waiting in public areas or in cohort rooms, and exercising proper hand hygiene after contact with respiratory secretions [
39]. If people have a history of long-term contact with birds, we suggest that they receive an influenza vaccine. We also recommend certain precautions for people who are highly susceptible to COVID-19 infection: consuming a balanced diet and exercising; not eating poultry eggs or products; never smuggling or purchasing meat from unknown birds; never touching or feeding migratory birds; never releasing or discarding birds; not mixing breeding birds with other poultry; and avoiding places with no air circulation or crowded places (such as traditional markets or hospitals, unless necessary). Moreover, SMSI may be more accurate in COVID-19 virus screening in highly suspected areas and populations; thus, government departments do not need to scramble for screening without specific targets, saving time, labor, and money for government health departments.
Table 2 summarizes different methods for the estimation of accuracy metrics in the highest correlation and incremental correlation. The last column of
Table 2 presents the number of predictors after the application of the selection method. The number presented for the Ridge Regression is 50. We included the constant as a variable by mistake when calculating the number of variables, and corrected it in our manuscript. It does not mean that for each predictor the method relies on only two observations. Although the numbers of observation is less than the number of predictors, the application of these methods is correct, as they can handle the classical high-dimensional case. In our predictive model, subset selection was the optimal method for explaining the data. The subset selection method only selected 10 of the 50 possible predictors. Furthermore, the subset selection prediction of new suspected COVID-19 cases and the error term are displayed in
Figure 3. The prediction in
Figure 3 is close to the true series; the error term is random and very small along the time axis, which suggests that the subset selection method can capture most of the relationship between people’s search behavior and the new suspected COVID-19 case number. In our study, the highest correlation and incremental correlation in the subset selection model were 0.9996 and 0.9963, respectively. The intra-class correlation coefficient (ICC) is a robust correlation measure on section data, but our study is based on time series, so ICC may not be applicable. The highest correlation and incremental correlation were high enough to explain our model. Therefore, the subset selection method was optimal in our current predictor model, and our findings are compatible with those of previous studies [
40,
41].
Figure 1,
Figure 2,
Figure 4,
Figure 5 display the outcomes of descriptive statistics.
Figure 1 and
Figure 2 illustrate that the keywords of fever and pneumonia were searched on social networks, six days before new suspected COVID-19 confirmed cases. The earliest keyword searches with a positive correlation over 80% were coronavirus and pneumonia, which was searched for nine days before new suspected COVID-19 cases. Using an SMSI to predict the outbreak of COVID-19 is affordable and effective and could be used to prevent people from hiding symptoms because they are afraid to seek medical attention, which may, in turn, lead to outbreaks.
This study is the first to investigate the possibility of using SMSI to predict outbreaks of COVID-19 in people in affected areas. The SMSI employed exhibited a high association with new suspected and confirmed COVID-19 cases. SMSI could be an effective early predictor, which would enable health government departments to locate potential and high-risk outbreak areas. Therefore, health government departments could prepare in advance for epidemic prevention and formulate new public health policies earlier.
This study has some limitations. First, people attempted to improve the accuracy of big data methods by, for instance, developing tools to overcome some of the problems that Google Flu Trends has recently encountered, including surges in media interest, which distorts the reported numbers of self-reported symptoms. COVID-19 is a novel infectious disease; thus, distorted reported numbers of self-reported symptoms may be unavoidable. Second, BSI is more popular than Google or Twitter use in China; thus, we have no other social network to validate our data. Therefore, the high usage rate of BSI in China is the principal corroborator of our conclusions. Third, statistically, early symptoms of COVID-19 are related to suspected patients, but not determining factors for new confirmed COVID-19 patients. New confirmed COVID-19 patients have been determined by the nucleic acid test. In addition, other respiratory diseases with similar symptoms might be the bias in the predictor model. Thus, the correlations between SMSI and new confirmed COVID-19 cases were lower than the correlation between SMSI and new suspected COVID-19 cases. Therefore, although the association between SMSI and new confirmed COVID-19 cases was strong, SMSI might be a good reference of potential outbreak of COVID-19, not a definitive tool for new confirmed COVID-19 cases.