The traditional infectious disease detection process is being challenged by potential social media applications [1
]. The latest estimates released by the United States Centers for Disease Control and Prevention (US-CDC) revealed the worldwide severity of the illness. According to this authoritative report, the US-CDC estimates that in the period between 1 October 2018 and 4 May 2019, there were approximately 37.4 million to 42.9 million flu infectious in the population population, among which there were from 17.3 to 20.1 million flu-related medical visits [3
]. Furthermore, 531,000 to 647,000 people require flu-related hospitalizations, and unfortunately, influenza caused 36,400–61,200 estimated deaths. This estimate is based on more recent data from a larger and more diverse group of countries, including lower middle-income countries, and this estimate excludes deaths from non-respiratory diseases. The statistics indicate a severe reality and a pressing challenge because influenza causes significant losses of human life and damage to property worldwide. As evidenced by the current flu season, influenza viruses can rapidly mutate, evading the most current vaccine formulations [4
]. Infectious diseases continue to be the leading cause of death worldwide, and they cause serious loss of life and property when they cannot be quickly and accurately assessed [5
]. The World Health Organization (WHO) announced on 15 August 2003 that by 7 August 2003, there were 8422 infectious cases of Severe Acute Respiratory Syndrome (SARS) worldwide, which involved 32 countries and regions [6
]. The number of deaths due to SARS in the world totaled 919, and the mortality rate was almost 11%. A handful of studies estimated the global macroeconomic impact of SARS at USD 30–100 billion, or approximately USD 3–10 million per case [7
]. However, from the first reported case on 15 December 2002 to the first epidemic announcement by the Chinese government on 11 February 2003, the time span was almost two months. The consequence was a significant loss of life and property due to the lack of effective disease identification and monitoring methods for real-time epidemic control. Based on the infectious disease epidemic report issued by the National Health Commission of China, in recent years, the number of incidences and deaths due to influenza in C-class infectious diseases shows an explosively upward trend annually (see Figure 1
). The death toll for five months during 2019 was nearly two times that for all of 2018.
The fundamental cause of these serious outbreaks is summarized as follows. First, traditional disease prevention and control institutions mainly rely on a single channel for information monitoring and access. Specifically, data are exclusively sourced from clinical statistics. However, relying exclusively on clinical statistics has obvious disadvantages, such as being time consuming and creating high labor costs [8
]. The traditional detection methods cannot integrate multichannel infectious disease information, such as social media and search engine data. Second, social media data are a powerful and promising tool that have been applied to many research subjects, such as healthcare informatics [10
], sentiment analytics [12
], and disaster management [13
]. The remarkable value of social media has been widely recognized [16
]. Particularly in this paper, we refer to microblogs posted on social media platforms especially from Sina Weibo as weibos. Although disease prevention and control institutions have exerted a significant role in disease detection, they might come up against substantial difficulties in using social media data in disease detection and control due to the lack of analytic methods or accurate monitoring of outbreaks and periods of infectious diseases. There is still a long journey to go to fully utilize social media data in disease prevention and control. Third, the flu is characterized by strong contagiousness and rapid spread, which makes it difficult to monitor in real-time or precisely estimate the spread of the flu. For example, the flu can be easily spread by droplets or contaminated items in the air and contact among people. The Centers for Disease Control and Prevention (CDC) publishes data on influenza-like illness ILI based on statistics and evaluation after a patient’s visit. It would be extremely difficult to obtain information and perform an analysis before the visit. However, apparently, time lags could lead to delayed treatment. Fourth, Google Flu Trends (GFT) provided an estimate of more than double the proportion of clinical data for influenza-like illness (ILI) published by the Centers for Disease Control and Prevention (CDC) [17
]. The ILI was calculated based on surveillance reports from laboratories across the United States [18
]. Google used web search data to propose a GFT model for real-time monitoring. When patients are aware of their active flu period and search for flu-related keywords through search engines such as Google, the patients’ behavior is recorded by the search engine. Social media sensors, in contrast, show unique advantages for quick flu monitoring and reliable estimation and prediction. There has been a growing consensus that social media sensors can also perform real-time monitoring that is more accurate than GFT [8
]. Fifth, part of the ILI data cannot be collected and thus are not available if the patient does not go to the hospital, which makes disease monitoring information inaccurate and incomplete. Therefore, the severity and urgency of the disease as reflected by traditional statistics are often underestimated. However, social media can capture this part of the data because when bloggers catch a cold, they can post their own flu symptoms through social media. If bloggers do not realize that they catch flu, they can also post some flu-related microblogs on social media. This part of the data can also be seen in advance of CDC released reports [19
]. Sixth, the previous literature mainly focused on whether the bloggers are infected based on the social media platform [20
] and cannot accurately distinguish the various periods of the disease, which largely reduces the effectiveness and pertinence of the information for disease control measures.
This section provides an extensive review of the relevant literature in three parts: optimization of the infectious disease detecting process, social media utilization for disease detection and semantic analysis techniques based on social media data. Much of the current research focuses on social media to analyze short texts [21
], in which several papers predict flu trends by classifying flu-related social media data [19
]. However, these studies do not divide the flu periods any further. Speedily evolving infectious diseases, including SARS, Ebola and influenza, pose significant health threats throughout the world because of their rapidly changing status and complicated detection process [24
]. A considerable amount of work is devoted to forecasting disease outbreaks. A two-period model optimized the process of when and where to assign Ebola treatment units across geographic regions during the outbreak’s early phases [26
]. Chen et al. (2018) developed a mixed-integer programming (MIP)-based framework to systematically analyze a rich set of policies and to determine the optimal hepatocellular carcinoma surveillance policies that maximize the societal net benefit [27
]. It is observed that the surveillance policies should be adapted to different disease progression rates and states. Several studies in this area focused on finding optimal surveillance solutions for flu vaccine production and allocation [28
]. Another study considered the conditions of limited reporting and spatial aggregation on the optimization of influenza surveillance system design [30
]. Research in this area is mainly concerned with epidemiological inference and prediction based on clinical data collection, but few of the studies provide improved detecting measures using social media data, which represent an alternative source of passive traditional surveillance data that have a larger volume and fewer reporting delays.
Data from social networks show apparent advantages in several aspects, such as being real time and time-sharing, along with a broad scope of data coverage [31
]. Disease surveillance has been investigated based on the recent rise in the popularity and scale of social media data. Aiello et al. (2011) reviewed and addressed the use, promise, perils, and ethics of social media- and internet-based data collection for public health surveillance [35
]. Additionally, the infectious disease detection process is being challenged by the potential applications of social media [1
]. Multiple types of social media have become emerging and promising data sources of disease surveillance and shown advanced achievements in tracking health informatics in different areas all over the world. Raamkumar et al. (2020) examined the differences of COVID-19-related public responses on Facebook in the United States, England and Singapore and showed that social media analysis was capable of providing insights about the communication strategies during disease outbreaks [37
]. Lwin et al. (2018) and Vijaykumar et al. (2017) investigates how Facebook can be utilized to implement and adapt in responding to the Zika epidemic in Singapore [38
]. Moreover, Dubey et al. (2014) identified and evaluated YouTube as significant resource for providing and disseminating information on public health issues like West Nile virus infection [40
]. Davidson et al. (2015) constructed an empirical network to substantially improve performance in predicting infections one week into the future using CDC data and combining this with internet-based data in the U.S. [41
]. Chen et al. (2016) proposed two temporal topic models to capture hidden states of a flu-related user and get better flu-peak predictions by using Twitter data. In addition, they validated their approaches by modeling the flu using Twitter in multiple countries of South America [42
]. Lamb et al. (2013) demonstrated that the use of Twitter data leads to significant improvements in flu surveillance by discriminating those categories of flu tweets that reported infection from those that expressed concerned awareness of the flu as well as tweets about the authors versus those about others [43
Sentiment classification [44
], feature extraction [46
] and public opinion monitoring [47
] are performed based on a social network dataset for sentiment analysis. Chen et al. (2020) introduced a novel approach of adding semantics as additional features into the training set for sentiment analysis, and they applied this approach to predict sentiment for three different Twitter datasets [49
]. The authors also investigated the real-time flu detection problems and proposed a flu detection model with emotional factors and semantic information [50
]. Adamopoulos et al. (2018) examined the effect of latent personality traits on consumers’ behavior and preferences, which originated from social media users’ levels of emotional range [51
]. The effect of social media advertising content on customer engagement was also studied via Facebook users’ humor and emotion [52
]. Although sentiment analysis has been widely applied to many fields, few researchers consider it to be a powerful tool to be used in determining a patient’s status when detecting infectious diseases. Furthermore, word embedding techniques, such as bag of words [53
] and word2vec [54
], have become effective means of text processing. High-quality word vector representations provide distributional information about words [55
], especially for word2vector, which appears to be outstanding at improving a model’s performance on a limited amount of data. The development of word2vector significantly improves the effectiveness of word representations by transforming sparse, discrete and high-dimensional vectors into dense, continuous and low-dimensional vectors [56
]. It is a foundation to transform word segments into fixed dimensional vectors, namely, word embedding, when studying user-generated content [57
]. In this paper, the use of high dimensional numeral vectors to represent words serve as semantic feature extractors or the input variables of a neural network. Artificial Neural Network (ANN) techniques have been used widely in text classification. Hughes et al. (2017) have used Convolutional Neural Networks (CNNs) for text processing and classification in online news, reviews and medical text [59
]. A large number of modified ANN techniques have emerged with technical advancements. Recurrent Neural Networks (RNNs) have also been an effective method for speech recognition and text sequence tagging [60
], and Long Short Term Memory (LSTM) networks [61
] perform well for sequence-based learning tasks. In addition, a large amount of text processing research has emerged, including the use of part-of-speech tagging [62
], lexicon approaches [63
], and other deep learning techniques [64
]. The extant literature has several key limitations. Firstly, the abovementioned studies have mainly focused on a single aspect of text processing, sentiment classification or flu tracking [43
]. More importantly, these papers do not adequately consider the influence of sentiment polarity on the classification of flu-related weibos or on dividing the different periods of flu-related weibos [50
]. Therefore, this paper incorporates sentiment factors into flu surveillance research, and the period classifications of flu-related weibos are probed. Second, neural network techniques are used to process the weibos at the text level [61
]. LSTM networks can be used for sentiment analysis of film reviews, part-of-speech tagging, and other fields [60
]. Previous studies show that LSTM performs relatively well in text processing but has been rarely used for disease weibo analysis. To fill in this gap, this paper aims to investigate the relationship between sentiment polarity and the flu period at the word level and text level based on a weibo dataset.
The main research rationale of this study is straightforward, i.e., first, to investigate the relationship between the sentiment polarity and the flu period from social networks, and second, to optimize the disease detecting process by predicting the different periods of flu.
The shortcomings of traditional data are evident since they are manually collected and time-consuming, which leads to high labor costs [8
]. In addition, traditional methods based on clinical data make it challenging to shed light on the current situation and predict future developing trends [41
]. Along with the widespread use of the Internet, social networking data, including web-based epidemiological data, have had explosive growth. Data from social networks show apparent advantages in several respects, such as being real-time and having time-sharing, along with a broad scope of data coverage [31
]. In terms of the scale of users, Sina Weibo’s monthly active users reached 431 million on 30 June 2018, overtaking Twitter, which makes Sina Weibo the world’s largest independent social media platform [67
]. Sina Weibo is the most popular social media platform in China for the public to share opinions and disseminate information about emergencies and major social events. Therefore, Weibo has a far-reaching scope of dissemination and is an important social influence. Therefore, the use of web-based social media data growth is an imperative trend to use for effective disease control and prevention. Weibo messages carry rich and meaningful implications. Previous approaches in flu state detection through social media have yielded outstanding achievements but have some limitations at the same time. Most obviously, the semantic information was seldom considered; this information might be important for flu detection [50
However, to our knowledge, this area of study has serious limitations. For example, bloggers experience different flu periods of the latent, infectious, or recovered kind, and their sentiments correspond to the different periods. In reality, most bloggers who are ill are usually negative, while bloggers in recovered periods are active and optimistic. Therefore, omitting the key flu period and non-differentiating Weibo messages data could lead to data contamination and misleading conclusions. It is important to investigate the relationship between the flu period and the sentiment polarity, to make it possible to be conducive to accuracy for classifying the flu period, which directly results in accurately estimating the number of patients in different flu periods. This approach would also help the CDC to take early action for disease control and prevention.
This paper aims to detect the flu period with sentiment polarity at the word and text level based on Sina Weibo data (web-based social media platform), and it proposes optimization suggestions for optimizing the disease detecting process. Several important findings are produced. (1) Social media is a promising and powerful data platform to detect flu patients by earlier discovery rather than traditional medical data. Their periods can be further sorted into infectious and recovered, as mined from social media. (2) The semantic information varies from the weibo texts posted by patients in different flu periods. The interclass distance between the recovered period and positive sentiment is closer than between the recovered period and negative sentiment, and the interclass distance between the infectious and the negative is closer than that of the positive. Additionally, it was noted that the healthier the bloggers are, the more positive sentiments they have. The more serious the flu is, the more that bloggers are connected with negative emotions. (3) A multichannel disease detection model is developed in this study to evaluate and classify the flu period with an accuracy of up to 0.926 based on the LSTM network. Our optimized model effectively improves the classification accuracy of the flu period after adding the sentiment classification results.
The research findings have important theoretical implications. (1) The previous literature investigates the sentiment and disease predictions separately. This paper examines the relationship between sentiment and disease detection. We found that by adding sentiment factors, the classification accuracy is improved remarkably, from 0.876 to 0.926. (2) This paper explores the relationship between the sentiment polarity and the flu-period at two levels of words and text, combining the methods of word2vector and LSTM, which have been used rarely for disease surveillance studies. (3) This research proposes a complete theoretical framework based on web-based social media data. The use of this model can be extended to many aspects, such as monitoring chronic and mental diseases.
This study also has important practical implications. (1) This paper optimizes the disease detecting process and establishes multichannel surveillance measures for CDC decision making. (2) This paper will monitor a larger range of infected population. Furthermore, it can identify patients in advance who are not aware of disease. (3) The previous weibo text processing classifies only flu-related weibos and unrelated weibos. This paper further divides flu-related weibos into two periods: recovered and infectious. Research outcomes improve the reliability and accuracy for the prediction of flu trends. Both point (2) and point (3) not only help the CDC to detect disease information in real time but also provide a novel method for disease information management. (4) The conclusion supports the expansion of the number of neural network training sets, eliminating some of the high cost of manual labeling. The classification results of the flu-period can be replaced in the model to increase the amount of training set data, which enables the LSTM neural network to fully learn to better characterize the model.
Timely and reliable flu monitoring is an important basis for successful control of the spread of disease and mitigation of the associated damage. However, due to its high contagiousness and rapid spread, the flu epidemic has caused great difficulties in prevention and surveillance. With the rapid development and popularity of web-based social media platform data, Sina Weibo, one of the world’s largest social media companies, has become an ideal data source to make real-time, low-cost surveillance possible as an early warning of outbreaks and an adjunct to traditional methods of investigation. According to the latest estimate by the United States Centers for Disease Control and Prevention (US-CDC), as many as 650,000 people worldwide die from seasonal flu-related respiratory diseases each year. It is evident that the flu imposes a heavy burden on the international community, and the flu’s global, social and economic costs are considerable. It is worth noting that improving the ability to monitor infectious disease is the key to further strengthening management capacity of the health system and organizing a massive flu outbreak response.
This paper explores the relationship between the flu-period and sentiment polarity from two levels based on Sina Weibo data. To be specific, at the word level, we used word2vector to create the flu-related weibos corpus and the t-SNE method to reduce the dimension. The centroid cluster and between-group linkage were jointly used to measure the distance between the four classes, thus visually showing the relationship between the sentiment polarity and flu-period. At the text level, the sentiment polarity and flu-period of flu-related weibos were classified by the LSTM networks, respectively. We counted the classification results as both belonging to the infectious and negative sentiment as well as to the recovered and positive sentiment, and we calculated the accuracy rate. We then compared the rate with the overall flu-period classification accuracy to observe the differences. This paper proposes an integrated conceptual framework and practical methods for optimizing the disease detection process with fast information, early discovery, added infected cases and high accuracy. These contributions are described in detail as follows:
First, in theory, this paper integrates various channels for detecting infectious diseases in real time with fast information. In addition to the clinical data and search engine data, the detecting data obtained through social media can also provide prompt and time-sharing disease information to the Centers for Disease Control and Prevention (CDC). The monitoring mechanisms operate in real time, which can help the CDC fully prepare for the next round of prevention and control.
Second, in practice, social media enables the early discovery of disease infection. The sooner the disease is diagnosed, the easier it is to properly treat and controlled. The CDC is committed to pursuing early detection of diseases. Through social media platforms, we can detect the spread and severity of a disease earlier than search engines and the CDC. When diseases break out, the patient might not be aware of them but could post on Twitter or Weibo. The behavior would be recorded by social media sensors. Based on human behavioral theory, the data possess unique value for detecting disease trends.
Third, social media is adept at tracking more patients than traditional clinic data. Larger infectious populations can be monitored by social media than with clinic data. Influenza-like illnesses (ILI) published by the CDC are measured according to outpatient statistics when fevers are higher than 38 degrees and are accompanied by a cough or sore throat. However, a considerable number of people often choose not to go to the hospital for treatment when they have the flu or might buy medicine from a pharmacy by themselves, which cannot be counted in ILI measurements. Social media can detect these patients, which could result in a larger amount of meaningful data being collected, and thus, these data could lead to more reliable prediction of disease outbreaks.
Fourth, this paper detects disease periods with observably high accuracy, which could directly result in significant differences in treatment and disease control measures. Targeting the disease period precisely helps clinical managers to improve the treatment effect and reduces the prevention cost by rationally allocating resources, such as medical personnel and medicine as well. This paper can not only detect whether the patient has the flu but also classify the flu period, infectious or recovered period, which lays the foundation for predicting future flu trends. It also provides another data source to assist the CDC in managing disease information.
Fifth, in terms of theoretical contributions, this paper investigates the relationship between sentiment polarity and the flu period at different word and text levels by combining the word2vector and LSTM methods, thereby carrying out interdisciplinary research in the fields of sentiment analytics and health informatics. In addition, this paper provides an effective solution for artificially labeling a training set. High-accuracy weibo texts can be used to boost the size of the training set, thus saving time and labor costs.
In future work, we need to study a wider range of data since the current data only cover two years, 2016 and 2017. Moreover, this paper compares the trend of official ILI data from the CDC and flu-related data from social media in 2016. We will examine more valuable disease information from social media-based data on a larger scale.