# Regional Influenza Prediction with Sampling Twitter Data and PDE Model

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Sampling Data

#### 2.1. Dataset

#### 2.2. Data Preprocessing

**Data sampling:**To characterize the impact of sampling on Twitter flu prediction, we simulated a random sampling process via a simple yet widely used sampling strategy to generate a sampled Twitter flu dataset. As illustrated in [20], tweets from Twitter data streams will be selected into the samples with the same probability $p$ during a random sampling process. Thus, the projected sampling flu tweets volume $n$ on a weekly basis can be obtained by $n=[N\times p]$, where $N$ is the total number of weekly full tweets on flu. $[\u2022]$ is the integer operation with rule of rounding. For instance, [0.3] equals 0, [5.3] equals 5, and [5.5] equals 6. Intuitively, the effect of random sampling could potentially change the underlying distributions of the flu tweets. Figure 3 demonstrates that the sampling tweets volumes have very different growth trends compared with the full data. This phenomenon reflects the information loss during the simulation process of random sampling.

**Data normalization:**To prevent the larger value input attributes from overwhelming smaller value inputs and to effectively decrease the prediction errors, we first normalized all the known historical sampling values of the tweets volume to a specific range of $[0,m]$ via linear scaling. Here m is predefined and in the present study we chose $m=5$ in order to align with the official flu case levels 1–5 defined by CDC. Specifically, we defined $M$ as the maximum value of tweets volumes, i.e., the maximal flu tweets volume over the entire flu season. We normalized the flu tweet counts by the linear transform $y=\frac{m}{M}x$, where $x$ is the real tweet volume, $y$ is the normalized tweet volume.

## 3. PDE Model

## 4. Predictive Modeling

**Flu tweet predictions**: In the prediction process of the research time period, the model parameters are time varying but under the same structured PDE. For forecasting the flu tweets of a given day, we first train the parameters of the PDE model and then solve the PDE for prediction. Specifically, weeks 1–3, 2–4, …, 15–17 are used as the training data, and we predict the flu tweets for the following weeks 4, 5, …,18, respectively. In this study we only require the last three weeks of the history data to train the model, which is much less than the historical data expected in [2].

**Reversing the data normalizing and sampling process**: Once we obtain the predicted flu tweets volume via the developed PDE model, we transform the predicted tweets volumes based on sampling data into the full version via reversing the data normalizing and sampling process.

**Measuring the prediction ability of the PDE-based model**: Lastly we compare the predicted flu tweets volumes (that have been transformed through reversing the data normalizing and sampling process) with the observed flu tweets volumes, i.e., the ground truth, to quantify the prediction accuracy of the flu tweet volume. The $\mathrm{relative}\mathrm{accuracy}=1-\frac{|{x}_{real}-{x}_{predict}|}{{x}_{real}}$ is applied to measure the prediction accuracy, where ${x}_{real}$ is the full flu tweet volume at every data collection time point and ${x}_{predict}$ is the predicted tweet volume, which has been inverse normalized and inverse sampled.

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Schmidt, C. Real-time Flu Tracking–by Monitoring Social Media, Scientists Can Monitor Outbreaks As They Happen, Nature, 2019. Available online: https://www.nature.com/articles/d41586-019-02755-6 (accessed on 18 September 2019).
- Wang, F.; Wang, H.; Xu, K.; Raymond, R.; Chon, J.; Fuller, S.; Debruyn, A. Regional level influenza study with geo-tagged twitter data. J. Med. Syst.
**2016**, 40, 189. [Google Scholar] [CrossRef] [PubMed] - Overview of Influenza Urveillance in the United States, Centers for Disease Control and Prevention. Available online: https://www.cdc.gov/flu/weekly/overview.htm (accessed on 15 October 2019).
- Vespignani, A. Multiscale mobility networks and the large scale spreading of infectious diseases. In APS March Meeting Abstracts; Boston University: Boston, MA, USA, 2010. [Google Scholar]
- Ajelli, M.; Goncalves, B.; Balcan, D.; Colizza, V.; Hu, H.; Ramasco, J.J.; Merler, S.; Vespignani, A. Comparing large-scale computational approaches to epidemic modeling: Agent-based versus structured metapopulation models. BMC Infect. Dis.
**2010**, 10, 190. [Google Scholar] [CrossRef] [PubMed][Green Version] - Colizza, V.; Barrat, A.; Barthelemy, M.; Valleron, A.J.; Vespignani, A. Modeling the worldwide spread of pandemic influenza: Baseline case and containment interventions. PLoS Med.
**2007**, 4, e13. [Google Scholar] [CrossRef] [PubMed][Green Version] - Chen, Z.; Xu, Z. A delayed diffusive influenza model with two-strain and two vaccinations. Appl. Math. Comput.
**2019**, 349, 439–453. [Google Scholar] [CrossRef] - Bocharov, G.; Volpert, V.; Tasevich, A. Reaction–diffusion equations in immunology. Comput. Math. Math. Phys.
**2018**, 58, 1967–1976. [Google Scholar] [CrossRef] - Van den Broeck, W.; Gioannini, C.; Goncalves, B.; Quaggiotto, M.; Colizza, V.; Vespignani, A. The gleamviz computational tool, a publicly avail-able software to explore realistic epidemic spreading scenarios at the global scale. BMC Infect. Dis.
**2011**, 11, 37. [Google Scholar] - Yanez, A.; Duggan, J.; Hayes, C.; Jilani, M.; Connolly, M. PandemCap. Decision support tool for epidemic management. In Proceedings of the 2017 IEEE Workshop on Visual Analytics in Healthcare (VAHC), Phoenix, AZ, USA, 1 Octorber 2017; pp. 24–30. [Google Scholar]
- Broniatowski, D.A.; Paul, M.J.; Dredze, M. National and local influenza surveillance through twitter: An analysis of the 2012–2013 influenza epidemic. PLoS ONE
**2013**, 8, e83672. [Google Scholar] [CrossRef] [PubMed][Green Version] - Smith, M.; Broniatowski, D.A.; Paul, M.J.; Dredze, M. Towards real-time measurement of public epidemic awareness: Monitoring influenza awareness through twitter. In AAAI Spring Symposium on Observational Studies through Social Media and Other Human—Generated Con-Tent; George Washington University: Washington, DC, USA, 2016. [Google Scholar]
- Chen, L.; Hossain, K.T.; Butler, P.; Ramakrishnan, N.; Prakash, B.A. Syndromic surveillance of flu on twitter using weakly supervised temporal topic models. Data Min. Knowl. Discov.
**2016**, 30, 681–710. [Google Scholar] [CrossRef] - Hayate, I.; Wakamiya, S.; Aramaki, E. Forecasting word model: Twitter-based influenza surveillance and prediction. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers; Nara Institute of Science and Technology, Nara, Japan; 2016; pp. 76–86. [Google Scholar]
- Lee, K.; Agrawal, A.; Choudhary, A. Forecasting influenza levels using real-time social media streams. In Proceedings of the 2017 IEEE International Conference on Healthcare Informatics (ICHI), Park City, UT, USA, 23–26 August 2017; pp. 409–414. [Google Scholar]
- Du, B.; Lian, X.; Cheng, X. Partial differential equation modeling with dirichlet boundary conditions on social networks. Bound. Value Probl.
**2018**, 2018, 50. [Google Scholar] [CrossRef] - Wang, B.; Yin, P.H.; Bertozzi, A.L.; Brantingham, P.J.; Osher, S.J.; Xin, J. Deep learning for real-time crime forecasting and its ternarization. Chin. Ann. Math. Ser. B
**2019**, 40, 949–966. [Google Scholar] [CrossRef][Green Version] - Wang, B.; Luo, X.Y.; Zhang, F.B.; Yuan, B.C.; Bertozzi, A.L.; Brantinham, P.J. Graph-based deep modelling and real time forecasting of sparse spatio-temporal data. Arxiv Prepr. Arxiv
**2018**, 1804, 00684. [Google Scholar] - Aiken, E.L.; Nguyen, A.T.; Santillana, M. Towards the used of neural networks for influenza prediction at multiple spatial resolutions. Arxiv Prepr. Arxiv
**2019**, 1911, 02673. [Google Scholar] - Xu, K.; Wang, F.; Jia, X.; Wang, H. The impact of sampling on big data analysis of social media: A case study on flu and ebola. In Proceedings of the 2015 IEEE Global Communications Conference (GLOBECOM), San Diego, CA, USA, 6–10 December 2015; pp. 1–6. [Google Scholar]
- Wang, Y.; Callan, J.; Zheng, B. Should we use the sample? analyzing datasets sampled from twitters stream api. ACM Trans. Web
**2015**, 9, 13. [Google Scholar] [CrossRef] - The Flu Season. Content source: Centers for Disease Control and Prevention, National Center for Immunization and Respiratory Diseases (NCIRD). Available online: https://www.cdc.gov/flu/about/season/flu-season.htm (accessed on 12 July 2018).
- The streaming apis. Available online: https://dev.twitter.com/streaming/public (accessed on 1 September 2018).
- Brauer, F. Compartmental Models in Epidemiology, in: Mathematical Epidemiology; Springer: Berlin/Heidelberg, Germany, 2008; pp. 19–79. [Google Scholar]
- Atzberger, P.J. Introduction to mathematical Biology; Wiley: Hoboken, NJ, USA, 1975. [Google Scholar]
- Wang, F.; Wang, H.; Xu, K.; Wu, J.; Jia, X. Characterizing information diffusion in online social networks with linear diffusive model. In Proceedings of the 2013 IEEE 33rd International Conference on Distributed Computing Systems, Philadelphia, PA, USA, 8–11 July 2013; pp. 307–316. [Google Scholar]
- Tang, S.; Yan, Q.; Shi, W.; Wang, X.; Sun, X.; Yu, P.; Wu, J.; Xiao, Y. Measuring the impact of air pollution on respiratory infection risk in China. Environ. Pollut.
**2018**, 232, 477–486. [Google Scholar] [CrossRef] [PubMed] - Gerald, C.F. Applied numerical analysis. Pearson Educ. India
**2004**. [Google Scholar] - Na´tr, L. Murray, JD: Mathematical biology. I. an introduction. Photosynthetica
**2002**, 40, 414–414. [Google Scholar] [CrossRef] - Friedman, A. Partial Differential Equations of Parabolic Type; Courier Dover Publications: Mineola, NY, USA, 2008. [Google Scholar]
- Oseledets, I.V. Tensor-train decomposition. SIAM J. Sci. Comput.
**2011**, 33, 2295–2317. [Google Scholar] [CrossRef] - Lagarias, J.C.; Reeds, J.A.; Wright, M.H.; Wright, P.E. Convergence properties of the nelder–mead simplex method in low dimensions. SIAM J. Optim.
**1998**, 9, 112–147. [Google Scholar] [CrossRef][Green Version]

**Figure 1.**National Center for Chronic Disease Prevention and Health Promotion regions 1–10 represent 10 different CDC regions in the United States.

**Figure 2.**Flu full tweets from the 40th week in 2018 (marked as 1 in the x-axis) to the 5th week in 2019 (marked as 18 in the x-axis), where lines with different colors represent different Center for Disease Control and Prevention (CDC) regions and flu tweets volume in the y-axis represents flu tweet counts on Twitter.

**Figure 3.**The data collection of flu tweets from random sampling and full tweets. x-axis represents the research time period of 18 weeks from the 40th week of 2018 to the 5th week of 2019, which are marked 1–18 in the x-axis. Flu tweets volume (flu tweet count) of each CDC region from Region 1 (marked as R1) to Region 10 (marked as R10) is shown in each subfigure.

**Figure 4.**The relative accuracy in each CDC region (regions 1–10) from the 40th week of 2018 to the 5th week of 2019, which covers the prophase and metaphase of a flu season. Here the relative accuracy is the conventional definition as $1-\frac{|{x}_{real}-{x}_{predict}|}{{x}_{real}}$ where ${x}_{real}$ is the actual full flu tweet volume at every data collection time point and ${x}_{predict}$ is the predicted tweet volume, which has been inverse normalized and inverse sampled.

**Figure 5.**The average relative accuracy of the 10 CDC regions with various sampling ratios during our data collection period. x-axis represents the CDC regions from 1 to 10. Lines with different colors represents that the prediction accuracies are based on different sampling data.

**Figure 6.**The predicted flu tweet volumes with different strengths of flu interventions. x-axis represents the predicted time period of 15 weeks from the 43rd week of 2018 to the 5th week of 2019, which are marked by 4–18 in the x-axis. The predicted flu tweets volume (flu tweet count) of each CDC region from Region 1 (marked as R1) to Region 10 (marked as R10) is shown in each subfigure.

**Table 1.**The average relative accuracy of the 10 CDC regions (marked R1–R10 in the table) with various sampling ratios during our data collection period.

R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 | R9 | R10 | |
---|---|---|---|---|---|---|---|---|---|---|

1% Sampling | 93% | 93% | 93% | 92% | 92% | 92% | 95% | 89% | 91% | 94% |

0.1% Sampling | 91% | 92% | 88% | 91% | 93% | 90% | 87% | 76% | 90% | 87% |

0.01% Sampling | 93% | 90% | 93% | 88% | 90% | 91% | 93% | 75% | 90% | 93% |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Wang, Y.; Xu, K.; Kang, Y.; Wang, H.; Wang, F.; Avram, A.
Regional Influenza Prediction with Sampling Twitter Data and PDE Model. *Int. J. Environ. Res. Public Health* **2020**, *17*, 678.
https://doi.org/10.3390/ijerph17030678

**AMA Style**

Wang Y, Xu K, Kang Y, Wang H, Wang F, Avram A.
Regional Influenza Prediction with Sampling Twitter Data and PDE Model. *International Journal of Environmental Research and Public Health*. 2020; 17(3):678.
https://doi.org/10.3390/ijerph17030678

**Chicago/Turabian Style**

Wang, Yufang, Kuai Xu, Yun Kang, Haiyan Wang, Feng Wang, and Adrian Avram.
2020. "Regional Influenza Prediction with Sampling Twitter Data and PDE Model" *International Journal of Environmental Research and Public Health* 17, no. 3: 678.
https://doi.org/10.3390/ijerph17030678