# Aggregating Twitter Text through Generalized Linear Regression Models for Tweet Popularity Prediction and Automatic Topic Classification

^{1}

^{2}

^{*}

*Eur. J. Investig. Health Psychol. Educ.*

**2021**,

*11*(4), 1537-1554; https://doi.org/10.3390/ejihpe11040109

## Abstract

**:**

## 1. Introduction

## 2. Methods

#### 2.1. Data Collection and Cleansing

#### 2.2. Data Preparation

- The “Text” column in the dataset was read as a vector, and then a collection of documents was created into a Corpus object by the R function tm::Corpus(VectorSource()) [10]. As mentioned above, a document corresponds to a tweet in the dataset.
- Punctuation was also removed using function tm::tm_map(x, removePunctuation), where x represented each tweet.
- Then, tm::DocumentTermMatrix() was used to create a DTM, which contains all the terms that were listed in all texts and the frequency with which a term appeared in a document.
- Sometimes, the dataset needed further cleaning because some off-interest terms were not obliterated. For example, in our study, URLs were not deleted and remained as terms starting with “HTTP”. Such terms were removed from the dataset.
- Some terms were sparse, meaning they only appeared in a small number of tweets, and they had insufficient contribution to the potential association. Hence, those terms were removed from DTM based on the number of documents that included them.
- The DTM, which contained the frequency of terms in each document, was merged with the outcomes of interest from the original dataset for future analysis.

#### 2.3. The Proposed Data Aggregation Procedueres by Regression Models

#### 2.3.1. Case Study I: Influential Keyword Selection by Hurdle Negative-Binomial Model Using Retweet Count as the Outcome

- The RR of each term was calculated with the univariate Hurdle model:$$\begin{array}{l}{f}_{\mathit{hurdle}}({y}_{i};{x}_{i},{z}_{i},\beta ,\gamma )\\ \text{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}}=\{\begin{array}{l}{f}_{0}(0;{z}_{\mathrm{i}};\gamma )\text{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}}\mathrm{if}\text{}{y}_{i}=0\\ (1-{f}_{0}\left(0;{z}_{i},\gamma \right))\xb7\frac{{f}_{\mathit{count}}({y}_{i};{x}_{i},\beta )}{1-{f}_{\mathit{count}}(0;{x}_{i},\beta )}\text{\hspace{1em}\hspace{1em}}\mathrm{if}\text{}{y}_{i}0\end{array}\end{array}$$
- Each column (i.e., term) in the DTM dataset was treated as a predictor. Using “Retweet_Count” as the outcome of interest (i.e., ${y}_{i}$ = Retweet Count of the ${i}^{th}$ tweet), the univariate model was fit for each term (i.e., ${x}_{i}\text{}$= Frequency of the term in the ${i}^{th}$ tweet) to estimate the relative risk. A term with a higher slope estimate has a stronger association with the number of times a tweet has been retweeted. The RR was calculated by taking the exponential of the regression slopes, and an elbow plot was created to display the RRs visually.
- The AIS was calculated to measure the overall influence of each tweet, as follows:
- The DTM dataset was imported and transposed as a TDM. In the TDM, rows referred to terms, and columns referred to documents.
- The frequency of each term was multiplied by its RR to obtain the score for each term. That is, the propensity score was equal to the multiplication of the frequency of a term in a tweet by the RR of that term.
- The summary statistics of the propensity scores of all terms in a corresponding tweet, such as the mean, median, or sum, were calculated. For example, the AIS-mean score is calculated as the sum of the product of the frequency of a term and the RR of that term across all terms in a tweet, divided by the total number of terms in that tweet. Below is a worked example of the calculation of AIS scores for a single tweet:

Word | Father | Take | His | Child | with | to | See | Favorite | Band | Coldplay |

Frequency in this specific tweet (${n}_{i}$) | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |

Frequency in all tweets | 360 | 1043 | 3 | 1659 | 2 | 3 | 1323 | 368 | 349 | 428 |

Term | Child | See | Take |

RR × Frequency | 3.6210 | 4.3122 | 4.529 |

Place | 1st | 2nd | 3rd |

- 4
- The final step evaluated the combined effect of all terms in a tweet on the popularity of the tweet in terms of retweet counts. For making inferences about the association, the univariate Negative-Binomial or Hurdle model was fitted again using the AIS as the predictor, where X = AIS. The slope estimate ($\widehat{\beta}$) based on Equation (1) of the AIS score indicates how strongly the summarized content by AIS affected the retweet counts.

#### 2.3.2. Case Study II: Topic Classification by Logistic Regression Model Using ASD against Non-ASD Topics as the Binary Outcome

- Tweets from the ASD topic and the non-ASD topics, in this example, “influenza” and “violence against women”, served as case and control groups, respectively. Both groups were randomly selected with the same sample size. The new dataset combined the ASD and the non-ASD data with a new variable serving as the outcome of interest, which was an indicator of the tweet’s true classification status (case = 1, pertinent to ASD topic; control = 0, pertinent to non-ASD topics).$$\mathit{log}\left(\frac{\pi}{1-\pi}\right)={\beta}_{0}+{\beta}_{1}X$$
- ORs were then calculated using univariate logistic regression (i.e., Y = the tweet’s true classification status; X = frequency of the term) following a similar process as in steps 1–2 in Case Study I. Here, ${\beta}_{1}$ obtained from previous step is the coefficient estimated for the OR of each term, where $OR={e}^{{\beta}_{1}}$. This OR estimate is the weight of each term for calculating the aggregated OR score for each tweet. An elbow plot was also created to display the ORs visually. The AIS was then calculated for each tweet based on the ORs following the same method as in Case Study I, step 3. Specifically, the AIS was calculated inputting ORs estimated from Equation (5) instead of using RRs in Equations (2)–(4).
- The Kruskal-Wallis rank-sum test was conducted to examine the association between the AIS and the outcome of interest. The test evaluated whether the terms of tweets had an impact on the disease topic and provided information about whether the AIS could be used as an indicator to classify tweets between ASD and non-ASD topics. The ROC curve and area under the ROC curve (AUC) were obtained to assess the performance of the classification using the AIS, using the pROC package in R [16].

#### 2.4. Model Diagnostics and Evaluation

## 3. Results

#### 3.1. Case Study I

#### 3.2. Case Study II

## 4. Model Diagnostic and Evaluation

## 5. Discussion and Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Beykikhoshk, A.; Arandjelović, O.; Phung, D.; Venkatesh, S.; Caelli, T. Data-mining Twitter and the autism spectrum disorder: A pilot study. In Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), Beijing, China, 17–20 August 2014; pp. 349–356. [Google Scholar]
- Fung, I.C.H.; Tse, Z.T.H.; Cheung, C.N.; Miu, A.S.; Fu, K.W. Ebola and the social media. Lancet
**2014**, 384, 2207. [Google Scholar] [CrossRef] - Hswen, Y.; Gopaluni, A.; Brownstein, J.S.; Hawkins, J.B. Using Twitter to detect psychological characteristics of self-identified persons with autism spectrum disorder: A feasibility study. JMIR mHealth uHealth
**2019**, 7, e12264. [Google Scholar] [CrossRef] [PubMed] - Moorhead, S.A.; Hazlett, D.E.; Harrison, L.; Carroll, J.K.; Irwin, A.; Hoving, C. A new dimension of health care: Systematic review of the uses, benefits, and limitations of social media for health communication. J. Med. Internet Res.
**2013**, 15, e1933. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Zhang, D.Y.; Han, R.; Wang, D.; Huang, C. On robust truth discovery in sparse social media sensing. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; pp. 1076–1081. [Google Scholar]
- Liu, J.; Chen, S.; Zhou, Z.H.; Tan, X. Generalized low-rank approximations of matrices revisited. IEEE Trans. Neural Netw.
**2010**, 21, 621–632. [Google Scholar] [PubMed] - Kim, H.; Howland, P.; Park, H.; Christianini, N. Dimension reduction in text classification with support vector machines. J. Mach. Learn. Res.
**2005**, 6, 37–53. [Google Scholar] - Corley, C.D.; Cook, D.J.; Mikler, A.R.; Singh, K.P. Text and structural data mining of influenza mentions in web and social media. Int. J. Environ. Res. Public Health
**2010**, 7, 596–615. [Google Scholar] [CrossRef] [PubMed] - Yin, Z.; Sulieman, L.M.; Malin, B.A. A systematic literature review of machine learning in online personal health data. J. Am. Med Inform. Assoc.
**2019**, 26, 561–576. [Google Scholar] [CrossRef] [PubMed] [Green Version] - R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2017. [Google Scholar]
- Wickham, H. Stringr: Modern, consistent string processing. R. J.
**2010**, 2, 38. [Google Scholar] [CrossRef] - Feinerer, I. Introduction to the tm Package Text Mining in R. 2013. Available online: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf (accessed on 22 November 2021).
- Zeileis, A.; Kleiber, C.; Jackman, S. Regression models for count data in R. J. Stat. Softw.
**2008**, 27, 1–25. [Google Scholar] [CrossRef] [Green Version] - Cameron, A.C.; Trivedi, P.K. Regression Analysis of Count Data; Cambridge University Press: Cambridge, UK, 2013; Volume 35. [Google Scholar]
- Jackman, S. pscl: Classes and Methods for R. Developed in the Political Science Computational Laboratory, Stanford University; R Package Version 1.03. 5; Department of Political Science, Stanford University: Stanford, CA, USA, 2010. [Google Scholar]
- Robin, X.; Turck, N.; Hainard, A.; Tiberti, N.; Lisacek, F.; Sanchez, J.; Müller, M. pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform.
**2011**, 12, 77. [Google Scholar] [CrossRef] [PubMed] - Kleiber, C.; Zeileis, A. Visualizing count data regressions using rootograms. Am. Stat.
**2016**, 70, 296–303. [Google Scholar] [CrossRef] [Green Version] - Duvekot, J.; van der Ende, J.; Verhulst, F.C.; Slappendel, G.; van Daalen, E.; Maras, A.; Greaves-Lord, K. Factors influencing the probability of a diagnosis of autism spectrum disorder in girls versus boys. Autism
**2017**, 21, 646–658. [Google Scholar] [CrossRef] [PubMed] - Zerbo, K.R.; Mo, C. Identifying factors associated with autism spectrum disorder based on a comprehensive national survey. Int. J. Child Adolesc. Health
**2018**, 11, 57–72. [Google Scholar] - Arnaud, É.; Elbattah, M.; Gignon, M.; Dequen, G. Deep learning to predict hospitalization at triage: Integration of structured data and unstructured text. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 4836–4841. [Google Scholar]
- Goel, A.; Gautam, J.; Kumar, S. Real time sentiment analysis of tweets using Naive Bayes. In Proceedings of the 2016 2nd International Conference on Next Generation Computing Technologies (NGCT), Piscataway, NJ, USA, 14–16 October 2016; pp. 257–261. [Google Scholar]
- Dey, L.; Chakraborty, S.; Biswas, A.; Bose, B.; Tiwari, S. Sentiment analysis of review datasets using naive bayes and k-nn classifier. arXiv
**2016**, arXiv:1610.09982. [Google Scholar] [CrossRef] [Green Version] - Gupte, A.; Joshi, S.; Gadgul, P.; Kadam, A.; Gupte, A. Comparative study of classification algorithms used in sentiment analysis. Int. J. Comput. Sci. Inf. Technol.
**2014**, 5, 6261–6264. [Google Scholar] - Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res.
**2003**, 3, 993–1022. [Google Scholar] - Adnan, M.M.; Yin, J.; Jackson, A.M.; Tse, Z.T.H.; Liang, H.; Fu, K.W.; Saroha, N.; Althouse, B.M.; Fung, I.C.H. World Pneumonia Day 2011–2016: Twitter contents and retweets. Int. Health
**2019**, 11, 297–305. [Google Scholar] [CrossRef] [PubMed] - Fung, I.C.H.; Yin, J.; Pressley, K.D.; Duke, C.H.; Mo, C.; Liang, H.; Fu, K.W.; Tse, Z.T.H.; Hou, S.I. Pedagogical Demonstration of Twitter Data Analysis: A Case Study of World AIDS Day, 2014. Data
**2019**, 4, 84. [Google Scholar] [CrossRef] [Green Version] - Schaible, B.J.; Snook, K.R.; Yin, J.; Jackson, A.M.; Ahweyevu, J.O.; Chong, M.; Tse, Z.T.H.; Liang, H.; Fu, K.W.; Fung, I.C.H. Twitter conversations and English news media reports on poliomyelitis in five different countries, January 2014 to April 2015. Perm. J.
**2019**, 23, 18–181. [Google Scholar] - Ormerod, M.; Del Rincón, J.M.; Devereux, B. Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis. JMIR Med. Inform.
**2021**, 9, e23099. [Google Scholar] [CrossRef] [PubMed] - Jiang, L.; Yu, M.; Zhou, M.; Liu, X.; Zhao, T. Target-dependent twitter sentiment classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 151–160. [Google Scholar]
- Agarwal, A.; Xie, B.; Vovsha, I.; Rambow, O.; Passonneau, R.J. Sentiment analysis of twitter data. In Proceedings of the Workshop on Language in Social Media (LSM 2011), Portland, OR, USA, 23 June 2011; pp. 30–38. [Google Scholar]
- Bifet, A.; Frank, E. Sentiment knowledge discovery in twitter streaming data. In International Conference on Discovery Science; Springer: Berlin/Heidelberg, Germany, 2010; pp. 1–5. [Google Scholar]
- Owoputi, O.; O’Connor, B.; Dyer, C.; Gimpel, K.; Schneider, N.; Smith, N.A. Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of the 2013 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–15 June 2013; pp. 380–390. [Google Scholar]
- Barracliffe, L.; Arandjelovic, O.; Humphris, G. A pilot study of breast cancer patients: Can machine learning predict healthcare professionals’ responses to patient emotions. In Proceedings of the International Conference on Bioinformatics and Computational Biology, Honolulu, HI, USA, 20–22 March 2017; pp. 20–22. [Google Scholar]

**Figure 2.**The elbow plot of relative risks of 135 terms from the Hurdle model (

**red**) and Negative-Binomial model (

**blue**).

**Figure 3.**The word clouds of 135 terms (the size of word is determined by RR) from the (

**a**) Hurdle model and (

**b**) Negative-Binomial model.

**Figure 4.**The elbow plot of odds ratios of the significant terms estimated from the univariate logistic regression model. Notice that most terms are not useful for classification, which highlights the advantage of using the AIS of keywords (significant terms) for analysis.

**Figure 5.**The word clouds of the significant terms from univariate logistic regression in the (

**a**) ASD group versus (

**b**) non-ASD group. The font size displays the popularity of a particular term in either group: the larger the font size, the larger the values of the odds ratio and hence the greater the presence/popularity of a topic/term.

**Figure 6.**The rootograms from the (

**a**) Hurdle model, (

**b**) Negative-Binomial model, and (

**c**) Poisson model.

**Figure 7.**The ROC curve for classifying ASD and non-ASD topics based on (

**a**) AIS-mean and (

**b**) AIS-median.

**Table 1.**Summary statistics from the final regression models to evaluate the association of the proposed summary score AIS on retweet frequency.

Model | AIS | Parameter Estimate | Standard Error ^{1} | RR | AIC |
---|---|---|---|---|---|

Negative-Binomial | Mean | 0.9340 | 0.0187 | 2.5447 | 325,978 |

Median | 0.2012 | 0.0216 | 1.2229 | 327,007 | |

Sum | 0.1838 | 0.0052 | 1.2018 | 325,993 | |

Hurdle | Mean | 1.8552 | 0.0527 | 6.3930 | 314,887 |

Median | 2.7288 | 0.0834 | 15.3145 | 314,673 | |

Sum | 0.4304 | 0.0091 | 1.5379 | 313,852 |

^{1}The standard errors are so small that the confidence intervals are very close to the estimated relative risks. All p-values of testing RR = 1 are less than 2 × 10

^{−16}. Therefore, the AIS is strongly associated with the retweet frequency, thus offering a good summary measure of the text contents.

**Table 2.**Summary statistics from the final logistic regression models to compare the proposed summary score AIS on the resulting classified groups.

Model | AIS | Mean of AIS | Median of AIS | ||
---|---|---|---|---|---|

Non-ASD | ASD | Non-ASD | ASD | ||

Logistic | Mean | 1.04 | 1.10 | 1.00 | 1.00 |

Median | 1.02 | 1.05 | 1.00 | 1.00 | |

Sum | 5.15 | 3.75 | 4.00 | 3.00 |

^{−16}. Therefore, the two groups are well classified.

**Table 3.**The summary statistics of classification accuracy for AIS scores based on mean, median, and sum.

AIS Scores | AUC | Sensitivity | Specificity | Cut-Off Point |
---|---|---|---|---|

Mean | 0.8228 | 0.9208 | 0.7246 | 1.000 |

Median | 0.9396 | 0.9289 | 0.8786 | 1.000 |

Sum | 0.3895 | 0.5601 | 0.3767 | 3.000 |

Reciprocal of Sum | 0.6055 | 0.7185 | 0.5082 | 0.2065 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Mo, C.; Yin, J.; Fung, I.C.-H.; Tse, Z.T.H.
Aggregating Twitter Text through Generalized Linear Regression Models for Tweet Popularity Prediction and Automatic Topic Classification. *Eur. J. Investig. Health Psychol. Educ.* **2021**, *11*, 1537-1554.
https://doi.org/10.3390/ejihpe11040109

**AMA Style**

Mo C, Yin J, Fung IC-H, Tse ZTH.
Aggregating Twitter Text through Generalized Linear Regression Models for Tweet Popularity Prediction and Automatic Topic Classification. *European Journal of Investigation in Health, Psychology and Education*. 2021; 11(4):1537-1554.
https://doi.org/10.3390/ejihpe11040109

**Chicago/Turabian Style**

Mo, Chen, Jingjing Yin, Isaac Chun-Hai Fung, and Zion Tsz Ho Tse.
2021. "Aggregating Twitter Text through Generalized Linear Regression Models for Tweet Popularity Prediction and Automatic Topic Classification" *European Journal of Investigation in Health, Psychology and Education* 11, no. 4: 1537-1554.
https://doi.org/10.3390/ejihpe11040109