Pedagogical Demonstration of Twitter Data Analysis: A Case Study of World AIDS Day, 2014

: As a pedagogical demonstration of Twitter data analysis, a case study of HIV / AIDS-related tweets around World AIDS Day, 2014, was presented. This study examined if Twitter users from countries with various income levels responded di ﬀ erently to World AIDS Day. The performance of support vector machine (SVM) models as classiﬁers of relevant tweets was evaluated. A manual coding of 1,826 randomly sampled HIV / AIDS-related original tweets from November 30 through December 2, 2014 was completed. Logistic regression was applied to analyze the association between the World Bank-designated income level of users’ self-reported countries and Twitter contents. To identify the optimal SVM model, 1278 (70%) of the 1826 sampled tweets were randomly selected as the training set, and 548 (30%) served as the test set. Another 180 tweets were separately sampled and coded as the held-out dataset. Compared with tweets from low-income countries, tweets from the Organization for Economic Cooperation and Development countries had 60% lower odds to mention epidemiology (adjusted odds ratio, aOR = 0.404; 95% CI: 0.166, 0.981) and three times the odds to mention compassion / support (aOR = 3.080; 95% CI: 1.179, 8.047). Tweets from lower-middle-income countries had 79% lower odds than tweets from low-income countries to mention HIV-a ﬀ ected sub-populations (aOR = 0.213; 95% CI: 0.068, 0.664). The optimal SVM model was able to identify relevant tweets from the held-out dataset of 180 tweets with an accuracy (F1 score) of 0.72. This study demonstrated how students can be taught to analyze Twitter data using manual coding, regression models, and SVM models.


Introduction
Globally, 36.9 million people were living with human immunodeficiency virus (HIV) and 2 million people became newly infected with HIV in 2014 [1].The annual World AIDS Day (WAD) promotes HIV awareness and advocates for HIV prevention, treatment, and community support for people living with HIV.According to the Centers for Disease Control and Prevention (CDC), more than 1.1 million people in the United States were estimated to be living with HIV in 2015, of whom approximately 15% were unaware of their HIV status [2].Prevention of HIV infection and associated illnesses and deaths is one of the goals of the Healthy People 2020 initiatives in the United States [3].
Social media, such as Twitter and Facebook, has become increasingly popular as a data source and a tool in public health for both epidemiologic surveillance and communication surveillance [4].Many public health agencies use social media to promote healthy lifestyles and disease prevention.For example, the CDC has specific Twitter profiles dedicated to HIV/AIDS prevention and control (@CDC_HIVAIDS and @talkHIV).Hence, prior studies examined using social media to deliver HIV prevention information [5].Social media analysis revealed users' reactions to specific health promotion events [6].Users from different countries might react to the same disease differently [7,8].Research comparing Twitter contents of five different languages pertinent to the MERS outbreak in South Korea in 2015 found that users from different Asian countries had different concerns about the outbreak [7].Research comparing English tweets with Chinese Weibo posts pertinent to Ebola also identified content topics specific to Chinese internet users [8].The language used in tweets might reveal users' demographics, including their socio-economic status [9].
The potential of using Twitter as a tool for health communication and public health surveillance has long been recognized by researchers [4,[10][11][12].For example, researchers have explored Twitter's potential role in the surveillance of behaviors associated with increased risk of HIV infection [13].The application of computer-enabled methods, such as keyword and hashtag analysis as well as supervised and unsupervised machine learning methods, may help improve digital surveillance [4,11] by scaling up content analysis of tweets pertinent to health topics, such as Ebola [8], pneumonia [14], polio [15], and Zika [16].However, Twitter data analysis training for Master of Public Health (MPH) students largely remains an unmet need [17].
In the past few years, our team at Georgia Southern University has met our students' educational needs through various student projects [7,14,15,[18][19][20].Some projects focused on specific health topics, such as sentiment, contents, and retweets of vaccine-related tweets [7].Others focused on a specific Twitter profile, such as the CDC's Office of Advanced Molecular Detection (@CDC_AMD) [18].In many cases, mixed methods were used.Students were trained to manually code Twitter content.They were also trained to perform statistical analysis, such as regression models.Wherever appropriate, machine learning methods were introduced to the students.
In this paper, through two related MPH student projects, the readers are provided with a pedagogical demonstration on statistical methods that can be used to analyze manually coded HIV-related Twitter data.The research presented here was conducted as part of the educational experience for three graduate students, KDP, CHD, and CM, who were mentored by two faculty members, ICHF and JY, at Georgia Southern University in consultation with the other co-authors of this paper.

Part I
In the first student project (MPH capstone project), our pedagogical objective was to train MPH students how to perform manual coding of tweets and subsequent statistical analyses of association between meta-data variables and content variables.It was hypothesized that Twitter users residing in countries of different income levels might express different concerns on Twitter regarding HIV/AIDS.It should be noted that a Twitter user's location is self-reported.Additionally, the country's income level as defined by the World Bank was chosen as the basis of categorization.This was chosen for our pedagogical purpose as it was relatively objective and without too much controversy.Through analyzing Twitter data around WAD 2014, this hypothesis was tested: H1: The contents of HIV-related tweets created around WAD 2014 (each content category as an outcome variable) vary by the self-reported location by country income level as defined by the World Bank.

Part II
In the second student project (a Special Topics class project), our pedagogical objective was to train MPH students in the application of an established supervised machine learning method to categorize the contents of HIV-related tweets created around WAD 2014 into, relevant, and, irrelevant, tweets.More specifically, it was aimed to train students on how to obtain the optimal support vector machine (SVM) model with the maximum F1 score and evaluate its performance using a manually coded held-out dataset.The F1 score is a measure of accuracy and is the harmonic mean of sensitivity and positive predictive value [21].The goal for the analysis was to obtain benchmark data for future studies using automated methods to categorize tweets carrying the words, HIV, or, AIDS.into tweets that were genuinely relevant to HIV and those that were not.
Our study serves as a small step towards a better understanding of how future public health practitioners can be trained to analyze Twitter data to address research questions pertinent to health communication.

Data Description
Our data consisted of HIV/AIDS-related Twitter messages (i.e., tweets) surrounding WAD 2014.Data was collected from publicly available, user-generated contents from Twitter Advanced Search [22].The advanced search allowed the use of several filters.Therefore, it was chosen to use three filters-query filter, language filter, and time filter, which fell under the sections, "All of these words", "Written in" and "From this date" respectively.With these filters, tweets matching the queried words, "HIV OR AIDS" in English within a three-day time frame from November 30, 2014, through December 2, 2014 (UTC +00:00) were gathered.People and place filters were not used, so any public user worldwide who posted a tweet in English about HIV or AIDS within the three-day time frame was included in the original sample.The original search yielded 184,349 original tweets (i.e., all retweets were excluded).
As there is a limit to the number of tweets that can be retrieved using the Twitter search Application Programming Interface (API), a python script (Supplementary Materials Python script S1) was written to retrieve the search results in March 2015 through web crawling.The contents of the tweets and their unique identification numbers (i.e., tweet IDs) were retrieved, and Twitter API was used to obtain additional information according to the tweet IDs.Information retrieved from the data collection included user ID number, username, user's location (available only if the user permitted), retweet count, message tweeted, time created, mention ID/IDs, and mention username.As geo-location data was only available for a minority of tweets-around 15% of tweets according to Liang, Sheng & Fu [23]the self-reported locations of the Twitter users were used as a proxy.Among the 184,349 tweets, 131,407 of them had self-disclosed something in the location field, of which 103,928 (56.4%) were identifiable (unpublished analysis).
Detailed descriptions of data retrieval are presented in Appendix A.

Manual Coding
From the 184,349 original tweets, a 1% random sample was extracted (n = 1826) for manual coding.After reading around 200 randomly selected tweets from our dataset by the senior author, the coding scheme was developed.For the content analysis portion of the study, each of the sampled tweets was numerically coded based on 12 questions: 1, 2a-c, 3, 4a-b and 5a-e (Tables 1 and 2): Language, written in English or not; 2.
Reported location by: (a) Country income level as described by the World Bank yearly revised gross national income (GNI) per capita classifications [24]  Results presented in Table 2 were based on coding to Questions 2b and 2c.
Using the coding scheme above, each tweet was grouped into respective categories.This study randomly selected 200 tweets (11%) to be coded by two coders independently.The inter-rater reliability was assessed by calculating the Cohen's kappa coefficient for the question variables (Appendix B. Table A1).Some of the values were low due to the extreme imbalance of the marginal totals.However, the observed proportion of agreement for each question was above 90%.Discrepancies were discussed and coding procedures were refined to address ambiguity in coding instructions and content meanings.The remaining tweets were divided and separately coded by the two coders.
Furthermore, a separate corpus of 180 tweets, randomly drawn from the original dataset, was coded by a third coder for the relevance of each tweet.This corpus was used as the held-out dataset for supervised machine learning.

Part I: Statistical Analysis
Statistical analysis was performed with R 2.15.0 to 3.2.1 [25].Logistic regression was performed for response variables (i.e., content categories) that are binary.The primary predictor of interest in these regression models was the country's income level of self-reported locations, and all other related content categories were considered as confounders which needed to control for.In the regression analysis, Question 2b (United States versus non-United States) and Question 2c (if the United States, which state or territory) was not included due to their strong correlations with the country's income level.The latter also suffered from data sparsity (too few observations in one or more categories).This study did not include Question 5a HIV prevention information from the regression models due to data sparsity.Regarding Question 4b, levels 1, 2, 3, 4, 5, and 6 were combined into one level for mention of HIV sub-populations (i.e., Yes, for any sub-population) so that there would be enough data in each category.In addition, due to strong multicollinearity between some of the content categories and self-reported locations, the stepwise model selection was used to determine the final regression models.

Part II: Support Vector Machine Model
SVM models are binary classifiers that can be trained, using a manually coded dataset, to separate tweets into two groups (relevant versus irrelevant).In Part II, all content categories were collapsed into one relevant category (i.e., all tweets coded with at least one content category) and one irrelevant category.The manually coded dataset of tweets in Part I (n = 1826) was used to create SVM models, with the aim of determining the optimal SVM model to test on a separate held-out dataset of 180 tweets.In this study, 1278 (70%) of the 1826 tweets in the manually coded sample were randomly selected as the training set for the SVM models, and the remaining 548 (30%) were used as the test set.The held-out dataset was separately randomly drawn from our data of 184,349 HIV/AIDS-related original tweets.The held-out data set was distinct from the sample of 1826 tweets in Part I, and the 180 tweets were manually coded as, relevant, or, irrelevant.The Twitter messages were preprocessed, with URL, digits, stop-words, and punctuation marks (except intra-word dashes) removed.Stemming was not performed to avoid the word, AIDS, being converted into, aid.Models were trained with variation in sparse term threshold within the document term matrices.Positive predictive value, sensitivity, specificity, and F1 scores were calculated for each training and test set.The optimal SVM model was identified based on the F1 score, which is the harmonic mean of sensitivity and positive predictive value, so a larger F1 score indicated the better prediction accuracy of the model.The trained optimal model was then used to predict whether the tweets in the held-out dataset of 180 tweets were relevant or not.SVM models were computed using R 3.2.1 to 3.2.3[25].

Ethics Approval
The research presented herein was approved by the Georgia Southern University's Institutional Review Board, which determined it to be exempt from full review (H15083 and H15368).

Part I: Statistical Analysis
Among our sample of 1826 tweets, two tweets were not properly coded by coder #2 for contents and were removed from subsequent analysis.Regarding the remaining 1824 tweets, self-reported locations of the Twitter users were not reported in 616 (33.4%) of the tweets.
Table 3 presents results from the logistic regression analysis.Compared with tweets from low income countries (reference category), tweets from OECD countries had 60% lower odds to mention HIV/AIDS epidemiology (adjusted odds ratio, AOR = 0.40; 95% confidence interval, 0.17, 0.98) after controlling for mentions of sub-populations.Compared with tweets from low income countries, tweets from OECD countries had 3.08 times the odds of mentioning HIV/AIDS compassion and support (AOR = 3.08, 95% CI = 1.18, 8.05) after controlling for mentions of WAD, HIV/AIDS epidemiology, and HIV/AIDS testing.Tweets from lower middle income countries had 79% lower odds (AOR = 0.21; 95% CI, 0.07, 0.66) than low income countries to mention any of the sub-populations affected by HIV/AIDS after controlling for mentions of WAD and HIV/AIDS epidemiology.Our results supported our hypothesis that the contents of HIV-related tweets created around WAD 2014 vary by the self-reported location and by country income level as defined by the World Bank.

Part II: SVM Results
Five SVM models were created.The sparse term threshold varied from none to (n−10)/n, where n refers to the total number of terms (i.e., 4,963), giving 230 terms.Model variations, positive predictive values, and sensitivity are displayed in Table 4.The SVM model with a sparse term threshold of (n−5)/n gave the best F1 score (0.76) in the test set and was therefore used to test the manually coded held-out dataset of 180 tweets.The trained SVM model was able to predict the dataset with 77% sensitivity, a positive predictive score of 68%, and an F1 score of 0.72.

Discussion
This case study serves as a pedagogical demonstration of how public health graduate students could be taught statistical and text mining techniques for analyzing Twitter data.This study explored how contents of manually coded Twitter data pertinent to a specific health topic can be analyzed.In this case, our dataset contained tweets related to HIV/AIDS retrieved around WAD 2014.First, multivariable regression models were applied with the country income level of users as the main explanatory variable to explore differences in contents between users whose self-reported countries belong to different World Bank country income levels.Next, the SVM model was applied as a classifier for relevant and irrelevant tweets.
Our findings suggest a possibility of divergent interests between countries of different levels of economic development with regard to the types of HIV/AIDS-related information being shared on Twitter.These results echo observations that social media users reacted to infectious disease outbreaks differently between different communities or countries [7,8].However, since location data is self-reported, and country-level income data does not apply to individuals (ecological fallacy), the authors caution against over-interpretation of the results of this exploratory study.
Our supervised machine learning analysis serves as a pilot for the use of SVM models to filter HIV-relevant tweets from irrelevant tweets in a corpus of tweets.The use of SVM models could reduce the amount of time needed to read through tweets and serve as the first sieve to pull relevant tweets for in-depth manual coding that will serve as the basis of qualitative analysis of the textual content of Twitter data.The SVM method described here serves to facilitate, and not replace, qualitative studies that are tantamount to our understanding of social media contents.

Limitations
The dataset of this study is cross-sectional.Temporal variability of the data and causality between the predictive and outcome variables are beyond its scope.The three-day time frame was intentional to capture the responses to WAD.Hence, our results might not be applicable to other times of the year.Given the search terms, the sample of tweets was predominantly written in the English language, and thus our results cannot be generalized to users who wrote in other languages.Given the sample size for manual coding, the Twitter contents of users from individual countries could not be compared.Self-reported locations might not always accurately reflect the exact geographical location of the users.The geolocation data was not used because the proportion of tweets with geolocation data was low.The limitations of using the World Bank's categorization of country income levels of Twitter users' self-reported location as the explanatory variable are acknowledged.The diversity of cultures, healthcare systems, and prevalence of HIV/AIDS across countries (or territories) within the same country income level precluded the authors from drawing inferences between economic development and HIV-related concerns raised by Twitter users.Furthermore, it should be noted that a country's income level is not an indication of an individual's income level, and such an inference (which is known as ecological fallacy in epidemiology) should not be made.It is noted that in any country, Twitter users do not represent the general population.Finally, our study did not evaluate the quality (accuracy or reliability) of the information provided in the tweets.Only "mentions" were coded.Evaluation of the quality of HIV/AIDS-related information posted on Twitter is beyond the scope of this study.

Conclusions
To conclude, a pedagogical demonstration in public health Twitter data analysis using HIV-related tweets around WAD 2014 as a case study was presented.As social media data analysis becomes more mainstream in public health practice, training in big data analytical techniques will become more relevant to the education of our future public health practitioners.

Conflicts of Interest:
The authors declare no conflict of interest.The CDC had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Supplementary Materials:
The following are available online at http://www.mdpi.com/2306-5729/4/2/84/s1,Python script S1.Author Contributions: Conceptualization, I.C.-H.F., Z.T.H.T. and S.-I.H.; data curation, K.D.P. and C.H.D.; formal analysis, K.D.P., C.H.D. and C.M.; methodology, H.L.; project administration, I.C.-H.F.; resources, I.C.-H.F. and S.-I.H.; software, H.L.; supervision, I.C.-H.F. and J.Y.; writing-original draft, I.C.-H.F., J.Y., K.D.P., C.H.D. and H.L.; writing-review & editing, I.C.-H.F., J.Y., C.M., H.L., K.-W.F., Z.T.H.T. and S.-I.H. Funding: This research received no external funding.Acknowledgments: ICHF (15IPA1509134; 16IPA1609578) and ZTHT (16IPA1619505) received salary support from the Centers for Disease Control and Prevention (CDC).This paper is not related to their CDC-supported research.CDC has no role in the study design, data collection, data analysis, writing and submission of this paper.An early version of part I of this paper was submitted by KDP to ICHF as part of her Public Health Capstone Research Project (PUBH 7991), Spring 2015, titled "Examination of Self-reported Locations and Content of HIV/AIDS Twitter Data."An early version of part II of this paper was submitted by CHD to ICHF as part of her class project in Public Health Special Topics: Social Media and Health (PUBH 7090-A), Fall 2015.ICHF thanks Dr. Chung-Hong Chan (The University of Hong Kong and The University of Mannheim) who shared with him the text mining teaching materials for instructional use in his course.The authors thank Ogochukwu Nnaemeka Ezumba of The University of Georgia for being the second manual coder of tweets.ICHF, JY, KDP, CHD and CM serve as co-first authors.
a All the percentages included in this Table use 1824 as the denominator.IDU: injecting drug users; OECD: Organization for Economic Co-operation and Development

Table 2 .
The self-reported locations by state or territory of users from the United States (n = 453 tweets).

Table 3 .
Adjusted odds ratio of country income levels of self-reported locations and other variables as predictors of mentions of HIV/AIDS epidemiology information, mentions of sub-populations, and mentions of HIV/AIDS compassion and support in a step-wise multivariable regression analysis.
* Country income levels of self-reported locations: low-income countries as the reference category; LMI, lower-middle-income countries; UMI, upper-middle-income countries; HIC, high-income countries that are not members of the Organization for Economic Co-operation and Development; OECD, countries that are members of the Organization for Economic Co-operation and Development.

Table 4 .
Statistics of the performance of five support vector machine (SVM) models as described by sensitivity, positive predictive value and F1 score.False Positive, TN, True Negative, TP: True Positive.The training set and test set were randomly selected from the corpus of 1826 tweets (70% and 30% respectively).F1 score is the harmonic mean of sensitivity and positive predictive value.Model C was selected because its F1 score on the test set was the highest among the five models.Model C was applied to the held-out dataset, which was a separate dataset of 180 manually coded tweets.Both the corpus of 1826 tweets and the corpus of 180 tweets were randomly selected from 184,349 HIV/AIDS-related original tweets from November 30 through December 2, 2014.