1. Introduction
Globally, 36.9 million people were living with human immunodeficiency virus (HIV) and 2 million people became newly infected with HIV in 2014 [
1]. The annual World AIDS Day (WAD) promotes HIV awareness and advocates for HIV prevention, treatment, and community support for people living with HIV. According to the Centers for Disease Control and Prevention (CDC), more than 1.1 million people in the United States were estimated to be living with HIV in 2015, of whom approximately 15% were unaware of their HIV status [
2]. Prevention of HIV infection and associated illnesses and deaths is one of the goals of the Healthy People 2020 initiatives in the United States [
3].
Social media, such as Twitter and Facebook, has become increasingly popular as a data source and a tool in public health for both epidemiologic surveillance and communication surveillance [
4]. Many public health agencies use social media to promote healthy lifestyles and disease prevention. For example, the CDC has specific Twitter profiles dedicated to HIV/AIDS prevention and control (@CDC_HIVAIDS and @talkHIV). Hence, prior studies examined using social media to deliver HIV prevention information [
5]. Social media analysis revealed users’ reactions to specific health promotion events [
6]. Users from different countries might react to the same disease differently [
7,
8]. Research comparing Twitter contents of five different languages pertinent to the MERS outbreak in South Korea in 2015 found that users from different Asian countries had different concerns about the outbreak [
7]. Research comparing English tweets with Chinese Weibo posts pertinent to Ebola also identified content topics specific to Chinese internet users [
8]. The language used in tweets might reveal users’ demographics, including their socio-economic status [
9].
The potential of using Twitter as a tool for health communication and public health surveillance has long been recognized by researchers [
4,
10,
11,
12]. For example, researchers have explored Twitter’s potential role in the surveillance of behaviors associated with increased risk of HIV infection [
13]. The application of computer-enabled methods, such as keyword and hashtag analysis as well as supervised and unsupervised machine learning methods, may help improve digital surveillance [
4,
11] by scaling up content analysis of tweets pertinent to health topics, such as Ebola [
8], pneumonia [
14], polio [
15], and Zika [
16]. However, Twitter data analysis training for Master of Public Health (MPH) students largely remains an unmet need [
17].
In the past few years, our team at Georgia Southern University has met our students’ educational needs through various student projects [
7,
14,
15,
18,
19,
20]. Some projects focused on specific health topics, such as sentiment, contents, and retweets of vaccine-related tweets [
7]. Others focused on a specific Twitter profile, such as the CDC’s Office of Advanced Molecular Detection (@CDC_AMD) [
18]. In many cases, mixed methods were used. Students were trained to manually code Twitter content. They were also trained to perform statistical analysis, such as regression models. Wherever appropriate, machine learning methods were introduced to the students.
In this paper, through two related MPH student projects, the readers are provided with a pedagogical demonstration on statistical methods that can be used to analyze manually coded HIV-related Twitter data. The research presented here was conducted as part of the educational experience for three graduate students, KDP, CHD, and CM, who were mentored by two faculty members, ICHF and JY, at Georgia Southern University in consultation with the other co-authors of this paper.
1.1. Part I
In the first student project (MPH capstone project), our pedagogical objective was to train MPH students how to perform manual coding of tweets and subsequent statistical analyses of association between meta-data variables and content variables. It was hypothesized that Twitter users residing in countries of different income levels might express different concerns on Twitter regarding HIV/AIDS. It should be noted that a Twitter user’s location is self-reported. Additionally, the country’s income level as defined by the World Bank was chosen as the basis of categorization. This was chosen for our pedagogical purpose as it was relatively objective and without too much controversy. Through analyzing Twitter data around WAD 2014, this hypothesis was tested:
H1: The contents of HIV-related tweets created around WAD 2014 (each content category as an outcome variable) vary by the self-reported location by country income level as defined by the World Bank.
1.2. Part II
In the second student project (a Special Topics class project), our pedagogical objective was to train MPH students in the application of an established supervised machine learning method to categorize the contents of HIV-related tweets created around WAD 2014 into, relevant, and, irrelevant, tweets. More specifically, it was aimed to train students on how to obtain the optimal support vector machine (SVM) model with the maximum F1 score and evaluate its performance using a manually coded held-out dataset. The F1 score is a measure of accuracy and is the harmonic mean of sensitivity and positive predictive value [
21]. The goal for the analysis was to obtain benchmark data for future studies using automated methods to categorize tweets carrying the words, HIV, or, AIDS. into tweets that were genuinely relevant to HIV and those that were not.
Our study serves as a small step towards a better understanding of how future public health practitioners can be trained to analyze Twitter data to address research questions pertinent to health communication.
2. Data Description
Our data consisted of HIV/AIDS-related Twitter messages (i.e., tweets) surrounding WAD 2014. Data was collected from publicly available, user-generated contents from Twitter Advanced Search [
22]. The advanced search allowed the use of several filters. Therefore, it was chosen to use three filters—query filter, language filter, and time filter, which fell under the sections, “All of these words”, “Written in” and “From this date” respectively. With these filters, tweets matching the queried words, “HIV OR AIDS” in English within a three-day time frame from November 30, 2014, through December 2, 2014 (UTC +00:00) were gathered. People and place filters were not used, so any public user worldwide who posted a tweet in English about HIV or AIDS within the three-day time frame was included in the original sample. The original search yielded 184,349 original tweets (i.e., all retweets were excluded).
As there is a limit to the number of tweets that can be retrieved using the Twitter search Application Programming Interface (API), a python script (
Supplementary Materials Python script S1) was written to retrieve the search results in March 2015 through web crawling. The contents of the tweets and their unique identification numbers (i.e., tweet IDs) were retrieved, and Twitter API was used to obtain additional information according to the tweet IDs. Information retrieved from the data collection included user ID number, username, user’s location (available only if the user permitted), retweet count, message tweeted, time created, mention ID/IDs, and mention username. As geo-location data was only available for a minority of tweets—around 15% of tweets according to Liang, Sheng & Fu [
23]— the self-reported locations of the Twitter users were used as a proxy. Among the 184,349 tweets, 131,407 of them had self-disclosed something in the location field, of which 103,928 (56.4%) were identifiable (unpublished analysis).
Detailed descriptions of data retrieval are presented in
Appendix A.
5. Discussion
This case study serves as a pedagogical demonstration of how public health graduate students could be taught statistical and text mining techniques for analyzing Twitter data. This study explored how contents of manually coded Twitter data pertinent to a specific health topic can be analyzed. In this case, our dataset contained tweets related to HIV/AIDS retrieved around WAD 2014. First, multivariable regression models were applied with the country income level of users as the main explanatory variable to explore differences in contents between users whose self-reported countries belong to different World Bank country income levels. Next, the SVM model was applied as a classifier for relevant and irrelevant tweets.
Our findings suggest a possibility of divergent interests between countries of different levels of economic development with regard to the types of HIV/AIDS-related information being shared on Twitter. These results echo observations that social media users reacted to infectious disease outbreaks differently between different communities or countries [
7,
8]. However, since location data is self-reported, and country-level income data does not apply to individuals (ecological fallacy), the authors caution against over-interpretation of the results of this exploratory study.
Our supervised machine learning analysis serves as a pilot for the use of SVM models to filter HIV-relevant tweets from irrelevant tweets in a corpus of tweets. The use of SVM models could reduce the amount of time needed to read through tweets and serve as the first sieve to pull relevant tweets for in-depth manual coding that will serve as the basis of qualitative analysis of the textual content of Twitter data. The SVM method described here serves to facilitate, and not replace, qualitative studies that are tantamount to our understanding of social media contents.
5.1. Limitations
The dataset of this study is cross-sectional. Temporal variability of the data and causality between the predictive and outcome variables are beyond its scope. The three-day time frame was intentional to capture the responses to WAD. Hence, our results might not be applicable to other times of the year. Given the search terms, the sample of tweets was predominantly written in the English language, and thus our results cannot be generalized to users who wrote in other languages. Given the sample size for manual coding, the Twitter contents of users from individual countries could not be compared. Self-reported locations might not always accurately reflect the exact geographical location of the users. The geolocation data was not used because the proportion of tweets with geolocation data was low. The limitations of using the World Bank’s categorization of country income levels of Twitter users’ self-reported location as the explanatory variable are acknowledged. The diversity of cultures, healthcare systems, and prevalence of HIV/AIDS across countries (or territories) within the same country income level precluded the authors from drawing inferences between economic development and HIV-related concerns raised by Twitter users. Furthermore, it should be noted that a country’s income level is not an indication of an individual’s income level, and such an inference (which is known as ecological fallacy in epidemiology) should not be made. It is noted that in any country, Twitter users do not represent the general population. Finally, our study did not evaluate the quality (accuracy or reliability) of the information provided in the tweets. Only “mentions” were coded. Evaluation of the quality of HIV/AIDS-related information posted on Twitter is beyond the scope of this study.
5.2. Conclusions
To conclude, a pedagogical demonstration in public health Twitter data analysis using HIV-related tweets around WAD 2014 as a case study was presented. As social media data analysis becomes more mainstream in public health practice, training in big data analytical techniques will become more relevant to the education of our future public health practitioners.