Automatic Categorization of LGBT User Proﬁles on Twitter with Machine Learning

: Privacy needs and stigma pose signiﬁcant barriers to lesbian, gay, bisexual, and transgender (LGBT) people sharing information related to their identities in traditional settings and research methods such as surveys and interviews. Fortunately, social media facilitates people’s belonging to and exchanging information within online LGBT communities. Compared to heterosexual re-spondents, LGBT users are also more likely to have accounts on social media websites and access social media daily. However, the current relevant LGBT studies on social media are not efﬁcient or assume that any accounts that utilize LGBT-related words in their proﬁle belong to individuals who identify as LGBT. Our human coding of over 16,000 accounts instead proposes the following three categories of LGBT Twitter users: individual, sexual worker/porn, and organization. This research develops a machine learning classiﬁer based on the proﬁle and bio features of these Twitter accounts. To have an efﬁcient and effective process, we use a feature selection method to reduce the number of features and improve the classiﬁer’s performance. Our approach achieves a promising result with around 88% accuracy. We also develop statistical analyses to compare the three categories based on the average weight of top features.


Introduction
There are more than 11 million lesbian, gay, bisexual, and transgender (LGBT) adults in the United States (U.S.) [1].
LGBT populations face social stigma and additional discrimination-imposed challenges such as a higher rates of HIV and depression than their heterosexual and cisgender peers [2]. To address LGBT issues and provide better service to this community, the first step is identifying those issues. However, traditional surveys and other research approaches such as focus groups are expensive and time-consuming, address limited issues, and obtain small-scale data.
Social media has become a mainstream channel of communication and has grown in popularity. Social media facilitates people's belonging to and exchanging information within LGBT communities by allowing users to transcend geographic barriers in online spaces with the limited risk of being "outed" [3]. Compared to heterosexual respondents, LGBT users are more likely to have accounts on social media websites, access social media daily, and make frequent use of the internet [4].
According to a survey, 80% of LGBT Americans use social networking websites, and about four in ten LGBT adults have revealed their sexual orientation or gender identity on • This paper offers a codebook to manually categorize LGBT users.

•
The prediction approach is an important step toward categorizing LGBT users by developing a machine learning classifier. • Methodologically, our approach can be reused in predicting not only LGBT users but also other minorities. • While this research uses Twitter data, the proposed approach and features can be adopted for other possible social media platforms.

•
The approach of this paper can be used to identify and filter out adult content.
• This research can be used by researchers to understand social media activities and concerns (e.g., health issues) of LGBT individuals.

•
This study can also be utilized by researchers to explore the social media strategies of LGBT organizations and identify best practices to promote social good for the LGBT population.

Materials and Methods
The methodology of this paper has five components, including data acquisition, data annotation, classification, evaluation, and statistical analysis (Figure 1). • While this research uses Twitter data, the proposed approach and features can be adopted for other possible social media platforms.

•
The approach of this paper can be used to identify and filter out adult content. • This research can be used by researchers to understand social media activities and concerns (e.g., health issues) of LGBT individuals.

•
This study can also be utilized by researchers to explore the social media strategies of LGBT organizations and identify best practices to promote social good for the LGBT population.

Materials and Methods
The methodology of this paper has five components, including data acquisition, data annotation, classification, evaluation, and statistical analysis (Figure 1) .

Data Acquisition
Twitter data were chosen for this project due to the hesitations many LGBT community members have about reporting their identities in official studies or in medical settings. Choosing Twitter data also allows for a broader reach within the community than is possible in a survey approach. Survey or focus-group based research on queer issues may also be heavily siloed, while Twitter data offer a broader view available at scale.
Twitter, a massively popular American microblogging and social networking platform launched in 2006, allows users to post short messages or "tweets" and interact with other users' tweets by liking or retweeting. Users choose to "follow" other users whose content they wish to view and can choose to only allow certain other users to follow their account. Twitter is a social media platform that provides us with a large-scale dataset to classify LGBT users. This paper categorizes Twitter users utilizing LGBT-related words in their profiles. Profiles were identified using the followerwonk platform (https://followerwonk.com/bio (accessed on 15 June 2019)) to obtain Twitter profiles containing "lesbian", "gay", "bisexual", "bi", "transgender", "trans man", and "trans woman" users in the U.S. and in each state, and only profiles that had at least 50 followers and 50 tweets to focus on active users. This process offered 42,644 profiles. After removing duplicate profiles, we found 38,978 unique profiles.
We recognize that the topic of this paper is a sensitive area presenting ethical challenges. To address these challenges, we include a self-reflexivity statement. First, we use publicly accessible Twitter data without any interaction with the users, our work is exempt from the institutional review board (IRB) review. However, we took great care in data collection and analysis and presenting results by not disclosing personally identifiable information. Second, to incorporate sensitivity in this paper, some of the coauthors belong to the LGBT community.

Data Annotation
In order to accurately categorize LGBT users, high quality data from users who selfidentify as LGBT in the United States are needed. Where previous work in the field has taken more simplistic approaches to gather profiles belonging to the community by simply including all profiles with mentions of LGBT terms, this results in low quality data due to the inclusion of accounts professing support as allies and automated accounts that post primarily pornographic material. This research could be used to automate the process of future classification and could serve as a repository for a number of future academic studies into many other aspects of LGBT social media activities.

Data Acquisition
Twitter data were chosen for this project due to the hesitations many LGBT community members have about reporting their identities in official studies or in medical settings. Choosing Twitter data also allows for a broader reach within the community than is possible in a survey approach. Survey or focus-group based research on queer issues may also be heavily siloed, while Twitter data offer a broader view available at scale.
Twitter, a massively popular American microblogging and social networking platform launched in 2006, allows users to post short messages or "tweets" and interact with other users' tweets by liking or retweeting. Users choose to "follow" other users whose content they wish to view and can choose to only allow certain other users to follow their account. Twitter is a social media platform that provides us with a large-scale dataset to classify LGBT users. This paper categorizes Twitter users utilizing LGBT-related words in their profiles. Profiles were identified using the followerwonk platform (https://followerwonk.com/bio (accessed on 15 June 2019)) to obtain Twitter profiles containing "lesbian", "gay", "bisexual", "bi", "transgender", "trans man", and "trans woman" users in the U.S. and in each state, and only profiles that had at least 50 followers and 50 tweets to focus on active users. This process offered 42,644 profiles. After removing duplicate profiles, we found 38,978 unique profiles.
We recognize that the topic of this paper is a sensitive area presenting ethical challenges. To address these challenges, we include a self-reflexivity statement. First, we use publicly accessible Twitter data without any interaction with the users, our work is exempt from the institutional review board (IRB) review. However, we took great care in data collection and analysis and presenting results by not disclosing personally identifiable information. Second, to incorporate sensitivity in this paper, some of the coauthors belong to the LGBT community.

Data Annotation
In order to accurately categorize LGBT users, high quality data from users who selfidentify as LGBT in the United States are needed. Where previous work in the field has taken more simplistic approaches to gather profiles belonging to the community by simply including all profiles with mentions of LGBT terms, this results in low quality data due to the inclusion of accounts professing support as allies and automated accounts that post primarily pornographic material. This research could be used to automate the process of future classification and could serve as a repository for a number of future academic studies into many other aspects of LGBT social media activities.
The annotation approach and codebook were developed iteratively, and responsively to both community rhetoric and the intricacies, twists, and unexpected challenges of mining social media data. Using a human-centered approach, a codebook was developed to reflect the most complexity possible when labeling the accounts of users, while still creating Electronics 2021, 10, 1822 4 of 15 disjoint sets. The final codebook was then applied to all collected user accounts by two coders independently for intercoder reliability.
The two authors independently coded and discussed 500 randomly selected profiles from the 38,978 unique profiles. Due to the nature of the internet and social media at large, searching for profiles with LGBT-related words in Twitter bios returns a fairly high percentage of results with primarily pornographic material, which may or may not be posted by "bots". Organizations were also classified separately, as they do not reflect individual experiences. Discrepancies were addressed by a third coder. The initial coding process offered three categories, including individual, porn/sex worker, and organization accounts. Coders needed to answer the following two questions for each account: Q1: Is the account useable for this research? This yes/no question excluded the following accounts: • Non-U.S. accounts where their bio information does not show a location in the U.S.; • Non-English accounts that posted mostly non-English tweets; • Inactive accounts that have not been active since 2017; • Private and suspended accounts; • Automated accounts that posted an unusual number of tweets, retweets, and likes, had a very low rate of followers to followings, and did not have an image. We also used Botometer (https://botometer.osome.iu.edu/ (accessed on 15 June 2019)) to identify automated accounts [61].
Q2: What is the category of the account? To address this question, coders used the following definition to assign one of the categories: • Individual accounts are controlled by a single person. • Sex Worker/Porn accounts are involved in the production of professional pornography both on and off screen, those engaged in prostitution and escort services, erotic dancers, fetish models, and amateur individuals using webcam sites, amateur porn sites, or pay-gated platforms to profit off of self-made content, and accounts that retweet primarily pornographic material and/or post their own nude photographs or moving images. • Organization accounts are managed by a group or an organization representing more than one person.
After completing the coding, we applied Cohen's κ to determine the agreement between the two coders. There were substantial agreements for Q1 (κ = 0.7862) and Q2 (κ = 0.7544).

Classification
Our next goal centers around inferring the category of the collected Twitter users automatically. We draw on Twitter account information to build a machine learning classifier. This paper follows the automated framework in Figure 2 to categorize LGBT users on Twitter. This step includes developing algorithms to assign a set of users U = u , u , … , u to known classes. The classification can be described as the prediction of the category of each user ( ). The following classifier algorithm (a) assigns a class (c) to each user in Equation (1): This step includes developing algorithms to assign a set of users U = {u 1 , u 2 , . . . , u k } to known classes. The classification can be described as the prediction of the category of each user (u i ). The following classifier algorithm (a) assigns a class (c) to each user in Equation (1): In this research, there are three classes (m = 3), including individual, sex worker/porn, and organization. To classify each user, the input of each classifier is a set of n features, . . , f n }. This research examines the following two types of features: bio and profile features. To predict the category of each user, we use the following two main approaches [62]: traditional methods including NaiveBayes, BayesNet, Random Forest, J48, and Support Vector Machines (SVM) and deep learning using Convolutional Neural Network (CNN). These methods are among high-performance classification algorithms [63][64][65][66][67][68]. CNN is of the popular deep learning methods and has been used for different classification tasks [62,[69][70][71]. The rest of the classifiers are traditional machine learning methods using for a wide range of applications such as spam detection [72,73] and document classification [74]. We transform the information of Twitter accounts into a set of features. The focus of this study is on the features displayed on Twitter accounts. These features illustrate information about users and their activities. Table 1 shows the definition of Twitter terms.

Term Definition
Account's Age The length of time that a Twitter account has been created.
Bio A short summary (up to 160 characters) about a user in their profile.

Like (Favorite)
Showing appreciation of a tweet by clicking on the like tab.

Followers
Twitter accounts that follow updates of a Twitter account.

Followings
Twitter accounts that are followed by a Twitter account.

Screen Name
The name displayed in the profile to show a personal identifier.

Tweet
A status update of a user containing up to 280 characters.

Username
The name to help identify a user using @, such as @TheEllenShow.
This paper uses features in the LGBT Twitter accounts and builds a feature vector for each account, which are briefly described below. This study uses the χ 2 value, which is one of the effective feature selection methods [75], to measure the discriminative power of features for ranking the impact of the

Evaluation
We examine the performance of the six algorithms to find which classifier performs better with the bio and profile features. To evaluate the performance of classifiers, we use some measures based on the confusion matrix. The following confusion matrix represents a binary classification example that can be extended to more than two categories: While TP and TN are correctly identified and misidentified reports, respectively, FP and FN are incorrectly identified and misidentified reports, respectively. We utilized precision (P), recall (R), the area under the ROC curve (AUC), and accuracy (ACC) based on the following definitions: ROC finds the tradeoff FP and TP by plotting FP on the X-axis and TP on the Y-axis; the closer to the upper left indicates better performance. Then, we computed the chi-square to rank and find the top features. In order to determine the category of each user, we adopted the six classification algorithms using 5-fold cross-validation, in which the data are broken into five subsets, and the holdout method is repeated five times. Each time, one of the three subsets is used as the test set, and the other four subsets are used as the training set.

Statistical Analysis
To compare individual, porn/sexual worker, and organization accounts based on the mean value of the top features identified in the previous step, we utilized an analysis of variance (ANOVA), which tests whether the weight of features is different for the three account's types. We used the value of the top features as the dependent variable. After we found a significant difference (p-value ≤ 0.05), we used Tukey's multiple comparison test [76] to find which of the means differ significantly from others. To control familywise errors, we used the false discovery rate (FDR) method [77] that reduces not only false positives but also false negatives [78]. We also utilized the absolute effect size using Cohen's d to identify the magnitude of the differences. We used the following classification index to interpret effect sizes: very small (d = 0.01), small (d = 0.2), medium (d = 0.5), large (d = 0.8), very large (d = 1.2), and huge (d = 2.0) [79].

Results
The manual coding process offered 16,241 users, including 12,488 (76.89%) individual, 2282 (14.05%) porn/sexual work, and 1471 (9.06%) organization accounts. In total, we obtained 1369 features. We tested the performance of the six classifiers developed in Weka (https://www.cs.waikato.ac.nz/ml/weka/ (accessed on 15 April 2021) with the five-cross validation methods. To ensure the comparability between the classifiers, we used the standard parameters. Out of the six classifiers, we found BayesNet produced higher accuracy and AUC than the rest of the algorithms (Figure 3). The BayesNet algorithm performed significantly better than the baseline accuracy of 0.7689, which was based on using the algorithm ZeroR relying on the target and ignores all predictors.

Results
The manual coding process offered 16,241 users, including 12,488 (76.89%) individual, 2282 (14.05%) porn/sexual work, and 1471 (9.06%) organization accounts. In total, we obtained 1369 features. We tested the performance of the six classifiers developed in Weka (https://www.cs.waikato.ac.nz/ml/weka/ (accessed on 15 April 2021) with the five-cross validation methods. To ensure the comparability between the classifiers, we used the standard parameters. Out of the six classifiers, we found BayesNet produced higher accuracy and AUC than the rest of the algorithms (Figure 3). The BayesNet algorithm performed significantly better than the baseline accuracy of 0.7689, which was based on using the algorithm ZeroR relying on the target and ignores all predictors. We found that finding the optimum number of features can improve the classification performance, which offers a time-saving and cost-efficient system. Therefore, we have examined a different number of features. The optimum number of features was 399 ( Figure 4).   Figure 3. Classification performance of six algorithms using 1369 features.
We found that finding the optimum number of features can improve the classification performance, which offers a time-saving and cost-efficient system. Therefore, we have examined a different number of features. The optimum number of features was 399 ( Figure 4).
obtained 1369 features. We tested the performance of the six classifiers developed in Weka (https://www.cs.waikato.ac.nz/ml/weka/ (accessed on 15 April 2021) with the five-cross validation methods. To ensure the comparability between the classifiers, we used the standard parameters. Out of the six classifiers, we found BayesNet produced higher accuracy and AUC than the rest of the algorithms (Figure 3). The BayesNet algorithm performed significantly better than the baseline accuracy of 0.7689, which was based on using the algorithm ZeroR relying on the target and ignores all predictors. We found that finding the optimum number of features can improve the classification performance, which offers a time-saving and cost-efficient system. Therefore, we have examined a different number of features. The optimum number of features was 399 ( Figure 4).    Table 2 shows the accuracy performance of BayesNet algorithms using the profile, bio, profile and bio, and top 399 profile and bio features. This table had three outcomes. First, the profile or bio features could identify the three classes with more than 80% accuracy. Second, the combination of profile and bio improved the performance of the classifier. Third, reducing the number of features enhanced the accuracy of BayesNet.  Table 3 summarizes the performance metrics of NaiveBayes with 399 features, where we found that the classifier was reasonably stable (SD ≤ 0.006 and CV ≤ 0.01). CV represents the coefficient of variation measured using Standard Deviation Mean .    Table 3 summarizes the performance metrics of NaiveBayes with 399 features, where we found that the classifier was reasonably stable (SD ≤ 0.006 and CV ≤ 0.01). CV represents the coefficient of variation measured using .    Table 4 shows the top 20 features that assisted in classifying users. Out of the top 20 features, 9 (45%) and 11 (55%) features were related to profile and bio categories, respectively. The bio features include the #words in the bio and the frequency of words in the bio of Twitter accounts, including bisexual, transgender, community, nsfw, porn, organization, LGBT, allies, event, and men. Among these words, nsfw and porn are used more   Table 4 shows the top 20 features that assisted in classifying users. Out of the top 20 features, 9 (45%) and 11 (55%) features were related to profile and bio categories, respectively. The bio features include the #words in the bio and the frequency of words in the bio of Twitter accounts, including bisexual, transgender, community, nsfw, porn, organization, LGBT, allies, event, and men. Among these words, nsfw and porn are used more by SWP accounts, and the rest of words are utilized more by organization accounts. The rest of the top 20 features are related to the profile category, including the #likes/year, the #followers/#followings, the account's age, the #tweets/year, the #tweets, the letter g in the screen name, the #followers, the username's length, and the #followings. Our statistical analysis shows that there were 46 (out of 60) significant differences. For example, the number of tweets is higher for individual (Ind) accounts than sexual worker/porn (SWP) and organization (Org) accounts (Table 4). We found the following findings:

•
There was no significant difference between SWP and Org accounts across three features, including #followers/followings, #followers, and #followings. Compared to Org accounts, SWP accounts had a higher #likes/year, #tweets/year, and #tweets and used nsfw (not safe for work) and pornographic words in their bio more. The value of the rest of the features was higher for Org accounts than SWP ones. In sum, we found three NS, five SWP > Org, and twelve SWP < Org comparisons.

•
There was no significant difference between Ind and Org accounts across two features, including #followings and nsfw. Compared to Org accounts, Ind accounts had a higher #likes/year, #tweets/year, and #tweets. The value of the rest of the features was higher for Org accounts than Ind ones. In total, this research identified two NS, three Ind > Org, and fifteen Ind < Org comparisons.

•
There was no significant difference between Ind and SPW accounts across nine features, including #likes/year, #tweets/year, and the length of the username, and using the words bisexual, transgender, community, organization, and allies in their bio. Compared to SPW accounts, Ind accounts had a higher account age, #tweets, and used the acronym LGBT more. The value of the rest of features was higher for SPW accounts than Ind ones. In total, this research identified nine NS, three Ind > SWP, and eight Ind < SWP comparisons.
• There is a significant difference between the three categories based on the following features: the account's age; the number of tweets; using porn, LGBT, and men words in the bio; using G in screen name; and the number of words in the bio.

•
The effect size analysis illustrates that the 46 significant differences were not trivial, including 6 very small, 18 small, 14 medium, 6 large, and 2 very large effect sizes ( Table 5). The maximum difference was between individual and organization accounts with 18 (90%) significant differences, and the minimum difference was between the individual and sexual worker/porn accounts with 11 (55%) significant differences out of 20 comparisons. The effect size analysis also confirmed that the magnitude of significant differences is considerable.

Discussion
This research is unique in that it provides a prediction framework including an automatic classifier, a feature selection approach, and evaluation measures. Our experiments were designed to categorize LGBT users based on different sets of features and categories and identify features that may contribute to improving the efficiency and effectiveness of the prediction. Our proposed model uses BayesNet to learn feature vectors and the χ 2 value to identify the optimal subset of features. Our proposed model outperformed the baseline on classifying LGBT accounts. We are now able to identify individual, sex worker/porn, and organization accounts with around 88% accuracy. The evaluation shows that the performance of our classifier is better than the baseline accuracy (76.89%) using ZeroR, which classifies each user to the largest class, which is individual users in this study.
While even a little higher than the baseline could be significant, our classifier shows more than 10% improvement over the baseline.
While using profile and bio features independently can provide a significant change over the baseline performance, the combination of profile and bio features and reducing the number of features can be more helpful in classifying LGBT accounts. The accuracy of our classifier is improved when both profile and bio features are used. While the number of profile features (81) is less than the number of bio features (1288), most of the top 50 features are related to profile information, indicating that profile information containing structured features plays an important role in classifying LGBT accounts. In addition, words in the bio of Ind, SWP, and Org accounts can be a good indicator to categorize LGBT users.
Our results suggest that profile information, words of bio, and characters of username and screen name can help to predict the category of LGBT users. For example, it is not surprising to see that the number of followers of Org and SWP accounts is more than Ind accounts because they have more fans than Ind ones. However, it is interesting to find that Org accounts used the like icon and tweeted less than Ind and SWP accounts, which means that Org accounts are cautious in posting social comments and showing their interests. The reason behind this strategy could be that a single unfortunate post can have a significant negative impact on organizations [80]. However, Ind and SWP accounts do not have this limitation and can be more active than Org accounts.
Compared to SWP and Ind accounts, Org ones use community in the bio more than Ind and SWP accounts, which means Org accounts are more interested in emphasizing their role for the community. The age of Org accounts is higher than the other two accounts, which indicates that organizations have been active on social media for more years than the other two types. The characteristics of SWP accounts are similar to Org accounts based on some features. For example, the number of followers and followings of Org and SWP accounts is more than Ind accounts. Org and SWP accounts use more words in their bio than Ind accounts to introduce their services and provide more information for customers.
The comparisons of Ind vs. SWP, Ind vs. Org, and SWP vs. Org illustrate that the minimum difference is between Ind and SWP accounts, indicating that SWP accounts are more similar to Ind accounts than Org ones. It seems that the strategy of SWP accounts is to behave similarly to Ind accounts. Therefore, it is a complicated task to distinguish between Ind and SWP accounts. However, identifying Org accounts is less complicated than SWP and Ind accounts. While it is a difficult task to identify Ind and SWP accounts, there are features (e.g., using nsfw in bio) that assist in finding SWP accounts.
This research provides significant contributions. First, while other research developed binary classifiers, this paper offers a multi-label classifier to categorize users. For example, one study identifies individual and organization users [81]. Second, this study illustrates that the used traditional machine learning methods in this research offer better performance than deep learning using CNN for categorizing LGBT users using bio and profile features. Our data size is not very large. Therefore, this finding is in line with the current literature that indicates that deep learning methods do not provide a significant performance over traditional methods if the size of a dataset is small or medium [82]. Third, the proposed approach is effective in utilizing bio and profile features to identify Ind, SWP, and Org accounts. Fourth, this paper identifies and uses features that can be used for similar purposes. Fifth, the proposed approach is flexible to incorporate not only bio and profile features but also other features (e.g., semantics of tweets), use other machine learning methods, and be applied on other social media platforms. Sixth, this research is beneficial for researchers who are interested in categorized LGBT users for social media analysis purposes. For instance, our work can be used by public health experts to identify LGBT individuals to study their information behavior on social media, by social media and marketing companies and application developers to filter out adult content, and by social science and business experts to study LGBT organizations. We believe our work bears the potential to help understand the needs of LGBT individuals on social media and develop interventions to address the needs of LGBT people. While our study contributes to LGBT studies in social media and opens a new direction for future research, this study bears certain limitations. First, we limit our features to bio and profile features. Second, this study is limited to LGBT users who live in the U.S. and post tweets in English. Third, our data collection was limited to lesbian, gay, bisexual, and transgender users, indicating that we might miss other possible relevant data.
Despite the limitations, our findings can provide new insights into types of LGBT users and their social media activities. Future research will need to consider n-grams (e.g., bigrams), linguistics features (e.g., verbs), the semantic meanings of words (e.g., themes), and global or local weighting methods. We aim to go beyond unigrams and incorporate n-grams, linguistic analysis, and semantic features in our prediction framework. That way, we hope to achieve a higher prediction level.

Conclusions
Twitter is a popular platform to obtain and analyze publicly available social media data. This platform has been used by researchers studying LGBT issues such as health. However, not all LGBT users are individual users. This research proposes a framework to categorize LGBT users on Twitter. We specially obtained features of Twitter accounts and developed an automated classifier with around 88% accuracy for categorizing LGBT users by type-user, sex worker/porn, and organizations. Our experiments were based on analyzing more than 16,000 Twitter accounts and showed that different types of LGBT accounts have distinct characteristics in their Twitter accounts, assisting in developing robust classifiers.
This research classifies LGBT users in three classes and explores several classification methods to identify the best classifier. Future work can address the limitations of this study, identify new features, develop classifiers with other machine learning techniques, and extend this work to other possible areas. Funding: This research was partially supported by the Big Data Health Science Center (BDHSC) at the University of South Carolina. All opinions, findings, conclusions, and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agency.

Conflicts of Interest:
The authors declare no conflict of interest.