Demographics and Personality Discovery on Social Media: A Machine Learning Approach

: This research proposes a new feature extraction algorithm using aggregated user engage-ments on social media in order to achieve demographics and personality discovery tasks. Our proposed framework can discover seven essential attributes, including gender identity, age group, residential area, education level, political afﬁliation, religious belief, and personality type. Multiple feature sets are developed, including comment text, community activity, and hybrid features. Various machine learning algorithms are explored, such as support vector machines, random forest, multi-layer perceptron, and naïve Bayes. An empirical analysis is performed on various aspects, including correctness, robustness, training time, and the class imbalance problem. We obtained the highest prediction performance by using our proposed feature extraction algorithm. The result on personality type prediction was 87.18%. For the demographic attribute prediction task, our feature sets also outperformed the baseline at 98.1% for residential area, 94.7% for education level, 92.1% for gender identity, 91.5% for political afﬁliation, 60.6% for religious belief, and 52.0% for the age group. Moreover, this paper provides the guideline for the choice of classiﬁers with appropriate feature sets.


Introduction
User demographic attributes and personality type (collectively called "private attributes") can be applied in several domains, for example, hate speech detection [1] and product recommendation [2] using additional demographic data. The ability to identify personality is useful for better understanding ourselves and others. For instance, we can choose an appropriate field of study that fits our personality or apply for a job that best fits our preferences. On the other hand, it can also be applied by recruiters to find appropriate applicants that fit the job description [3]. Persuasive mass communication is another benefit of personality discovery. It aims at encouraging large groups of people to believe and act on the communicator's viewpoint. It is used by governments to encourage healthy behaviors, by marketers to acquire and retain consumers, and by political parties to mobilize the voting population [4].
Myers-Briggs Type Indicator (MBTI) [5] is a well-established personality model that describes the characteristics of an individual using four dichotomous attributes: (1) Main focus or favorite world: people who prefer to focus on the outer world then have the extraversion characteristic (E). Otherwise, if they prefer their inner world, then they have the introversion characteristic (I). (2) The way people process their information: if they prefer to focus on basic information, then they have the sensing characteristic (S). If they prefer to interpret and add meaning or they seek creative solutions to problems, then they are intuitive people (N). (3) Decision-making method: if they use logic or fairness in making a decision, then they have a thinking characteristic (T). If they decide by first looking at the people and circumstances, then they are sensitive and have the feeling characteristic (F). (4) The way people deal with the outside world: if they prefer to be decisive and well organized then they have judging behavior (J). If they are flexible and willing to stay open to new information or options, this means that they have the perceiving characteristic (P). Moreover, the combinations of MBTI types can be integrated to fit the personality type of individuals ; they are ENFJ, ENFP, ENTJ, ENTP, ESFJ, ESFP, ESTJ, ESTP, INFJ, INFP,  INTJ, INTP, ISFJ, ISFP, ISTJ, and ISTP. This model has been widely used in many practical applications despite its validity and reliability [6]. Additionally, it has been shown that MBTI attributes can be correlated with ones from the Big Five model [7,8].
To discover their personality types, users are required to explicitly provide their information to the MBTI instrument by filling out a multiple-choice questionnaire. The online assessment is time-consuming and may lead to a response bias problem. Previous studies have investigated the potential of machine learning algorithms in demographic and personality attribute classification. On Facebook, the features can be extracted from both texts (e.g., posts and comments) and activity (e.g., likes). Kosinski et al. [9] extracted users' liked pages as feature sets for predicting private attributes and personality traits. The singular value decomposition technique was deployed to solve the dimensionality problem by reducing the dimension of the user-like matrix before training with logistic and linear regression. On Twitter, the textual information and network relation (e.g., followings and followers) are applied as feature sets. For example, Aletras and Chamberlain [10] created a graph that represents user relations to predict socioeconomic attributes. They suggested that the combination of textual features and graph embeddings provides a significant improvement over the use of either alone. On Instagram, the features are extracted from images and sometimes from other information, such as likes. For example, Ferwerda and Tkalcic [11] proposed the use of both visual and content features extracted from pictures to predict the user's personality type. On Reddit, the features are mainly based on text content (e.g., posts and comments). Gjurković and Šnajder [12] proposed the use of text in the user's posts, comments, and other metadata for predicting MBTI personality types. Multilayer perceptron and logistic regression algorithm are applied and obtained 76% of the macro average of F 1 score.
We analyzed the concept of MBTI and found that the personality type model is related to social behaviors. Therefore, it is in our interest to explore the possibility of social media contributing to personality prediction. Therefore, Reddit, a well-known social media platform, was our main focus since its users are organized as members of communities (called "subreddits"). Each text post is attached with user information. Most Reddit users are anonymous. However, some users declare themselves by published short tags called "author flairs" next to their names. Our contributions are as follows.

1.
We propose methods for extracting demographic and personality attributes from Reddit users using author flairs.

2.
Multiple feature sets are also proposed and explored by machine learning algorithms to find the best-performing combinations.

3.
To validate our experimental results, processed author flairs are applied as ground truth for the training and testing process.

Experimental Data
This section introduces the characteristics of Reddit posts. Figure 1 illustrates a Reddit post in the community, namely, "datingoverthirty". Author flairs are communityspecific descriptors that some users apply to describe themselves to other members of the communities ("subreddits"). Figure 2 visualizes the element of the text post that consists of the author's name, attached with the short tag called author flair. We can see that the author clarifies her gender and age on the author flairs as "♀36", which means that she is a 36-year-old woman. However, author flair is not required by Reddit; hence, most users are anonymous to the system. We found that Reddit does not summarize user profiling, which is different from Facebook.
We obtained comments made in August, September, October of 2018 from the Pushshift website, which maintains a publicly accessible database of various Reddit data, includ-are anonymous to the system. We found that Reddit does not summarize user profiling, which is different from Facebook.
We obtained comments made in August, September, October of 2018 from the Pushshift website, which maintains a publicly accessible database of various Reddit data, including submissions and comments. The obtained data contains 300,877,224 comments from 8,131,714 users in 177,116 communities. Each comment includes an author's name, author flair, community name, and text body. Note that we respect user privacy; therefore, data anonymization is performed on user identity before the experiment setup.

Framework Overview
There are three main processes in our proposed framework, which are: (1) private attributes extraction, (2) feature extraction, and (3) private attribute prediction. The private attribute extraction takes in the user description dataset and outputs private attribute datasets. The feature extraction takes in the comment text dataset and outputs the extracted feature sets. The private attribute prediction takes in the feature sets and the attributes for the classification. Figure 3 illustrates our framework. We conducted our experiments using 64-bit Python 3.6 on a Linux system with Intel Xeon Gold 6130 and 250 GB of memory.   are anonymous to the system. We found that Reddit does not summarize user profiling, which is different from Facebook. We obtained comments made in August, September, October of 2018 from the Pushshift website, which maintains a publicly accessible database of various Reddit data, including submissions and comments. The obtained data contains 300,877,224 comments from 8,131,714 users in 177,116 communities. Each comment includes an author's name, author flair, community name, and text body. Note that we respect user privacy; therefore, data anonymization is performed on user identity before the experiment setup.

Framework Overview
There are three main processes in our proposed framework, which are: (1) private attributes extraction, (2) feature extraction, and (3) private attribute prediction. The private attribute extraction takes in the user description dataset and outputs private attribute datasets. The feature extraction takes in the comment text dataset and outputs the extracted feature sets. The private attribute prediction takes in the feature sets and the attributes for the classification. Figure 3 illustrates our framework. We conducted our experiments using 64-bit Python 3.6 on a Linux system with Intel Xeon Gold 6130 and 250 GB of memory.

Framework Overview
There are three main processes in our proposed framework, which are: (1) private attributes extraction, (2) feature extraction, and (3) private attribute prediction. The private attribute extraction takes in the user description dataset and outputs private attribute datasets. The feature extraction takes in the comment text dataset and outputs the extracted feature sets. The private attribute prediction takes in the feature sets and the attributes for the classification. Figure 3 illustrates our framework. We conducted our experiments using 64-bit Python 3.6 on a Linux system with Intel Xeon Gold 6130 and 250 GB of memory.
are anonymous to the system. We found that Reddit does not summarize user profiling, which is different from Facebook.
We obtained comments made in August, September, October of 2018 from the Pushshift website, which maintains a publicly accessible database of various Reddit data, including submissions and comments. The obtained data contains 300,877,224 comments from 8,131,714 users in 177,116 communities. Each comment includes an author's name, author flair, community name, and text body. Note that we respect user privacy; therefore, data anonymization is performed on user identity before the experiment setup.

Framework Overview
There are three main processes in our proposed framework, which are: (1) private attributes extraction, (2) feature extraction, and (3) private attribute prediction. The private attribute extraction takes in the user description dataset and outputs private attribute datasets. The feature extraction takes in the comment text dataset and outputs the extracted feature sets. The private attribute prediction takes in the feature sets and the attributes for the classification. Figure 3 illustrates our framework. We conducted our experiments using 64-bit Python 3.6 on a Linux system with Intel Xeon Gold 6130 and 250 GB of memory.

Data Preprocessing
Given the Reddit comment dataset, the framework generates three new datasets, which are: (a) preprocessed comments, (b) user's aggregated comments dataset, and (c) user description dataset. Note that dataset (a) and (b) will be used for feature extraction. Dataset (c) will be used for private attribute extraction.
To pre-process the user's comments, we first converted the body text to lower cases, then tokenization was performed by replacing URLs, user names, community names, HTML characters, elongated words, and numbers. The token "xxeos" marks the end of the sliding window for n-gram feature extraction. Table 1 depicts the replacement tokens and their descriptions. We then expanded word contractions and removed unwanted punctuation and extra white spaces in the comment's body. Five-digit number xxeos The end of a comment Table 2 shows a sample of the pre-processed comments. Finally, the user's engagements on Reddit were obtained by aggregating all comments posted by each user as a document. A sample of the pre-processed comment text dataset is shown in Table 3.  The user description dataset contains the author's name, community name, flair CSS class, and flair text. We then removed the duplicated descriptions for each authorcommunity pair. Table 4 shows a sample of the preprocessed user description dataset. For gender identity, we looked for users who identify themselves as male or female in gender-related communities and found that multiple patterns are representing male and female values, for example, users with "male" and "female" flair class or flair text with these regular expression patterns: "[♂♀] ?\d{2}" and "\d{2} [MF].+", such as "♂34" (34-year-old male) and "23/F/5 10" (23-year-old female with the height of 5 10"). We post-processed all variations into uniform values of "male" and "female". We performed random under-sampling on the male class to reduce the size of the dataset due to hardware limitations in our experiment.

Age Group
For age groups, we looked for users who specified themselves with two-digit ages in their descriptions, which happened to be the same patterns as gender identity, in agerelated communities. We excluded users at the age of sixty-five and above because they were virtually non-existent on the website. The age attributes were segmented into four classes often used in demographic targeting, including teenagers (15)(16)(17)(18)(19), young adults (20-34), younger middle-aged (35-49), and older middle-ages (50-64). For the teenage group, we also looked for users who specified their age with "ˆ(\d{2})$" flair pattern. We performed random under-sampling on the teenager class.

Residential Area
To extract the user's residential areas, we focused on national and continental communities. We segmented them into eight regions, including North American, European, South American, South Asian, Southeast Asian, East Asian, Middle Eastern, and African. We performed random under-sampling on the European class to match the North American class to reduce the size of the dataset.

Education Level
For education level, we focused on three groups: high school, undergraduate, and graduate. Hence, we looked for degree names or fields of study in communities related to education. However, we found a low number of users for the high school class; therefore, we used numeric age descriptions in the "teenagers" community, which is the largest high school community on the website, as additional information.

Political Affiliation
For political affiliation, there are a lot of factions both in the real world and on the website, for example, socialist, center-left, libertarian, and far-right. We only focused on liberals and conservatives, which are the biggest and clearest political groups on Reddit. We do this by looking for users with liberal and conservative flairs in mostly North American political communities.

Religious Belief
For religious beliefs, similar to political affiliation, there are a lot of religious factions. We looked for the six biggest beliefs in the world in communities discussing religious topics. These are Atheist, Christian, Muslim, Jewish, Buddhist, and Hindu.

Personality Type
For personality type, the framework searched for users whose author's flairs were one of the sixteen MBTI personality types with "ˆ([EI][SN][TF][JP])$" pattern in community discussions about MBTI personality types. Table 5 shows the lists of communities used in the extraction.

Feature Extraction
We began by performing feature extraction, followed by feature selection. After that, we experimented with multiple classification algorithms and a couple of techniques to address the imbalance problem. Then, we evaluated the performance of each approach.

Human-Designed Features
We used Linguistic Inquiry and Word Count (LIWC), introduced by Tausczik and Pennebaker [13], in our experiment for the human-designed features. These are predefined categories of words that can be created as a frequency vector for a document. We also experimented with the term frequency-inverse document frequency (tf-idf) version of LIWC.

Bag-of-Words (BoW) Features
This is a text representation model that considers the term occurrences in the aggregated comments. We used uni-grams and bi-grams for bag-of-words features. We also experimented with both stemmed and non-stemmed words for the n-grams. Finally, we calculated tf-idf, then selected the best 20,000 n-grams based on their ANOVA F-values.

Community Activity (CA) Features
These features indicated user engagement in the communities on the website. This was inferred from the number of comments made in each community as activity features. Let f c,u be the number of comments posted by user u in community c. Let C be the set of communities in the dataset. Community activity features of user u, denoted as CA u , can be described as follows: CA u = {f c,u for c in C} From our statistical analysis, we found that users commented in 82 communities at the 95th percentile. Hence, for each private attribute, we also created a feature set of the best 100 communities based on their ANOVA F-values from 53,966 communities.
Nevertheless, CA u only represented user interests; therefore, we also experimented with a weighted feature set, denoted as CA_Wtg u , that considered the normality of other users in the dataset, which is the same concept as tf-idf. Let U be the users in the dataset. The weighted community activity of user u can be described as follows.
CA_Wtg u = f c,u × log |U| |u ∈ U : c ∈ u| for c in C Algorithm 1 shows the feature extraction and selection algorithm for the features. UseWeighted is a Boolean parameter indicating whether to transform into a weighted vector or not. SelectKBest is a Boolean parameter indicating whether to perform feature selection or not. K is an integer parameter indicating the number of desired features. The time complexity of this algorithm is O(n) with n as the number of comments in PreprocessedComments.  We created a combination of bag-of-words and community activity as hybrid features. We also experimented with the addition of the human-designed features and a version with 10,000 features to study the robustness of the features. Algorithm 2 shows the proposed feature extraction and selection for the features. UseLIWC is a Boolean parameter indicating whether to add LIWC features to the vector. The time complexity of this algorithm is O(nmk) where n is the number of users, m is the number of features in NgramFeatures, and k is the number of features in ActivityFeatures.

Feature Selection
Filter-based feature selection was performed on all features except for human-designed features to maximize the performance and reduce overfitting. Table 6 shows the list of feature sets used in our experiment. We used the one-way ANOVA F-test to test the relationship between predictor and response then selected the features with the highest F-value. Let f i be the average value of feature i, x be the average value of feature averages, x i be a value of feature i, f be the average value of the feature, and DF be the degree of freedom. F-value can be calculated as follows.

Sum of squares between features =SS between
Sum of squares within feature = SS within = ∑ (x i − f) 2 F-value = SS between ÷ DF between SS within ÷ DF within

Classification Algorithms
To see the potential of our proposed community and hybrid features, we performed experiments using 10-fold cross-validation on several classifiers, including multinomial naive Bayes, support vector machine, random forest, multi-layer perceptron, and majority class classifier. These classifiers will be trained with the feature sets using the extracted attributes as labels.

•
Majority class classifier (MCC) always classifies the most frequent class in the dataset. This classifier is often used as the baseline against machine learning models to demonstrate their superior decision-making.

•
Multinomial naïve Bayes (NB) is a popular conditional probabilistic classifier. We used one of the classic variants used in text classification with Laplace smoothing. • Support vector machine (SVM) [14] creates a discrimination hyperplane between two sets of data points. We used linear SVM with the L2 penalization and squared hinge as the loss function. We used the one-vs-rest strategy for multi-class datasets.

•
Random forest (RF) [15] is a majority-voting classifier that consists of multiple decision trees, each trained with a different dataset. We created a random forest with 100 decision trees with the maximum features equal to the square root of the original number of features. • Multi-layer perceptron (MLP) is a fully connected artificial neural network. We used two hidden layers, each with 64 units with the rectified linear unit (ReLU) activation. We held out 10% of the training data to use as the validation set for early stopping.

Imbalance Problem
We experimented with a couple of resampling methods, including random oversampling (RO) and synthetic minority over-sampling technique (SMOTE) [16], to address the imbalance problem and study their effects on the performance.

Results
We extracted seven attributes from 45,751 unique users. These were 17,589 users for gender identity, 4136 users for age group, 17,446 users for residential area, 3499 users for education level, 810 users for political affiliation, 2709 users for religious belief, and 4723 users for personality type. Table 7 shows the number of users in each class of the datasets.

Classification Performance
We obtained quite impressive and promising results for both demographic attributes and personality types. Table 8 shows the best macro average F 1 scores for each demographic attribute. We found that our proposed CA_Freq_100 feature set obtained the best performance measured in terms of F 1 score. The F 1 score of residential area prediction reached 98.1%. Applying another proposed feature set (CA_Wgt_100), the education level prediction gets an F 1 score of 94.7%. The gender identity prediction using the HF feature set obtained 92.1%. For political affiliation, we received an F 1 score of 91.5% by using the CA_Wgt_100 feature. The religious belief prediction performance was 60.6% using CA_Wgt_100, and the age group at 52.0% with the HF features. We can conclude that our proposed feature sets, CA and HF, provided the best performance contribution for predicting all demographic and personality attributes.  For personality datasets, we found that our proposed feature sets significantly outperformed the baseline (p < 0.001). To the best of our knowledge, our methods achieved the highest performance on MBTI personality prediction for Reddit datasets. We compared our results with the work done by Gjurković and Šnajder [12], which experimented with a similar Reddit dataset. Despite having fewer instances (4723 vs. 9111) and features (100 vs. 11,140), using our proposed feature sets displayed significantly better performance. However, we were not able to experiment with their published dataset due to the lack of data for our feature extraction methods. Table 9 shows the performance comparison between [12] and our proposed methods. We evaluated the performance of NB, MLP, RF, and SVM by comparing our proposed feature sets and the feature set proposed by [10]. In Table 10, 10 feature sets are evaluated on 11 private attribute predictions. We found that our proposed feature sets outperformed all baseline feature sets. For community activity features, RF mostly performed best, except MLP for E/I and J/P datasets. For personality prediction, we found that the NB learned from the community activity feature (CA_Freq_100) obtained the best performance. Gender and age prediction could be achieved by using MLP learned from the hybrid feature set.  Table 11 shows the training time of the best algorithms (shown in Table 10) in seconds using different feature sets for attribute prediction. We found that our proposed feature sets required a shorter training time compared to the baseline feature sets (p < 0.001). For community activity features, CA_Freq_100 used the shortest training time followed closely by CA_Wgt_100 because of its small size. The hybrid feature sets had a longer training time due to their complex extraction processes. However, the stemmed version of the comment text feature set (BoW_Stemmed) had a significantly higher training time than the non-stemmed counterpart (BoW_Ngrams).  LIWC  539  83  527  73  47  138  121  101  105  112  115  LIWC_Tfidf  561  86  529  75  47  137  120  104  108  117  119  BoW_Ngrams  667  120  788  115  61  158  150  121  130  153  135  BoW_Stemmed  2426  366  2461  345  239  667  510  462  479  519  495   Proposed   CA_Freq  228  61  388  70  16  43  84  92  63  73  69  CA_Freq_100  49  8  39  8  2  5  10  9  9  10  10  CA_Wgt_100  103  17  83  15  3  10  22  18  18  20  21  HF  1004  154  879  147  83  223  213  170  175  206  194  HF_LIWC  1788  280  1618  249  146  421  322  321  333  371  362  HF_10k  996  140  834  128  72  208  153  156  165 185 180

Robustness
We evaluated the robustness of all algorithms learned from different feature sets by measuring the difference between the training and testing performance. The overfitting rate was calculated by the following equation. Note that the lower overfitting rate is more desirable.
Overfitting rate = F 1,Train − F 1,Test From Table 12, we found that most of the baseline feature sets (LIWC and BoW) were over-fitted. Our proposed feature sets had a very low overfitting rate. This means that our proposed feature sets are desirable for learning algorithms. For the community activity, we found that the CA_Freq_100 feature set was the most fitted. We also found that the hybrid feature sets fit better than the LIWC and n-gram feature sets. Unsurprisingly, the HF_10k feature set fit better than the regular one (HF).

Imbalance Problem
From the information of our datasets shown in Table 13, we found that class imbalance occurred in all private attribute datasets. Therefore, two oversampling techniques were explored to see their potential on our proposed feature sets. Table 14 shows F 1 scores obtained from random oversampling (RO) and SMOTE compared to the performance obtained from the original datasets (Table 8), denoted as the "None" technique (which means no oversampling method was deployed on that dataset). Experimental results shown in Table 14 revealed that using our proposed feature sets (CA_Wgt_100) without oversampling techniques reached the highest performance for personality prediction tasks (Per., E/I, S/N, T/F, J/P). Note that the personality type prediction task (Per.) was the most difficult problem since it contained sixteen classes that came from the combination of [ Table 7 for details). For the CA_Freq and CA_Wgt_100 feature set, we found that RO and SMOTE had a small contribution to the F 1 score for education and political belief prediction. RO method improved classification performance on age group, residential area, and religious belief prediction.

Demographic Attributes
We have shown the predictive analysis of our work in the previous section. However, we also wanted to discuss descriptive results to better understand user behavior. We did this by looking for informative word features with high F-test values in each dataset. For gender identity prediction, we found effective word features related to relationships such as "SO" (significant other), "boyfriend", and "my husband". We also found that some lifestyle and news communities, such as "gaming", "technology", and "worldnews", can be used to imply the gender of the user who interacts with them.
We discovered communities related to lifestyle activities, such as "beetle", "Curling", "bicycleculture" that could be used as a data source for age group prediction. For the residential area dataset, we could predict the residential area with a high F 1 score of 98.1%. For informative words, we found words corresponding to their languages, for instance, "el" (Spanish for "the") or "de" (Spanish for "of"). For education level prediction, we found words explicitly related to the topic, such as "PhD", "grad", "student", and "college". We also discovered communities directly related to education other than the ones we extracted from, such as "AskAcademia", "csMajors", "gradadmissions", and "CollegeRant".
For the political affiliation dataset, we discovered that the most informative words were related to accusations, such as "FBI" or "witnesses" since we obtained the experimental data during the nomination of US Supreme Court Justice Brett Kavanaugh. New communities related to controversial discussions were discovered, such as "AskScience" and "debatereligious". This implies that those users like to express their world views on controversial issues. For the religious belief dataset, we found words corresponding to religious teachings, for example, "Quran" (the text of Islam) or "Allah" (the god of Islam).

Personality Types
Our proposed feature set, CA_Freq_100 with NB, significantly outperformed the research work done by Gjurković and Šnajder [12] on personality prediction at 64.4% (with over 22.7% improvement). We also performed a feature analysis and found words mentioning their personality types and MBTI-related communities as the most effective features.
One interesting question is "Can personality type be inferred from the demographic attributes?" We answered this question by setting up the experiment to see the predictability power of demographic feature sets. First, we derived a new dataset from the personality data set consisting of 4723 users by integrating their six demographic attributes. Then, models obtained from each demographic dataset were deployed to predict the missing demographic value found in the new dataset. After that, logistic regression was trained by the set of six demographic attributes to predict the personality types. We found that the macro average F 1 score of the model was very low and close to that of MCC, with a 2.3% difference. As shown in Table 15, we found that using logistic regression learned from six demographic attributes obtained worse performance compared to the baseline feature set (LIWC). Our experimental results implied that people's personality types were independent of their demographic attributes.

Conclusions
We have done an empirical analysis of our proposed feature sets for private attribute prediction covering classification performance, training time, and imbalance problems. From experimental results, we can conclude that user engagement on Reddit shows promising results for the discovery tasks. Although much research has been done on large platforms, such as Facebook and Twitter, we have shown that Reddit is a potential source of demographic and personality study as well. Our results show that we can predict MBTI personality type with an F 1 score of 64.4% with a dataset of 4723 users. Our proposed feature sets applied with machine learning algorithms provided an impressive performance. We obtained 98.1% for residential area, 94.7% for education level, 92.1% for gender identity, 91.5% for political affiliation, 60.6% for religious belief, and 52.0% for age group.
For future work, we plan to explore ways of extracting other demographic attributes using the same technique. For the proposed feature sets, feature transformation and decomposition can be performed to study the change in performance. Imbalance problems can also be further investigated for textual features, which are known to be more difficult to handle than numeric ones.  Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://files.pushshift.io/reddit/comments/ (accessed on 23 August 2021).