A White-Box Sociolinguistic Model for Gender Detection

: Within the area of Natural Language Processing, we approached the Author Proﬁling task as a text classiﬁcation problem. Based on the author’s writing style, sociodemographic information, such as the author’s gender, age, or native language can be predicted. The exponential growth of user-generated data and the development of Machine-Learning techniques have led to signiﬁcant advances in automatic gender detection. Unfortunately, gender detection models often become black-boxes in terms of interpretability. In this paper, we propose a tree-based computational model for gender detection made up of 198 features. Unlike the previous works on gender detection, we organized the features from a linguistic perspective into six categories: orthographic, morphological, lexical, syntactic, digital, and pragmatics-discursive. We implemented a Decision-Tree classiﬁer to evaluate the performance of all feature combinations, and the experiments revealed that, on average, the classiﬁcation accuracy increased up to 3.25% with the addition of feature sets. The maximum classiﬁcation accuracy was reached by a three-level model that combined lexical, syntactic, and digital features. We present the most relevant features for gender detection according to the trees generated by the classiﬁer and contextualize the signiﬁcance of the computational results with the linguistic patterns deﬁned by previous research in relation to gender.


Introduction
In recent years, User-Generated Content (UGC) has increased due to the proliferation of web spaces that encourage user participation, such as blogs or social networks: "Gradually, a greater range of tools and platforms for the development and hosting of such content emerged, resulting in a further widening of participation in user-led content creation" [1].
Computational social sciences [2] consider UGC as a valuable source of information to understand social phenomena and reinforce their sociological explanations with quantitative analyses [3]. Within this research agenda, Author Profiling (AP) analyzes the authors' linguistic productions in order to trace identifying clues, such as age, gender, personality traits, or native language. Gender detection models developed by the AP field directly benefit other research areas, such as forensic linguistics, marketing, and sociolinguistics [4].
The World Wide Web provides an environment in which new types of criminal activities are perpetrated [5]. Specifically, online harassment has been closely related to anonymity, since users can join virtual communities, such as social networks by providing a false identity using nicknames or false pictures: "Anonymity has long been thought to encourage bad behavior, either by changing the salient norms, or through reducing the subjective need to adhere to norms by dampening the effect of internal mechanisms such as guilt and shame" [6].
From a forensic linguistic point of view, the linguistic choices made by a suspected author become textual evidence in legal investigation. In this vein, researchers have designed computational models to automatically detect different types of online harassment, such as sexual harassment [7] or cyberbullying [8].
However, AP models have gradually moved away from the needs of forensic linguistics since they have lost explanatory power with the implementation of black-box algorithms. Forensic linguists ultimately need language as evidence, and therefore profiling models cannot incur opacity in exchange for accuracy: "Computational authorship profiling is not necessarily interested in understanding the inner (linguistic) mechanisms of the machine, as long as the accuracy rates are outperforming previous models" [9].
In addition, marketing companies are interested in developing recommender systems from the sociodemographic data of their customers in order to improve customer experience making personalized recommendations of products and services [10]. Recommender systems are based on Machine-Learning algorithms that learn from the textual data generated by the users through comments and reviews or from their browsing activities in shopping applications or websites. In these marketing computational applications, gender has been considered a useful social variable [11]. For example, in [12], Aljohani and Cristea designed a Deep-Learning model to detect the gender of MOOC learners in order to offer them personalized course content.
Finally, in [13] Nguyen et al. indicated the contribution of automatic gender detection on sociolinguistics, traditionally concerned with defining linguistic patterns correlated with social variables, such as age, gender, or social class. Sociolinguistic conclusions related to gender were drawn mainly from face-to-face informal interactions. Consequently, the introduction of Machine-Learning techniques for the analysis of large amounts of textual data may contribute to the definition of new sociolinguistic patterns.
This study contributes to sociolinguistics and to computational sociolinguistics in particular, as well as related research fields, since we propose a computational model for gender detection in Spanish based on decision trees. Beyond the classification task, our objective is twofold: on the one hand, to verify previous sociolinguistic conclusions about gender and, on the other hand, to define new linguistic patterns that may contribute to disciplines, including forensic linguistics, in profiling the gender of authors from textual data.
The style and linguistic characteristics of the messages are traces that the author leaves. They allow unmasking the identity of the person who hides behind the text even when the perpetrator poses as another identity. Our model is committed to interpretability, since the forensic linguist needs access to the linguistic traces that allow the identification of the suspect as evidence for a legal investigation.
We incorporated sociolinguistic features, such as tag questions or appellatives, along with other features previously explored in the field of author profiling, such as lexical richness. Unlike previous works, we handled a reduced set of features, specifically, less than 200 features, which covered a wide linguistic spectrum, from orthography to pragmatics. In fact, despite that our model reached 5.53% less accuracy than the model of Santosh et al. in [14], we obtained competitive results handling less than 1% of the number of features.
The paper is organized as follows. Section 2 reviews the state of the art regarding automatic gender detection. Section 3 describes our tree-based computational model for gender detection. In Section 4, we report and discuss the experimental results. The paper finishes with some concluding remarks and future research avenues presented in Section 5.

Automatic Gender Detection: From Discriminant Analysis to Deep Learning
Gender has been, by far, the sociodemographic characteristic that has received the most attention both in computational studies and in sociolinguistic research. Within the computational field, automatic gender detection began around the turn of the new millennium. Researchers implemented Artificial-Intelligence methods, such as Machine-Learning algorithms, to classify sets of authors according to their gender. Unlike sociolinguistics, which incorporated sociological theories about gender in its analyses, automatic gender detection has been approached as a binary classification task (female or male). This simplistic notion of gender was questioned in [15]: "If we start with the assumption that 'female' and 'male' are the relevant categories, then our analyses are incapable of revealing violations of this assumption". In relation to automatic gender detection studies, four stages may be identified.

First Stage
Early automatic gender detection studies examined formal texts, such as the British National Corpus or the Enron e-mail corpus, or datasets collected from controlled settings. Research on automatic gender detection was constrained due to the unavailability of annotated data.
In [16], Thomson and Murachver conducted three experiments to analyze genderpreferential language style in electronic discourse. They computed the number of messages, the total word count, the sentence length as well as other lexical and morphological features, such as references to emotion, compliments, insults, apologies, intensive adverbs, subordinating conjunctions, and adjectives, among others. Although they detected few significant differences in the electronic messages written by the 35 participants, they correctly classified 91.4% of the authors using a discriminant analysis.
Singh [17] examined gender differences related to lexical richness in free and spontaneous recorded conversations produced by 30 participants. Specifically, he applied eight lexical richness measures, such as noun-rate and adjective-rate per 100 words, and typetoken ratio. By performing a statistical discriminant analysis, Singh reached a classification accuracy of 90%.
In 2002, Corney et al. [18] analyzed an e-mail corpus sourced from a large academic organization. Initially, they collected 8820 e-mail documents written by 342 authors. They reported a 70.2% F 1 score with a Support Vector Machine classifier and 221 features, organized into seven categories: document-based, word-based, character-based, function words, structural, gender-preferential, and other features. The so-called gender-preferential features mainly consisted of the frequency of adjectives and adverbs, along with the frequency of the word sorry and apology-related words.
Koppel, Argamon and Shimoni [19] conducted a gender detection study on 566 documents from the British National Corpus labeled both for author gender and for genre (fiction and non-fiction). They proposed a learning method based on the Exponential Gradient algorithm to find a linear separator between female-authored and male-authored texts. They used 405 function words and the 500 most common ordered triples, the 100 most common ordered pairs and all the single parts-of-speech tags as features. With this method, they reached approximately 80% accuracy inferring the authors' gender and 98% on genre identification.
In 2005, Boulis and Ostendorf [20] performed a computational analysis on the Fisher corpus made up of 12,000 recorded telephone conversations. They extracted word unigrams and bigrams as features, and they tested various Machine-Learning algorithms, such as Naïve Bayes, Maximum Entropy, and Rocchio. The maximum accuracy of 92.5% was achieved by a Support Vector Machine classifier using about 300 K word bigrams.
The consolidation of the blogosphere with the expansion of platforms, including blogger.com or wordpress.org, represented a paradigm shift in automatic gender detection.

Second Stage
In a second stage, automatic gender detection moved from formal documents to informal texts due to the possibility of collecting large datasets from the blogosphere. Moreover, researchers considered, along with gender, other sociodemographic characteristics, such as age, origin, and personality traits.
Nowson and Oberlander [21] collected a personal weblog corpus constituted by 71 authors to predict the authors' gender and also to evaluate their openness attitude. Their features consisted of dictionary-based features (Linguistic Inquiry and Word Count and the MRC psycholinguistic database) and 125 n-grams of parts-of-speech tags. With a Support Vector Machine classifier, they reached 92.5% classification accuracy for gender detection.
In [22], Schler et al. examined the effects of gender and age on a blog dataset sourced from blogger.com made up of 71,000 blogs. However, to prevent bias, they created a subdataset of 37,478 blogs guaranteeing an equal gender composition in each age group. Schler et al. computed the post length and extracted function words, parts-of-speech tags, blog words, and hyperlinks as features. They reported 80.1% classification accuracy using the algorithm Multi-Class Real Winnow.
Yan and Yan [23] collected 75,000 blog entries written by 3000 authors from Xanga. They extracted unigrams along with weblog-specific features, such as background color, word fonts and cases, and emoticons. They reported a 68% F 1 score on gender detection with a Naïve Bayes classifier.
In 2009, Goswami, Sarkar and Rustagi [24] predicted gender and age from a 20,000 blog corpus previously explored by [22], considering the frequency of 52 non-dictionary words and the length of sentences. With a Naïve Bayes classifier, they achieved 89.3% classification accuracy on gender detection.
In [25], Mukherjee and Liu collected their own blog corpus made up of 3100 blogs sourced from various blog hosting sites, such as technorati.com and blogger.com. They extracted stylistic features, word classes, and gender-preferential features [18], along with parts-of-speech sequence patterns. With the implementation of a Support Vector Machine classifier, they reached 88.56% classification accuracy.
Otterbacher [26] explored a movie review corpus composed of 31,300 reviews sourced from the Internet Movie Database (IMDb) website. He applied a Logistic Regression classifier to content-based and metadata features to reach a classification accuracy of 73.3%.

Third Stage
In a third stage, automatic gender detection studies were interested in microblogging. The previous stages demonstrated that it was possible to predict the gender of authors from formal and informal texts of a certain length; however, microblogging platforms, such as Twitter, set a new challenge: to extract identifying information with little textual input.
The first study on Twitter was conducted by Rao et al. [27]. They manually annotated 1000 Twitter users to infer four user attributes: gender, age, political orientation, and regional origin. They achieved a classification accuracy of 72.3% using a Support Vector Machine classifier with unigrams and bigrams tokens (1,256,558 features) and sociolinguistic features. The latter set of features mainly consisted of punctuation mark frequencies, a list of emoticons compiled from the Wikipedia, and some words lists, such as exasperation expressions or affection words.
In [28], Burger, Henderson, Kim and Zarrella created a large dataset of 184,000 Twitter users with 4.1 million tweets for training. Unlike the work presented by [27], they automatically annotated their dataset following the URLs included by the users in their profile description that linked to their blog sites. They experimented with a wide variety of classifiers, including Naïve Bayes, Support Vector Machines, and Balanced Winnow 2. They reported 75.5% classification accuracy with a Balanced Winno2 classifier using character n-grams with n in the range (1, 5) and word unigrams and bigrams (15,572,522 features).
In 2012, Fink, Kipecky and Morawski [29] collected 78,853 Twitter users in order to infer authors' gender from the content of their tweets. They reached 80.6% classification accuracy using a Support Vector Machine classifier with word unigrams, Twitter hastags and LIWC categories as features (1,231,910 features).
Ciot, Sonderegger and Ruths [30] also experimented with a Support Vector Machine classifier on Twitter data. However, unlike previous work, they extracted tweets written in four languages: Japanese, Indonesian, French, and Turkish. They reached the highest accuracy of 87% on Turkish and the lowest accuracy of 63% on Japanese with the 20 top words uniggrams, bigrams, trigrams, and hashtags, along with other metadata features, such as the 20 top mentions or links.

Fourth Stage
The latest developments in Machine Learning have been incorporated in recent years in automatic gender detection research. More specifically, Deep Learning structures have been designed to perform gender detection from textual and visual data.
In [32], Manna, Pascucci and Monti analyzed 56 blogs collected from the SogniLucidi website using a Feed-Forward Neural Network. They reported 77.6% classification accuracy using word unigrams, bigrams, and trigrams as features.
Park and Woo [33] explored an AIDS-related bulletin board from HealthBoard.com and created a gender detection model based on the emotions expressed by the users in their comments and posts. They applied the NRC and BING dictionary to extract sentiment information. The results exhibited that Deep-Learning structures, and more specifically, Convolutional Neural Networks, outperformed traditional Machine-Learning algorithms. In fact, they reached 73.44% accuracy with Random Forest, whereas a Convolutional Neural Network structure yielded 91% accuracy.
In 2020, Safara et al. [34] explored the Enron e-mail corpus with an Artificial Neural Network and the Whale Optimization Algorithm to find the optimal weights and improve the accuracy of the neural network structure. They outperformed previous work on the Enron corpus achieving an accuracy of 98% with 48 linguistic features distributed into four categories: character-based features, word-based features, syntax-based features, and structure-based features.
Finally, Kowsari et al. [35] employed Deep Neural Networks and Convolutional Neural Networks on the PAN-AP-17 dataset collected from Twitter, reporting an accuracy of 86.33% using TF-IDF scores and GloVe word vectors.
Deep Learning methods have been implemented in other data modalities. To mention some examples, refs. [36,37] employed Convolutional Neural Network (CNN) for gender detection from facial images, and [38] used Multi-Scale Convolutional Neural Networks on audio data. Table 1 presents an overview on some of the previous works on automatic gender detection. For each of the works reported, the table shows the dataset, features, and algorithm used as well as the accuracy reached.

A Sociolinguistic Model for Gender Detection
We implemented Machine-Learning techniques to explore social media posts and to design a computational model for gender detection. Our main objective was to define sociolinguistic patterns related to gender, and thus we prioritized the interpretability of the model over the classification accuracy.

PAN-AP-13 Dataset
Our computational analysis was conducted on the dataset provided by the PAN organizing committee for the Author Profiling task at the Conference and Labs of the Evaluation Forum (CLEF) 2013 edition that took place in Valencia (Spain). PAN is conceived as "a series of scientific events and shared tasks on digital forensics and stylometry" (https://pan.webis.de/, accessed on 15 December 2021). Research teams participate in the shared tasks under the same computational conditions as they must deploy their models on the TIRA platform.
The PAN-AP-13 dataset consisted of social media posts written in English and Spanish, and labeled by gender (female and male) and age (10 s: 13-17 years; 20 s: 23-27 years; and 30 s: 33-47 years). We ran the experiments on the Spanish subdataset made up of 84,060 authors: 75,900 authors for training and 8160 for testing. We focused only on gender detection, and thus we omitted the information related to age. Table 2 presents the dataset distribution. We recommend [39] for more information about the data.
In the 2013 PAN AP task, 21 research teams participated. The classification accuracy on gender detection in the Spanish subdataset ranged from 47.84% to 64.73%. The maximum classification accuracy was reached by [14] using a Decision-Tree classifier trained with a combination of style-based features, such as punctuation marks frequencies, and contentbased features, such as word unigrams and Latent Dirichlet Analysis-based topics. However, as our objective is to reveal linguistic patterns correlated to gender, we structured the features from a linguistic point of view in the following categories: orthographic, morphological, lexical, syntactic, and pragmatic-discursive. In addition, taking into account that the dataset contained digital elements, such as URLs or emoticons, we included a digital level. All features were extracted automatically with Python. In what follows, we introduce the different categories of the 198 features. Note that we indicate the number of features of each category in parentheses: Orthographic features (29) mainly captured punctuation marks frequencies, such as single, double and angular quotation marks, commas, full-stops, colons, semi-colons, question marks, exclamation marks, parentheses, dashes, and ellipsis points. We also extracted sequences of punctuation marks, such as duplication of question and exclamation marks, repetition of question and exclamation marks, and combinations of question and exclamation marks. Finally, we computed frequencies of alphabetic characters, repetition of vowels and consonants, upper-case and lower-case characters, combination of upperand lower-case characters, and numeric characters.
Lexical features (33) captured word-based information. We applied three dictionaries to extract information about the frequencies of slang, emotive and offensive words. More specifically, we used the Spanish Specific Lexicon of Social Networks and the Spanish Emotion Lexicon created by Sidorov (https://www.cic.ipn.mx/~sidorov/, accessed on 16 December 2021) [41] and a hand-crafted offensive word list. We also captured the use of mitigating lexical elements from the frequencies of modal and epistemic verbs, probability adjectives and adverbs, approximators, conditional tense verbs, subjunctive mood verb forms, and non-personal verbs, among other elements.
Syntactic features (29) were based on syntactic dependencies. We used the Spacy (https://spacy.io/, accessed on 16 December 2021) library to segment the messages into sentences and obtain the dependencies tags. We captured the following syntactic dependencies: nominal subjects, clausal subjects, direct objects, indirect objects, oblique nominal complements, nominal modifiers, adjectival modifiers, adverbial modifiers, numeric modifiers, determiners, case markers, appositional modifiers, clausal complements, open clausal complements, adverbial clause complements, and adjectival clause complements, along with syntactic relationships, such as coordination, juxtaposition, and subordination. In addition, we included other features, such as number of sentences, sentence length, word repetition by coordination, and the frequency of direct and indirect objects at the initial position.
Digital features (16) captured the frequencies of digital elements, such as URLs, embedded pictures, and emoticons. We computed the ratio between emoticons and words, and some emoticon frequencies related to basic emotions, such as sadness or joy.
Pragmatic-discursive features (62) captured pragmatics information, such as presuppositions and speech acts. More specifically, we extracted five types of explicit speech acts (assertive, directive, commissive, expressive, and declaration) from the verbs of the sentences. For example, sentences formed by the verb I promise were captured as commissive speech acts, while sentences formed by I order were classified as directive speech acts. We created five lists of verbs based on the speech act theory.
With respect to presuppositions, we extracted existential (determiner phrases with definite interpretation, such as the phone, deictic terms and proper names), lexical (factive verbs, such as regret, verbs of judging, such as criticizing, change of state verbs and implicative verbs), and focal presuppositions (grammatical structures formed by a focus adverb, such as even). We also considered some features previously explored by sociolinguistics, such as tag questions, and politeness and apology expressions. Finally, we included some features in order to capture discursive information: discursive markers, type of sentences, and total number of words, line breaks, and tabulations.

Decision Trees for Gender Detection
Decision tree-based models have been applied for solving practical problems, such as medical diagnosis or marketing personalization. Unlike other Machine-Learning algorithms frequently employed in the AP area, such as Support Vector Machine, Naïve Bayes, and Deep Learning structures, Decision Trees are considered white-box models [42] because they generate interpretative and understandable models [43]: "For knowledge-based systems, decision trees have the advantage of being comprehensible by human experts and of being directly convertible into production rules" [44].
For this reason, we selected a Decision-Tree classifier to evaluate the performance in terms of classification accuracy of all the 63 possible combinations of the six features sets. We limited the maximum depth of the tree to five levels to prevent overfitting and to generate short human-readable rules. Table 3 shows the mean classification accuracy achieved with the combination of the different feature sets.
Considering the mean values, we observed that the greater the number of feature sets involved, the better the classification accuracy of the models. In isolation, the digital level achieved the highest accuracy of 56.9%. The combination of the digital and the lexical levels increased the classification accuracy by 2%. The maximum accuracy of 59.2% was yielded by adding the syntactic feature set to the previous combination. From Table 3, it can be observed that the incorporation of the morphological, pragmatics-discursive and orthographic features did not lead to an increase in accuracy.  In Figure 1, we partially reproduce the tree belonging to the DLS model. As it can be observed, digital features, such as the GIF/words ratio and lexical features, such as the frequency of words with appreciative suffixes occupy the first levels of the tree.

Results
To better understand the significance of the features and to be able to trace a sociolinguistic explanation, we ranked the features according to the values provided by the attribute feature_importances_ of the scikit-learn (https://scikit-learn.org/stable/, accessed on 17 December 2021) library, which computed the feature importance as the mean and standard deviation of accumulation of the impurity decrease within each tree.

Orthography
The tree-based classifier discarded 16 out of 29 orthographic features. As shown in Table 4 Some of these sociolinguistics patterns are consistent with previous work: refs. [15,27] detected that females included more ellipsis points in their messages; ref. [27] concluded that females were more than twice as likely to repeat exclamation marks than male users; refs. [15,45] found that character flooding was among the most informative features in the female category, and [46] indicated that females wrote more upper-case characters as expressiveness markers. In contrast, numeric characters have been previously correlated with male-authored texts in [47,48].
Although frequencies of punctuation marks have been considered as features on AP models [49], as far as we know, gender studies tended to focus mainly on question and exclamation marks [50] or on non-standard orthography [51], and thus further research needs to be conducted on other punctuation marks, such as dashes, commas, double quotation marks, parentheses, and full-stops, which were correlated with male-authored messages according to our results. However, refs. [52,53] found that females used, on average, more punctuation marks.

Morphology
Regarding morphology, 19 out of 30 morphological features were discarded by the classifier. As shown in Table 5 These morphological patterns have already been detected in previous work. Several works have reported the use of personal pronouns and verbs by females [15,26,30,54,55]. Argamon [56] also found that females used more conjunctions than males. On the other hand, male-authored texts contained, in general, more determiners, prepositions and nouns, as shown in [26,30,48,57].

Lexicon
Regarding the lexical level, the classifier removed 21 out of 33 lexical features. It can be seen in Table 6 that, from a lexical perspective, females included more emotive terms (M = 10.918) and, specifically, joy-(M = 5.585) and sadness-related (M = 2.680) words. They also employed more mitigating lexical elements (M = 5.385) in order to attenuate their statements. Male-authored messages contained more approximators or numerical hedges, derived words with appreciative affixes (M = 5.138) and, specifically, suffixed words (M = 1.247).
Finally, males exhibited a higher letters/words ratio (M = 4.367), in fact, they wrote more words over six characters (M = 50.317), and they presented a higher lexical diversity (M = 0.758).
Tannen [58] concluded that females tended to have a more supportive orientation. For this reason, they used more attenuated assertions and mitigating lexical elements [59]. This finding is supported by our computational analysis. However, our results differ from well-established conclusions regarding the use of diminutives and suffixed words, since, according to [59][60][61][62], females included more diminutives in their texts. Finally, our results also differ from [26], who found that vocabulary richness was associated with female-authored texts.

Syntax
Regarding the syntactic level, the classifier discarded 20 out of 29 syntactic features. According to the results shown in Table 7 Thomson [16] also found that males wrote slightly longer sentences than females. Regarding the syntactic dependencies, it should be noted that automatic gender detection studies have focused on sequences of dependencies tags, instead of isolated dependencies. Ref. [63] extracted individual dependency relations; however, they did not provide the distribution of the syntactic features in relation to gender, and therefore we cannot perform a comparative analysis of the results.

Digital Features
At the digital level, the classifier discarded 6 out of 16 digital features. As shown in Table 8, messages posted by females exhibited a higher GIF/words ratio (M = 0.023). In fact, female-authored texts contained more emoticons on average (M = 1.301) and, specifically, more love-related emoticons (M = 0.213). In addition, they also shared more images (M = 0.097) in their posts. In contrast, male-authored texts included more URLs (M = 0.253) and more cool emoticons (M = 0.063).
Our findings are also consistent with previous sociolinguistic work. [15,27,64] found that females tended to reinforce their messages with non-verbal communication elements, such as emoticons in order to express emotion. Moreover, females' preference for loverelated emoticons has already been indicated by [65]. Regarding the embedded images, ref. [66] found that females shared more photos than males. On the other hand, refs. [22,28] also detected that male-authored messages included more URLs. This pattern has been frequently related to the preference of male users for the informational dimension of communication. However, a qualitative analysis is necessary in order to provide empirical evidence for this conclusion.

Pragmatic and Discourse
Finally, regarding the pragmatic-discursive level, the classifier did not consider 52 out of 62 pragmatics-discursive features. As shown in Table 9, female-authored texts Our results are consistent with previous work, such as those reported in [15,55] who found that females wrote more exclamative sentences and [21] who detected that females used more contextual or deictic words than males. Moreover, traditionally, politeness has been related to the female gender [67]. As far as we know, previous gender detection models did not include pragmatics presuppositions as features. Therefore, further research is needed to draw strong sociolinguistic conclusions in this regard.

Conclusions
Is it possible to identify the author's gender from a given text? This is the question we investigated in this paper. The rise of social media, the fact that text is the most prevalent media type used in our digital activity, and people's tendency to hide their identity on social media platforms have shown the potential of authorship profiling. In this paper, we focused on gender identification, a subtask of the authorship profiling problem that aims at determining the gender of the author of a given text, and this has revealed an interesting research area that could benefit forensics, marketing analysis, advertising, sociolinguistics, etc.
Classifying the gender of a person based on short messages is a difficult task since we have to deal with short length and multi-content text. The main goal of our research is to determine sociolinguistic patterns related to gender that could improve automatic gender detection in author profiling tasks. In fact, to claim that gender identification is possible means to assume that men and women generally use different classes of language and that we can identify linguistic features that indicate gender. However, identifying a set of features that indicate gender is still an open research problem.
In this paper, Machine-Learning techniques were used to analyze a Spanish social media corpus in order to obtain linguistic patterns to design a computational model for gender detection. A tree-based computational model made up of 198 features was proposed. Unlike previous work, we handled a reduced set of features that covered a wide linguistic spectrum (orthographic, morphological, lexical, syntactic, pragmatic-discursive, and digital).
A decision tree classifier to evaluate the performance of feature combinations was implemented. Experiments on our corpus indicated an accuracy up to 59.2% in identifying gender. Our experiments also indicated that digital, lexical, and syntactic features were significant gender discriminators. Despite that our model reached less accuracy than other models, we obtained competitive results handling less than 1% of the number of features used in more accurate models.
In modeling our problem, we made a decision regarding the trade-off between model accuracy and model interpretation. In our work, the interpretability of the model was prioritized over the classification accuracy. Although opaque methods in gender detection generally obtain higher accuracy, we chose to pay a penalty in terms of predictive performance when selecting an interpretable model that allows for human/linguistic understanding.
To find linguistic patterns correlated to gender is a common interest in gender identification tasks (within the area of author profiling) and in the field of sociolinguistics. Therefore, the close collaboration between researchers in these two areas can produce better results that benefit both research fields. This is why we claim that the interdisciplinarity is a must when dealing with gender, language, and computation.
As future research avenues, we will replicate this computational analysis on other PAN-AP datasets in order to validate some sociolinguistic patterns. In addition, we will conduct a qualitative analysis to fully understand some of the quantitative results presented in this study. Data Availability Statement: The part of the data that supports the findings of this study is available on request from the corresponding authors.

Conflicts of Interest:
The authors declare no conflict of interest.