Advancing Author Gender Identification in Modern Standard Arabic with Innovative Deep Learning and Textual Feature Techniques

Himdi, Hanen; Shaalan, Khaled

doi:10.3390/info15120779

Open AccessArticle

Advancing Author Gender Identification in Modern Standard Arabic with Innovative Deep Learning and Textual Feature Techniques

by

Hanen Himdi

^1,*

and

Khaled Shaalan

²

¹

Computer Science and Artificial Intelligence Department, College of Computer Science and Engineering, University of Jeddah, Jeddah 21955, Saudi Arabia

²

Faculty of Engineering and IT, The British University in Dubai, DIAC Block 11, Dubai P.O. Box 345015, United Arab Emirates

^*

Author to whom correspondence should be addressed.

Information 2024, 15(12), 779; https://doi.org/10.3390/info15120779

Submission received: 18 October 2024 / Revised: 14 November 2024 / Accepted: 20 November 2024 / Published: 5 December 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Author Gender Identification (AGI) is an extensively studied subject owing to its significance in several domains, such as security and marketing. Recognizing an author’s gender may assist marketers in segmenting consumers more effectively and crafting tailored content that aligns with a gender’s preferences. Also, in cybersecurity, identifying an author’s gender might aid in detecting phishing attempts where hackers could imitate individuals of a specific gender. Although studies in Arabic have mostly concentrated on written dialects, such as tweets, there is a paucity of studies addressing Modern Standard Arabic (MSA) in journalistic genres. To address the AGI issue, this work combines the beneficial properties of natural language processing with cutting-edge deep learning methods. Firstly, we propose a large 8k MSA article dataset composed of various columns sourced from news platforms, labeled with each author’s gender. Moreover, we extract and analyze textual features that may be beneficial in identifying gender-related cues through their writings, focusing on semantics and syntax linguistics. Furthermore, we probe several innovative deep learning models, namely, Convolutional Neural Networks (CNNs), LSTM, Bidirectional LSTM (BiLSTM), and Bidirectional Encoder Representations from Transformers (BERT). Beyond that, a novel enhanced BERT model is proposed by incorporating gender-specific textual features. Through various experiments, the results underscore the potential of both BERT and the textual features, resulting in a 91% accuracy for the enhanced BERT model and a range of accuracy from 80% to 90% accuracy for deep learning models. We also employ these features for AGI in informal, dialectal text, with the enhanced BERT model reaching 68.7% accuracy. This demonstrates that these gender-specific textual features are conducive to AGI across MSA and dialectal texts.

Keywords:

natural language processing (NLP); deep learning; text mining; BERT; textual analysis; transformers-based models

Graphical Abstract

1. Introduction

Author Gender Identification (AGI) plays a vital role across numerous contexts [1]. It has broad implications that extend beyond straightforward categorization [2]. AGI has both economic effects and commercial advantages, especially regarding market targeting and branding [3]. Tailoring products and services to focus on individual gender preferences, demands, and communication styles can increase market reach and sales [4]. Beyond diversifying the conception of language, writing as a method for identifying gender holds important implications in the digital age. This can lead to enhancing online content moderation by identifying gendered hate speech or mitigating written content biases. These insights can also be helpful for marketers as they craft more targeted strategies and messaging to help reach their intended audiences. With such demand in today’s digital era come security concerns, as nowadays, there is an increasing reliance on gender aid-based verification to prevent impersonation [5]. Cybercriminals may impersonate users to enhance the credibility of their impersonation. Along the same line, a case that emphasizes the importance of AGI is the infamous case of the “Lonely Heart” hacking group. This organization perpetrated romance scams in which male hackers specialize in creating identities of women to defraud victims into providing monetary funds or divulging sensitive personal information (https://www.asisonline.org/security-management-magazine/latest-news/online-exclusives/2022/targeting-all-lonely-hearts/ (accessed on 6 November 2024)). All in all, AGI can deepen understanding of how language conveys identity while providing practical examples to improve safety, inclusivity, and communication on the internet.

On the other hand, evaluating AGI across formal writing written in MSA may be more difficult than evaluating across informal writing found in dialectal text. In formal and serious contexts, the text has a consistent tone, minimizing personal linguistic peculiarities. Another problem is dealing with authors who employ formal language and a neutral writing style on purpose, such as in court documents and professional communications. These variables make it more difficult to discover gendered language subtleties in formal and conservative settings than in informal writing styles. Consider an author’s writing style, as it may impact vocabulary selection and sentence structures, which aid in recognizing gender-related linguistic patterns [6]. As part of this study project, we conducted a thorough evaluation of Arabic AGI assignments to capture nuances in writings that might be relevant in determining gender.

The examination of writing can be challenging because of the dynamic nature of word use based on context. Therefore, we focused on columns in newspaper platforms to represent the formal text genre. Columns have dedicated themes of a combination of personal opinions and reporting on events. They are usually written in a formal setting that presents the journalistic style found in today’s news articles. However, they may embed their personal views [7]. For this, our methodology involved scrutinizing columns to discern linguistic patterns associated with gender within formal text genres in MSA format.

Although various studies have explored AGI works, most have employed approaches to machine learning and deep learning techniques with general characteristics not tailored to gender specifics [8]. In this study, we focused on exploring the influence of two feature engineering categories, which are Bidirectional Encoder Representations from Transformers (BERT) word embedding and textual features (semantic and syntactic) with deep learning and BERT-based models, in an effort to realize AGI. Accordingly, deep learning models allow for more accurate, scalable, and efficient identification compared to traditional methods due to their ability to efficiently handle and learn from large datasets and complexities; namely, Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), and BERT were employed. Furthermore, the textual features were derived from each gender’s writings and implemented as gender-specific features, which were combined with BERT to capitalize on the synergy between BERT’s deep contextual comprehension and the vital potency of these features to address AGI.

This research advances the state-of-the-art methodology of Author Gender Identification in the following ways:

Propose a large gender-labeled dataset of 8k Arabic news columns in MSA format.
Provide in-depth textual feature analysis that facilitates AGI.
Construct innovative deep learning models by employing BERT word embedding and textual features.
Develop an enhanced BERT model utilizing textual features for AGI.

The remainder of this paper is organized as follows: Section 2 describes related work, and Section 3 discusses the methodology of our proposed approach, including dataset analysis, preprocessing, feature engineering, and experiment setup. The results are presented in Section 4, along with the error analysis. Section 6 validates the performance of the optimal model on real-world articles, Section 8 lists the limitations and future work, and Section 9 concludes the paper.

2. Related Works

AGI recognizes gender-specific patterns and traits in tweets or texts to predict a user’s gender automatically. It makes use of text mining and natural language processing (NLP) to identify variations in word usage and writing style and to infer whether a user’s gender is male or female [9].

Liu et al. [10] examined inference methods for estimating age and determining gender classification using Twitter data. The authors employed machine learning and deep learning models to analyze the importance of various language representations for demographic inference. The utility of several feature sets using various techniques on a dataset derived from Wikidata was also examined. The findings showed that using bigrams or sequential patterns yields results for age inference that are on par with neural network models and marginally better than unigram text characteristics. Consequently, when analyzing demographic inference tasks like age, a simpler model makes more sense. For gender, however, it is vital to use richer language models like deep learning, especially those that use sentence embeddings. The use of richer language models is important because they emphasize the various roles that language plays in inferring demographic information on social media. Overall, the results indicated that unigrams and traditional models that use statistical features have remarkable performances. Similarly, neural networks, such as Siamese network configurations that pay close attention to user bios and tweets, especially those that use language embeddings, were also found to perform exceptionally well. The age differences were minor, whereas gender was more significant.

Sezerer et al. [11] examined the AGI of social platform X users using a Turkish dataset comprising 3368 users in the training set and 1924 users in the test set, where each user has 100 tweets. To validate the performance of each annotator comprising Turkish speakers, the annotations were cross-checked using random subsets. A standard classifier was used by relying on the traditional bag-of-words approach. The empirical results showed that the baseline achieved a 72.32% accuracy score. Interestingly, the observations also shed light on Turkish Twitter users’ patterns, where 17.5% of the users were nonhuman accounts. This observation indicates that Twitter is more than just a social media tool for certain people. Approximately 2% of the users were found to be bots or automated accounts. This implies that given a random dataset selection from Twitter, there is a probability that at least 2% of activities originating from bots or automated accounts. Though the study was for AGI, these interesting findings were vital.

In a study by Sarwar et al. [12], an Author Gender Identification method (AGI-P) was introduced and evaluated using a new dataset of 1944 samples. It used customized, fine-tuning transformer-based approaches. Based on these foundations, the results showed that AGI-P outperformed the leading machine learning classifiers and pretrained multilingual language models (such as DistilBERT, mBERT, XLM-RoBERTa, and Multilingual DEBERTa), achieving an accuracy of 92.03%.

Another study by Taha et al. [13] introduced three new system architectures for speaker identification to overcome the limitations of current diarization and voice-based biometric systems. It is argued that traditional diarization segments audio by speech timing but cannot differentiate speakers, whereas biometric systems only work with single-speaker recordings. Therefore, the new architectures incorporate gender, emotion identification, and diarization at the segment level. When tested on the VoxCeleb and RAVDESS databases, these proposed system architectures showed improved recognition accuracy and were able to effectively handle multiple speakers and emotional variations, achieving over 98% accuracy in gender and emotion classification.

In yet another study, Tüfekci and Bektaş Kösesoy [14] developed and assessed four algorithms, Naive Bayes, Random Forest, LSTM, and CNN, using a dataset of 43,292 Turkish news articles (IAG-TNKU). The LSTM algorithm achieved the highest accuracy at 88.51%. In this context, the study highlights the effectiveness of advanced deep learning techniques for gender identification, apart from introducing a valuable dataset for future research.

A study by Sakaki et al. [15] presented an approach that combines text and image processing to increase the accuracy of gender inference. In other words, the text and image data of a Japanese Twitter user was used to create a gender inference that relied on logistic regression to obtain probability scores. The empirical results showed that the accuracy score was 64.21% and 84.63% for image processing and text processing, respectively. On the other hand, the results of the combined methods achieved an accuracy score of 85.11%.

Similarly, a study by Taniguchi et al. [16] presented a novel hybrid approach to gender inference. To determine whether a person is male or female, a logistic regression was used to integrate the necessary information from a combination of text and image classifiers. This approach was found to be more effective than conventional methods. The results of the analysis showed that the basis point achieved an 80.25% accuracy score, which is 1.35 points higher when compared to the standard combination method. The study also confirmed that by examining the logistic regression weightage, the rate at which each image’s information contributes to the inference can vary for various genders.

A study by ElSayed and Farouk [17] was conducted using neural network models to determine the authors’ gender in writing tweets in Egyptian dialects. Their study was based on compiling a model based on convolutional and bidirectional layers called the C-Bi-GRU multichannel model. Interestingly, the model’s performance produced a result of AGI accuracy of 91.37%.

AlZahrani and Al-Yahya [18] addressed the gap in the context of authorship attribution in NLP, specifically to identify the text authorship of classical Arabic texts in Islamic law. This study targets a similar issue as AGI in terms of author writing traits. The authors evaluated pretrained transformer models on a newly created dataset from digital Islamic law sources. Among the fine-tuned models—such as AraBERT, AraELECTRA, ARBERT, and MARBERT—both ARBERT and AraELECTRA achieved the highest accuracy at 96%. This result indicates their strong performance in author identification, especially for Islamic legal texts.

Similarly, Hussein et al. [19] examined the Egyptian dialects on Twitter to detect the tweets’ writer gender. To achieve the best AGI results for tweets, multiple experiments were conducted on the Egyptian Dialect Gender Annotated and PAN-AP’17 datasets. Specifically, an innovative mixed feature vector and N-gram feature vector with ensemble weighting were used with two classifiers—Logistic Regression and Random Forest—to identify gender characteristics in Egyptian tweets from a dataset containing 70,000 tweets per gender. Through the use of these novel methods, the results of the experiments showed that an overall 87.6% AGI accuracy was achieved.

According to the previous studies, summarized in Table 1, most Arabic AGI tasks have focused on informal text, such as tweets. However, there has been a scarcity of experiments on AGI in formal texts that integrate new transformer-based models with textual aspects to maximize both capacities. Formal texts adhere to standards; are rich in context, domain-specific, and content-focused; have less noise and irrelevant data; and are often longer, allowing models to capture more linguistic elements and patterns linked with gender, resulting in more accurate AGI models.

3. Methodology

This research aimed to compile an Arabic AGI model by examining distinctive textual features across genders. The proposed models are content-based models that focus on the textual content within the columns. The input to these models is a stream of columns, and the output is the class of the genders of the columns’ authors. After reviewing related work, we now present an overview of the methodology used throughout this research, as shown in Figure 1.

3.1. Dataset

One of the major contributions of this paper is the creation of a large Arabic newspaper column dataset that is freely available to any interested academics. We chose to collect columns because they follow a formal writing style with features that need to be studied from the gender identification perspective. The diversity in linguistics in the columns between genders may reveal distinctive linguistic markers pertaining to a specific gender. We collected columns from two Arab nations’ news platforms: Okaz (www.okaz.com.sa (accessed on 1 November 2024 )) from Saudi Arabia and Youm7 (www.youm7.com (accessed on 1 November 2024)) from Egypt. The extraction technique utilized a Python scraper expressly programmed to retrieve the data from these new platforms, covering the period from 1 January 2022 to 31 December 2022. The extracted metadata for each column included the title, columnist’s name, publication date, topic domain, and columnist’s image (if available). The result of the extraction process comprised 8000 Arabic columns covering various topics. The columns’ subject areas encompassed politics, technology, culture, and economics. A sample of a column article authored by a female writer and another sample authored by a male writer are presented in Table 2.

3.2. Gender Annotation

To designate each column piece according to the author’s gender, we recruited 20 female native Arabic speakers with computer science degrees from a university in the Kingdom of Saudi Arabia. The main objective was to rely on the authors’ names associated with each column to label its gender, thus labeling the column with a gender. Each participant was given 400 columns, each with the author’s name, and instructed to follow these instructions when labeling:

Match the author’s name with the names found in the Arabic gender database [20]. This database includes over 6.5 million Arabic entries, including names labeled with their associated gender. If a name was listed as a “gender-neutral name”, meaning it can be used for any gender, the column was excluded to avoid mislabeling.
If the name is not found in the database, the author’s picture associated with the article is used as a reference to determine gender.
In cases where the name was absent in the database without an associated image, recruiters are allowed to cast their vote. The vote is to be submitted based on dedicated efforts to search for the author’s full name on social media platforms, such as X, and find associated information indicating their gender, such as a picture or profession. This process is influenced by the method employed to annotate the Arabic gender tweets dataset in a study by Mubarak et al. [21].

Table 3 presents the dataset’s annotation statistics for the authors’ gender labeling process. Each recruiter was instructed to specify the gender labeling method as either “database”, “picture”, or “vote”. As shown in the table, the names of 54 female authors (1785 columns) and 62 male authors (1896 columns) were matched in the databases. Moreover, 12 female (173 columns) and 6 male (71 columns) authors were matched by their associated pictures. As for the “vote” gender labels for the authors, they resulted from recruiters searching for the author’s names through social media accounts on platform X. The main reason for going to a social media platform is that an author’s profile bio presents useful information about the author’s gender. For instance, one author’s profile bio mentioned she was a كاتبة (writer) and محامية (lawyer), with both Arabic terms using feminine-specific suffixes, clearly identifying the author as female. To further validate the accuracy of these six voted labels, the authors’ full names were provided to two additional recruiters, who conducted the same search process to cast their votes independently from the initial recruiter. The label was designated only if a minimum of two of the three votes were cast for a certain gender. Through this voting process, four female (42 columns) and two male (33 columns) authors were labeled, as shown in Table 3.

To assess the consistency of gender labels assigned by all three recruiters, we applied Fleiss’ kappa—a statistical measure suitable for evaluating inter-rater agreement among multiple raters [22]. Fleiss’ kappa was calculated to determine the level of agreement, as it effectively measures reliability while adjusting for chance. The resulting kappa value was 0.86, which, according to widely accepted guidelines, signifies substantial agreement among the raters.

The dataset’s compilation processes, including annotation and revision, spanned six months (2 February 2023 to 8 August 2023). Only articles that met the set requirements were included. The final dataset consisted of 2000 columns authored by 70 writers and labeled as ‘female’ and an additional 2000 columns authored by 70 writers and labeled as ‘male’.

3.3. Dataset Analysis

In Figure 2 and Figure 3, we present the most common words used in the columns authored by females and males, respectively. The objective was to gain insight into the word use of each gender, which would theoretically indicate their interests and perspectives. First, we found that both genders used some mutual words in columns on political topics, such as الرئيس (president), الدول (countries), and العالم (the world). However, we found that male-authored columns often exhibited a more analytical writing style, with less emphasis on emotional factors. This insight is supported by the frequency of words found in Figure 3, such as العمل (work), الحوار (negotiate), تحقيق (investigate), السياسة (politics), and الحرب (war). On the other hand, female-authored columns included more emotional (spiritual) and social words, such as الله (Allah; God), الحب (love), المرأة (women), الم (pain), المجتمع (society), and العلاقات (connection), as exhibited in Figure 2. Second, in terms of quantifiers, such as many, few, and all, male and female authors shared a similar neutral number of word uses in these categories. Interestingly, male authors referred greatly to individuals by referring to their social statuses, such as الوزير (the minister) and السعودي (Saudi), compared to female authors who had a higher tendency to express positive feelings compared with male authors. We observed that female authors often used positive terms that reinforced optimism, 67% compared to 42% by male authors, whereas male authors simply concentrated on detailing the undesirable events without emotional enforcement. These gender-specific features in writing styles findings are aligned with those of [6,23].

3.4. Dataset Preprocessing

Preprocessing is an important step while handling large amounts of data to ensure their consistency, reduce noise, and make them suitable/better input data for the models. For the Arabic language, preprocessing is especially crucial due to the language’s rich morphology and unique script characteristics. The proposed text preprocessing comprised several phases, which are described below:

Removing punctuation, diacritics, nonalphabetic characters, and non-Arabic words. We coded a Python script that automatically processes these functions.
Standardizing the various writings of آ, أ, إ (Alef) by replacing them with bare ا, and substituting ة (Teh Marbuta) with ه (Haa’). Tashaphyne, an innovative Arabic NLP tool that performs normalization techniques [24], was utilized in this preprocessing step.
Tokenization to split the input sentences into individual tokens. The Farasa Segmenter [25] was utilized. We found it useful for Arabic because it takes into account the unique characteristics of the morphology of Arabic words.

3.5. Dataset Statistics

Table 4 presents the dataset’s statistics in terms of words and characters after preprocessing. According to the table, columns written by female authors had a slightly higher mean in words (425.65) and characters (2701.16) count than male authors’ columns (400.78 words, 2544.97 characters). These findings suggest that, on average, female-authored columns are marginally longer in both words and characters.

The standard deviation, a metric indicating the variability of individual text lengths from the mean [26], is greater in male-authored columns in both word (259.92) and character counts (1664.82) than for female-authored columns (215.20 words, 1376.76 characters). The results show that male authors mostly use a wider range of words or characters in their columns. Therefore, male-authored columns can consist of very short to significantly long works. In contrast, female columns show more consistency and are nearer to the average length, demonstrating less variability. This trend can also be visualized by minimum and maximum values, where the columns written by male authors have lower minimum and higher maximum values in words and character count than those written by female authors.

Further insights about the data are visualized in Figure 4, which reveals that male-authored columns had a significant diversity in word count, ranging from 82 to 2185 words. We can observe a peak around the 200–300 word range in the figure, which is followed by a gradual decline in frequency as the word count increases. These findings suggest that most of the male-authored columns are relatively short, with a small proportion of outliers, consisting of more than 700 words. Conversely, columns written by females had a more consistent variance in word count within the same range. This distribution indicates that female-authored columns have consistent lengths, clustering closely around the mean and median values with few outliers compared to male-authored columns.

Figure 5 illustrates the character count frequency, which depicts the contrasting variance in character count between columns produced by males and those authored by females. The minimum and maximum word counts for male and female texts are relatively comparable, indicating that although fundamental metrics such as word and character count provide a general overview, they overlook the intricate linguistic patterns that can effectively distinguish between genders, thereby enhancing the depth and precision of gender-based textual features. Therefore, our primary goal was to investigate language patterns that facilitate the identification of subtle variations in writing styles between genders.

3.6. Feature Engineering

Feature engineering is the process of developing new features from an existing set of features and transforming chosen features into a format that a computer can easily understand. This research involves a thorough utilization of two strategies for extracting features, namely, word embedding and textual features.

3.6.1. BERT Word Embedding

Word embedding approaches in language modeling use multidimensional vectors of real numbers to represent words or sentences. Vectors are indicative of the prevalence of the occurrence or co-occurrence of words or phrases within a corpus. The nonlinear nature of a neural network, along with its ability to efficiently incorporate pretrained word embeddings, often leads to improved classification accuracy [27].

Using BERT word embeddings entails the use of a pretrained BERT model to generate dense vector representations (embeddings) of words or tokens [28]. In contrast to standard word embedding models—such as Word2vec or GloVe, which produce a single fixed vector for each word—BERT supplies contextualized embeddings. This means that the same word may have distinct embeddings, depending on its context in a sentence. In this study, we used the BERT-base model [29] from the Huggingface platform, which has been successfully used in various Arabic studies [30].

3.6.2. Textual Features

One of the primary objectives of this study was to identify linguistic patterns associated with gender and to utilize those patterns to train AGI models. Several researchers often categorize gender-specific features based on those features derived from the writing style and content of the text [31]. Studies in support of this notion applied human psychology writing analysis to examine these features [32,33].

Accordingly, we categorized the gender-specific textual features from a linguistic point of view into two categories—syntactic and semantic—similar to those applied in human psychology writing analysis studies.

Syntactic features: These are used in the theoretical study of lexemes, their linguistic structure, and their syntactic connections within a given language. They include part-of-speech (POS) tags. These POS tags provide information about the grammatical role of each word in the text. These tags have been effectively used in several AGI studies [34,35]. This textual feature category is capable of producing a comprehensive set of markers to investigate the text, as they are the main blocks used to create statements. Thus, the POS tags examined were limited to those specific POS that previous studies found useful in AGI: nouns, verbs, adverbs, adjectives, proper nouns, conjunctions, prepositions, and pronouns.
Semantic features: Linguistics are certain semantic categories that are too fine-grained to be captured by general POS. For semantic features, we focus on words that fall into psycholinguistic and sociolinguistic subcategories. The former includes words that merge psychology and linguistics to investigate how humans acquire, produce, comprehend, and utilize language [36]. On the other hand, the latter addresses the relation between language and society, specifically in how language usage differs across various social groups and depends on criteria like ethnicity, gender, age, and geographical location [37]. These subcategories have been explored in other AGI studies [38,39]. In this study, the set of semantic markers investigated were the following: negators, time, quantifiers, month, nationality, exception, religion, location, title, and day.

Table 5 displays all 18 textual features extracted, with examples of words related to the relevant category. Both textual feature categories were extracted using the Arabic-specific tagger, Arabic Linguistic Pipeline (ALP) [40]. Uniquely, it offers POS tags that are meticulously crafted in accordance with Arabic grammar. Specifically, its tags correspond to the grammatical rules of gender (masculine and feminine) and number (singular, dual, and plural) in the Arabic language. For this, we grouped all the variants of a POS tag into their general POS category. A sample of an Arabic-specific tagged article in ALP is shown in Figure 6. The ALP tool also offers specific semantically optimized taggers that identify words concerning negations, exceptions, religious terms, and several more. According to a study by the authors of [41], the tool achieved the highest tagging accuracy when compared with other well-known tools.

3.7. Proposed Models

We present four approaches to developing AGI models. The first approach was compiling deep learning models with BERT. Second, we compile models with the same deep learning classifiers yet with the proposed textual features. The third and fourth approaches explain the architecture details of BERT-based models and the enhanced BERT model when employed with the textual features, respectively.

The resulting textual feature vectors from the above process are fed to several classifiers, which are listed below.

3.7.1. Convolutional Neural Network (CNN)

Deep neural networks have been used effectively for object identification and text classification [42]. This study’s CNN model was composed of several layers. The first layer was a 64-filter 1D convolutional layer, with a kernel size of 3, and integrated with a ReLU activation function. This layer was followed by a max-pooling layer that reduced the spatial dimensions to 2. A second 1D convolutional layer was added, this time with 32 filters, a three-dimensional kernel, and a ReLU activation function. The output from preceding layers was flattened into a one-dimensional vector and transmitted to a dense layer with 64 units and a ReLU activation function. To prevent overfitting, a 50% dropout layer was implemented.

3.7.2. Long Short-Term Memory (LSTM)

This network is a modified variant of the standard recurrent neural network developed to capture long-term dependency information. LSTMs process inputs sequentially, making them suitable for managing textual data, specifically, where each LSTM unit acquires the embedding vector of the current word and the output from the previous unit as inputs. To address our AGI task, we assembled the model by applying the first layer, which contains 100 units. The output of the first layer is regularized via a dropout layer with a rate of 0.2. Following that is a dense layer with 64 units and a ReLu activation function. A dropout layer is further integrated with a dropout rate of 0.5. Finally, a linear, fully connected layer in the form of a dense layer with one unit is applied, together with a sigmoid activation function, to generate the probability distribution across the classification classes.

3.7.3. Bidirectional LSTM (Bi-LSTM)

To address our AGI task, we used a BiLSTM network. This network consists of two LSTMs that analyze the text in both forward and backward directions. Both networks have the same parameters as the LSTM layer previously described, with outputs regularized via a dropout layer with a dropout rate of 0.5. The regularized outputs are combined and transformed into a final label using a fully connected layer that generates a probability distribution across the labels.

3.7.4. BERT

Developed by Google, this transformer-based machine learning model is specifically designed for NLP workloads [28]. BERT is the first NLP technique to use self-attention, which is enabled by the bidirectional transformers that form the core components of its architecture. To perform the AGI task, we opted for the pretrained BERT-base [29] model for Arabic from the Huggingface platform. The pretrained model has been trained on around 8.2 billion words and contains a size range of 110 million parameters, consisting of 12 transformer layers, along with 12 attention heads. It has a hidden size of 768, and the maximum sequence length for tokenization in this experiment is set to 512.

3.7.5. Enhanced BERT

One of the primary goals of this research was to enhance the AGI task. To do this, we fine-tune the BERT model using the previously described textual features. Initially, we used the previously mentioned pretrained BERT-base model to set up the tokenizer and model. The sequences were properly padded and truncated while tokenized and encoded in BERT format, with a maximum length of 128 characters. Next, we transformed the labels from the training and test sets into TensorFlow tensors to guarantee model compatibility. After that, the BERT model layers were frozen to load the pretrained BERT model and modify its trainable attributes in order to guarantee that the BERT model weights stayed constant during training.

A global average pooling layer was applied to the BERT embeddings to obtain the pooled result. In addition to the BERT model inputs, we provided an input layer for extra language characteristics. This input enters through a dense layer with a ReLU activation function, while the concatenated output is then sent through a dense layer with a ReLU activation function and a 50% dropout rate. The final output layer generates the final predictions using a sigmoid activation function, and Figure 7 shows a thorough breakdown of the upgraded BERT design.

3.8. Experimental Setup

To train the aforementioned models, experiments were conducted multiple times with different parameters. The following optimal hyperparameters were identified as the most effective. The deep learning-compiled models were set with a batch size of 32, a max sequence length of 512, a learning rate of 0.001, an Adam optimizer, and an epoch of 20 with early stopping based on the validation loss. For the BERT and enhanced BERT models, we implemented the same settings: maximum sequence length = 512, batch size = 16, epoch = 10, Adam optimizer, learning rate = 2 × 10⁻⁵, and the loss function was binary cross-entropy while tracking the accuracy metric. We implemented early stopping for training if the validation loss did not improve for 5 consecutive epochs, restoring the best weights and reducing the learning rate by a factor of 0.2 if the validation loss did not improve for two consecutive epochs. The reason for the variance in epochs between the deep learning models and BERT models is the average dataset number and the fact that BERT is highly pretrained, whereas higher epochs might cause overfitting.

The AGI models’ compilation was carried out on a server endowed with an Intel(R) Xeon(R) E5-2670 CPU, NVIDIA(R) GeForce GTX 1080 GPU with 8 GB of video memory and 64 GB of RAM. The Python programming language was utilized for the purpose of this research. The framework utilized TensorFlow 1.8.0. The cleaned dataset contained 4000 columns, with 2000 columns authored by females and 2000 columns authored by males. Then, it was loaded and split into 80% training (3400) and 20% testing (600). The data used in our experiments have been made freely available on Github (https://github.com/Hanen-Tarik (accessed on 21 November 2024)). The evaluation metric for all the models was the ratio of correctly predicted labels to the total number in the testing dataset. The proposed models were evaluated using the most prevalent criteria, including precision, recall, accuracy, and F1-score.

4. Results

This section summarizes the findings from the experiments conducted in this research in two sections. The first section presents the evaluation metrics of the compiled models trained on BERT word embeddings and textual features. The second section demonstrates the results and evaluation metrics of the BERT-based transformer model and enhanced BERT model. The performances obtained by the developed models are displayed in Table 6 and Figure 8.

In the first stage of the experiments, we evaluated the performance of various deep learning models (LSTM, BiLSTM, CNN) for gender classification. Initially, we combined deep learning models with BERT embeddings to observe their performance. The LSTM and BiLSTM achieved 69% accuracy, whereas the CNN had 68% accuracy for gender identification. The LSTM achieved a higher precision of 79% for females compared to 64% for males. In contrast, it achieved a higher recall and F1-score for the male class, indicating better performance in identifying male instances. The BiLSTM and CNN models exhibited trends comparable to the LSTM with slightly varying metrics.

After that, the textual features were added to the deep learning models to evaluate their performance. Table 6 illustrates how incorporating textual features resulted in significant enhancement of the performance of the deep learning models. The LSTM model’s accuracy jumped to 83%, with notable improvements in precision, recall, and F1-score for both classes. The BiLSTM and CNN models followed this trend, with the CNN model reaching an overall accuracy of 90%, demonstrating the impact of adding textual features on classification performance.

In the second stage of the experiments, we evaluated the performance of the BERT and the enhanced BERT model. The results show that the enhanced BERT model achieved a significantly better performance than the standard BERT model. For example, enhanced BERT boosts accuracy to 91% compared with the standard BERT, which achieved 72% accuracy. In addition to accuracy, enhanced BERT also notably improved the precision, recall, and F1-scores for both classes, highlighting its ability to perform better for gender classification. These findings demonstrate how advanced features and architectures, such as those used with the enhanced BERT model, can significantly boost the performance of deep learning models in gender classification.

5. Error Analysis and Important Features

This section discusses an error analysis of the false scenarios when the AGI was mistaken. These represent the vast majority of identification errors across all models. Furthermore, we considered the most important features of the AGI models.

5.1. Error Analysis

To better understand the results, we analyzed some instances where the gender of the author was incorrectly identified. Since the columns were mostly written in formal genres such as MSA, they tended to exhibit a unified linguistic writing style. Consequently, the models meticulously searched for distinctive features of each gender in MSA phrases that convey subtle gender cues. These cues were mostly nuanced and crafted in a way that might mask gender-distinctive features. Notably, the majority of the models were able to recognize the unique gender features identified in previous studies. As a result, columns that contained archetypal gender-specific features were attributed to that gender. For example, among the distinctive features related to female authors is that they express a more emotional tone in their writing compared to male writers, which has been supported in previous studies [43,44]. In practice, we found that female-authored columns contained profound emotional language, which may consist of several adjectives and personal pronouns to express attachment to the topic. Consequently, columns that contained many of these textual features were identified as being authored by females, though in some cases, such columns were male-authored, as shown in Table 7. The table shows an excerpt of a column written by a male writer that was misclassified as being written by a female writer—a possible reason being the presence of adjectives like الصادق (honest), الكاذب (false), أكبر (big), أكثر (more), الماضية (past), and كبيرا (big).

On the other hand, female-authored columns misclassified as male-authored contained more direct and focused language. According to previous studies [39], male writers tend to write longer statements that emphasize lucidity and conventionally present facts while also referencing general traits in the events explained. Accordingly, this difference in language usage may account for the female-authored columns being misclassified as male-authored columns, as shown in Table 7, where in the excerpt, the author presented the topic using several precise terms denoting solemnity, such as تشهد (witness), تقرير (report), and جدية (serious).

5.2. Important Features

We tested the previous findings by calculating the information gain (IG) to determine the important features. Information gain is a metric used in decision trees to estimate the efficacy of a feature when dividing the dataset into distinct groups. The calculation determines the decrease in entropy, which represents the level of uncertainty of the target variable (class labels) when a feature has been identified. Essentially, IG serves to determine the rate at which a certain feature affects the accuracy of predictions in a model. Features with higher IG are considered to be more effective [45].

Figure 9 displays the top 10 important features, ranked according to the highest IG score. Aligned with the previous findings, pronouns, religious terms, nouns, and conjunctions exhibit the highest feature dominance. This insight corresponds with the findings that female authors express a more emotional tone, which involves the use of pronouns and religious terms, which may explain their dominance in important features in the models’ performance. Another noteworthy observation is the dominance of proper nouns in the important features. This may correlate with the finding that male authors focus on presenting topics by conveying facts composed of various proper nouns. Other important features include quantifiers, negations, conjunctions, and adverbs, which models might have associated with a specific gender because of their prominence in the authors’ writings.

6. Validation of Enhanced BERT on Tweets

To test the performance of the optimal model (Enhanced BERT) on real-world articles and different text genres, we compiled an in-house gender-specific tweet dataset. We manually extracted tweets from well-known users to address the diversity and reliability of the data, such as football players, ministers, and social media influencers that have common gender-specific names, such as محمد (Mohammed) for males and فاطمة (Fatimah) for females. In a week, a total of 600 tweets, divided into 300 female and 300 male tweets, were collected. After that, we opted for the same preprocessing and feature engineering steps as in Section 3.5. Following that, the dataset was uploaded to the optimal model as unseen test data.

Table 8 presents the results of the optimal model, with an accuracy of 68.7%. Several factors may have contributed to the moderate results, which include:

Difference in text genre: The differences in the writing style between columns and tweets might explain the subpar performance, as the model may fail to adapt to the different language writing styles in different genres. Tweets are often written in informal language, which might contain slang and other dialects, and since the textual features were extracted using the ALP tool, which works with MSA, it may not have accurately defined the semantics and syntactic features in the informal language used in tweets.
Difference in text length distribution: The training data included diverse words to allow the models to identify linguistic indicators of a particular gender. Since the dataset included tweets, which have a much lower word count than columns, this may have hindered the model’s ability to identify gender-specific indicators, resulting in average performance. This can be observed in Table 9, from which the mean word count for male and female tweets is 16.71 and 10.74, respectively, compared to 400.78 for male-authored and 425.65 for female-authored columns.

Addressing these challenges might be beneficial to enhance the model’s performance. Nevertheless, the average result produced gives a promising indication of the feasibility of using the enhanced BERT model to identify author gender in different text genres.

7. Discussion

According to previous results, it is evident that models that incorporate textual features achieve the best results across all models. Deep learning models with BERT embedding and the basic BERT model resulted in lower accuracies (68% to 72%) in contrast to the highest attained result for the same models with textual features at 90%.

A plausible explanation for this mediocre performance might be the nature of the BERT model, which is trained on enormous amounts of data, making it difficult to capture domain-specific features. However, the textual features were specially customized to gender, making them useful factors.

From the reported studies in the Arabic literature for AGI, our enhanced BERT model was comparable with the best model reported in [17,18,19], reaching accuracy scores of 91.3%, 96%, and 87.6%, respectively. These findings highlight the potential for the integration of gender-specific textual features with BERT to take advantage of BERT’s contextualized understanding, with gender-specific textual features serving as identifiers.

8. Limitations and Future Work

Despite the exhibited success of our models, three significant limitations impacted the validity of our findings and their practical application in real-life scenarios.

The first notable limitation pertains to the narrow focus of the examined textual features, which concentrated on semantic and syntactic linguistics. This restricted scope of features limited the possibility of identifying gender across other forms of text genres. Another limitation is our exclusive collection of columns to inhibit the formal text genre. However, formal text may include various word usages, depending on their context. For example, word use in academic writing is different from that used in news articles, yet it is still considered formal. Thus, it is important to include a wide range of formal texts across various writings. Moreover, while the dataset compilation process was carefully conducted, some labeling errors by recruiters and reviewers may still be present. Nevertheless, the researchers made every effort to ensure the columns were accurately gender-labeled. The limitations presented herein underscore the necessity of future research to expand the use of various textual features. A promising venue for future work may explore the inclusion of various formal text datasets, probing other innovative deep learning and transformer-based models with distinct fine-tuning approaches for a successful AGI model. While this research emphasizes the potential advantages of AGI models, we also note the ethical concerns, such as privacy, possible biases, and misuse. In particular, in certain sociopolitical situations, AGI technology may reinforce prejudices or be utilized in ways that contribute to discrimination, thus caution must be prioritized. For future work, research may concentrate on establishing controls to reduce these dangers and ensure that the technology is utilized properly. Notably, as this study contributes to the field of AGI, it focused on investigating the technical aspects of relating textual features for AGI. The researchers have no intention of introducing bias, causing harm, or reinforcing stereotypes.

9. Conclusions

The primary aim of this research was to explore the feasibility of determining an author’s gender through their formal writing. Determining the author’s gender in casual writing may provide linguistic indicators (e.g., personal pronouns) that disclose their gender. Nonetheless, the more cohesive structure in professional writing may prove to be difficult. This work focused on extracting unique textual aspects related to semantics and syntax. Due to the absence of formal text datasets related to gender, we enabled the compilation of an extensive dataset of columns to exemplify formal texts. We assembled and executed multiple experiments utilizing deep learning classifiers, including CNN, LSTM, BiLSTM, and BERT-based models that employed BERT word embeddings and textual features. This study’s results indicate that the BERT model, utilizing concatenated textual features, achieved a high accuracy of 91%. Furthermore, this model demonstrated moderate efficacy (68.7%) in predicting informal text represented by tweets. These data emphasize that our algorithm is proficient in determining the gender of authors in formal writings. The efficacy of these textual elements as gender-specific indicators may serve as a foundation for future studies on Arabic gender identity in formal text genres.

Author Contributions

Conceptualization, H.H. and K.S.; methodology, H.H. and K.S.; software, H.H. and K.S.; validation, H.H. and K.S.; formal analysis, H.H. and K.S.; resources, H.H. and K.S.; analysis: H.H. and K.S.; investigation: H.H. and K.S.; data curation, H.H.; writing—original draft preparation, H.H.; writing—review and editing, H.H. and K.S.; visualization, H.H. and K.S.; supervision, H.H. and K.S.; project administration, H.H. and K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset presented in this study can be found on the author’s Github.

Acknowledgments

We would like to thank the University of Jeddah for their support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lindqvist, A.; Sendén, M.G.; Renström, E.A. What is gender, anyway: A review of the options for operationalising gender. Psychol. Sex. 2021, 12, 332–344. [Google Scholar] [CrossRef]
Cheng, N.; Chandramouli, R.; Subbalakshmi, K. Author gender identification from text. Digit. Investig. 2011, 8, 78–88. [Google Scholar] [CrossRef]
Husain, R.; Paul, J.; Koles, B. The role of brand experience, brand resonance and brand trust in luxury consumption. J. Retail. Consum. Serv. 2022, 66, 102895. [Google Scholar] [CrossRef]
Molinillo, S.; Aguilar-Illescas, R.; Anaya-Sánchez, R.; Liébana-Cabanillas, F. Social commerce website design, perceived value and loyalty behavior intentions: The moderating roles of gender, age and frequency of use. J. Retail. Consum. Serv. 2021, 63, 102404. [Google Scholar] [CrossRef]
Saeed, S. A customer-centric view of E-commerce security and privacy. Appl. Sci. 2023, 13, 1020. [Google Scholar] [CrossRef]
Newman, M.; Groom, C.; Handelman, L.; Pennebaker, J. Gender Differences in Language Use: An Analysis of 14,000 Text Samples. Discourse Process. 2008, 45, 211–236. [Google Scholar] [CrossRef]
Block, A. Why Newspapers Should Not Have Columnists. 2014. Available online: https://stanforddaily.com/2014/11/09/why-newspapers-should-not-have-columnists/ (accessed on 1 October 2024).
Sulochana, B.C.; Pragada, B.S.; Kiran, B.C.; Reddy, G.A.; Venugopalan, M. Author Identity Unveiled: Gender and Age Prediction from Textual Patterns using BERT. In Proceedings of the 2024 4th International Conference on Intelligent Technologies (CONIT), Hubballi, India, 21–23 June 2024; pp. 1–6. [Google Scholar] [CrossRef]
Onikoyi, B.; Nnamoko, N.; Korkontzelos, I. Gender prediction with descriptive textual data using a Machine Learning approach. Nat. Lang. Process. J. 2023, 4, 100018. [Google Scholar] [CrossRef]
Liu, Y.; Singh, L.; Mneimneh, Z. A Comparative Analysis of Classic and Deep Learning Models for Inferring Gender and Age of Twitter Users. In Proceedings of the 2nd International Conference on Deep Learning Theory and Applications-DeLTA, Online, 7–9 July 2021. [Google Scholar]
Sezerer, E.; Polatbilek, O.; Tekir, S. A Turkish Dataset for Gender Identification of Twitter Users. In Proceedings of the 13th Linguistic Annotation Workshop, Florence, Italy, 1–2 August 2019; pp. 203–207. [Google Scholar] [CrossRef]
Sarwar, R.; An Ha, L.; Teh, P.S.; Sabah, F.; Nawaz, R.; Hameed, I.A.; Hassan, M.U. AGI-P: A Gender Identification Framework for Authorship Analysis Using Customized Fine-Tuning of Multilingual Language Model. IEEE Access 2024, 12, 15399–15409. [Google Scholar] [CrossRef]
Taha, T.M.; Messaoud, Z.B.; Frikha, M. Convolutional Neural Network Architectures for Gender, Emotional Detection from Speech and Speaker Diarization. Int. J. Interact. Mob. Technol. 2024, 18, 88. [Google Scholar] [CrossRef]
Tüfekci, P.; Bektaş Kösesoy, M. Biological gender identification in Turkish news text using deep learning models. Multimed. Tools Appl. 2024, 83, 50669–50689. [Google Scholar] [CrossRef]
Sakaki, S.; Miura, Y.; Ma, X.; Hattori, K.; Ohkuma, T. Twitter user gender inference using combined analysis of text and image processing. In Proceedings of the Third Workshop on Vision and Language, Dublin, Ireland, 23–29 August 2014; pp. 54–61. [Google Scholar]
Taniguchi, T.; Sakaki, S.; Shigenaka, R.; Tsuboshita, Y.; Ohkuma, T. A weighted combination of text and image classifiers for user gender inference. In Proceedings of the Fourth Workshop on Vision and Language, Lisbon, Portugal, 18 September 2015; pp. 87–93. [Google Scholar]
ElSayed, S.; Farouk, M. Gender identification for Egyptian Arabic dialect in twitter using deep learning models. Egypt. Inform. J. 2020, 21, 159–167. [Google Scholar] [CrossRef]
AlZahrani, F.M.; Al-Yahya, M. A Transformer-Based Approach to Authorship Attribution in Classical Arabic Texts. Appl. Sci. 2023, 13, 7255. [Google Scholar] [CrossRef]
Hussein, S.; Farouk, M.; Hemayed, E. Gender identification of egyptian dialect in twitter. Egypt. Inform. J. 2019, 20, 109–116. [Google Scholar] [CrossRef]
Halpern, J. Lexicon-driven approach to the recognition of Arabic named entities. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, 22–23 April 2009; pp. 193–198. [Google Scholar]
Mubarak, H. Build fast and accurate lemmatization for Arabic. arXiv 2017, arXiv:1710.06700. [Google Scholar]
Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378–382. [Google Scholar] [CrossRef]
Choucane, A.M. Gender language differences do men and women really speak differently. Glob. Engl.-Oriented Res. J. (GEORJ) 2016, 2, 182–200. [Google Scholar]
Zerrouki, T. Tashaphyne: A Python package for Arabic Light Stemming. J. Open Source Softw. 2024, 9, 6063. [Google Scholar] [CrossRef]
Abdelali, A.; Darwish, K.; Durrani, N.; Mubarak, H. Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, CA, USA, 12–17 June 2016; pp. 11–16. [Google Scholar]
Ayeni, A. Empirics of Standard Deviation. 2014. Available online: https://www.researchgate.net/publication/264276808_Empirics_of_Standard_Deviation?channel=doi&linkId=53d74d290cf228d363eae74b&showFulltext=true (accessed on 1 October 2024).
Goldberg, Y. A Primer on Neural Network Models for Natural Language Processing. arXiv 2015, arXiv:1510.00726. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Safaya, A.; Abdullatif, M.; Yuret, D. KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona (Online), Spain, 12–13 December 2020; pp. 2054–2059. [Google Scholar]
Husain, F.; Uzuner, O. Transfer Learning Approach for Arabic Offensive Language Detection System—BERT-Based Model. arXiv 2021, arXiv:2102.05708. [Google Scholar]
Argamon, S.; Koppel, M.; Pennebaker, J.W.; Schler, J. Automatically profiling the author of an anonymous text. Commun. ACM 2009, 52, 119–123. [Google Scholar] [CrossRef]
Subrayan, A.M.; Chone, L.S.; Muthusamy, C.; Veeravagu, J. Gendered-Linked Differences in Speech Styles: Analysing Linguistic and Gender in the Malaysian Context/DIFFÉRENCES DE SEXE DANS LE STYLE DE DISCOURS: ANALYSES LINGUISTIQUES ET ANALYSES SUR LE SEXE DANS LE CAS DE MALAISIE. Cross-Cult. Commun. 2010, 6, 18. [Google Scholar]
Subon, F. Gender differences in the use of linguistic forms in the speech of men and women in the Malaysian context. J. Humanit. Soc. Sci. 2013, 13, 67–79. [Google Scholar] [CrossRef]
Alsmearat, K.; Al-Ayyoub, M.; Al-Shalabi, R.; Kanaan, G. Author gender identification from Arabic text. J. Inf. Secur. Appl. 2017, 35, 85–95. [Google Scholar] [CrossRef]
Ouni, S.; Fkih, F.; Omri, M.N. Bots and Gender Detection on Twitter Using Stylistic Features. In Proceedings of the Advances in Computational Collective Intelligence, Hammamet, Tunisia, 28–30 September 2022; Bădică, C., Treur, J., Benslimane, D., Hnatkowska, B., Krótkiewicz, M., Eds.; Springer: Cham, Switzerland, 2022; pp. 650–660. [Google Scholar]
Balamurugan, K. Introduction to Psycholinguistics—A Review. Stud. Linguist. Lit. 2018, 2, 110. [Google Scholar] [CrossRef][Green Version]
Hymes, D. The Scope of Sociolinguistics. Int. J. Sociol. Lang. 2020, 2020, 67–76. [Google Scholar] [CrossRef]
Safara, F.; Mohammed, A.S.; Yousif Potrus, M.; Ali, S.; Tho, Q.T.; Souri, A.; Janenia, F.; Hosseinzadeh, M. An Author Gender Detection Method Using Whale Optimization Algorithm and Artificial Neural Network. IEEE Access 2020, 8, 48428–48437. [Google Scholar] [CrossRef]
Morales Sánchez, D.; Moreno, A.; Jiménez López, M.D. A White-Box Sociolinguistic Model for Gender Detection. Appl. Sci. 2022, 12, 2676. [Google Scholar] [CrossRef]
Freihat, A.; Bella, G.; Mubarak, H.; Giunchiglia, F. A Single-Model Approach for Arabic Segmentation, POS-Tagging and Named Entity Recognition. In Proceedings of the 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP), Algiers, Algeria, 25–26 April 2018. [Google Scholar] [CrossRef]
Alluhaibi, R.; Alfraidi, T.; Abdeen, M.A.; Yatimi, A. A Comparative Study of Arabic Part of Speech Taggers Using Literary Text Samples from Saudi Novels. Information 2021, 12, 523. [Google Scholar] [CrossRef]
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep Learning–based Text Classification: A Comprehensive Review. ACM Comput. Surv. 2021, 54, 1–40. [Google Scholar] [CrossRef]
Bamman, D.; Eisenstein, J.; Schnoebelen, T. Gender identity and lexical variation in social media. J. Socioling. 2014, 18, 135–160. [Google Scholar] [CrossRef]
Rao, D.; Yarowsky, D.; Shreevats, A.; Gupta, M. Classifying latent user attributes in twitter. In Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents, Toronto, ON, Canada, 26–30 October 2010; pp. 37–44. [Google Scholar]
Tangirala, S. Evaluating the impact of GINI index and information gain on classification using decision tree classifier algorithm. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 612–619. [Google Scholar] [CrossRef]

Figure 1. Proposed framework.

Figure 2. Female author word cloud.

Figure 3. Male author word cloud.

Figure 4. Word count distribution in columns across genders.

Figure 5. Character count distribution in columns across genders.

Figure 6. POS-tagged article using Arabic Pipeline Tool (APL).

Figure 7. Model architecture for the enhanced BERT.

Figure 8. Model performance comparison.

Figure 9. Top 10 important features.

Table 1. Summary of related works.

Study	Dataset	Summary of Methods	Language
[18]	Al-Maktaba Al-Shamela corpus (an online library for Arabic and Islamic collections of books)	Pretrained transformer-based models for classification of Arabic text datasets	Arabic
[17]	A corpus of Egyptian dialect obtained from Twitter—EDGAD and PAN (part of CLEF)—PAN AP’17	Neural network model (NN) tuned for gender identification (GI) in Egyptian dialect/NN architectures tuned for GI, such as ANN, CNN, LSTM, C-Bi-LSTM, and C-Bi-GRU	Arabic
[19]	EDGAD dataset obtained from Twitter	AGI model + an engineered feature vector that integrates different EAD language-related features with N-gram	Arabic
[10]	Wikidata corpus	BERT—a pretrained transformer network that relies on deep learning (DL) models by incorporating word embeddings + sentence embeddings	English
[9]	Twitter User Gender Classification dataset obtained from Kaggle	TF-IDF model to enrich existing dataset on natural language processing, such as bag-of-words, and pretrained word embeddings (GLOVE, BERT, GPT2, Word2Vec) + machine learning (ML) (Naïve Bayes [NB], Support Vector Machine [SVM], Random Forest [RF])	English
[15]	Yahoo Crowd Sourcing corpus of user and image annotated datasets	A combined method of text + image processing	Japanese
[12]	A corpus of Punjabi dataset containing news articles (texts) extracted from a newspaper	LightGBM for data training/feature extraction + machine learning	Punjabi
[11]	Twitter corpus containing 3368 users in the training set & 1924 users in the test set—each user has 100 tweets	Bag-of-words model + SVM with linear kernel as a classifier	Turkish
[13]	VoxCeleb (version 1 & 2) datasets containing audio-visual collections + RAVDESS dataset	Diarization algorithms, such as CNNs, to extract audio data parameters for speaker diarization + gender speaker and emotional speaker	English
[16]	Yahoo Crowd Sourcing	Hybrid approach + logistic regression	English
[14]	IAG-TNKU datasets of a Turkish newspaper comprising 43,292 news articles	ML models (NB + RF) & DL models (CNN + LSTM)	Turkish

Table 2. Samples of columns written by both genders.

Female	Male
من مدينة سان بطرسبورغ تم إطلاق فيلم أفريقي جديد، بهدف توحيد الجهود وسبل التعاون تحت عنوان "روسيا وأفريقيا: نريد السلام والتقدم والمستقبل". في مدينة "ليبيا" الروسية. كما شهدت سان بطرسبورغ - العاصمة الروسية الثانية- انعقاد أكبر حدث أساسي في العلاقات بين موسكو ودول القارة السمراء. ملتقى سياسي وعسكري واقتصادي مهم، نتج عن طاولة النقاش فيه العديد من الصفقات وخطط التعاون في أمور الطاقة والزراعة. فكرة عدم التدخل في الشأن الداخلي وتوطيد الجهود الروسية للدول الإفريقية من جهة الاستثمار والدخول في استثمارات روسية طويلة الأجل. في قمة التعاون بين روسيا ودول القارة الإفريقية برزت عدة أمور أهمها دعم العلوم والتكنولوجيا واهتمام روسيا بإيجاد حلول لمشكلات الفقر وزيادة استخدام الطاقة المتجددة في أعمال التنمية. جاءت مشاركة بوتين خلال قمة روسيا - أفريقيا 2023 مقتصرة ككتابة خطاب مسجل على أهمية إبقاء هدوء واستقرار القارة وحملت على عاتقها ملفات صعبة تحتم لها عن حلول	تحت عنوان المهرجان الذي تنظمه هيئة الأدب والنشر والترجمة في وزارة الثقافة، انطلقت فعالية تجمع بين أشهر الكتاب والروائيين، ومن أبرز الأسماء كانت دون حضور أي كتاب جديد ولا قراءة بدون كتب. تم افتتاح البرنامج على أنشطة أدبية وثقافية في مجالات كالقصص الشعرية والقصائد التي تبرز ثقافة وإرث الدولة العريق، مما يبعث برسالة إلى الجمهور والاستماع بفعالية للنقاشات الثقافية. تضمنت الفعالية مجموعة من الأقسام الخاصة بأنواع الأدب، مثل: أنواع الأدب، السينما، الفنون المسرحية، وأقسام تبرز أنواع الأزياء والموضة، بالإضافة إلى مساحة مخصصة لمتابعة صناع الأفلام وصناع المسرح، وبناء علاقة مع المنتجين والمخرجين لتبادل أفكار إنتاج الأعمال الأدبية. مثل هذه المهرجان لا يبرز علاقة الجهات المعنية بالثقافة والأدب وفهمها وحسنها في تمييز ثقافة المحاكاة على المساحة الدولية من خلال المشاركة في تفعيلها وتعزيز نجاحها. الفعاليات تتضمن محاضرة وأحاديث تعزز بما فيها التفاهم على ضرورة المشاركة والنقاش، وتواجد أبرز الكتاب والمفكرين ليجيبون على استفسارات خاصة مباشرة من الجمهور، مما يبرز مدى الاهتمام بالفعاليات والروابط المحلية والعالمية في الجانب الاجتماعي والاقتصادي
English Translation
From the city of Saint Petersburg, a new Russian-African film was launched, aiming to unify efforts and means of cooperation under the title “Russia and Africa: We Want Peace, Progress, and the Future”. The event took place in the Russian city of “Libya”. Saint Petersburg, Russia’s second capital, also witnessed the holding of the largest event focused on relations between Moscow and African nations. It was a significant political, military, and economic forum, resulting in several agreements and cooperation plans in areas such as energy and agriculture. The idea emphasized non-interference in domestic affairs and bolstered Russian efforts to invest in African countries through long-term Russian investments. In the Russia-Africa Cooperation Summit, several key topics emerged, most notably support for science and technology, and Russia’s focus on finding solutions to issues of poverty and increasing the use of renewable energy in development projects. Putin’s participation in the Russia-Africa Summit 2023 was limited to a recorded speech, emphasizing the importance of maintaining peace and stability on the continent. The summit tackled challenging issues that demanded practical solutions.	Under the title of the festival organized by the Literature, Publishing, and Translation Authority of the Ministry of Culture, an event was launched that brought together famous writers and novelists. Among the prominent names, there was no presence of any new books nor any reading without books. The program opened with literary and cultural activities in fields such as poetic stories and poems that highlight the state’s rich culture and heritage, sending a message to the audience and actively listening to cultural discussions. The event included several sections dedicated to different types of literature, such as literature genres, cinema, theatrical arts, and sections showcasing fashion and styles. Additionally, there was a dedicated space for filmmakers and theater creators, building connections with producers and directors to exchange ideas on producing literary works. Such a festival highlights the importance of the entities involved in culture and literature, their understanding, and their distinction in simulating culture on an international scale by promoting and enhancing its success through active participation. The activities include lectures and discussions that strengthen mutual understanding, emphasizing the necessity of participation and discussion. Prominent writers and intellectuals are present to respond to audience inquiries directly, reflecting the extent of interest in the events and the local and global connections in the social and economic realms.

Table 3. Dataset annotation results.

Class	Female	Male
Matched in Database	54	62
Matched with Picture	12	6
Voted	4	2

Table 4. Dataset statistics per 2000 columns for each class.

Category	Metric	Word Count	Character Count
Male	Mean	400.78	2544.97
	Standard Deviation	259.92	1664.82
	Minimum	82	488
	Median	327	2064.5
	Maximum	2185	14,013
Female	Mean	425.65	2701.16
	Standard Deviation	215.20	1376.76
	Minimum	84	539
	Median	396	2510.5
	Maximum	1795	10,967

Table 5. Textual features.

Syntactic
Noun	قطة (cat), رجل (man)	Conjunctions	و: ثم (and)
Verb	اكل (eat), لعب (play)	Prepositions	من (from), على (on)
Adjectives	كبير (big), أعلى (higher)	Proper Nouns	أحمد (Ahmad), العراق (Iraq)
Adverbs	دائما (always)	Pronouns	أنا (I), هو (he)
Semantic
Negators	لا (no), لم (did not)	Exception	إلا (except)
Time	الساعه (hour)	Religious	الله (Allah), مسجد (mosque)
Quantifiers	كثير (many), قليل (a few)	Location	مصر (Egypt), شارع الأول (First street)
Month	محرم (Muharam)	Title	الرئيس (the president)
Nationality	سعودي (Saudi)	Day	السبت (Saturday)

Table 6. Model performance results.

Deep Learning Models with BERT Word Embedding
Model	Class	Precision	Recall	F1-Score	Accuracy
LSTM	Female	0.79	0.51	0.62	0.69
	Male	0.64	0.87	0.74
BiLSTM	Female	0.76	0.56	0.65	0.69
	Male	0.65	0.82	0.73
CNN	Female	0.72	0.57	0.64	0.68
	Male	0.65	0.78	0.71
Deep Learning Models with Textual Features
Model	Class	Precision	Recall	F1-Score	Accuracy
LSTM	Female	0.76	0.95	0.85	0.83
	Male	0.94	0.71	0.81
BiLSTM	Female	0.82	0.98	0.89	0.88
	Male	0.97	0.78	0.87
CNN	Female	0.91	0.91	0.94	0.90
	Male	0.93	0.91	0.93
BERT Models
Model	Class	Precision	Recall	F1-Score	Accuracy
BERT	Female	0.83	0.51	0.64	0.72
	Male	0.65	0.90	0.75
Enhanced BERT	Female	0.92	0.94	0.93	0.91
	Male	0.93	0.91	0.93

Table 7. Misclassified column samples.

Prediction	Arabic	Translated (English)
Female Misclassified as Male	حرارة غير مسبوقة على كوكب الأرض، وتشهد مناطق بالعالم ارتفاعا غير مسبوق فى درجات الحرارة، ويجد متابعو الطقس أنفسهم مع موجات لا تنتهى من الحر، وموجة تسلم أخرى بلا توقف، لدرجة أن تقرير الأرصاد يتوقع موجة حارة جديدة، والموجة السابقة مستمرة، والأمر أصبح أكثر جدية من خلاف سنوى بين عشاق الصيف، ومحبى الشتاء، فالفصول تبدلت، وتداخلت بسبب تغير مناخى وتحذيرات من صيف وشتاء وجفاف وفيضانات وحرائق	Unprecedented heat on the planet, and regions of the world are witnessing an unprecedented rise in temperatures, and weather followers find themselves with endless waves of heat, and another wave of non-stop, to the point that the meteorological report expects a new heat wave, and the previous wave continues, and the matter has become more serious than an annual dispute between summer lovers, and winter lovers, as the seasons have changed, and overlapped due to climate change and warnings of summer and winter, drought, floods and fires.
Male Misclassified as Female	أصبحت وسائل التواصل الاجتماعي مرتعًا لكل مَن هبّ ودبّ، وأصبح ينشر فيها الغث والسمين، والخبر الصادق والكاذب، دون أن تكون هناك مساءلة لمن يخرج عن الحقيقة.. فهناك من ينشر أخبارًا غير صحيحة، أو يهول من تلك الأمور؛ فيعطيها أكبر مما تستحق؛ وذلك لكسب متابعين أكثر، لكنه لا يعلم أن ذلك الفعل قد يؤثر على سمعة وطن بأكمله.. ففي الساعات الماضية تناقل المغردون خبر ما حدث في دار التربية الاجتماعية في محافظة خميس مشيط على أنه حدث في رعاية اليتيمات؛ وهو ما أحدث لغطًا كبيرًا، دون معرفة الحقيقة التي ظهرت فيما بعد	Social media has become a hotbed for everyone who comes and bears, and spreads wheat and fat, and honest and false news, without there being accountability for those who deviate from the truth. There are those who publish incorrect news, or exaggerate these things, giving them more than they deserve, in order to gain more followers, but they do not know that this act may affect the reputation of an entire country. In the past hours, tweeters circulated the news of what happened in the Social Education House in Khamis Mushait Governorate as an event in the care of orphans, which caused great controversy, without knowing the truth that emerged later.

Table 8. Results of enhanced BERT model with tweets.

Model	Accuracy	F1-Score	Recall	Precision
Enhanced BERT	68.7	67.8	68.3	65.7

Table 9. Test unseen dataset statistics per 300 tweets for each class.

Category	Metric	Word Count	Character Count
Male	Mean	16.71	99.16
	Standard Deviation	19.81	121.08
	Minimum	2.0	4.0
	Median	9.0	54.0
	Maximum	138.0	280.0
Female	Mean	10.74	61.79
	Standard Deviation	12.61	75.28
	Minimum	2.0	3.0
	Median	6.0	32.0
	Maximum	86.0	213.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Himdi, H.; Shaalan, K. Advancing Author Gender Identification in Modern Standard Arabic with Innovative Deep Learning and Textual Feature Techniques. Information 2024, 15, 779. https://doi.org/10.3390/info15120779

AMA Style

Himdi H, Shaalan K. Advancing Author Gender Identification in Modern Standard Arabic with Innovative Deep Learning and Textual Feature Techniques. Information. 2024; 15(12):779. https://doi.org/10.3390/info15120779

Chicago/Turabian Style

Himdi, Hanen, and Khaled Shaalan. 2024. "Advancing Author Gender Identification in Modern Standard Arabic with Innovative Deep Learning and Textual Feature Techniques" Information 15, no. 12: 779. https://doi.org/10.3390/info15120779

APA Style

Himdi, H., & Shaalan, K. (2024). Advancing Author Gender Identification in Modern Standard Arabic with Innovative Deep Learning and Textual Feature Techniques. Information, 15(12), 779. https://doi.org/10.3390/info15120779

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing Author Gender Identification in Modern Standard Arabic with Innovative Deep Learning and Textual Feature Techniques

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Dataset

3.2. Gender Annotation

3.3. Dataset Analysis

3.4. Dataset Preprocessing

3.5. Dataset Statistics

3.6. Feature Engineering

3.6.1. BERT Word Embedding

3.6.2. Textual Features

3.7. Proposed Models

3.7.1. Convolutional Neural Network (CNN)

3.7.2. Long Short-Term Memory (LSTM)

3.7.3. Bidirectional LSTM (Bi-LSTM)

3.7.4. BERT

3.7.5. Enhanced BERT

3.8. Experimental Setup

4. Results

5. Error Analysis and Important Features

5.1. Error Analysis

5.2. Important Features

6. Validation of Enhanced BERT on Tweets

7. Discussion

8. Limitations and Future Work

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI