Next Article in Journal
Impact of Digital Twins and Metaverse on Cities: History, Current Situation, and Application Perspectives
Next Article in Special Issue
A Cloud-Native Web Application for Assisted Metadata Generation and Retrieval: THESPIAN-NER
Previous Article in Journal
The Impact of Nonlinear Mobility Models on Straight Line Conflict Detection Algorithm for UAVs
Previous Article in Special Issue
Applying a Character-Level Model to a Short Arabic Dialect Sentence: A Saudi Dialect as a Case Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Detecting Hateful and Offensive Speech in Arabic Social Media Using Transfer Learning

1
LIM, Hassan II University of Casablanca, Casablanca 20000, Morocco
2
Department of Computer Science, Moulay Ismail University, Meknes 50050, Morocco
3
FCSIT, Al-Baha University, Al-Baha 65528, Saudi Arabia
4
ReDCAD Laboratory, University of Sfax, Sfax 3038, Tunisia
5
Department of Management Information Systems and Production Management, College of Business and Economics, Qassim University, P.O. Box 6640, Buraidah 51452, Saudi Arabia
6
Department of Computer Science, College of Arts and Sciences at Tabarjal, Jouf University, Sakaka 72388, Saudi Arabia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(24), 12823; https://doi.org/10.3390/app122412823
Submission received: 11 November 2022 / Revised: 3 December 2022 / Accepted: 12 December 2022 / Published: 14 December 2022
(This article belongs to the Special Issue Recent Trends in Natural Language Processing and Its Applications)

Abstract

:
The democratization of access to internet and social media has given an opportunity for every individual to openly express his or her ideas and feelings. Unfortunately, this has also created room for extremist, racist, misogynist, and offensive opinions expressed either as articles, posts, or comments. While controlling offensive speech in English-, Spanish-, and French- speaking social media communities and websites has reached a mature level, it is much less the case for their counterparts in Arabic-speaking countries. This paper presents a transfer learning solution to detect hateful and offensive speech on Arabic websites and social media platforms. This paper will compare the performance of different BERT-based models trained to classify comments as either abusive or neutral. The training dataset contains comments in standard Arabic as well as four dialects. We will also use their English translations for comparative purposes. The models were evaluated based on five metrics: Accuracy, Precision, Recall, F1-Score, and Confusion Matrix.

1. Introduction

Access to the internet and social media has seen an important democratization over the past few years. According to Kemp [1], the total number of active internet users has reached 4.95 billion people worldwide as of January 2022. Among them, 4.6 billion users are active on social media (thus, a 10% increase when compared to 2021). In the Middle East and North Africa (MENA) regions, countries such as Egypt, Saudi Arabia, Algeria, and Morocco were ranked among the top 30 countries with the largest numbers of Facebook users in 2022.
Social media platforms nowadays represent the essence of free speech since they offer their users a space where they can express their opinions with no control or censorship. This is, usually, a healthy situation, but not when some individuals tend to use expressions, statements, and allegations that may offend other counterparts with different beliefs, backgrounds, genders, or even races.
Even though they are protected by the CDA 230 in its latest version [2], a great number of social media platforms are putting the effort to provide the best protection for their users against hateful and offensive content. However, the main problem of the approaches they offer is that they rely heavily on human monitoring and user reports.
The last couple of years have seen a serious increase in hate speech on social media platforms worldwide, especially during the COVID-19 lockdown. For instance, the COVID-19 lockdown was behind a worrying 20% rise in online hate speech in the United Kingdom [3].
This problem has caught the attention of several researchers worldwide and has urged them to seek solutions in order to detect and prohibit such behavior. The recent literature provides different strategies for automatic hate speech detection. However, most of these works were dedicated to the most spoken languages worldwide, such as English, Spanish, and other languages.
Surprisingly, despite being among these languages, works related to Arabic, as will be covered in Section 2, have been rare, and mostly dialect-focused. This comes from the fact that social media users in Arabic speaking countries tend to use their own dialects instead of the standard Arabic. Another reason is the choice of letters used when writing. In Middle Eastern countries, social media users prefer Arabic letters, while in the North African countries, the Romanized Arabic, also known as “Arabizi”, is preferred.
In this work, as will be described in Section 3 and Section 4, we focus on the Middle Eastern region as it has a wider variety of dialects, and also because of their use of the Arabic letters on social media platforms. We propose a selection of transfer learning models based on BERT [4] for hateful and offensive speech detection that can cover statements (social media comments) written in the standard Arabic, as well as three of most spoken Arabic dialects in the region (Egyptian, Iraqi, and Gulf). Section 5 will conclude this paper and suggest possible future directions.

2. Related Works

Having a proper definition of hateful and offensive speech has always caused controversy since these definitions tend to have religious, cultural, or ethnical backgrounds. The United Nations [5] defined hate speech as any expression that attacks a person on a group based on any of their identity factors, would it be ethnic, religious, racial, etc. As stated earlier, hateful and offensive speech has grown significantly on social media in the last couple of years, especially in the aftermath of the COVID-19 lockdown. With the rise of Artificial Intelligence (AI) methods and their great performance in many fields [6,7,8,9,10], several researchers from around the globe have also presented several AI-based solutions to automate the detection of hateful and offensive speech in their online communities.
Whereas English is backed by most of the mature solutions, other top languages such as French and Spanish have also presented interesting approaches. Vanetik and Mimoun [11] have built and annotated a dataset of 2856 French tweets where 927 were labeled as “racist” whereas the rest was “not racist”. They applied TF-IDF, N-gram, and BERT sentence embedding for text representation, then went on to compare different binary classification models such as “Random Forest” (RF), “Logistic Regression” (LR), and “Extreme Gradient Boosting” (XGBoost). The best performance for that dataset was provided by LR when it was backed by BERT word embedding, with 79% accuracy. Following similar steps, Arcila-Calderón et al. [12] have built a dataset of 10855 Spanish tweets where 2773 were labeled as “hateful” and the rest as “non-hateful”. They used this dataset to train machine learning and deep learning models able to detect gender and sex orientation based hateful online statements. They used “Bag of Words” (BoW) as a text representation for the machine learning algorithms, and word embedding for the RNN deep learning model. The latter combination provided the best performance with 84% accuracy, exceeding by far the second best, which was BoW with LR (76%). Another attempt to detect hate speech in Spanish social media was presented by Plaza-Del-Arco et al. [13] who selected tweets from two different datasets. From the first dataset, they selected 6000 tweets and labeled 1567 of them as “hateful”, and the remaining 4433 as “not hateful”. From the second dataset, they collected 6600 tweets and labeled 2739 of them as “hateful”, and the remaining 3861 as “not hateful”. When comparing the performances of different pre-trained models, BETO, a monolingual transformer model based focused on the Spanish language, outperformed BERT and XLM.
The literature also presents tentative hate speech detection models in other languages such as Urdu [14], Turkish [15], or Chinese [16].
Another interesting approach that several researchers suggested is to propose multilingual hate speech detection models. Chiril et al. [17] combined two datasets. The first one, in English, contains tweets with hateful content against women and immigrants, whereas the second one, in both English and French, contains sexist tweets. The combined dataset contains a total of 16,156 tweets, of which, 6171 were labeled as “hate”, and the remaining 9985 were labeled as “non-hate”. They used FastText [18] and GloVe [19] for multilingual embedding as backing for the Bidirectional Long-Short Term Memory (BiLSTM) algorithm. The best performance was provided by the combination FastText-BiLSTM with a 79% accuracy. Corazza et al. [20] combined three datasets of tweets. The first one, in English, contains 16,000 tweets, from which, 5006 were labeled as “positive” for racism and sexism, whereas the remaining 10,884 tweets were labeled as “negative”. The second dataset, in Italian, contains 4000 tweets, from which, 1296 were labeled as “hateful”, whereas the remaining 2704 were labeled as “not hateful”. The third dataset, in German, contains 5009 tweets, of which, 1688 were labeled as “offensive”, whereas the remaining 3321 were labeled as “other”. Using these numbers, the authors made sure to keep a 1 to 3 ratio between hateful and non-hateful tweets in their datasets. The best performances were provided by the FastText backed Long Short Term Memory (LSTM) algorithm with an F1-Score of 82% for English, word embedding of Italian tweets along with LSTM for the Italian language with an 80% F1-Score, and FastText embedding with Gated Recurrent Unit (GRU) for German with a 75.8% F1-Score. Ranasinghe and Zampieri [21] have trained and compared the performance of mBERT [4] and XLM-R [22] models on a multilingual dataset containing tweets and Facebook comments in Bengali (4000 FB comments), Hindi (8000 tweets), and Spanish (6600 tweets), which were labeled as “hateful” or “non-hateful”. The overall best performance for this dataset was provided by XLM-R with F1-scores of 84% for Bengali, 85% for Hindi, and 75% for Spanish.
When compared to other languages, works dedicated to hateful and offensive language detection in Arabic are scarcer. To the best of our knowledge, most of the earliest works started in 2017 when Abozinadah and Jones [23] suggested a statistical learning approach to back the Support Vector Machines (SVM) algorithm to classify tweets in Arabic as either “abusive” or not. Mubarak et al. [24] are also considered pioneers of Arabic obscene language detection. They set up a list of common obscene words in Arabic social media, and based on it, they extracted and annotated 32000 tweets that contain these words. Among these tweets, 79% were considered “offensive”, 2% were considered “obscene”, and the remaining 19% were considered “normal”.
Different approaches have been proposed since then. Albadi et al. [25] presented an approach dedicated to religious hate speech detection. They constructed a dataset of 6000 tweets, where each 1000 is specific to one of the six most common religions or sects in the Middle East. They extracted the features from the tweets using AraVec [26] and then, proceeded to a binary classification using LR, SVM, and GRU algorithms. The latter provided the best performance with a 79% accuracy. Anezi [27] collected a dataset of 4203 comments from different social media platforms. Each of these comments was manually annotated by a group of Arab native speaker to one of seven classes (“Against Religion”, “Racist”, “Against Gender Equality”, “Insulting or bullying”, “Violent or offensive”, “Normal Positive”, and “Normal Negative”). For the classification, they opted for Recurrent Neural Networks (RNN) for an accuracy rate of 84.14%. Shannaq et al. [28] built a dataset of 4505 tweets from four different domains where there is a high possibility of finding offensive speech (“Celebrities”, “Gaming”, “News”, and “Sports”). They fine-tuned AraVec and GloVe models and fed the word embeddings to two classifiers (SVM and XGBoost) that had their hyperparameters optimized by the Genetic Algorithm (GA). The best performance was provided by the AraVec-backed GA-SVM combination with an 88.2% accuracy. Alsafari et al. [29] built a dataset of 5631 tweets which were annotated by two female and one male Gulf native speakers. The dataset had six labels (“Clean”, “Offensive”, “Religious Hate”, “Gender Hate”, “Nationality Hate”, and “Ethnicity Hate”). They used different word embedding techniques such as AraVec, FastText, and mBERT, and then, fed the data to three deep learning algorithms: LSTM, GRU, and Convolutional Neural Networks (CNN). The mBERT-CNN combination provided the best results with an F-Macro of 75.51%, which they considered encouraging given the training limitations.

3. Materials and Methods

Most of the abovementioned methods were based either on standard Machine Learning/Deep Learning approaches, or on Word2Vec/AraVec approaches. This statement was also confirmed by Anezi [27] in his literature review. The main issue with these methods is that, when they try to represent a word, they tend to miss on its context. For example, a word like “المغرب” will be represented in the same way every time regardless of its context. The thing here is that the same word, depending on context, can either mean “Morocco”, “west”, or “sunset”.
As it will be further developed in this section, we tried to build our hate speech detection model while taking context into consideration. This prompted us to explore the possibilities offered by BERT [4]-based models. These models can provide more insight on context by scouting a particular word’s successors and predecessors, rather than relying just on its successors, as it was the case with the previous methods.
Based on the usual classification performance indicators, i.e., Accuracy, Precision, Recall, F1-Score, and Confusion Matrix, we will compare different BERT-based approaches and see which can provide the best results. The approaches we will compare are the following:
  • Translating the comments into English and using the classical BERT [4];
  • Using the Multilingual BERT (mBERT) [4] on the translated comments;
  • Using mBERT on the original comments in Arabic;
  • Using AraBERT [30], a BERT-based model trained on Arabic text, on the original comments in Arabic.

3.1. BERT Based Models

3.1.1. BERT

BERT, or Bidirectional Encoder Representations from Transformers, is pre-trained to provide word representations based on context learnt from both sides. It is based on the Transformer architecture [31], from which, it only takes the encoder side.
According to Alammar’s description [32], BERT is presented in two versions:
  • BERTBASE: with 12 encoder layers, 768 hidden unit feedforward layers, and 12 attention heads;
  • BERTLARGE: with 24 encoder layers, 1024 hidden unit feedforward layers, and 16 attention heads.
Both of these versions are larger than the default version suggested in the original paper [31], which consists of 6 encoder layers, 512 hidden unit feedforward layers, and 8 attention heads.
Figure 1 displays the encoder stacks in both BERTBASE and BERTLARGE. Each encoder can be broken down into two sublayers.
The first layer is the self-attention layer whereas the second layer is the feed forward neural network layer. The self-attention layer allows the encoder to investigate other words from a sentence while encoding a specific word. This can be done in different steps.
The first step is to multiply the embedding of the word x i by three weight matrices W Q , W K , and W V , thus extracting three vectors: the query vector q i , the key vector k i , and the value vector v i , respectively.
These vectors will be used, in the second step, to calculate the attention score of the processed word. This score evaluates the focus that needs to be placed on the other words of the sentence while processing the current word. This score is the dot product of the processed word’s query vector q i and the key vector k j of each of the other words of the sentence.
The third step is the division by the square root of the key vector’s dimension d k to keep the gradients stable.
The scores will be normalized in the fourth step using a softmax function.
The fifth step is to multiply the softmax score by the value vector v i so we can keep the needed words while discarding the irrelevant ones.
These weighted value scores will be summarized in the sixth step to produce the output of the self-attention layer for the processed word.
For better performance, the abovementioned steps are applied on matrices instead of individual vectors. For instance, the embedded sentences will be collected in a matrix X that will be multiplied by the weight matrices W Q , W K , and W V to produce the query Q , key K , and value V matrices.
The output of the self-attention layer Z will be calculated according to the following equation:
Z = S O F T M A X ( Q . K T d k ) . V
Having multiple attention heads (12 for BERTBASE, and 24 for BERTLARGE) allows for the model to focus on different positions and have multiple representation subspaces with different Z matrices. As the feedforward layer expects only one Z matrix, the different matrices from the attention heads are concatenated into one, which will be multiplied by another weight matrix W O .
Each encoder sublayer, as described in Figure 2, applies a Layer Norm operation [33] on its output before feeding it to the next sublayer (or layer). This output can be expressed as:
Z = L a y e r N o r m ( X + Z )
BERT’s pre-training was based on two tasks going together. In the first task, “Next Sentence Prediction” (NSP), two sentences are fed to the model with an embedding “A” for the first sentence and “B” for the second sentence. The sentence “B” is provided as the real next sentence in half of the cases, whereas a random sentence is provided in the other half of the cases. In the second task, “Masked Language Model” (MLM), 15% of the tokens are replaced with a [MASK] token in 80% of the cases, and a random token for 10% of the cases, while they remain as they were for the last 10% of the cases. This task is applied right after the tokenization process.
For BERT’s pre-training, the authors used the configuration described in Table 1. The datasets used in the pre-training process are the “BooksCorpus” by Zhu et al. [35] with 800 million words, and the English Wikipedia with 2.5 billion words.

3.1.2. Multilingual BERT (mBERT)

Multilingual BERT (mBERT) is a version of BERT created by Devlin et al. [4] to narrow the language gap by training it on corpora of multiple languages. Latin-based languages which have comparable structure and vocabulary can benefit from common representations [22]. Other languages, such as Arabic, vary in syntactic and morphological structure and share very little with the multitude of Latin-based languages. Hence, mBERT still lags behind single-language based models due to a lack of data representation and a restricted language-specific vocabulary.

3.1.3. AraBERT

Developed by Antoun et al. [30], AraBERT is a transfer learning model that was pre-trained for three specific tasks:
  • Sentiment Analysis (SA);
  • Named Entity Recognition (NER);
  • Question Answering (QA).
It is based on BERT [4] and is designed to tackle the issue of non-contextual word representations in the Arabic language that Word2Vec-based models such as AraVec employ [26].
The developers of AraBERT [30] followed the same NSP-MLM pre-training based procedure as the one adopted by the original BERT team [4]. However, due to the small size of the Arabic Wikipedia dumps, when compared to their English counterparts, Antoun et al. [30] needed to scrap articles from different news websites and add the data they collected to two publicly available datasets: the “Open Source International Arabic News Corpus” by Zeroual et al. [37], and the “1.5 billion words Arabic corpus” made available by El-Khair [38], Thus making AraBERT’s pre-training dataset reach an approximate total size of 24GB of text containing around 70 million non-duplicate sentences. In their later version (AraBERT v2), which we will use in our hate speech detection and classification, the size of the pre-training dataset had risen to 77GB.

3.2. Fine-Tuning BERT-Based Models for Text Classficiation

Chi Sun et al. [39] investigated various BERT finetuning approaches for text classification and maintained that:
  • BERT’s top layer is helpful for text classification;
  • BERT can solve the catastrophic forgetting problem with an adequate layer-wise decreasing learning rate;
  • Further within-task and within-domain pre-training can considerably improve its performance;
  • A prior multi-task fine-tuning is similarly beneficial to single-task fine-tuning, although its advantage is less than that of additional pre-training;
  • BERT can provide good performance with small-sized datasets.
For fine-tuning BERT and mBERT, Devlin et al. [4] found that optimal hyperparameters are mostly task-specific, but the range of possible values described in Table 2 usually provides satisfactory results whatever the task is. The same setup was also confirmed by Antoun et al. [30] for fine-tuning AraBERT.

3.3. Dataset

Unlike other languages, Arabic content on social media (tweets, posts, comments, etc.) are mostly written in local dialects instead the standard Arabic, considered to be the official language in all Arab countries. This comes from the fact that standard Arabic is used for formal, administrative, educational, and religious purposes. Arabs tend to use their local dialects as a means of communication when attending their usual day-to-day tasks, and thus, in social media. For this reason, any efficient attempt to detect offensive and hateful speech in Arabic social media needs to be able to work with Arabic dialects. This urged many researchers, as we have seen in the previous section, to manually collect and annotate their own datasets, serving specific needs.
However, there are some publicly available datasets in the literature, though rare, that can provide good performance with proper training. Mulki et al. [40] created a dataset with 5846 tweets in Levantine dialects that were annotated to either (“Hate”, “Abusive”, or “Normal”). Another interesting dataset was presented by Alakrot et al. [41] who collected 15050 YouTube comments in different dialects, namely Gulf, Egyptian, and Iraqi. These comments were labeled as either “offensive” or “inoffensive”. To the best of our knowledge, this dataset may be among the largest and most diverse one available so far in the literature, and as such, we will use it for our approach.
In their annotation process, Alakrot et al. used the help of three annotators, two of them were from Iraq and Egypt, two countries that were highly represented in the dataset, whereas the third was from Libya, a country that was almost absent from the dataset. For their final annotation, Alakrot et al. provided two datasets with two different scenarios based on which they would consider if a comment was considered hateful or not. In the first scenario, the annotation is based on a unanimous vote, whereas in the second scenario, it is based on a majority vote.
For training our BERT-based models, we chose the dataset of the second scenario, which has a better proportion of the two classes. Since BERT is trained in English, and the comments our dataset are in Arabic, we used Google API to translate them into English.

3.4. Preprocessing

The first step of preprocessing the dataset is to remove the missing values as they represent a small minority when compared to the useful data. This will leave a total of 11,268 YouTube comments from which 4748 comments were considered as hateful and labeled as “1”, against 6520 comments that were considered as non-hateful and labeled as “0”. In terms of percentage, it would show approximately 42% of the total as hateful comments against 58% non-hateful.
After a quick look at the data, as displayed in Figure 3, you realize that it still needs to be further processed as to remove the emojis, punctuation, stop words, as well as the extra letters used for emphasis. Some words also need to be stemmed and lemmatized to keep their root.
Preprocessing the Arabic comments was done using the Farasa library by Abdelali et al. [42], which covers most of the abovementioned issues. This can give us the possibility to go from words such as “العاهرات” to segmented words such as “ال+عاهر+ات”, and thus, keeping the root “عاهر”, which will be used in the further steps.
After the segmentation, comes the step of tokenization. In this step, every extracted word will be converted into a token and provided a token ID, which will be used in the training process. The tokens from each comment will be stored in a set of a maximum length of 512. However, and for further optimization, we have seen that most of the comments contain less than 150 tokens, as displayed in Figure 4. For this reason, we will adopt a maximum length of 150.
The next step is to shuffle the dataset before splitting it into a training and testing subsets. The hyperparameters we used for fine-tuning AraBERT to the dataset are the ones presented in Table 2.

3.5. Candidate Algorithms

As explained in the beginning of this section, in this study, we will explore the potential of the four following BERT-based approaches:
  • Translating the comments into English and using the classical BERT [4]. In the following, the approach will be referred to as “BERTEN”;
  • Using the Multilingual BERT (mBERT) [4] on the translated comments. In the following, the approach will be referred to as “mBERTEN”;
  • Using mBERT on the original comments in Arabic. In the following, the approach will be referred to as “mBERTAR”;
  • Using AraBERT [30], a BERT-based model trained on Arabic text, on the original comments in Arabic. In the following, the approach will be referred to as “AraBERT”.
In a previous study [43], we compared the performance of the following Shallow and Deep Learning algorithms:
  • Logistic Regression (LR);
  • Naïve Bayes (NB);
  • Random Forests (RF);
  • Support Vector Machines (SVM);
  • Long Short-Term Memory (LSTM).
In that study, we found that LSTM provided the best performance, so we will keep it as a candidate in this current study.
We will also take the Linear Support Vector Classification (LinearSVC) based on the approach presented by Alakrot et al. [44], in which they demonstrated that LinearSVC can reach a 90% accuracy, which is higher than our LSTM-based approach, reaching an accuracy of 82%.
To measure the performance of the abovementioned approaches, we will rely on the usual classification metrics, i.e., Accuracy, Precision, Recall, F1-Score, and the Confusion Matrix.

3.6. Development Environment

The environment in which this study was conducted is described as follows:
  • Programming Language: Python 3;
  • IDE: Google Colab Pro;
  • RAM: 32GB;
  • GPU: Nvidia Tesla P100;
  • CPU: Intel(R) Xeon(R) CPU @ 2.30GHz.

4. Results and Discussion

Based on the fine-tuning recommendations described in Table 2, we proceeded to train our candidate models on our dataset. Table 3 provides an overview on the performance of these models.
BERTEN outperformed all the other candidate models with an Accuracy of 98%, a Precision of 98%, a Recall of 98%, and an F1-Score of 98%. AraBERT was a close runner-up with an Accuracy of 96%, a Precision of 95%, a Recall of 96%, and an F1-Score of 95%. mBERTAR achieved an Accuracy of 83%, a Precision of 84%, a Recall, of 82%, and an F1-Score of 83%. mBERTEN achieved an Accuracy of 81%, a Precision of 82%, a Recall of 80% and an F1-Score of 81%.
The results proved that the Precision and the Recall of the BERT-based models are almost alike. This means that these models are not as biased as the baseline models, such as LSTM [43] and LinearSVC [44], and perform equally well for both the positive and the negative comments.
mBERT is versatile and generic at the cost of performance. Moreover, mBERTAR performs better than the mBERTEN.
AraBERT performs well in this case as the source data is quite similar to the target data. As BERT was trained on a relatively larger language corpus when compared to that of AraBERT, the dominance of BERTEN validates that the volume of the language corpora for pretraining plays a vital role in a model’s performance.
As the BERTEN and AraBERT were clear winners, with very close values in their confusion matrices as depicted in Figure 5, we conducted a more granular analysis on these two models to understand what really made the difference between these two models.
A closer look at some of the comments that were wrongly predicted by AraBERT and correctly by BERTEN made us realize that these comments had sarcasm in them. This assumption was verified after annotating all the comments of the dataset as to having sarcasm or not.
Again, the difference in the size of the pre-training datasets of BERT and AraBERT has demonstrated their important roles in performance, especially in detecting the eventual presence of sarcasm.
Taking a deeper look at why BERT-based models outperformed other language models such as LSTM [43] and LinearSVC [44], in the past, conventional language models could only interpret text input sequentially — either from right to left or from left to right — but not simultaneously. BERT is unique since it can simultaneously read in both directions. Bidirectionality is the name for this capacity, which the invention of Transformers [31] made possible. BERT is pre-trained on two distinct but related NLP tasks—Masked Language Modeling and Next Sentence Prediction—using this bidirectional capacity. The Masked Language Model (MLM) training’s goal is to conceal a word in a phrase and then have the computer anticipate the hidden word based on the context of the concealed word. The goal of Next Sentence Prediction training is to make the algorithm determine if two provided phrases relate logically and sequentially or whether their relationship is just random [4]. This is the core reason why BERT-based models are very powerful when compared to other conventional unidirectional language models.
Another reason behind the power of BERT based models when compared to other conventional models, is again, their huge pre-training corpus. This makes these models turnkey and ready to be used in every use case scenario (provided a prior fine-tuning), whereas in the latter case, conventional models need to be trained from scratch.

5. Conclusions

In this paper, we explored the possibilities offered by BERT-based models to detect hateful and offensive speech in social media comments that are written as either in standard Arabic, or in any of the top three most spoken dialects in the Middle East (Gulf, Egyptian, and Iraqi). We trained different BERT-based models such as multilingual BERT (mBERTAR) and AraBERT on a dataset made available online by Alakrot et al. [41]. This dataset contains hateful YouTube comments written in the aforementioned dialects. We also trained BERT in its base (BERTEN) and multilingual (mBERTEN) forms on the English translations of the comments in Alakrot et al.’s [41] dataset.
BERTEN provided the best results with 98% accuracy, closely followed by AraBERT with 96% accuracy. BERTEN had the advantage of having larger training corpora than AraBERT, allowing the former to better detect subtilties such as sarcasm. A further optimization of AraBERT in this direction would be necessary. The multilingual BERT performed poorly in both the Arabic and English datasets because of its built-in versatility.
In further optimization of the performance of BERT-based models in detecting hateful and offensive speech, and for them to provide efficient results in the entirety of Arabic social media, we plan on training them on more datasets in Levantine and North-African dialects. The latter will be the main issue to tackle since in North Africa, social media users tend to use both Arabic letters and “Arabizi” (Arabic with Romanized letters) in their comments.

Author Contributions

Methodology, Z.B. and M.O. (Mariyam Ouaissa); Software, Z.B. and M.O. (Mariya Ouaissa); Formal analysis, Z.B., M.O. (Mariya Ouaissa) and K.G.; Data curation, M.O. (Mariyam Ouaissa), M.K. and M.A.; Writing—original draft, Z.B., M.O. (Mariya Ouaissa), M.O. (Mariyam Ouaissa), M.K., M.A. and K.G.; Project administration, Z.B., M.O. (Mariya Ouaissa), M.O. (Mariyam Ouaissa), M.K. and M.A.; Funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no additional funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Acknowledgments

The researchers would like to thank the Deanship of Scientific Research, Qassim University, for funding the publication of this project.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kemp, S. Digital 2022: Global Overview Report. Available online: https://bit.ly/KEMP-2022 (accessed on 9 August 2022).
  2. Communication Decency Act 230 CDA 230. Available online: https://bit.ly/CDA-230 (accessed on 9 September 2022).
  3. Baggs, M. Online Hate Speech Rose 20% During Pandemic: “We’ve Normalised it”—BBC News. Available online: https://bbc.in/3Qb7lKV (accessed on 9 August 2022).
  4. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv 2018, arXiv:1810.04805. [Google Scholar]
  5. United Nations United Nations Strategy and Plan of Action on Hate Speech. Available online: https://bit.ly/UN-Hate (accessed on 9 August 2022).
  6. Qaisar, S.M.; Mihoub, A.; Krichen, M.; Nisar, H. Multirate Processing with Selective Subbands and Machine Learning for Efficient Arrhythmia Classification. Sensors 2021, 21, 1511. [Google Scholar] [CrossRef] [PubMed]
  7. Mihoub, A. A Deep Learning-Based Framework for Human Activity Recognition in Smart Homes. Mob. Inf. Syst. 2021, 2021, 6961343. [Google Scholar] [CrossRef]
  8. Zidi, S.; Mihoub, A.; Mian Qaisar, S.; Krichen, M.; Abu Al-Haija, Q. Theft detection dataset for benchmarking and machine learning based classification in a smart grid environment. J. King Saud Univ.—Comput. Inf. Sci. 2022, in press. [Google Scholar] [CrossRef]
  9. Mihoub, A.; Snoun, H.; Krichen, M.; Salah, R.B.H.; Kahia, M. Predicting COVID-19 Spread Level using Socio- Economic Indicators and Machine Learning Techniques. In Proceedings of the 2020 First International Conference of Smart Systems and Emerging Technologies (SMARTTECH), Riyadh, Saudi Arabia, 3–5 November 2020; pp. 128–133. [Google Scholar]
  10. Mihoub, A.; Fredj, O.B.; Cheikhrouhou, O.; Derhab, A.; Krichen, M. Denial of service attack detection and mitigation for internet of things using looking-back-enabled machine learning techniques. Comput. Electr. Eng. 2022, 98, 107716. [Google Scholar] [CrossRef]
  11. Vanetik, N.; Mimoun, E. Detection of Racist Language in French Tweets. Information 2022, 13, 318. [Google Scholar] [CrossRef]
  12. Arcila-Calderón, C.; Amores, J.J.; Sánchez-Holgado, P.; Blanco-Herrero, D. Using Shallow and Deep Learning to Automatically Detect Hate Motivated by Gender and Sexual Orientation on Twitter in Spanish. Multimodal Technol. Interact. 2021, 5, 63. [Google Scholar] [CrossRef]
  13. Plaza-del-Arco, F.M.; Molina-González, M.D.; Ureña-López, L.A.; Martín-Valdivia, M.T. Comparing pre-trained language models for Spanish hate speech detection. Expert Syst. Appl. 2021, 166, 114120. [Google Scholar] [CrossRef]
  14. Ali, R.; Farooq, U.; Arshad, U.; Shahzad, W.; Beg, M.O. Hate speech detection on Twitter using transfer learning. Comput. Speech Lang. 2022, 74, 101365. [Google Scholar] [CrossRef]
  15. Mayda, I.; Demir, Y.E.; Dalyan, T.; Diri, B. Hate Speech Dataset from Turkish Tweets. In Proceedings of the 2021 Innovations in Intelligent Systems and Applications Conference (ASYU), Elazig, Turkey, 6–8 October 2021; pp. 1–6. [Google Scholar]
  16. Jiang, A.; Yang, X.; Liu, Y.; Zubiaga, A. SWSR: A Chinese dataset and lexicon for online sexism detection. Online Soc. Netw. Media 2022, 27, 100182. [Google Scholar] [CrossRef]
  17. Chiril, P.; Benamara Zitoune, F.; Moriceau, V.; Coulomb-Gully, M.; Kumar, A. Multilingual and Multitarget Hate Speech Detection in Tweets. ACL Anthol. 2019, 4, 351–360. [Google Scholar]
  18. Joulin, A.; Grave, E.; Bojanowski, P.; Douze, M.; Jégou, H.; Mikolov, T. FastText.zip: Compressing text classification models. arXiv 2016, arXiv:1612.03651. [Google Scholar]
  19. Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; Volume 19, pp. 1532–1543. [Google Scholar]
  20. Corazza, M.; Menini, S.; Cabrio, E.; Tonelli, S.; Villata, S. A Multilingual Evaluation for Online Hate Speech Detection. ACM Trans. Internet Technol. 2020, 20, 1–22. [Google Scholar] [CrossRef] [Green Version]
  21. Ranasinghe, T.; Zampieri, M. Multilingual Offensive Language Identification with Cross-lingual Embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, online, 16–18 November 2020; pp. 5838–5844. [Google Scholar]
  22. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the ACL 2019—57th Annual Meeting of the Association for Computational Linguistics, Tutorial Abstracts, Florence, Italy, 28 July–2 August 2019; pp. 31–38. [Google Scholar] [CrossRef]
  23. Abozinadah, E.A.; Jones, J.H. A Statistical Learning Approach to Detect Abusive Twitter Accounts. In Proceedings of the Proceedings of the International Conference on Compute and Data Analysis—ICCDA ’17, Lakeland, FL, USA, 19–23 May 2017; ACM Press: New York, NY, USA, 2017; pp. 6–13. [Google Scholar]
  24. Mubarak, H.; Darwish, K.; Magdy, W. Abusive Language Detection on Arabic Social Media. In Proceedings of the Proceedings of the First Workshop on Abusive Language Online, Vancouver, BC, Canada, August 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 52–56. [Google Scholar]
  25. Albadi, N.; Kurdi, M.; Mishra, S. Are they Our Brothers? Analysis and Detection of Religious Hate Speech in the Arabic Twittersphere. In Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Barcelona, Spain, 28–31 August 2018; pp. 69–76. [Google Scholar]
  26. Soliman, A.B.; Eissa, K.; El-Beltagy, S.R. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Comput. Sci. 2017, 117, 256–265. [Google Scholar] [CrossRef]
  27. Anezi, F.Y. Al Arabic Hate Speech Detection Using Deep Recurrent Neural Networks. Appl. Sci. 2022, 12, 6010. [Google Scholar] [CrossRef]
  28. Shannaq, F.; Hammo, B.; Faris, H.; Castillo-Valdivieso, P.A. Offensive Language Detection in Arabic Social Networks Using Evolutionary-Based Classifiers Learned From Fine-Tuned Embeddings. IEEE Access 2022, 10, 75018–75039. [Google Scholar] [CrossRef]
  29. Alsafari, S.; Sadaoui, S.; Mouhoub, M. Hate and offensive speech detection on Arabic social media. Online Soc. Netw. Media 2020, 19, 100096. [Google Scholar] [CrossRef]
  30. Antoun, W.; Baly, F.; Hajj, H. AraBERT: Transformer-based Model for Arabic Language Understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar] [CrossRef]
  31. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arxiv 2017, arXiv:1706.03762. [Google Scholar]
  32. Alammar, J. The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning). Available online: https://bit.ly/jalammar2 (accessed on 26 August 2022).
  33. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
  34. Alammar, J. The Illustrated Transformer . Available online: https://bit.ly/jalammar1 (accessed on 26 August 2022).
  35. Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; Volume 2015, pp. 19–27. [Google Scholar]
  36. Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
  37. Zeroual, I.; Goldhahn, D.; Eckart, T.; Lakhouaja, A. OSIAN: Open Source International Arabic News Corpus—Preparation and Integration into the CLARIN-infrastructure. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, Florence, Italy, 1–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 175–182. [Google Scholar]
  38. El-khair, I.A. 1.5 billion words Arabic Corpus. arXiv 2016, arXiv:1611.04033. [Google Scholar]
  39. Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to Fine-Tune BERT for Text Classification? In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2019; Volume 11856, pp. 194–206. ISBN 9783030323806. [Google Scholar]
  40. Mulki, H.; Haddad, H.; Bechikh Ali, C.; Alshabani, H. L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language. In Proceedings of the Third Workshop on Abusive Language Online, Florence, Italy, 1 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 111–118. [Google Scholar]
  41. Alakrot, A.; Murray, L.; Nikolov, N.S. Dataset Construction for the Detection of Anti-Social Behaviour in Online Communication in Arabic. Procedia Comput. Sci. 2018, 142, 174–181. [Google Scholar] [CrossRef]
  42. Abdelali, A.; Darwish, K.; Durrani, N.; Mubarak, H. Farasa: A Fast and Furious Segmenter for Arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; Volume 2016, pp. 11–16. [Google Scholar]
  43. Boulouard, Z.; Ouaissa, M.; Ouaissa, M. Machine Learning for Hate Speech Detection in Arabic Social Media. In Computational Intelligence in Recent Communication Networks; Springer: Berlin/Heidelberg, Germany, 2022; pp. 147–162. [Google Scholar] [CrossRef]
  44. Alakrot, A.; Fraifer, M.; Nikolov, N.S. Machine Learning Approach to Detection of Offensive Language in Online Communication in Arabic. In Proceedings of the 2021 IEEE 1st International Maghreb Meeting of the Conference on Sciences and Techniques of Automatic Control and Computer Engineering MI-STA, Tripoli, Libya, 25–27 May 2021; pp. 244–249. [Google Scholar]
Figure 1. Encoder stacks in both BERTBASE and BERTLARGE [32].
Figure 1. Encoder stacks in both BERTBASE and BERTLARGE [32].
Applsci 12 12823 g001
Figure 2. Detailed architecture of the first encoder in a BERT stack [34].
Figure 2. Detailed architecture of the first encoder in a BERT stack [34].
Applsci 12 12823 g002
Figure 3. Examples of comments from the dataset.
Figure 3. Examples of comments from the dataset.
Applsci 12 12823 g003
Figure 4. Number of tokens in each comment from the dataset.
Figure 4. Number of tokens in each comment from the dataset.
Applsci 12 12823 g004
Figure 5. The confusion matrices of BERTEN (a) and AraBERT (b).
Figure 5. The confusion matrices of BERTEN (a) and AraBERT (b).
Applsci 12 12823 g005
Table 1. BERT pre-training configuration.
Table 1. BERT pre-training configuration.
HyperparameterValue
OptimizerAdam
Learning Rate1e−4
β10.9
β20.999
L2 Weight Decay0.01
Dropout0.1
ActivationGelu [36]
Table 2. BERT fine-tuning configuration.
Table 2. BERT fine-tuning configuration.
HyperparameterRange of Possible Values
Batch Size16, 32
Learning Rate (Adam)5 × 10−5, 3 × 10−5, 2 × 10−5
Number of epochs2, 3, 4
Table 3. Classification Performance Scores.
Table 3. Classification Performance Scores.
AlgorithmAccuracyPrecisionRecallF1-Score
BERTEN0.980.980.980.98
AraBERT0.960.950.960.95
LinearSVC [44]0.90.890.760.81
mBERTAR0.830.840.820.83
LSTM [43]0.820.920.740.82
mBERTEN0.810.820.800.81
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Boulouard, Z.; Ouaissa, M.; Ouaissa, M.; Krichen, M.; Almutiq, M.; Gasmi, K. Detecting Hateful and Offensive Speech in Arabic Social Media Using Transfer Learning. Appl. Sci. 2022, 12, 12823. https://doi.org/10.3390/app122412823

AMA Style

Boulouard Z, Ouaissa M, Ouaissa M, Krichen M, Almutiq M, Gasmi K. Detecting Hateful and Offensive Speech in Arabic Social Media Using Transfer Learning. Applied Sciences. 2022; 12(24):12823. https://doi.org/10.3390/app122412823

Chicago/Turabian Style

Boulouard, Zakaria, Mariya Ouaissa, Mariyam Ouaissa, Moez Krichen, Mutiq Almutiq, and Karim Gasmi. 2022. "Detecting Hateful and Offensive Speech in Arabic Social Media Using Transfer Learning" Applied Sciences 12, no. 24: 12823. https://doi.org/10.3390/app122412823

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop