AI-Based Misogyny Detection from Arabic Levantine Twitter Tweets †
Abstract
:1. Introduction
- The Arabic text is represented using the word and word-embedding techniques.
- The state-of-art deep learning BERT technique is used to detect Arabic misogyny.
2. Related Works
3. Proposed Model
3.1. ATDS Architecture
3.2. Pre-Processing
- (a)
- TokenizationThis process was used to convert the Arabic text (sentence) into tokens or words. Tokenized documents can be transformed into sentences, and sentences can be converted into tokens. Tokenization divides a text sequence into words, symbols, phrases, or tokens [18].
- (b)
- NormalizationThe normalization is performed to make all words in the same form, and there are many techniques, such as stemming. We can make normalization by different methods such as regular expressions.
- (c)
- Stop Word EliminationIn the text preprocessing task, there are numerous terms that have no critical meaning but appear frequently in a document. It refers to words that do not help to increase performance, because they do not provide much information for the sentiment classification task; therefore, stop words should be removed before the feature selection process.
- (d)
- StemmingOne word can appear in many distinct forms, but the semantic meaning remains the same. Stemming is the process of replacing and removing suffixes and affixes to obtain the root, base or stem word.
- (e)
- LemmatizationThe goal of lemmatization is the same as stemming: to reduce words to their base or root words. However, in lemmatization, the inflection of words is not simply cut off; rather, it leverages lexical information to turn words into their base forms [19].
3.3. Representation
3.4. Text Detection
- Passive Aggressive ClassifierPassive-Aggressive algorithms are a family of Machine learning algorithms that are popularly used in big data applications. Passive-Aggressive algorithms are generally used for large-scale learning. It is one of the online-learning algorithms. In online machine learning algorithms, the input data comes in sequential order, and the machine learning model is updated sequentially, as opposed to conventional batch learning, where the entire training dataset is used at once [20].
- Logistic RegressionLogistic regression is a statistical model that, in its basic form, uses a logistic function to model a binary dependent variable, although many more complex extensions exist [19].
- Random Forest ClassifierThe term “Random Forest Classifier” refers to the classification algorithm made up of several decision trees. The algorithm uses randomness to build each individual tree to promote uncorrelated forests, which then uses the forest’s predictive powers to make accurate decisions [19].
- Linear SVCThe support vector machine (SVM) classifier is one of the commonly used algorithms for text classification due to its good performance. SVM is a non-probabilistic binary linear classification algorithm, which is performed by plotting the training data in a multi-dimensional space. Then, SVM categorizes the classes with a hyper-plane. The algorithm will add a new dimension if the classes cannot be separated linearly in multi-dimensional space to separate the classes. This process will continue until the training data can be categorized into two different classes [19].
- Decision Tree ClassifierDecision Trees are also used in tandem when you are building a Random Forest classifier, which is a culmination of multiple Decision Trees working together to classify a record based on a majority vote. A Decision Tree is constructed by asking a series of questions with respect to a record of the dataset we have got [19].
- K Neighbors ClassifierKNN works by finding the distances between a query and all the examples in the data, selecting the specified number of examples (K) closest to the query, and then voting for the most frequent label (in the case of classification) or averaging the labels (in the case of regression) [19].
- ARABERTv2AraBERT is an Arabic pre-trained language model based on Google’s BERT architecture. AraBERT uses the same BERT-Base configuration [21].
4. Experimental Analysis
4.1. Dataset
- Damning (Damn): tweets under this class contained cursing content.
- Derailing (Der): tweets under this class combined justification of women’s abuse or mistreatment.
- Discredit (Disc): tweets under this class beared slurs and offensive language against women.
- Dominance (Dom): tweets under this class implied the superiority of men over women.
- Sexual Harassment (Harass): tweets under this class described sexual advances and sexual nature abuse.
- Stereotyping and Objectification (Obj): tweets under this class promoted a fixed image of women or described women’s physical appeal.
- Threat of Violence (Vio): tweets under this class had intimidating content with threats of physical violence.
- None: if no misogynistic behaviors existed.
4.2. Implementation Environment
4.3. Evaluation Metrics
5. Results and Discussion
- Misogyny identification (Binary): Tweets’ contents were classified into misogynistic and non-misogynistic. This required merging the seven categories of misogyny into the misogyny class.
- Categories classification (Multi-class): Tweets were classified into eight categories: discredit, dominance, damning, derailing, sexual harassment, stereotyping and objectification the threat of Violence, or non-misogynistic. In addition, we found that Linear SVC outperformed all compared models in terms of generalization for machine learning and BERTv2 for the deep learning technique.
5.1. Binary Classification
5.2. Multi Classification
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Farha, I.A.; Magdy, W. Multitask Learning for Arabic Offensive Language and Hate-Speech Detection. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, 11–16 May 2020; pp. 86–90. Available online: https://www.aclweb.org/anthology/2020.osact-1.14 (accessed on 22 February 2021).
- Mulki, H.; Ghanem, B. Let-Mi: An Arabic Levantine Twitter Dataset for Misogynistic Language. arXiv 2021, 154–163. Available online: http://arxiv.org/abs/2103.10195 (accessed on 22 February 2021).
- Alkhair, M.; Meftouh, K.; Othman, N.; Smali, K. An Arabic Corpus of Fake News: Collection, Analysis and Classification to cite this version: HAL Id: Hal-02314246 An Arabic Corpus of Fake News: Collection, Analysis and Classification. Arabic Lang. Process. 2019. [Google Scholar] [CrossRef] [Green Version]
- Jahan, M.S.; Oussalah, M. A Systematic Review of Hate Speech Automatic Detection Using Natural Language Processing. 2021. Available online: http://arxiv.org/abs/2106.00742 (accessed on 22 February 2021).
- Alshalan, R.; Al-Khalifa, H. A deep learning approach for automatic hate speech detection in the saudi twittersphere. Appl. Sci. 2020, 10, 8614. [Google Scholar] [CrossRef]
- Samghabadi, N.S.; Patwa, P.; Pykl, S.; Mukherjee, P.; Das, A.; Solorio, T. Aggression and Misogyny Detection using BERT: A Multi-Task Approach. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, Marseille, France, 11–16 May 2020; pp. 126–131. Available online: https://www.aclweb.org/anthology/2020.trac-1.20 (accessed on 22 February 2021).
- Fersini, E.; Nozza, D.; Rosso, P. AMI @ EVALITA2020: Automatic misogyny identification. CEUR Workshop Proc. 2020, 2765. [Google Scholar] [CrossRef]
- Hengle, A.; Kshirsagar, A.; Desai, S.; Marathe, M. Combining Context-Free and Contextualized Representations for Arabic Sarcasm Detection and Sentiment Identification. 2021. Available online: http://arxiv.org/abs/2103.05683 (accessed on 22 February 2021).
- Al-Yahya, M.; Al-Khalifa, H.; Al-Baity, H.; Alsaeed, D.; Essam, A. Arabic Fake News Detection: Comparative Study of Neural Networks and Transformer-Based Approaches. Complexity 2021, 2021, 5516945. [Google Scholar] [CrossRef]
- Suleiman, D.; Awajan, A.; Al-Madi, N. Deep learning based technique for plagiarism detection in Arabic texts. In Proceedings of the 2017 International Conference on New Trends in Computing Sciences (ICTCS), Amman, Jordan, 11–13 October 2017; pp. 216–222. [Google Scholar] [CrossRef]
- Husain, F. Arabic Offensive Language Detection Using Machine Learning and Ensemble Machine Learning Approaches. 2020. Available online: http://arxiv.org/abs/2005.08946 (accessed on 22 February 2021).
- Husain, F.; Uzuner, O. Transfer Learning Approach for Arabic Offensive Language Detection System—BERT-Based Model. 2021. Available online: http://arxiv.org/abs/2102.05708 (accessed on 22 February 2021).
- Abuzayed, A.; Al-Khalifa, H. Sarcasm and Sentiment Detection In {A}rabic Tweets Using {BERT}-based Models and Data Augmentation. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19–20 April 2021; pp. 312–317. Available online: https://www.aclweb.org/anthology/2021.wanlp-1.38 (accessed on 22 February 2021).
- Lichouri, M.; Abbas, M.; Benaziz, B.; Zitouni, A.; Lounnas, K. Preprocessing Solutions for Detection of Sarcasm and Sentiment for Arabic. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19–20 April 2021; pp. 376–380. Available online: https://www.aclweb.org/anthology/2021.wanlp-1.49 (accessed on 22 February 2021).
- Frenda, S.; Ghanem, B.; Montes-y-Gómez, M. Exploration of misogyny in Spanish and english tweets. CEUR Workshop Proc. 2018, 2150, 260–267. [Google Scholar]
- Muaad, A.; Jayappa, H.; Al-Antari, M.; Lee, S. ArCAR: A Novel Deep Learning Computer-Aided Recognition for Character-Level Arabic Text Representation and Recognition. Algorithms 2021, 14, 216. [Google Scholar] [CrossRef]
- Alyafeai, Z.; Al-shaibani, M.S.; Ghaleb, M.; Ahmad, I. Evaluating Various Tokenizers for Arabic Text Classification. 2021, Volume 5. Available online: http://arxiv.org/abs/2106.07540 (accessed on 22 February 2021).
- Kowsari, K.; Meimandi, K.J.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information 2019, 10, 150. [Google Scholar] [CrossRef] [Green Version]
- Huang, J. Detecting fake news with machine learning. J. Phys. Conf. Ser. 2020, 1693, 012158. [Google Scholar] [CrossRef]
- Antoun, W.; Baly, F.; Hajj, H. AraBERT: Transformer-based Model for Arabic Language Understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
- Chola, C.; Benifa, J.V.; Guru, D.S.; Muaad, A.Y.; Hanumanthappa, J.; Al-Antari, M.A.; Gumaei, A.H. Gender Identification and Classification of Drosophila melanogaster Flies Using Machine Learning Techniques. Comput. Math. Methods Medicine 2022, in press. [Google Scholar] [CrossRef]
- Hanumanthappa, J.; Muaad, A.Y.; Bibal Benifa, J.V.; Chola, C.; Hiremath, V.; Pramodha, M. IoT-Based Smart Diagnosis System for HealthCare. In Sustainable Communication Networks and Application. Lecture Notes on Data Engineering and Communications Technologies; Karrupusamy, P., Balas, V.E., Shi, Y., Eds.; Springer: Singapore, 2022; Volume 93. [Google Scholar] [CrossRef]
Type of Misogyny | نوع كراهية النساء | No of Tweets |
---|---|---|
None | لاشي | 3061 |
misogyny | كراهية النساء | 4805 |
Type of Misogyny (Clasess Name) | نوع كراهية النساء | No of Tweets |
---|---|---|
Discredit | تشويه السمعة | 2327 |
Stereo typing and objectification | الكتابة والصياغة المجسمة | 290 |
Damning | اللعنة | 256 |
Threat of violence | التهديد بالعنف | 175 |
Derailing | الخروج عن السكة | 59 |
Dominance | هيمنة | 38 |
Sexual harassment | التحرش الجنسي | 17 |
None | لا كراهية | 3388 |
Method | Az | Sp | Re | F-M |
---|---|---|---|---|
Passive Aggressive Classifier | 81 | 84 | 86 | 85 |
Logistic Regression | 81.50 | 81 | 90 | 86 |
Random Forest Classifier | 62 | 62 | 68 | 76 |
Linear SVC | 83 | 85 | 88 | 86 |
Decision Tree Classifier | 70 | 74 | 78 | 76 |
K Neighbors Classifier | 65 | 64 | 98 | 78 |
ARABERTv2 | 90 | - | - | - |
Method | Az | Sp | Re | F-M |
---|---|---|---|---|
Passive Aggressive Classifier | 72 | 72 | 72 | 72 |
Logistic Regression | 69 | 68 | 68 | 68 |
Random Forest Classifier | 40 | 39 | 39 | 39 |
Linear SVC | 74 | 73 | 73 | 73 |
Decision Tree Classifier | 56 | 56 | 56 | 56 |
K Neighbors Classifier | 44 | 43 | 43 | 43 |
ARABERTv2 | 89 | - | - | - |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Muaad, A.Y.; Davanagere, H.J.; Al-antari, M.A.; Benifa, J.V.B.; Chola, C. AI-Based Misogyny Detection from Arabic Levantine Twitter Tweets. Comput. Sci. Math. Forum 2022, 2, 15. https://doi.org/10.3390/IOCA2021-10880
Muaad AY, Davanagere HJ, Al-antari MA, Benifa JVB, Chola C. AI-Based Misogyny Detection from Arabic Levantine Twitter Tweets. Computer Sciences & Mathematics Forum. 2022; 2(1):15. https://doi.org/10.3390/IOCA2021-10880
Chicago/Turabian StyleMuaad, Abdullah Y., Hanumanthappa Jayappa Davanagere, Mugahed A. Al-antari, J. V. Bibal Benifa, and Channabasava Chola. 2022. "AI-Based Misogyny Detection from Arabic Levantine Twitter Tweets" Computer Sciences & Mathematics Forum 2, no. 1: 15. https://doi.org/10.3390/IOCA2021-10880
APA StyleMuaad, A. Y., Davanagere, H. J., Al-antari, M. A., Benifa, J. V. B., & Chola, C. (2022). AI-Based Misogyny Detection from Arabic Levantine Twitter Tweets. Computer Sciences & Mathematics Forum, 2(1), 15. https://doi.org/10.3390/IOCA2021-10880