Next Article in Journal
Informing Disaster Recovery Through Predictive Relocation Modeling
Previous Article in Journal
Bridging the AI Gap in Medical Education: A Study of Competency, Readiness, and Ethical Perspectives in Developing Nations
Previous Article in Special Issue
A BERT-Based Multimodal Framework for Enhanced Fake News Detection Using Text and Image Data Fusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

UA-HSD-2025: Multi-Lingual Hate Speech Detection from Tweets Using Pre-Trained Transformers

1
Centro de Investigación en Computación, Instituto Politécnico Nacional (CIC-PN), Mexico City 07738, Mexico
2
Department of Software Engineering, The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan
3
School of Informatics and Robotics, Institute of Arts and Culture, Lahore 54000, Pakistan
*
Author to whom correspondence should be addressed.
Computers 2025, 14(6), 239; https://doi.org/10.3390/computers14060239
Submission received: 27 May 2025 / Revised: 11 June 2025 / Accepted: 13 June 2025 / Published: 18 June 2025
(This article belongs to the Special Issue Recent Advances in Social Networks and Social Media)

Abstract

The rise in social media has improved communication but also amplified the spread of hate speech, creating serious societal risks. Automated detection remains difficult due to subjectivity, linguistic diversity, and implicit language. While prior research focuses on high-resource languages, this study addresses the underexplored multilingual challenges of Arabic and Urdu hate speech through a comprehensive approach. To achieve this objective, this study makes four different key contributions. First, we have created a unique multi-lingual, manually annotated binary and multi-class dataset (UA-HSD-2025) sourced from X, which contains the five most important multi-class categories of hate speech. Secondly, we created detailed annotation guidelines to make a robust and perfect hate speech dataset. Third, we explore two strategies to address the challenges of multilingual data: a joint multilingual and translation-based approach. The translation-based approach involves converting all input text into a single target language before applying a classifier. In contrast, the joint multilingual approach employs a unified model trained to handle multiple languages simultaneously, enabling it to classify text across different languages without translation. Finally, we have employed state-of-the-art 54 different experiments using different machine learning using TF-IDF, deep learning using advanced pre-trained word embeddings such as FastText and Glove, and pre-trained language-based models using advanced contextual embeddings. Based on the analysis of the results, our language-based model (XLM-R) outperformed traditional supervised learning approaches, achieving 0.99 accuracy in binary classification for Arabic, Urdu, and joint-multilingual datasets, and 0.95, 0.94, and 0.94 accuracy in multi-class classification for joint-multilingual, Arabic, and Urdu datasets, respectively.
Keywords: social media; deep learning; machine learning; transfer learning; data mining; SVM; BERT; RoBERTa; hate speech; Urdu hate speech; Arabic hate speech social media; deep learning; machine learning; transfer learning; data mining; SVM; BERT; RoBERTa; hate speech; Urdu hate speech; Arabic hate speech

Share and Cite

MDPI and ACS Style

Ahmad, M.; Waqas, M.; Hamza, A.; Usman, S.; Batyrshin, I.; Sidorov, G. UA-HSD-2025: Multi-Lingual Hate Speech Detection from Tweets Using Pre-Trained Transformers. Computers 2025, 14, 239. https://doi.org/10.3390/computers14060239

AMA Style

Ahmad M, Waqas M, Hamza A, Usman S, Batyrshin I, Sidorov G. UA-HSD-2025: Multi-Lingual Hate Speech Detection from Tweets Using Pre-Trained Transformers. Computers. 2025; 14(6):239. https://doi.org/10.3390/computers14060239

Chicago/Turabian Style

Ahmad, Muhammad, Muhammad Waqas, Ameer Hamza, Sardar Usman, Ildar Batyrshin, and Grigori Sidorov. 2025. "UA-HSD-2025: Multi-Lingual Hate Speech Detection from Tweets Using Pre-Trained Transformers" Computers 14, no. 6: 239. https://doi.org/10.3390/computers14060239

APA Style

Ahmad, M., Waqas, M., Hamza, A., Usman, S., Batyrshin, I., & Sidorov, G. (2025). UA-HSD-2025: Multi-Lingual Hate Speech Detection from Tweets Using Pre-Trained Transformers. Computers, 14(6), 239. https://doi.org/10.3390/computers14060239

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop