Next Article in Journal
Carbon-Aware, Energy-Efficient, and SLA-Compliant Virtual Machine Placement in Cloud Data Centers Using Deep Q-Networks and Agglomerative Clustering
Previous Article in Journal
A Space-Time Plume Algorithm to Represent and Compute Dynamic Places
Previous Article in Special Issue
UA-HSD-2025: Multi-Lingual Hate Speech Detection from Tweets Using Pre-Trained Transformers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Large Language Model-Based Approach for Multilingual Hate Speech Detection on Social Media

Centro de Investigación en Computación, Instituto Politécnico Nacional (CIC-IPN), Mexico City 07320, Mexico
*
Author to whom correspondence should be addressed.
Computers 2025, 14(7), 279; https://doi.org/10.3390/computers14070279
Submission received: 22 June 2025 / Revised: 4 July 2025 / Accepted: 11 July 2025 / Published: 15 July 2025
(This article belongs to the Special Issue Recent Advances in Social Networks and Social Media)

Abstract

The proliferation of hate speech on social media platforms poses significant threats to digital safety, social cohesion, and freedom of expression. Detecting such content—especially across diverse languages—remains a challenging task due to linguistic complexity, cultural context, and resource limitations. To address these challenges, this study introduces a comprehensive approach for multilingual hate speech detection. To facilitate robust hate speech detection across diverse languages, this study makes several key contributions. First, we created a novel trilingual hate speech dataset consisting of 10,193 manually annotated tweets in English, Spanish, and Urdu. Second, we applied two innovative techniques—joint multilingual and translation-based approaches—for cross-lingual hate speech detection that have not been previously explored for these languages. Third, we developed detailed hate speech annotation guidelines tailored specifically to all three languages to ensure consistent and high-quality labeling. Finally, we conducted 41 experiments employing machine learning models with TF–IDF features, deep learning models utilizing FastText and GloVe embeddings, and transformer-based models leveraging advanced contextual embeddings to comprehensively evaluate our approach. Additionally, we employed a large language model with advanced contextual embeddings to identify the best solution for the hate speech detection task. The experimental results showed that our GPT-3.5-turbo model significantly outperforms strong baselines, achieving up to an 8% improvement over XLM-R in Urdu hate speech detection and an average gain of 4% across all three languages. This research not only contributes a high-quality multilingual dataset but also offers a scalable and inclusive framework for hate speech detection in underrepresented languages.

1. Introduction

Social media platforms, such as X, Facebook, Instagram and YouTube, have transformed global communication, allowing for real-time connectivity between cultures. They foster vibrant online communities and shape public opinion [1,2]. However, their widespread use has amplified harmful content, particularly hate speech, which targets individuals or groups based on attributes such as race, religion, or gender [3]. This phenomenon undermines online safety and social cohesion, necessitating robust detection mechanisms. In contrast, positive discourse, such as hope speech [4,5,6] and social support, encourages constructive dialogue, underscoring the need for robust detection systems to mitigate harm and promote healthier online environments [7]. Significant research has advanced hate speech detection [8], primarily focusing on monolingual datasets in high-resource languages such as English, Spanish, and code-mixed Dravidian languages such as Tamil and Malayalam [9]. Many studies have explored explainable datasets and regional linguistic nuances, achieving strong performance with transformer modifications [10]. Multilingual efforts, such as the SemEval-2021 toxic spans detection task, have made further progress [11].
However, hate speech detection using joint multilingual techniques such as Urdu, Spanish and English using large language models remains underdeveloped due to the lack of annotated datasets and the complexity of their linguistic structures. English, Spanish, and Urdu are widely spoken globally and present diverse linguistic challenges [12]. English dominates social networks, Spanish is critical in regions like Latin America, and Urdu is prevalent in South Asia and diaspora communities, often in code-mixed forms [13]. The lack of annotated datasets for Urdu limits effective natural language processing (NLP) solutions [14]. To address this, we introduce a novel trilingual dataset of 10,193 tweets in English, Urdu, and Spanish annotated with high inter-annotator agreement and evaluate a unified multilingual pipeline using machine learning, transformer models, and large language models (LLMs).
Recent advancements in natural language processing (NLP), particularly the development of transformer-based and multilingual transformer models, have significantly improved cross-lingual understanding through transfer learning [15,16]. While translation-based methods facilitate the alignment of diverse linguistic data [17], core NLP tasks such as clinical text classification [18], sentiment analysis [19], and Named Entity Recognition (NER) [20] have also seen remarkable improvements. However, research on hate speech detection, toxic content classification, and entity extraction in the context of social media—especially using joint multilingual approaches across Urdu, English, and Spanish—remains largely underexplored. These three languages were deliberately chosen due to their global and digital significance: English is a dominant language on the internet and in NLP research, Spanish is one of the most widely spoken languages worldwide with a vast online presence, and Urdu represents a low-resource language with a growing digital content and a unique set of linguistic challenges. Together, they offer a diverse set of orthographic, morphological, and syntactic features that help evaluate the robustness and adaptability of multilingual NLP models. Low-resource languages such as Urdu continue to be underrepresented in this field, which hinders the development of inclusive and effective NLP solutions [21].
To address this gap, we manually annotated three distinct datasets in English, Urdu, and Spanish, ensuring high-quality, language-specific hate speech annotations. Following rigorous preprocessing, we performed feature extraction using TF–IDF combined with classical machine learning models. In parallel, we employed pretrained word embeddings—FastText and GloVe—within BiLSTM and CNN architectures to capture contextual semantics. To further enhance performance, we incorporated advanced contextual embeddings using transformer-based models including BERT, RoBERTa, XLM-RoBERTa, and Google’s ELECTRA. Additionally, we employed the large language model OpenAI GPT-3.5-turbo to perform hate speech classification. Our comprehensive approach aims to develop a robust, real-time multilingual hate speech detection system tailored to the unique linguistic and cultural nuances of online discourse, thereby contributing to safer digital environments across diverse communities.
This study makes the following contributions:
To the best of our knowledge, joint multilingual and translation-based techniques have not been previously explored on a combined Spanish, English, and Urdu dataset. Our work pioneers this approach, enabling more inclusive and effective hate speech detection across diverse languages.
We developed a high-quality multilingual dataset comprising 10,193 annotated tweets in English, Urdu, and Spanish for hate speech detection. This dataset enhances cross-lingual research and supports the robust evaluation of multilingual hate speech models.
We developed detailed pseudo-code for our multilingual hate speech detection content detection pipeline to support future research and enable better reproducibility.
We conducted 41 experiments to evaluate and compare the performance of machine learning, deep learning, transfer learning, and large language models on our trilingual hate speech detection tasks, aiming to identify the most effective model for this challenge.
Based on the results, our proposed GPT-3.5 Turbo model outperformed the transformer-based XLM-R model—achieving an up to 8% improvement in Urdu and an overall average gain of 4% across English, Spanish, and Urdu in hate speech detection.

2. Literature Review

The widespread use of social media has intensified the spread of hate speech, posing threats to online safety and social cohesion [22]. Detecting hate speech across diverse linguistic contexts requires addressing challenges such as cultural nuances, resource scarcity, and contextual ambiguities. Recent advancements in natural language processing (NLP) have leveraged transformer-based models, attention mechanisms, and cross-lingual embeddings to tackle these issues, with limited focus on low-resource languages like Urdu. This review critically analyzes journal-based research on multilingual hate speech detection, emphasizing datasets, Urdu-specific challenges, transformer-based models, attention mechanisms, translation-based methods, and ethical considerations. It positions our trilingual dataset (10,193 tweets: 3834 English, 3197 Urdu, and 3162 Spanish) and attention-augmented, translation-based framework as a significant contribution to robust detection in diverse linguistic settings.

2.1. Multilingual and Low-Resource Hate Speech Datasets

Recent efforts in hate speech detection have shown a growing shift from monolingual datasets to multilingual and culturally diverse corpora, recognizing the global nature of online discourse. Aluru et al. [23] developed a trilingual dataset covering English, Hindi, and Tamil, highlighting the need for culturally sensitive annotations and enabling cross-lingual model evaluation. Similarly, Ousidhoum et al. [24] curated a fine-grained dataset of 13,000 English and Arabic tweets with labels for hostility and racism aimed at improving model robustness across cultures. Bosco et al. [25] focused on Spanish tweets targeting immigrants and women, showing strong performance with BERT-based models. Despite these advances, low-resource languages such as Urdu remain largely underrepresented, limiting the development of inclusive hate speech detection systems [26]. Our work addresses this gap by introducing a trilingual dataset comprising 10,193 tweets—including 3197 in Urdu—annotated with high inter-rater agreement (Fleiss’ Kappa = 0.821), providing a robust foundation for multilingual hate speech research.

2.2. Urdu-Specific Challenges

Urdu, spoken by over 230 million people, presents several unique challenges in NLP due to its use of both Perso-Arabic and Roman scripts, frequent code-mixing with English, and considerable orthographic variability [25]. Akhter et al. [27] created a Roman Urdu dataset for sentiment analysis and highlighted the complications of informal, inconsistent writing that are equally relevant for hate speech detection. Javed et al. [28] applied LSTM models to Perso-Arabic Urdu for hate speech tasks but noted the lack of large-scale annotated corpora as a major hindrance. Our dataset builds on these foundations by including 3197 Urdu tweets, standardized using the Google Translate API to minimize inconsistencies across scripts and annotated with high reliability to facilitate robust multilingual and code-mixed hate speech processing.

2.3. Transformer-Based Models

Transformer-based architectures have revolutionized hate speech detection by effectively capturing complex semantic dependencies and contextual nuances. Waseem et al. [29] applied attention mechanisms to multilingual datasets (English, German, and Turkish), enhancing cross-lingual semantic understanding and achieving improved F1-scores. Fortuna et al. [30] used XLM-RoBERTa to detect offensive language in English and Portuguese, showing that attention-driven contextualization is particularly effective in identifying implicit hate speech. For low-resource settings, Qureshi et al. [31] adapted DistilBERT for Urdu sentiment analysis, demonstrating the model’s potential to overcome data scarcity—a key challenge in hate speech tasks. Our study extends this line of research by integrating attention-based transformers (BERT and XLM-RoBERTa) and LLMs (GPT-3.5 Turbo), achieving F1-scores of 0.87 (English), 0.85 (Spanish), 0.81 (Urdu), and 0.88 in Joint Multilingual settings, highlighting the effectiveness of cross-lingual transformers in underrepresented contexts.

2.4. Attention Mechanisms in NLP

Attention mechanisms are central to modern NLP, enabling models to prioritize relevant parts of input sequences for enhanced context understanding. Bahdanau et al. [32] introduced attention for neural machine translation, allowing for the dynamic alignment of source and target sequences and laying the groundwork for subsequent innovations. Vig et al. [33] popularized attention with transformer architectures using multi-head attention to process input tokens in parallel, capturing diverse semantic relationships critical for hate speech detection. Recent advancements include Linformer, which reduces computational complexity through low-rank approximations, improving scalability for longer sequences [34]. Flash Attention optimizes memory usage by recomputing intermediate values, achieving significant speed improvements [35]. In hate speech detection, attention mechanisms enhance context modeling, as shown by Salawu et al. [36], who used attention in a graph-based approach to detect subtle hate speech patterns in multilingual tweets. Our framework leverages multi-head attention in transformers and explores sparse attention variants, improving performance on code-mixed Urdu texts by focusing on contextually relevant tokens.

2.5. Translation-Based Approaches

Translation-based strategies have become vital in addressing data scarcity for multilingual NLP tasks. Ranasinghe et al. [37] aligned English and Spanish datasets using machine translation, which significantly improved mBERT’s cross-lingual performance. Stappen et al. [38] applied translation-augmented training to German and Portuguese, boosting zero-shot generalization. For Urdu, Ali et al. [39] translated Roman Urdu texts for sentiment classification and observed improvements, though challenges with slang and informal expressions remained. Our approach incorporates translation through the Google Translate API to standardize English, Urdu, and Spanish tweets, enabling consistent annotation and joint multilingual processing. This strategy contributed to our model achieving a state-of-the-art F1-score of 0.88, outperforming traditional SVM baselines by 8.75–10.17 percentage points.

2.6. Advanced Methodologies

Recent methodologies have enhanced hate speech detection. Salawu et al. proposed a graph-based model with attention mechanisms to capture contextual relationships in multilingual tweets, improving the detection of subtle hate speech [38]. Vidgen and Derczynski developed a semi-supervised framework for English, leveraging unlabeled data for enhanced robustness [40]. Pereira-Kohatsu et al. used CNNs with attention layers for Spanish, capturing linguistic nuances [41]. Our ensemble combines SVM, BiLSTM with GloVe/FastText embeddings, attention-augmented transformers, and LLMs, achieving significant performance gains over baselines.

2.7. Research Gaps and Opportunities

While multilingual hate speech detection has advanced in recent years, the integration of low-resource languages such as Urdu—especially in its Perso-Arabic and Romanized forms—within unified multilingual frameworks remains largely unaddressed. Although translation has been used for data augmentation, its application as a primary pipeline for standardizing multilingual corpora involving underrepresented scripts like Urdu has not been systematically evaluated or compared with native multilingual transformer models. Additionally, prior work often overlooks comprehensive benchmarking across classical machine learning, deep learning, and large language models (LLMs) on balanced trilingual datasets. To address these gaps, our study introduces a novel trilingual dataset comprising 10,193 annotated tweets in English, Spanish, and Urdu with high inter-annotator agreement (Fleiss’ Kappa = 0.821). We propose a translation-based preprocessing pipeline using the Google Translate API to standardize linguistic and script variations and to evaluate a diverse ensemble of models—including SVM, CNN, BiLSTM, BERT, XLM-RoBERTa, and GPT-3.5 Turbo—augmented with multi-head and sparse attention mechanisms. Our approach achieves an F1-score of up to 0.88, demonstrating that translation-driven standardization combined with attention-augmented transformers can effectively improve multilingual hate speech detection, particularly for low-resource and script-diverse languages like Urdu.

3. Methodology and Design

3.1. Construction of Dataset

The dataset was curated using the Tweepy API to collect tweets from X (formerly Twitter) between January 2024 and February 2025, resulting in 10,193 tweets: 3834 English (1809 “Hateful,” 2025 “Not-Hateful”), 3197 Urdu (1642 “Hateful,” 1555 “Not-Hateful”), and 3162 Spanish (1398 “Hateful,” 1764 “Not-Hateful”), with a total of 4849 “Hateful” and 5344 “Not-Hateful” labels across the combined dataset.
We employed stratified sampling to ensure a balanced representation of hateful and non-hateful content across languages, maintaining proportional label distributions within each language subset. Keywords were selected to capture a broad emotional spectrum, including both hateful and non-hateful sentiments, to avoid bias toward negative content.
For English, terms included hate-related words such as “fuck,” “cunt,” and “shithead,” along with neutral terms like “support.” For Urdu, examples included offensive terms like “کتے” (dog) and “حرامی” (motherfucker), along with neutral or positive words like “امید” (hope). For Spanish, hateful expressions such as “hijo de puta” (son of a bitch) and “mierda” (shit) were used alongside neutral terms like “gracias” (thank you). Including neutral or positive terms like “gracias” and “support” helped capture non-hateful contexts such as gratitude or encouragement, ensuring the dataset reflects the diversity of real-world discourse. Additional terms like “war” and “sorrow” broadened the emotional range.
To evaluate the reliability of keyword-based sampling, we analyzed the annotation outcomes relative to the type of keywords used during data collection. Among tweets retrieved using hate-related keywords, approximately 43% were ultimately labeled as non-hateful, demonstrating that keyword presence alone is not a definitive indicator of hate speech and reflecting substantial noise introduced by keyword filtering. Conversely, 57% of these tweets were confirmed as hateful, indicating that keyword-based selection can still capture relevant content, albeit imperfectly. Interestingly, among tweets retrieved using neutral or positive keywords, approximately 12% were eventually annotated as hateful, often due to sarcastic expressions, coded language, or implicit hate not captured by overt keywords. These findings highlight both the strengths and limitations of keyword-based sampling. While it facilitates targeted data collection, it also introduces considerable noise and challenges for the annotation process, underscoring the critical role of comprehensive manual annotation to ensure dataset reliability. From an initial pool of 58,000 tweets, the final dataset was filtered and stored in three CSV files stratified by language. To clarify, the class distribution of 5344 “Not-Hateful” and 4849 “Hateful” across all languages reflects the total dataset counts, ensuring consistency with the reported 3834 English, 3197 Urdu, and 3162 Spanish tweets. Figure 1 illustrates the methodology workflow and data collection process.

3.2. Annotation

Annotation involved binary classification (“Hateful” or “Not-Hateful”) by three native-speaking postgraduate Computer Science students per language, with distinct annotator groups for English, Urdu, and Spanish to ensure linguistic and cultural expertise.
The annotation team included one female and two male annotators for English (ages 24–28), two female and one male annotators for Urdu (ages 23–27), and one female and two male annotators for Spanish (ages 25–29), ensuring a diversity of perspectives to minimize annotation bias. Each annotator independently labeled the 10,193 tweets, and the final label for each instance was determined through majority voting (i.e., agreement by at least two annotators) within each language group.
The annotation process was guided by detailed, language-specific criteria designed to distinguish hateful content—such as the use of a hostile tone, derogatory slurs (e.g., “fuck”), and prejudice against minorities—from non-hateful content, which typically includes neutral or empathetic language without an intent to harm.
To ensure cross-lingual consistency, we developed a standardized annotation guideline, which was translated into English, Urdu, and Spanish, and conducted a joint training session for all annotators. During training, annotators collaboratively discussed cultural nuances to harmonize their understanding of hate speech across languages. For example, in Urdu, the word “کتا” (kutta-dog) is commonly used as a strong insult, whereas in Spanish, phrases like “hijo de puta” (son of a bitch) are considered highly offensive. Such discussions helped refine the annotation framework to accommodate linguistic and cultural variations across languages.
Additionally, a cross-language validation subset (510 tweets, 5% of the dataset) was translated into English as a pivot language by bilingual experts (e.g., Urdu–English and Spanish–English translators) and annotated by all groups. This process ensured that annotators, despite not speaking each other’s languages, could align their understanding through translated examples and shared guidelines. Periodic calibration meetings were held to resolve discrepancies, further ensuring consistency across languages. Table 1 illustrates examples of hateful and non-hateful tweets.

3.3. Annotation Guidelines

To develop a high-quality and consistent multilingual hate speech detection dataset, we designed annotation guidelines for annotators working with English, Urdu, and Spanish social media posts. These guidelines were formulated based on previous literature, platform community standards, expert consultations, and with a focus on cultural sensitivity and linguistic diversity. Examples from the dataset are given in Table 1.

3.3.1. Hate

-
Identify whether the post targets a person or group based on characteristics such as race, religion, ethnicity, gender, sexual orientation, nationality, disability, or political beliefs. For example, in Urdu, posts targeting Pathans or Shias may reflect ethnic or sectarian bias common in Pakistan. In Latin American Spanish, xenophobic comments against Venezuelan or Bolivian migrants are prevalent.
-
Read the sentence carefully and check for explicit hate using direct slurs, threats, or dehumanizing language. For instance, Urdu hate speech might include words implying certain ethnic groups are terrorists or criminals, while Spanish posts may use derogatory terms to demean migrants or minorities.
-
Look closely at the language and detect implicit hate through sarcasm, coded language, stereotypes, or indirect suggestions of harm or inferiority. In Urdu, phrases limiting women’s roles to domestic spaces or controlling social behavior reflect deep-rooted patriarchy. In Latin American Spanish, homophobic slurs or stereotypes about laziness may be implied rather than directly stated.
-
Pay attention to whether the post distinguishes between offensive but non-hateful language (e.g., profanity used casually or generally) and actual hate speech. Cultural context matters: certain slang or swear words in Urdu or Spanish may be used colloquially without hateful intent but could also be weaponized in hate speech.
-
Read the sentence carefully and find context—evaluate tone, intention, and any references that might alter the meaning (e.g., satire, irony, or humor). Some posts may appear hateful but could be sarcastic or ironic. Understanding local humor or cultural references in Urdu or Spanish is crucial.
-
Be alert to cultural and regional phrases or slang that may carry hateful meanings in local usage. For example, derogatory terms unique to Pakistani regional dialects or Latin American Spanish regionalisms should be recognized as potentially hateful.
-
Consider if the post incites, encourages, or glorifies violence or discrimination. Posts calling for violence against ethnic groups such as Shias in Urdu or expulsions of migrants in Latin American contexts are clear examples.
-
Avoid judging based solely on keywords—a word may be hateful in one context and neutral in another. The word “kutta” (dog) in Urdu is a common insult but context and target group must be considered.
-
Evaluate whether the post contributes to hostility or social division against a group or individual. Posts that reinforce stereotypes or encourage exclusion, even subtly, contribute to social division.
-
In case of doubt or ambiguity, flag the post for expert review or discuss it with the annotation team for consensus.

3.3.2. Not Hate

-
Expresses opinions, disagreements, or criticism without targeting a group with hate or inciting violence.
-
Uses sarcasm or humor without causing harm or marginalizing others.
-
Contains offensive language or slang used casually or in non-targeted ways (e.g., swearing without targeting a group).
-
Discusses social issues or controversial topics respectfully or neutrally.
-
Makes personal experiences or general observations without demeaning others.
-
Includes neutral, informative, or supportive content.
-
Shows empathetic, constructive, or inclusive tone even when discussing sensitive topics.
-
Debates or criticizes ideas or policies without attacking identity groups.
-
Mentions protected characteristics (e.g., race and religion) in non-hostile, descriptive, or academic ways.
-
Lacks harmful intent or effect and does not provoke hostility or fear.

3.4. Annotation Selection

In the present study, we prepared and used hate speech datasets in three languages, English, Spanish, and Urdu, with the aim of comprehensively examining multilingual hate speech detection models. The data sampling procedure was appropriately planned to include the linguistic and cultural situations where hate speech exists on online social sites.
To ensure the quality and reliability of the annotated data, we adopted a manual annotation approach carried out by domain-sensitive annotators who were experts in the respective languages. For the Urdu and English datasets, three members of the annotation team were involved: the first author, the second author, and another lab member proficient in Urdu. All three annotators are PhD students and native speakers of Urdu while also possessing a high level of academic proficiency in English, which is one of the national languages of Pakistan. This combination of linguistic expertise and academic background enabled them to identify nuanced linguistic features, cultural expressions, and contextual patterns of hate speech in both languages.
In the case of Spanish data, three PhD students of our lab participated in our data curation: they are all native speakers of Spanish. Their language affiliation and educational experience made sure that the Spanish data was annotated efficiently and sensitively and that both explicit and implicit indicators of hate speech in Spanish social media data could be identified.
In order to set up the annotation process as easily and consistent as possible, we had a separate Google Sheet per language-specific dataset. The different annotators received separate entries, and a set of annotation rules was provided that presented the definition of hate speech clearly and with more examples that helped minimize the degree of ambiguity and brought standardization in labeling. A majority voting method was applied to the three annotators in determining the final label applied to each instance. Where discrepancies and annotations could not be settled on and where no agreement was reached on any labels, we held a weekly session to agree on the issues and ensure the integrity and consistency of the final output on the dataset. In recognition of their time and effort, each annotator was compensated at a rate of $0.03 USD per annotated sample.
This multilingual, curated, and human-annotated dataset allowed us to train and test our machine learning, deep learning, transfer learning, and large language models in a robust, reliable, and ethically sound manner.

3.5. Inter-Annotator Agreement

In evaluating the consistency and reliability of the manual annotations among the annotators, we calculated the Inter-Annotator Agreement (IAA) score. The fact that every sample was separately annotated by three raters required the use of a statistic that would assess the agreement degree of a categorical data based on multiple raters; thus, we used Fleiss’ Kappa.
The score given by our analysis was 0.82 in Fleiss’ Kappa, and this shows that there was a strong agreement in annotations. Based on the general guidelines of interpreting Kappa values, a higher value than 0.80 implies substantial or near-perfect agreement, and the consistency and quality of our annotation are therefore confirmed. Table 1 shows the interpretation of the Kappa values.
Additionally, we calculated Fleiss’ Kappa scores separately for each language to better understand potential variation in annotation difficulty: English achieved a Kappa of 0.82, Spanish scored 0.83, and Urdu had a slightly lower score of 0.80. This difference highlights that annotation in Urdu is more challenging, likely due to its script complexity Table 2. shows interpretation of Cohen’s kappa values.

3.6. Corpus Characteristics

The dataset, comprising 10,193 tweets, showcases translation statistics across English, Spanish, and Urdu and Joint Multilingual language sets as shown in Figure 2. English tweets lead with a total of 274,247 words and 1,246,070 characters, an average of 124.5 words per post, and a vocabulary of 29,176 words. Spanish tweets follow with 273,791 words, 1,246,070 characters, an average of 27.4 words per post, and a vocabulary of 20,424 words. Urdu tweets exhibit 312,090 words, 1,302,757 characters, an average of 31.2 words per post, and the largest vocabulary at 36,361 words. The chart highlights these metrics, with all languages sharing an equal post count of 10,193. Additionally, the Joint Multilingual dataset includes 10,193 posts, 281,865 words, an average of 28.1 words per post, a vocabulary of 41,660 words, and a total of 1,183,315 characters. While Figure 3 shows the label distribution of our trilingual dataset and Figure 4 shows a word cloud that provides a quick, visual summary of the most frequent words in the text.

3.7. Ethical Considerations

The research methodologies involved in this study only employed publicly available social media data in respect to user privacy and ethical considerations. We anonymized tweets by removing personal identifiers and prohibited contacting original posters. The dataset will only be shared with researchers adhering to stringent ethical protocols, ensuring privacy and responsible use.

3.8. Translation-Based Approach

The translation-based technique was designed to standardize Spanish, Urdu and English tweets by translating all content into a single target language. This unified format allows the dataset to be stored in a single CSV file, where the first column contains the tweet text and the second column holds the label. This structure simplifies both data processing and analysis. The translation pipeline consists of the following steps.
Pre-Translation Tokenization: Prior to translation, the text was segmented into smaller units such as words or phrases. This step enhanced translation accuracy by enabling the translation system to more effectively interpret the context and meaning of each segment.
-
Handling Noisy Translations: After completing the translation, we conducted a thorough manual review of the entire process to identify and correct any potential errors. Special attention was given to idiomatic expressions and slang, which often do not translate directly, to ensure that the final content accurately preserved the original meaning and clarity.
-
Post-Translation Alignment: To ensure consistency, we initially translated the Urdu text into English and compiled it into a single CSV file to form a unified corpus. For the second corpus, we translated the English content into Urdu, carefully aligning it with the original Urdu texts in the dataset. This approach ensured that both the translated and original texts were of comparable quality, making them suitable for reliable analysis.
-
Text Length Standardization: To handle excessively long texts, truncation was applied to maintain a consistent input length, thereby making the data more suitable for processing by deep learning models.
This approach helped mitigate the impact of linguistic differences between Urdu, Spanish and English, ensuring that they did not hinder model performance. It also facilitated consistent text processing across the three languages. Table 3 shows the step by step pseudo-code for multilingual hate speech classification using a GPT-3.5 Turbo model on the Spanish, Urdu and English corpus.

3.9. Preprocessing

Preprocessing is a critical step in natural language processing (NLP), as raw text data are often noisy, inconsistent, and filled with irrelevant elements that can compromise model performance. The goal of preprocessing is to clean and normalize the text, making it more suitable for analysis and downstream machine learning tasks.
Our workflow began by removing undesirable elements such as URLs, emojis, punctuation, numbers, user mentions, and hashtags. These elements are often stylistic or non-standardized and could introduce noise during model training. However, we acknowledge that hashtags—particularly in the context of hate speech—can carry important semantic and social signals (e.g., #BanMuslims and #KillAllX), often serving as markers of ideology, sentiment, or group identity. In this study, hashtags were removed to maintain consistency across multilingual text and reduce annotation complexity introduced by long or compound hashtags. Nevertheless, we recognize this as a limitation, and future work will explore retaining or extracting features from hashtags to enhance model sensitivity to contextually relevant hate speech cues.
Next, we normalize the text by converting all characters to lowercase, ensuring that words like “Love” and “love” were treated identically. We then eliminated stop words in English, Urdu, and Spanish—common words such as is, hai, or de—that contribute little semantic value. Finally, stemming was applied to reduce words to their root forms (e.g., running to run, khushiyan to khushi). These preprocessing steps helped transform messy and unstructured input into a cleaner, more uniform representation, allowing theNLP models to more effectively learn and better generalize. Table 4 outlines the preprocessing steps used in this study.

3.10. Application of Models, Training, and Testing Phase

This section discusses how different models were applied in the process of training and testing in the hate speech detection task. In our work, all the major learning paradigms were included—machine learning (ML), deep learning (DL), transfer learning (TL), and large language models (LLMs)—to fully test the performance on different languages and data conditions as shown in Figure 5.
In the machine learning method, four popular classifiers were used: random forest (RF), support vector machine (SVM), decision tree (DT), and eXtreme Gradient Boosting (XGBoost). The models were pre-trained on Text Frequency–Inverse Document Frequency (TF–IDF) feature extraction to transform the raw textual data into a numeric format to be trained upon. TF–IDF is useful because it identifies the worth of words in an individual document by the whole corpus.
We applied pretrained word embeddings to characterize the semantic framework of the supplied text within the framework of the deep-learning. Namely, FastText and GloVe embeddings were fed into the neural network architectures including convolutional neural networks (CNNs) and bidirectional long short-term memory (BiLSTM) networks. They can be used to learn complicated syntactic and semantic dependences in text and tend to be quite effective in sequence-based problems such as hate speech classification.
We also embedded contextual representations into the powerful transformer-based models (e.g., BERT, ELECTRA, RoBERTa, and XLM-RoBERTa) and used them in the transfer learning. These models are trained with huge corpora and could be used to fine-tune our hate speech datasets to make them suitable to downstream classification. With their contextual interpretation, they would be able to successfully absorb subtle interpretations and language-specific hints, particularly in multilingual and code-mixing contexts.
Lastly, we utilized the small language model (LLM) GPT-3.5 Turbo with the highest level of generative model capabilities that could execute classification with maximum contextual existence and generalization. The model uses its extensive knowledge base and ability to learn in context to admirably work on low-resource languages including the Urdu language. In our desire to have a strong and justifiable test, we incorporated a standardized 80–20 split across models and languages. Such a stable configuration made a difference in the comparative evaluation of the model behavior with the same data conditions, and they were then assessed with the help of the essential evaluation statistics (precision, recall, F1-score, and accuracy).
Table 5 shows the key parameters and optimal configurations used for fine-tuning the diverse set of models applied to the task of multilingual hate speech detection in English, Urdu, and Spanish tweets. We experimented with traditional machine learning models such as logistic regression, support vector machines (SVM), and random forest, each fine-tuned with their most effective hyperparameters—for example, the L2 penalty and the “liblinear” solver for logistic regression, a linear kernel and optimal regularization for SVM, and 100 estimators for random forest with Gini impurity. In the deep learning category, we optimized a convolutional neural network (CNN) using 128 filters, a kernel size of 3, ReLU activation, max pooling, and dropout for regularization. Similarly, the BiLSTM model was fine-tuned with 128 units, dropout, and recurrent dropout to enhance its ability to handle sequential data. Transformer-based models including BERT (uncased), mBERT, and XLM-R were fine-tuned using the pretrained bert-base-multilingual-cased checkpoint, with optimal settings such as a maximum sequence length of 128, learning rate of 2 × 10−5, and 3 epochs, using the AdamW optimizer. Additionally, GPT-3.5 Turbo was fine-tuned on over a million tokens using a learning rate multiplier of 2, batch size of 15, and three training epochs. These carefully selected and optimized configurations helped maximize performance across all models, enabling the robust detection of hate speech across multiple languages.

4. Result and Analysis

This section presents a comprehensive evaluation of our proposed models for hate speech detection across three individual languages—Spanish translation, Urdu translation, and English translation—as well as a Joint Multilingual dataset combining all three. We explore the performance of a range of approaches, including classical machine learning (ML) algorithms, deep learning (DL) models, transfer learning (TL) techniques leveraging pre-trained multilingual transformers, and large language models (LLMs) via prompt-based and fine-tuning strategies.
The experiments were conducted with consistent training, validation, and test splits across all models to ensure a fair comparison. Evaluation metrics such as accuracy, precision, recall, F1-score, and macro-averaged results are reported for each setting.

4.1. Large Language Models

Table 6 shows the GPT-3.5 Turbo model performance in hate speech identification for the English translation, Spanish translation, Urdu translation, and a combination of all languages (Multilingual). According to four conventional measures, which are precision, recall, F1-score, and accuracy, the assessment was conducted. For all the datasets, GPT-3.5 Turbo showed a good and stable performance, highlighting its robustness and generalization within a multilingual set-up.
For the English dataset, GPT-3.5 Turbo achieved a high score of 0.87 across all evaluation metrics, indicating excellent precision and recall in correctly identifying hate speech in English social media discourse.
Similarly, on the Spanish dataset, the model maintained a strong performance with a consistent 0.85 score across all metrics, demonstrating its effectiveness in understanding and classifying hate speech in a non-English language. On the Urdu dataset, GPT-3.5 Turbo scored 0.81 for all metrics, which, while slightly lower than the scores for English and Spanish, still reflects a competent performance for a low-resource language. This result suggests that GPT-3.5 Turbo is capable of handling complex multilingual tasks, even when primarily trained on high-resource languages.
The most notable performance was observed for the Joint Multilingual dataset, where the model reached the highest scores—0.88 for precision, recall, F1-score, and accuracy. This indicates that GPT-3.5 Turbo not only adapts well to individual languages but also generalizes exceptionally when exposed to a diverse multilingual dataset. The Joint dataset results underscore the model’s strength in cross-lingual transfer learning and its ability to effectively manage linguistic variation in real-world multilingual social media scenarios.

4.2. Transformers Results

Table 7 presents a comparative evaluation of various transformer models for hate speech detection across four settings: English translation, Spanish translation, Urdu translation, and a Joint Multilingual dataset. The models evaluated include bert-base-uncased, electra-base-discriminator, roberta-base, and xlm-roberta-base, with performance measured using four metrics: precision, recall, F1-score, and accuracy.
In the English dataset, xlm-roberta-base was the best performing model, with precision, recall, F1-score and accuracy values of 0.84, beating roberta-base and also bert-base-uncased, which closely followed with scores of 0.84 and 0.83, respectively. This implies that even multilingual models such as XLM-R could perform well with monolingual English data.
In the Spanish context, xlm-roberta-base again achieved the highest and most balanced performance, with a uniform score of 0.81 across all metrics. Other models like bert-base-uncased and electra-base-discriminator performed slightly lower, ranging from 0.78 to 0.80, suggesting that XLM-R’s multilingual capabilities offer a consistent advantage for non-English languages.
For Urdu, performance generally declined across all models compared with English and Spanish, likely due to the complexity and lower resource availability for Urdu. However, xlm-roberta-base remained the top performer with a 0.74 score across all evaluation metrics, while other models such as roberta-base and electra-base-discriminator performed notably lower, with F1-scores of 0.68 and 0.69, respectively. This further emphasizes the strength of XLM-R in handling low-resource languages like Urdu.
Finally, for the Joint Multilingual dataset, which included a mix of English, Spanish, and Urdu, xlm-roberta-base again demonstrated superior performance with consistent 0.84 scores across all metrics. Notably, bert-base-uncased also performed well in this setting with a score of 0.84, indicating that with diverse multilingual training data, even models primarily pretrained on English can generalize effectively. However, electra-base-discriminator and roberta-base performed less well, with scores ranging between 0.76 and 0.78.
In summary, xlm-roberta-base consistently outperformed other transformer-based models across all languages and the Joint dataset, highlighting its robustness and generalization capability for hate speech detection across multilingual social media discourse.

4.3. Deep Learning Results

Table 8 summarizes the performance of various deep learning models, specifically CNN and BiLSTM models, across different languages (English translation, Spanish translation, and Urdu translation) and a Joint Multilingual setup using two types of word embeddings: FastText and GloVe. The evaluation metrics comprised precision, recall, F1-score, and accuracy.
For the English dataset, the BiLSTM models clearly outperformed the CNN regardless of the embedding type. Both FastText and GloVe with BiLSTM achieved the highest scores, with F1-score and accuracy values of 0.78, demonstrating that BiLSTM was better at capturing sequential information in the English data. The CNN models using both embeddings performed similarly but slightly lower, with an F1-score of 0.74.
In the Spanish dataset, the overall performance declined slightly. The best result was obtained using FastText + BiLSTM, which reached an F1-score of 0.75 and accuracy of 0.75. Other configurations, especially GloVe embeddings with the CNN or BiLSTM models, showed weaker performance, indicating that FastText may be more effective for Spanish texts, possibly due to its handling of sub-word information.
The Urdu dataset showed the lowest performance across all configurations. The best results were from FastText + BiLSTM, with an F1-score of 0.71. The GloVe-based models performed significantly worse, especially GloVe + BiLSTM, which achieved an F1-score of only 0.41. This suggests that GloVe embeddings may not represent Urdu well, possibly due to limited pretraining data in that language.
In the Joint Multilingual setup, which combined translated and original data from all languages, the BiLSTM models again outperformed the CNN. FastText + BiLSTM achieved the highest F1-score of 0.76 and accuracy of 0.76, showing that this combination is effective in capturing multilingual sequence patterns. GloVe-based models again trailed slightly behind, with a maximum F1-score of 0.68.
In summary, the results show that BiLSTM consistently outperformed the CNN across all languages and that FastText embeddings generally yielded better performance than GloVe, particularly for Spanish and Urdu. The Joint Multilingual setting enhanced performance, highlighting the advantage of training on diverse, translated datasets for deep learning-based hate speech detection.

4.4. Machine Learning Results

Table 9 shows the results of different classical machine learning models such as random forest (RF), support vector machine (SVM), decision tree (DT), and XGBoost (XGB) on hate speech detection datasets of English, Spanish, Urdu, and combined multilingual models. The reported metrics of the evaluation are precision, recall, F1-score, and accuracy.
For the English translation dataset, the SVM model achieved the best overall performance with an F1-score and accuracy of 0.82, outperforming the other models. Random forest and XGBoost also performed well, both achieving an F1-score of 0.80 or above. The Decision Tree lagged slightly behind with an F1-score of 0.76. In the Spanish dataset, all models showed a slightly lower performance compared with English, with SVM again leading with an F1-score of 0.78, indicating its relative robustness across languages.
For the Urdu translation dataset, performance declined slightly across all models, with the highest F1-score of 0.77 obtained by both SVM and random forest. This suggests that Urdu may pose greater linguistic challenges for traditional models, possibly due to script complexity or limited training data.
In the Joint Multilingual setting, where data from all languages were combined into a single training set, all models saw improved or stable performance. Notably, SVM again led with an F1-score and accuracy of 0.82, confirming its consistent effectiveness. Random forest and XGBoost also performed comparably well with F1-scores of 0.81 and 0.80, respectively.
Overall, the table highlights that SVM consistently outperformed the other traditional models across all individual and combined language datasets and that joint multilingual training improved model robustness and generalization.

4.5. Error Analysis

Table 10 presents the performance comparison of top-performing models from four learning paradigms—machine learning (ML), deep learning (DL), transfer learning (TL), and large language models (LLM)—across four linguistic settings: English translation, Spanish translation, Urdu translation, and a Joint Multilingual hate speech dataset. Each model was evaluated based on precision, recall, F1-score, and accuracy, allowing for a detailed assessment of hate speech detection capability across diverse approaches and languages.
All the model types yielded fairly good results in English. The best-performing GPT-3.5 Turbo (LLM) was rated 0.87 in all measures, a good measure of being able to define hate speech in English communication. This was closely followed by roberta-base (TL) 0.84, SVM (ML) 0.82, and BiLSTM (DL) 0.78 reaching the lowest performance rate. These findings clearly show that the classical ML and DL methods classified as English showed lower results compared with the LLM and TL models.
Following the same trend, GPT-3.5 Turbo was the best performer in the Spanish language, with the same overall score of 0.85 across the board. The bert-base-uncased (TL) model also ranked very well at 0.81, followed by SVM at 0.78. Despite its utility, the BiLSTM model fell behind with a score of 0.75, which implies deep learning with traditional embeddings (FastText) did not generalize to Spanish as well as the LLMs and pre-trained transformer-based models.
The Urdu dataset, representing a low-resource language, showed a relatively lower performance across all models. However, GPT-3.5 Turbo still outperformed the other methods, achieving 0.81 across all metrics. The xlm-roberta-base (TL) followed with 0.74, indicating its capability in multilingual and low-resource settings. The traditional SVM (ML) yielded 0.77, while BiLSTM (DL) showed the weakest performance with only 0.64, highlighting challenges in modeling Urdu with limited-feature-based or shallow architectures.
In the Joint Multilingual setting—which combined English, Spanish, and Urdu—GPT-3.5 Turbo achieved the highest results, with 0.88 across all evaluation metrics, showcasing its strength in handling mixed-language inputs and generalizing well across linguistic boundaries. The bert-base-uncased model also performed strongly with 0.84, while SVM and BiLSTM scored 0.82 and 0.76, respectively.
Across all languages and learning paradigms, GPT-3.5 Turbo (LLM) consistently outperformed the other models, highlighting the advantages of large-scale pretraining and context-rich language modeling. Transfer learning models, particularly roberta-base, bert-base-uncased, and xlm-roberta-base, also performed robustly across the languages, especially in English and Spanish.
Table 11 presents the class-wise performance metrics of our proposed GPT-3.5 Turbo-based hate speech detection model evaluated under four distinct experimental settings: English, Spanish, Urdu, and a Joint Multilingual setup. In each setting, we report the precision, recall, and F1-score for the “Not-Hateful” and “Hateful” classes, as well as the overall accuracy, macro average, and weighted average values. Each language setting was constructed by translating the other two languages into the target language and merging them with the original data, thereby creating an enriched dataset for each experimental condition. In the English setting, the dataset included original English texts combined with Spanish and Urdu texts translated into English. The model achieved strong and balanced performance across both classes: a precision of 0.84 and recall of 0.90 for the “Not-Hateful” class and a precision of 0.90 and recall of 0.85 for the “Hateful” class, resulting in identical F1-scores of 0.87. The overall accuracy was also 0.87, indicating that the model effectively captured patterns of hate and non-hate speech in English when enhanced with cross-lingual data.
In the Spanish setting, the dataset consisted of original Spanish texts merged with Urdu and English data translated into Spanish. The performance remained strong but slightly lower than English, with F1-scores of 0.84 and 0.85 for the “Not-Hateful” and “Hateful” classes, respectively, and an overall accuracy of 0.85. This suggests that the model generalizes well to Spanish, though there may be some loss in nuance or fidelity during translation.
The Urdu setting, where both English and Spanish data were translated into Urdu and combined with original Urdu texts, showed a relatively lower performance. The model achieved F1-scores of 0.80 for “Not-Hateful” and 0.82 for “Hateful”, with an overall accuracy of 0.81. This moderate drop may be attributed to challenges in translating content into Urdu with semantic precision or limited syntactic regularities in the Urdu dataset that make learning harder for the model.
The best results were observed in the Joint Multilingual setup, in which all datasets—original and translated—were combined into a single multilingual training set. In this configuration, the model achieved precision, recall, and F1-scores of 0.88 for both classes and the highest accuracy of 0.88 overall. This demonstrates the effectiveness of training on a rich and diverse multilingual corpus, as it helps the model to generalize better across linguistic variations and context. Figure 6, Figure 7, Figure 8 and Figure 9 show confusion matrices to visually represent the performance of the classification model by showing the correct and incorrect predictions for each class, allowing us to evaluate the model’s accuracy, precision, recall, and overall classification behavior in more detail.

5. Limitations of Proposed Solution

Despite the promising results and advancements achieved in this study, several limitations remain. First, while our trilingual dataset addresses English, Spanish, and Urdu in its Nastaliq script form, it still represents only a fraction of the world’s languages, and the performance of the model on other low-resource or code-mixed languages remains untested. Each language presents unique linguistic challenges that impact hate speech detection. English, a high-resource language, benefits from established NLP tools but faces difficulties due to lexical ambiguity, sarcasm, and evolving slang. Spanish involves regional dialects, gendered grammar, and morphologically rich structures, with variations in hate expressions across communities and complications from code-mixing with English. Our focus on Nastaliq Urdu presents distinct challenges such as a complex morphology, script-specific tokenization issues, and culturally implicit hate speech embedded in idiomatic and religious references that are often subtle and difficult for annotators and models to capture accurately. Additionally, the syntactic difference—Urdu’s subject–object–verb order compared with the subject–verb–object structure of English and Spanish—adds to modeling complexity. Second, the translation-based approach, although effective for standardizing texts, may introduce semantic distortions or lose cultural and contextual nuances, especially in complex languages like Urdu with rich idiomatic expressions. Third, the reliance on annotated social media data limits the scope to public discourse on specific platforms, which might not generalize well to other online contexts or private communications. Fourth, the dataset size, while substantial for these languages, may still be insufficient for fully training extremely large models like GPT-3.5 Turbo without risking overfitting or bias towards dominant language patterns. Finally, the ethical considerations around annotation, although rigorously addressed, may still encounter challenges related to subjective interpretations of hate speech across diverse cultural backgrounds, potentially affecting annotation consistency.

6. Conclusions and Future Work

Social media platforms shape public discourse, amplifying both harmful and positive content. This study advances multilingual hate speech detection, with a focus on the understudied Urdu language. Our trilingual dataset (10,193 tweets) and translation-based pipeline, leveraging machine learning, deep learning, transformer models, and large language models (LLMs), achieved significant improvements over baseline SVM models. Notably, the framework yielded strong performance for English (GPT-3.5 Turbo: F1-score of 0.87), Spanish (GPT-3.5 Turbo: F1-score of 0.85), and the Joint Multilingual dataset (GPT-3.5 Turbo: F1-score of 0.88). Urdu performance (GPT-3.5 Turbo: F1-score of 0.81), while improved over baselines by 5.19%, highlights ongoing challenges in low-resource settings, particularly due to code-mixing and limited pre-training data. Issues such as cross-lingual generalization, model interpretability, and low-resource language performance remain critical and far from resolved. Future work should prioritize Urdu-specific embeddings, enhanced translation pipelines for slang and code-mixed texts, and semi-supervised learning to foster safer, more inclusive digital communication.
Building on this work, future research can expand in several directions. One priority is to extend the multilingual framework to include additional low-resource languages and dialects, especially those prevalent in underrepresented regions, to foster even more inclusive hate speech detection. Enhancing the translation pipeline with context-aware and culturally sensitive translation models could reduce semantic loss and improve detection accuracy. Moreover, integrating multimodal data—such as images, videos, and audio—from social media posts could provide a richer context for hate speech identification beyond text alone, and to further enhance the detection process, we plan to retain and analyze hashtags as potential features given their role in signaling hate-related content and online community affiliations. Additionally, future work may include a qualitative analysis of typical misclassifications and false positive/negative examples across languages, which could offer further insight into the nuanced linguistic challenges that persist despite quantitative evaluation.

Author Contributions

Conceptualization, M.U. and M.A.; methodology, M.A. and M.U.; software, M.A. and M.U.; validation, I.G. and R.Q.T.; formal analysis, R.Q.T., I.G. and M.A.; investigation, R.Q.T., M.U. and G.S.; resources, M.A.; data curation, M.A. and M.U.; writing—original draft preparation, M.A. and M.U.; writing—review and editing, M.A. and M.U.; visualization, M.U. and M.A.; supervision, R.Q.T. and G.S.; project administration, R.Q.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research did not receive any funding.

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

The dataset utilized in this study is not publicly available due to ongoing research but can be provided upon reasonable request. Interested researchers should contact the author at usman.cic21@gmail.com, Centro de Investigación en Computación, Instituto Politécnico Nacional (CIC-PN), Mexico City 07738, Mexico. Requests must include a detailed description of the intended use and the requester’s institutional affiliation.

Acknowledgments

The work was done with partial support from the Mexican Government through grant A1-S-47854 of CONAHCYT, Mexico and grants 20241816, 20241819, 20240936 and 20240951 of the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico. The authors thank the CONAHCYT for the computing resources brought to them through the Plataforma de Aprendizaje Profundo para Tecnologías del Lenguaje of the Laboratorio de Supercómputo of the INAOE, Mexico and acknowledge the support of Microsoft through the Microsoft Latin America PhD Award.

Conflicts of Interest

The authors declare no conflicts of interest in this study.

References

  1. AlKhudari, M.N.; Abduljabbar, O.J.; Al Manaseer, A.M.; AL-Omari, M.S. The role of social media in shaping public opinion among Jordanian university students. J. Infrastruct. Policy Dev. 2024, 8, 5489. [Google Scholar] [CrossRef]
  2. Swastiningsih, S.; Aziz, A.; Dharta, Y. The Role of Social Media in Shaping Public Opinion: A Comparative Analysis of Traditional vs. Digital Media Platforms. J. Acad. Sci. 2024, 1, 620–626. [Google Scholar] [CrossRef]
  3. Tash, M.S.; Ramos, L.; Ahani, Z.; Monroy, R.; Calvo, H.; Sidorov, G. Online Social Support Detection in Spanish Social Media Texts. arXiv 2025, arXiv:2502.09640. [Google Scholar]
  4. Ahmad, M.; Usman, S.; Farid, H.; Ameer, I.; Muzammil, M.; Hamza, A.; Sidorov, G.; Batyrshin, I. Hope Speech Detection Using Social Media Discourse (Posi-Vox-2024): A Transfer Learning Approach. J. Lang. Educ. 2024, 10, 31–43. [Google Scholar] [CrossRef]
  5. Ahmad, M.; Ameer, I.; Sharif, W.; Usman, S.; Muzamil, M.; Hamza, A.; Jalal, M.; Batyrshin, I.; Sidorov, G. Multilingual hope speech detection from tweets using trans- fer learning models. Sci. Rep. 2025, 15, 9005. [Google Scholar]
  6. Ullah, F.; Zamir, M.T.; Ahmad, M.; Sidorov, G.; Gelbukh, A. Hope: A multilingual approach to identifying positive communication in social media. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), Co-Located with the 40th Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS. org, Valladolid, Spain, 24 September 2024. [Google Scholar]
  7. Arif, M.; Shahiki Tash, M.; Jamshidi, A.; Ullah, F.; Ameer, I.; Kalita, J.; Gelbukh, A.; Balouchzahi, F. Analyzing hope speech from psycholinguistic and emotional per- spectives. Sci. Rep. 2024, 14, 23548. [Google Scholar] [CrossRef] [PubMed]
  8. Ahmad, M.; Waqas, M.; Hamza, A.; Usman, S.; Batyrshin, I.; Sidorov, G. UA-HSD-2025: Multi-Lingual Hate Speech Detection from Tweets Using Pre-Trained Transformers. Computers 2025, 14, 239. [Google Scholar] [CrossRef]
  9. Zamir, M.; Tash, M.; Ahani, Z.; Gelbukh, A.; Sidorov, G. Lidoma@ dravidianlangtech 2024: Identifying hate speech in telugu code-mixed: A bert multilingual. In Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages, St. Julian’s, Malta, 22 March 2024; pp. 101–106. [Google Scholar]
  10. Ahani, Z.; Tash, M.S.; Tash, M.; Gelbukh, A.; Gelbukh, I. Multiclass hope speech de- tection through transformer methods. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), Co-Located with the 40th Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS. org, Valladolid, Spain, 24 September 2024. [Google Scholar]
  11. Pavlopoulos, J.; Sorensen, J.; Laugier, L.; Androutsopoulos, I. SemEval-2021 task 5: Toxic spans detection. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; pp. 59–69. [Google Scholar]
  12. Ali, M.; Muhammad, A.; Asad, M.; Sajawal, M.; Alexopoulos, C.; Charalabidis, Y. Towards perso-arabic urdu language hate detection using machine learning: A com- parative study based on a large dataset and time-complexity. In Proceedings of the 26th Pan-Hellenic Conference on Informatics, Athens, Greece, 25–27 November 2022; pp. 317–321. [Google Scholar]
  13. Perera, S.S.; Sumanathilaka, D.K. Machine Translation and Transliteration for Indo- Aryan Languages: A Systematic Review. In Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages, Abu Dhabi, United Arab Emirates, 20 January 2025; pp. 11–21. [Google Scholar]
  14. Fortuna, P.; Nunes, S. A survey on automatic detection of hate speech in text. ACM Comput. Surv. 2018, 51, 1–30. [Google Scholar] [CrossRef]
  15. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirec- tional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu-Tational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, (long and short papers). pp. 4171–4186. [Google Scholar]
  16. Kolesnikova, O.; Tash, M.S.; Ahani, Z.; Agrawal, A.; Monroy, R.; Sidorov, G. Ad- vanced Machine Learning Techniques for Social Support Detection on Social Media. arXiv 2025, arXiv:2501.03370. [Google Scholar]
  17. Biradar, S.; Saumya, S.; Chauhan, A. Fighting hate speech from bilingual hinglish speaker’s perspective, a transformer-and translation-based approach. Soc. Netw. Anal. Min. 2022, 12, 87. [Google Scholar] [CrossRef] [PubMed]
  18. Ahmad, M.; Sidorov, G.; Amjad, M.; Ameer, I.; Batyrshin, I. Opioid Crisis Detection in Social Media Discourse Using Deep Learning Approach. Information 2025, 16, 545. [Google Scholar] [CrossRef]
  19. Ahmad, M.; Batyrshin, I.; Sidorov, G. Sentiment Analysis Using a Large Language Model–Based Approach to Detect Opioids Mixed With Other Substances Via Social Media: Method Development and Validation. JMIR Infodemiol. 2025, 5, e70525. [Google Scholar] [CrossRef] [PubMed]
  20. Ahmad, M.; Farid, H.; Ameer, I.; Ullah, F.; Muzamil, M.; Jalal, M.; Hamza, A.; Batyrshin, I.; Sidorov, G. UE-NER-2025: A GPT-Based Approach to Multi-Lingual Named Entity Recognition on Urdu and English. IEEE Access 2025, 13, 111175–111186. [Google Scholar] [CrossRef]
  21. Ashraf, M.R.; Jana, Y.; Umer, Q.; Jaffar, M.A.; Chung, S.; Ramay, W.Y. Bert-based sentiment analysis for low-resourced languages: A case study of urdu language. IEEE Access 2023, 11, 110245–110259. [Google Scholar] [CrossRef]
  22. Chetty, N.; Alathur, S. Hate speech review in the context of online social networks. Aggress. Violent Behav. 2018, 40, 108–118. [Google Scholar] [CrossRef]
  23. Aluru, S.S.; Mathew, B.; Saha, P.; Mukherjee, A. Deep learning models for multilin- gual hate speech detection. arXiv 2020, arXiv:2004.06465. [Google Scholar]
  24. Siddiqui, J.A.; Yuhaniz, S.S.; Mujtaba, G.; Soomro, S.A.; Mahar, Z.A. Fine-grained multilingual Hate speech detection using Explainable AI and Transformers. IEEE Access 2024, 12, 143177–143192. [Google Scholar] [CrossRef]
  25. Sharjeel, M.; Nawab, R.M.A.; Rayson, P. COUNTER: Corpus of Urdu news text reuse. Lang. Resour. Eval. 2017, 51, 777–803. [Google Scholar] [CrossRef]
  26. Mehmood, A.; Farooq, M.S.; Naseem, A.; Rustam, F.; Villar, M.G.; Rodríguez, C.L.; Ashraf, I. Threatening URDU language detection from tweets using machine learning. Appl. Sci. 2022, 12, 10342. [Google Scholar] [CrossRef]
  27. Kandhro, I.A.; Jumani, S.Z.; Kumar, K.; Hafeez, A.; Ali, F. Roman Urdu headline news text classification using RNN, LSTM and CNN. Adv. Data Sci. Adapt. Anal. 2020, 12, 2050008. [Google Scholar] [CrossRef]
  28. Bilal, M.; Khan, A.; Jan, S.; Musa, S. Context-aware deep learning model for detection of roman urdu hate speech on social media platform. IEEE Access 2022, 10, 121133–121151. [Google Scholar] [CrossRef]
  29. Sharif, W.; Abdullah, S.; Iftikhar, S.; Al-Madani, D.; Mumtaz, S. Enhancing Hate Speech Detection in the Digital Age: A Novel Model Fusion Approach Leveraging a Comprehensive Dataset. IEEE Access 2024, 12, 27225–27236. [Google Scholar] [CrossRef]
  30. Haider, F.; Pollak, S.; Albert, P.; Luz, S. Emotion recognition in low-resource settings: An evaluation of automatic feature selection methods. Comput. Speech Lang. 2021, 65, 101119. [Google Scholar] [CrossRef]
  31. Azhar, N.; Latif, S. Roman urdu sentiment analysis using pre-trained distilbert and xlnet. In Proceedings of the 2022 Fifth International Conference of Women in Data Science at Prince Sultan University (WiDS PSU), Riyadh, Saudi Arabia, 28–29 March 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 75–78. [Google Scholar]
  32. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
  33. Gillioz, A.; Casas, J.; Mugellini, E.; Abou Khaled, O. Overview of the Transformer- based Models for NLP Tasks. In Proceedings of the 2020 15th Conference on Computer Science and Information Systems (FedCSIS), Sofia, Bulgaria, 6–9 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 179–183. [Google Scholar]
  34. Wong, J.T.; Zhang, C.; Cao, X.; Gimenes, P.; Constantinides, G.A.; Luk, W.; Zhao, Y. A3: An Analytical Low-Rank Approximation Framework for Attention. arXiv 2025, arXiv:2505.12942. [Google Scholar]
  35. Fu, D.Y.; Dao, T.; Saab, K.K.; Thomas, A.W.; Rudra, A.; Ré, C. Hungry hun- gry hippos: Towards language modeling with state space models. arXiv 2022, arXiv:2212.14052. [Google Scholar]
  36. Alrehili, A. Automatic hate speech detection on social media: A brief survey. In Proceedings of the 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA), Abu Dhabi, United Arab Emirates, 3–7 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
  37. Ranasinghe, T.; Zampieri, M. Multilingual offensive language identification with cross- lingual embeddings. arXiv 2020. [Google Scholar] [CrossRef]
  38. Bigoulaeva, I.; Hangya, V.; Gurevych, I.; Fraser, A. Label modification and boot- strapping for zero-shot cross-lingual hate speech detection. Lang. Resour. Eval. 2023, 57, 1515–1546. [Google Scholar] [CrossRef] [PubMed]
  39. Ghulam, H.; Zeng, F.; Li, W.; Xiao, Y. Deep learning-based sentiment analysis for roman urdu text. Procedia Comput. Sci. 2019, 147, 131–135. [Google Scholar] [CrossRef]
  40. Vidgen, B.; Derczynski, L. Directions in abusive language training data, a systematic review: Garbage in, garbage out. PLoS ONE 2020, 15, e0243300. [Google Scholar] [CrossRef] [PubMed]
  41. Pereira-Kohatsu, J.C.; Quijano-Sánchez, L.; Liberatore, F.; Camacho-Collados, M. Detecting and monitoring hate speech in Twitter. Sensors 2019, 19, 4654. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Proposed methodology and design.
Figure 1. Proposed methodology and design.
Computers 14 00279 g001
Figure 2. Statistics of trilingual dataset.
Figure 2. Statistics of trilingual dataset.
Computers 14 00279 g002
Figure 3. Label distribution of trilingual dataset.
Figure 3. Label distribution of trilingual dataset.
Computers 14 00279 g003
Figure 4. Word cloud for (a) Spanish, (b) Urdu, and (c) English datasets.
Figure 4. Word cloud for (a) Spanish, (b) Urdu, and (c) English datasets.
Computers 14 00279 g004
Figure 5. Application of models training and testing phase.
Figure 5. Application of models training and testing phase.
Computers 14 00279 g005
Figure 6. Confusion Matrix: GPT-3.5 Turbo (English).
Figure 6. Confusion Matrix: GPT-3.5 Turbo (English).
Computers 14 00279 g006
Figure 7. Confusion Matrix: GPT-3.5 Turbo (Spanish).
Figure 7. Confusion Matrix: GPT-3.5 Turbo (Spanish).
Computers 14 00279 g007
Figure 8. Confusion Matrix: GPT-3.5 Turbo (Urdu).
Figure 8. Confusion Matrix: GPT-3.5 Turbo (Urdu).
Computers 14 00279 g008
Figure 9. Confusion Matrix: GPT-3.5 Turbo (Joint Multilingual).
Figure 9. Confusion Matrix: GPT-3.5 Turbo (Joint Multilingual).
Computers 14 00279 g009
Table 1. Examples of hateful and non-hateful tweets in Urdu, Spanish, and English.
Table 1. Examples of hateful and non-hateful tweets in Urdu, Spanish, and English.
LanguageTweet (Translation/Context)ReasoningLabel
Urduپٹھان دہشتگرد ہیں (Pathans are terrorists)Stereotypes Pathans as terroristsHateful
شیعوں کو مارو (Kill the Shias)Incites violence against Shias
عورتیں بس گھر کے لیے ہیں (Women are only for the home)Misogynistic; restricts women’s roles
سندھی اجرک بہت خوبصورت ہے (Sindhi Ajrak is very beautiful)Praises Sindhi cultureNot-Hateful
کراچی میں بارش ہو رہی ہے (It is raining in Karachi)Neutral weather statement
SpanishLos venezolanos son criminales (Venezuelans are criminals)Xenophobic stereotypeHateful
Las mujeres solo sirven para cocinar (Women are only good for cooking)Misogynistic; demeans women
Hay que echar a los bolivianos (We must kick out Bolivians)Incites expulsion of Bolivians
El flamenco es un arte increíble (Flamenco is an incredible art)Celebrates flamenco cultureNot-Hateful
Hoy hace sol en Madrid (It’s sunny in Madrid today)Neutral weather statement
EnglishMuslims are terroristsReligious stereotypeHateful
Women belong in the kitchenMisogynistic; restricts women’s roles
Go back to your countryXenophobic attack
Indian culture is so vibrantPositive cultural appreciationNot-Hateful
It’s raining in LondonNeutral weather statement
Table 2. Interpretation of Cohen’s kappa values.
Table 2. Interpretation of Cohen’s kappa values.
Kappa ValueInterpretation
1.0Perfect Agreement
0.80–1.0Substantial Agreement
0.60–0.80Moderate Agreement
0.40–0.60Fair Agreement
<0.40Poor Agreement
Table 3. Pseudo-code for multilingual hate speech classification.
Table 3. Pseudo-code for multilingual hate speech classification.
StepOperation
Annotated-DAnnotate (D1, D2, D3, Guidelines, Majority-Voting)
Preprocessed-DPreprocess (Annotated-D)
Eng-D1Translate-To-English(D1)
Eng-D2Translate-To-English(D2)
Urd-D1Translate-To-Urdu(D1)
Urd-D3Translate-To-Urdu(D3)
Spa-D1Translate-To-Spanish(D1)
Spa-D3Translate-To-Spanish(D3)
Validate-TranslationsValidate-Translations (Eng-D1, Eng-D2, Urd-D1, Urd-D3, Spa-D1, Spa-D3)
Combined-Eng-D→ Merge (Eng-D1, Eng-D2, D3)
Combined-Urd-D→ Merge (Urd-D1, Urd-D3, D2)
Combined-Spa-D→ Merge (Spa-D1, Spa-D3, D1)
Joint-DJoint-Process (Combined-Eng-D, Combined-Urd-D, Combined-Spa-D)
Features-MLExtract-Features (Joint-D, TF–IDF)
Features-DLExtract-Features (Joint-D, FastText, GloVe)
Features-TFApply-AttentionLayer (Joint-D, MultiHead + Sparse Attention); Extract-Features (BERT, RoBERTa, ELECTRA, XLM-R)
Features-LLMApply-AttentionLayer (Joint-D, MultiHead + Sparse Attention); Extract-Features (LLMs)
Predictions-MLClassify (Features-ML, Models: SVM, XGBoost, Random Forest, Decision Tree)
Predictions-DLClassify (Features-DL, Models: CNN, BiLSTM)
Predictions-TFClassify (Features-TF, Models: BERT, RoBERTa, ELECTRA, XLM-RoBERTa)
Predictions-LLM→ Classify (Features-Large Language Models)
MetricsEvaluate ([Predictions], Test-Set, Metrics: Accuracy, Precision, Recall, F1-Score)
Table 4. Data preprocessing pipeline.
Table 4. Data preprocessing pipeline.
StepDescription
D1Clean(D): Remove hashtags, URLs, emojis, punctuation, and numbers
D2Lowercase(D1): Convert all text to lowercase
D3Remove-Stopwords(D2, English, Urdu, Spanish): Remove stopwords in three languages
D4Stem(D3): Apply stemming to tokens
ReturnD4
Table 5. Hyper-parameter tuning for multilingual hate speech detection task used in this study.
Table 5. Hyper-parameter tuning for multilingual hate speech detection task used in this study.
ModelKey Parameters
Logistic RegressionPenalty = l2, Solver = liblinear, C = 1.0, Max Iterations = 1000
SVMKernel = linear, C = 1.0, Gamma = scale, Decision Function = ovr
Random ForestEstimators = 100, Max Depth = None, Criterion = gini, Random State = 42
CNNFilters = 128, Kernel Size = 3, Activation = ReLU, Pooling = MaxPooling1D, Dropout = 0.5, Optimizer = Adam, Epochs = 10, Batch Size = 32
BiLSTMUnits = 128, Dropout = 0.5, Recurrent Dropout = 0.3, Optimizer = Adam, Epochs = 10, Batch Size = 32
BERT-based (uncased), mBERT, XLM-RPretrained: bert-base-multilingual-cased, Max Length = 128, Learning Rate = 2 × 10−5, Batch Size = 16, Epochs = 3, Optimizer = AdamW, Warmup Steps = 0
GPT 3.5 TurboTrained Tokens = 1,289,778, Epochs = 3, Batch Size = 15, LR Multiplier = 2, Seed = 1,693,683,698
Table 6. Results for large language model (GPT-3.5 Turbo).
Table 6. Results for large language model (GPT-3.5 Turbo).
DatasetModelPrecisionRecallF1-ScoreAccuracy
English translationGPT-3.5 Turbo0.870.870.870.87
Spanish translationGPT-3.5 Turbo0.850.850.850.85
Urdu translationGPT-3.5 Turbo0.810.810.810.81
Joint MultilingualGPT-3.5 Turbo0.880.880.880.88
Table 7. Results for transformer models.
Table 7. Results for transformer models.
LanguageModelPrecisionRecallF1-ScoreAccuracy
English translationbert-base-uncased0.830.830.830.83
electra-base-discriminator0.830.840.830.84
roberta-base0.830.840.830.84
xlm-roberta-base0.820.820.820.82
Spanish translationbert-base-uncased0.810.810.810.81
electra-base-discriminator0.800.800.800.80
roberta-base0.800.800.800.80
xlm-roberta-base0.810.810.810.81
Urdu translationbert-base-uncased0.720.700.710.71
electra-base-discriminator0.700.690.690.69
roberta-base0.730.720.720.72
xlm-roberta-base0.740.730.730.73
Joint Multilingualbert-base-uncased0.840.840.840.84
electra-base-discriminator0.760.760.760.76
roberta-base0.760.760.760.76
xlm-roberta-base0.840.840.840.84
Table 8. Results for deep learning models.
Table 8. Results for deep learning models.
LanguageEmbeddingModelPrecisionRecallF1-ScoreAccuracy
English translationFastTextCNN0.740.740.740.74
FastTextBiLSTM0.780.780.780.78
GloVeCNN0.740.740.740.74
GloVeBiLSTM0.780.780.780.78
Spanish translationFastTextCNN0.670.670.670.67
FastTextBiLSTM0.750.750.750.75
GloVeCNN0.650.650.650.65
GloVeBiLSTM0.710.710.710.71
Urdu translationFastTextCNN0.630.630.630.63
FastTextBiLSTM0.640.640.640.64
GloVeCNN0.390.390.390.39
GloVeBiLSTM0.410.410.410.41
Joint MultilingualFastTextCNN0.680.680.680.68
FastTextBiLSTM0.760.760.760.76
GloVeCNN0.640.640.640.64
GloVeBiLSTM0.680.680.680.68
Table 9. Results for machine learning models.
Table 9. Results for machine learning models.
LanguageModelPrecisionRecallF1-ScoreAccuracy
English translationsRF0.820.810.810.81
SVM0.830.820.820.82
DT0.760.760.760.76
XGB0.810.800.800.80
Spanish translationRF0.780.770.770.77
SVM0.780.780.780.78
DT0.710.710.710.71
XGB0.770.770.770.77
Urdu translationRF0.770.760.760.76
SVM0.780.770.770.77
DT0.680.680.680.68
XGB0.760.750.750.75
Joint MultilingualRF0.810.810.810.81
SVM0.820.820.820.82
XGB0.800.800.800.80
Table 10. Top performing models in each learning approach.
Table 10. Top performing models in each learning approach.
LanguageModel TypeModelPrecisionRecallF1-ScoreAccuracy
EnglishMLSVM0.830.820.820.82
DLBiLSTM (FastText)0.780.780.780.78
TLroberta-base0.840.840.840.84
LLMGPT 3.5 Turbo0.870.870.870.87
SpanishMLSVM0.780.780.780.78
DLBiLSTM (FastText)0.750.750.750.75
TLbert-base-uncased0.810.810.810.81
LLMGPT 3.5 Turbo0.850.850.850.85
UrduMLSVM0.780.770.770.77
DLBiLSTM (FastText)0.640.640.640.64
TLxlm-roberta-base0.740.730.730.73
LLMGPT 3.5 Turbo0.810.810.810.81
Joint MultilingualMLSVM0.820.820.820.82
DLBiLSTM (FastText)0.760.760.760.76
TLbert-base-uncased0.840.840.840.84
LLMGPT 3.5 Turbo0.880.880.880.88
Table 11. Class wise score of Proposed models (Gpt-3.5 turbo).
Table 11. Class wise score of Proposed models (Gpt-3.5 turbo).
DatasetClassPrecisionRecallF1-ScoreSupport
EnglishNot-Hateful0.840.90.87953
Hateful0.90.850.871086
SpanishNot-Hateful0.820.880.84952
Hateful0.880.830.851087
UrduNot-Hateful0.780.830.8951
Hateful0.840.790.821088
JointNot-Hateful0.850.910.88953
Hateful0.910.860.881086
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Usman, M.; Ahmad, M.; Sidorov, G.; Gelbukh, I.; Tellez, R.Q. A Large Language Model-Based Approach for Multilingual Hate Speech Detection on Social Media. Computers 2025, 14, 279. https://doi.org/10.3390/computers14070279

AMA Style

Usman M, Ahmad M, Sidorov G, Gelbukh I, Tellez RQ. A Large Language Model-Based Approach for Multilingual Hate Speech Detection on Social Media. Computers. 2025; 14(7):279. https://doi.org/10.3390/computers14070279

Chicago/Turabian Style

Usman, Muhammad, Muhammad Ahmad, Grigori Sidorov, Irina Gelbukh, and Rolando Quintero Tellez. 2025. "A Large Language Model-Based Approach for Multilingual Hate Speech Detection on Social Media" Computers 14, no. 7: 279. https://doi.org/10.3390/computers14070279

APA Style

Usman, M., Ahmad, M., Sidorov, G., Gelbukh, I., & Tellez, R. Q. (2025). A Large Language Model-Based Approach for Multilingual Hate Speech Detection on Social Media. Computers, 14(7), 279. https://doi.org/10.3390/computers14070279

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop