Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic

Ouali, Soufiyan; Raisi, Kanza; Mourhir, Asmaa; Nfaoui, El Habib; Garouani, Said El

doi:10.3390/bdcc10050132

Open AccessArticle

Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic

by

Soufiyan Ouali

^1,*

,

Kanza Raisi

²,

Asmaa Mourhir

²

,

El Habib Nfaoui

¹

and

Said El Garouani

¹

L3IA Laboratory, Faculty of Sciences Dhar El Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco

²

School of Science and Engineering, Al Akhawayn University, Ifrane 53000, Morocco

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(5), 132; https://doi.org/10.3390/bdcc10050132

Submission received: 26 January 2026 / Revised: 12 March 2026 / Accepted: 23 March 2026 / Published: 24 April 2026

(This article belongs to the Special Issue Natural Language Processing Applications in Big Data)

Download

Browse Figures

Versions Notes

Abstract

Offensive language detection is crucial for ensuring safe and inclusive digital environments. Identifying harmful content protects users and supports healthier online interactions. Despite advances in transformer-based models, particularly Large Language Models (LLMs), their application to this task remains underexplored for low-resource languages such as Moroccan Arabic, especially compared with high-resource languages. This study evaluates the performance of various open- and closed-source LLMs for offensive language detection in Moroccan Darija. The evaluated models include general-purpose LLMs such as LLaMA, Mistral, and Gemma, as well as Arabic-focused models such as ArabianGPT, Falcon Arabic, and Atlas-Chat. We also experiment with reasoning models such as DeepSeek and GPT-4. Beyond traditional evaluation metrics, we investigate the robustness of these LLMs and examine the impact of adversarial training on their performance. Moreover, we contribute to the field by creating a large, high-quality dataset. Our evaluation revealed that GPT-4o Mini achieved the best overall performance, reaching an F1-score of 88%. However, robustness testing under black-box and white-box adversarial attacks exposed notable vulnerabilities, with attack success rates reaching 30%, thereby highlighting the need for enhancement. Despite the complex morphology and linguistic variability of Moroccan Darija, adversarial training resulted in a notable improvement in both overall model performance and robustness against adversarial attacks, yielding an average increase of 20.89% in resistance to attacks. Furthermore, this approach enabled GPT-4o Mini to achieve an F1-score of 91%, surpassing the current state-of-the-art performance by 6%. These results highlight the importance of incorporating adversarial approaches in low-resource dialectal settings to effectively address linguistic variability and data scarcity.

Keywords:

robust AI for low-resource language; Moroccan Arabic; Arabic dialects; LLMs for low-resource language; offensive speech recognition; adversarial training

1. Introduction

Offensive language remains a pervasive challenge in modern digital communication, driven by the rapid growth of social media and online interaction platforms. Expressions of hate speech, abusive language, and cyberbullying can significantly harm individuals and undermine the trust, safety, and well-being of online communities. Addressing this challenge requires effective offensive language detection mechanisms capable of identifying harmful content in a timely and scalable manner. Such systems play a vital role in protecting vulnerable users, supporting community guidelines, and ensuring compliance with legal and ethical standards. In highly interconnected digital spaces, robust detection approaches also help promote respectful discourse while enabling platforms to moderate content responsibly without compromising freedom of expression.

Moroccan Arabic (Darija), a low-resource dialect that presents significant challenges for offensive language detection. Its primarily spoken and informal nature results in the absence of standardized orthography and consistent syntactic rules, as well as a scarcity of essential Natural Language Processing (NLP) resources, such as tokenizers, embeddings, and named entity recognition (NER) models [1,2]. These challenges are further intensified by the influence of Berber, French, and Spanish, substantial regional variation, and the frequent use of non-standard writing forms, including Latin characters mixed with numerals [3]. These factors hinder the creation of reliable datasets and models. Moreover, offensive language in Darija is highly context dependent, shaped by social norms, regional interpretations, and implicit meanings, which makes detection far more complex than simple keyword-based approaches. Annotator bias and inconsistent labeling in existing datasets further hinder model generalization and increase susceptibility to adversarial manipulation. These challenges are compounded by the lack of evaluation frameworks specifically designed for dialectal Arabic. To address these issues, this study proposes a robust framework to enhance offensive language detection in Darija by reducing labeling inconsistencies, improving generalization, and strengthening adversarial resilience, ultimately supporting more effective content moderation in this underrepresented dialect.

Recently, LLMs have demonstrated strong performance across a wide range of NLP benchmarks, including emotion recognition, sentiment analysis, and hate and offensive speech detection in high-resource languages such as English [4,5,6]. These successes have established LLMs as leading approaches in modern NLP. Their ability to capture complex linguistic patterns through fine-tuning on large pre-trained models has also led to claims that they can adapt effectively to new languages and dialects, even in low-data settings. However, the application of LLMs to low-resource languages remains relatively underexplored. In this study, we examine the performance of LLMs for offensive language detection in Darija.

Deploying language models in real-world applications requires more than high accuracy; it demands robust testing beyond controlled datasets. Recent research emphasizes robustness and generalizability, ensuring that models perform reliably under diverse and unpredictable user inputs. This concern is particularly critical for low-resource languages like Darija, where the lack of comprehensive linguistic resources makes models more prone to failure. In such contexts, even minor perturbations in the input, such as synonym substitutions, orthographic variations, or code-switching, can significantly alter the model’s predictions. Recent research on dialectal offensive language detection has shown that the performance of large models can drop significantly in accuracy under adversarial conditions [7]. The authors in [8] report that aspect-based sentiment analysis methods can lose over 50% in accuracy when exposed to noisy or perturbed inputs. This vulnerability underscores the importance of building models that not only perform well under standard conditions but also demonstrate resilience to linguistic diversity and adversarial manipulation, which are common in real-world settings.

This study aims to assess the effectiveness and robustness of several well-established LLMs reported in the literature, including GPT-4, Mistral, DeepSeek, LLaMA, Gemma, ArabianGPT, Falcon Arabic, and AtlasChat, when applied to Moroccan Darija. In particular, the study addresses the following research questions:

RQ1: Do modern large language models outperform traditional deep learning and finetuned transformer-based classifiers in offensive language detection for Moroccan Darija under standard evaluation settings?

RQ2: How robust are these models to adversarial attacks?

RQ3: Which large language models demonstrate the highest resilience to adversarial attacks while maintaining strong classification performance?

RQ4: Can adversarial training improve the robustness of language models for offensive language detection in Moroccan Darija?

The main contributions of this work are summarized as follows. First, we construct a high-quality dataset for offensive language detection in Darija, designed to capture realistic linguistic challenges commonly found in online communication and to reflect variation in language use. Second, we conduct a comprehensive evaluation of several large language models for offensive language detection in this low-resource dialect. Third, we assess the robustness of these models under multiple adversarial perturbation strategies that simulate realistic noise and manipulation in user-generated text. Finally, we investigate the effectiveness of adversarial training as a defense mechanism for improving model robustness while maintaining strong classification performance.

The rest of the paper is organized as follows. Section 2 surveys the related work. Section 3 outlines the methodology. Section 4 presents and discusses the results. Finally, Section 5 concludes the paper and highlights directions for future research.

2. Background and Related Work

In this section, we review the main approaches and models used for offensive language detection in both high-resource and low-resource languages, with a particular focus on Arabic and its dialects. We also summarize the current state-of-the-art results reported across different linguistic settings. In addition, this section provides a concise overview of model performance from another perspective, robustness, by discussing how these models perform in real-world settings.

2.1. Offensive Language Detection Methods

LLMs have recently revolutionized the field of NLP, achieving strong results across a wide range of tasks, from language understanding to language generation. As a result, increasing attention has been directed toward using LLMs to advance research in offensive language detection, particularly for high-resource languages such as English, French, German, and Spanish. In [9], the authors conducted a comparative study evaluating three LLMs (i.e., GPT-3.5, Flan-T5, and Mistral) across three high-resource languages (i.e., English, Spanish, and German), using a dataset of 5000 samples for each language. Mistral achieved the best result for English with an F1-score of 91.2%, closely approaching the 92.0% performance achieved by RoBERTa-Large in SemEval-2020 [10]. For Spanish, Mistral achieved an F1-score of 87.7%, while for German, GPT-3.5 outperformed the other models with an F1-score of 76.8%. Similarly, in [11], researchers evaluated six LLMs for English, including Falcon-7B-Instruct, RedPajama-INCITE-7B-Instruct, MPT-7B-Instruct, LLaMA-2-7B-Chat, T0-3B, and Flan-T5-large. Flan-T5 achieved the best result, with an F1-score of 91%. LLaMA 2 also performed strongly, achieving an F1-score of 87.4%, whereas the other models lagged noticeably, all scoring below 75% F1. In [12], the authors leveraged GPT-2 and investigated the effectiveness of various data augmentation strategies, achieving a notable F1-score of 91.50%, which highlights the importance of rich and representative datasets for improving NLP model performance. In [13], the authors proposed a novel approach for analyzing offensive speech in both explicit and implicit forms. Using this method and fine-tuning multiple LLMs (e.g., GPT-2, GPT-3, and Flan-T5), they detected explicit offensiveness with an F1-score of 86.16% and implicit offensiveness with 76.71% using Flan-T5-large. GPT-2 also showed promising results, achieving an F1-score of 85.67% for explicit and 71.58% for implicit offensive speech detection.

In OffensEval 2023 [14], several underrepresented languages, such as Arabic, Danish, Greek, and Turkish, were included, and participants achieved promising results (Table 1). These findings underscore the effectiveness of LLMs and transformer-based models such as BERT in detecting offensive speech. However, research on dialectal variants remains limited. Most current models, such as BERT and AraBERT, are trained on formal language, particularly Modern Standard Arabic (MSA), which reduces their effectiveness in real-world dialectal contexts. Arabic dialects, including Egyptian, Levantine, Maghrebi, and Gulf Arabic, differ significantly from MSA in structure and vocabulary, which limits the generalizability of these models. For the Moroccan dialect, which is the central focus of our work, several attempts have been made to reduce this gap. For instance, the study in [15] trained four deep learning models to detect offensive language. However, to date, only two studies have addressed this task using advanced methods. The earliest work in this area [16] presented the Offensive Moroccan Comments Dataset (OMCD), a specialized corpus developed specifically for the detection of offensive language in Moroccan Darija and restricted to Arabic-script texts. Following fine-tuning experiments with several BERT-family architectures on this dataset, MARBERT demonstrated the strongest performance, attaining an F1-score of 84.02%.

A subsequent contribution [17] introduced the Moroccan Darija Offensive Language Detection Dataset (DarOLD), which expanded coverage to include both Arabic and Latin-script representations of Darija and was explicitly constructed for offensive content detection. After evaluating multiple BERT-based models through fine-tuning on DarOLD, the DarijaRoBERTa achieved the best performance, with an F1-score of 85% representing the current SOTA for Darija.

The availability of LLMs tailored to the Arabic language, particularly its dialects, remains extremely limited, if not nearly nonexistent. This makes our work the first to investigate the performance of LLMs for offensive speech detection in Moroccan Darija.

2.2. Robustness of Offensive Detection Models

Although the literature reports strong results, these findings are typically based on closed test sets and may not reflect real-world performance. Without robust testing that accounts for input variability, models cannot be considered reliable for deployment. Adversarial attack testing has emerged as a powerful method for evaluating the reliability of NLP models. It assesses a model’s ability to generalize and maintain accurate performance when exposed to challenging inputs, such as deliberate perturbations, typographical errors, or paraphrased sentences that differ from the training data but retain the same semantic meaning [18]. These attacks are commonly generated using two main approaches: black-box and white-box.

In a black-box attack, the attacker has no access to the model’s internal workings, such as its architecture, parameters, or gradients. Instead, adversarial examples are crafted by applying random or heuristic-based changes to the input text, such as swapping characters, inserting noise, or paraphrasing. Because the model’s internal behavior is unknown, this approach mirrors real-world scenarios in which attackers can interact with a system only through input-output queries.

In contrast, a white-box attack assumes full access to the model’s internals, including its architecture and gradients. This enables the attacker to generate adversarial inputs by directly analyzing how changes to the input affect the model’s predictions. Using gradient information, precise perturbations can be applied to maximize the likelihood of fooling the model. This type of attack typically targets open-source models.

An LLM typically receives input as a combination of a prompt and a user sample. Consequently, attacks can target either component: prompt injection through the prompt or user-end attacks through the sample. Our primary focus is on user-end attacks, as they better simulate real-world adversarial scenarios. These attacks are generally categorized into three levels: character-level, in which individual letters, punctuation marks, numbers, or spacing are altered to subtly distort the input while keeping it human-readable; word-level, which involves replacing words with synonyms, antonyms, or semantically similar terms while preserving grammatical correctness; and sentence-level, in which the sentence structure or meaning is modified through techniques such as paraphrasing. Each of these attack types can be applied in both white-box and black-box settings, with various perturbation strategies available within each category. In [19], the researchers examined the effect of different manipulations on the performance of various LLMs and found that word-level attacks were the most effective, causing an average performance drop of 39% across all tasks. Character-level attacks ranked second, inducing a 20% performance drop in most datasets. Similarly, in [17], the authors successfully deceived large models (e.g., DarijaBERT and RoBERTa) into classifying offensive text as non-offensive, and vice versa, through sample injection using character and word-level attacks in a black-box setting. They found that character-level attacks achieved the highest Attack Success Rate (ASR). More specifically, inserting dots between letters achieved an ASR of 29%, inserting spaces between letters achieved 24%, modifying a character with random noise achieved 18%, modifying a single character achieved 17%, deleting spaces between two words achieved 16%, and repeating vowels achieved 13%.

Creating adversarial attacks is a critical yet delicate process. While some methods can achieve high ASR, their validity becomes questionable if the resulting text is unnatural or unreadable to humans. As highlighted in [20], injecting heavy distortions or meaningless text may effectively break a classifier, but such approaches are unsuitable for real-world applications. These modifications often alter the original meaning or render the input nonsensical, undermining the purpose of testing model robustness. A truly effective adversarial example must preserve coherence and readability while subtly influencing the model’s prediction. This concern is supported by findings in [21], where researchers observed that approximately 90% of existing adversarial attack outputs either distort semantic meaning or mislead human annotators. Additionally, the authors demonstrated that natural, human-readable adversarial attacks using word-level perturbation in white-box settings and sentence-level perturbation in black-box settings can effectively deceive LLMs, while preserving semantic similarity. Using these attacks led them to achieve a significant accuracy drop of 55.41%.

A review of the literature indicates that, for an adversarial attack to be considered successful, it must satisfy two key criteria: it should remain readable and interpretable to humans while also effectively deceiving the model. This dual requirement ensures that the perturbation is both realistic and sufficiently challenging. The literature also underscores the urgent need for robust models that can withstand such attacks and remain reliable enough for deployment in real-world applications. This is particularly important given the significant performance gap between high-resource and low-resource languages, as models trained on limited data are often more vulnerable to adversarial manipulation and less capable of generalizing beyond controlled test environments.

Research on robust offensive language detection in Moroccan Darija remains critically underexplored, with only one study [17] attempting to build classification models using transformer-based architectures. This effort relied on BERT-based sentence embeddings and deep learning classifiers, yet it revealed significant vulnerabilities, with adversarial attacks successfully deceiving the model at a rate of up to 29%. Building on the foundational work of [16,17], our study makes a novel and comprehensive contribution by presenting the first robustness-focused evaluation of various LLMs for offensive language detection in Darija. Specifically, we investigate whether LLMs can outperform traditional transformer-based models in terms of both accuracy and resilience to adversarial attacks. Our robustness analysis spans multiple levels of adversarial attack character, word, and sentence level to uncover hidden weaknesses and assess model resilience before real-world deployment. Furthermore, we address the acute scarcity of dialectal resources by introducing a large, high-quality, and representative dataset. This dataset is developed using rigorous labeling protocols to ensure consistency and minimize annotation bias, thereby laying a stronger foundation for reliable and generalizable model development in this low-resource context. Finally, we contribute to the field by evaluating the effectiveness of adversarial training, a method recommended in [17] for improving model resistance to adversarial attacks. This work helps bridge the resource gap and provides a strong foundation for building more reliable AI systems for low-resource dialects such as Moroccan Darija.

3. Materials and Methods

This section describes the dataset acquisition and preparation process, including the data sources, key characteristics, preprocessing steps, and labeling methods employed. In addition, we outline the methodology used for LLM selection and evaluation, as well as the process of building a robust offensive language detection model tailored to Darija using an adversarial training approach.

3.1. Dataset Acquisition and Preparation

In this study, we employed two established datasets that align with our research scope and have demonstrated strong performance in the literature. The first is DarOLD [17], which contains 20,402 human-labeled sentences annotated as offensive or non-offensive, collected from Twitter and YouTube. Given the Darija writing style (e.g., code-switching and code-mixing), the dataset was designed to integrate both Arabic and Latin scripts, thereby reflecting real-world language use. Using this dataset and fine-tuning a transformer-based model, the authors reported an F1-score of 85%. The second dataset is OMCD, developed in [16]. It consists of 8000 comments collected from social media platforms (e.g., Facebook and YouTube) and human-labeled as offensive or non-offensive. Using this dataset, the authors achieved an F1-score of 84%.

To maintain consistency between the DarOLD and OMCD datasets, we applied the same preprocessing pipeline used for DarOLD to OMCD. This process involved converting emojis into their textual equivalents to preserve sentiment-related information, standardizing variations in Arabic script to unify letter forms, and removing punctuation and diacritics to emphasize the core textual content. Social media-specific elements, including links, user mentions, and hashtags, were also removed, as they provide limited linguistic value. Words containing repeated characters (e.g., “gooood”) were normalized by limiting consecutive repetitions to a maximum of two. An additional step involved removing duplicate entries across both datasets to ensure all comments in the merged dataset were unique. This measure also helped prevent data leakage during dataset splitting, thereby enabling more reliable model evaluation.

After preprocessing and duplicate removal, the DarOLD dataset was reduced from 20,398 to 14,114 comments (≈30.8% removed), while the OMCD dataset decreased from 8024 to 7516 comments (≈6.3% removed).

Although both datasets were already labeled, they followed different annotation frameworks. To ensure consistency, we relabeled the DarOLD dataset according to the OMCD guidelines. This step established a unified labeling standard, thereby improving the overall coherence of the merged dataset. The original DarOLD guidelines emphasized clear markers of offensiveness, including vulgar language, hate speech, sexism, name-calling, and threats. This narrow focus supported consistent labeling of overtly harmful content. In contrast, OMCD adopts a broader framework that includes both explicit and implicit forms of offensiveness. Beyond clear instances of abuse, OMCD accounts for subtle expressions of contempt, humiliation, or implicit aggression elements that are often overlooked but equally harmful in online discourse. Integrating DarOLD into the OMCD framework enhances the dataset’s capacity to train models that can generalize across diverse real-world scenarios. To achieve this, we used Claude 3.5 Sonnet, a state-of-the-art LLM developed by Anthropic and recognized for its strong performance across multilingual tasks. A custom prompt was designed to guide Claude 3.5 Sonnet in applying the OMCD labeling guidelines, as presented in Figure 1. Each text entry in the DarOLD dataset was processed using this prompt, and the model’s output was recorded as the updated label. Entries for which the model’s output disagreed with the original labels were flagged for the next verification step, which involved manual review by native speakers following the OMCD guidelines.

To ensure transparency and reliability in the relabeling process, we quantified the agreement between the original DarOLD annotations and the updated labels generated by applying the OMCD framework with Claude 3.5 Sonnet. Out of 14,114 samples, 2626 instances (18.61%) exhibited disagreement, after which 1024 labels were confirmed as modified, while 1602 labels remained unchanged because manual review confirmed the original annotation was correct.

To formally assess annotation consistency, we computed Cohen’s kappa (κ), which accounts for agreement occurring by chance using Equations (1)–(3).

κ = \frac{P_{o} - P_{e}}{1 - P_{e}}

(1)

where P_o is the observed agreement between the two label sets:

P_{o} = \frac{N u m b e r o f a g r e e m e n t s}{T o t a l n u m b e r o f s a m p l e s} = \frac{11,488}{14,114} \approx 0.8

(2)

and P_e is the expected agreement by chance, calculated as:

P_{e} = \sum_{i = 1}^{k} P_{A, i} \cdot P_{B, i} = (0.5 \times 0.5) + (0.5 \times 0.5) = 0.5

(3)

with k representing the number of label categories (Offensive/Non-offensive),

P_{A, i}

the proportion of samples assigned to category i by the original annotation, and

P_{B, i}

The proportion assigned by the updated labels. Using this formulation, Cohen’s kappa score was 0.6, indicating substantial agreement.

Table 2, summarizes the distribution of label changes and manual verification outcomes.

The resulting dataset achieved a high level of standardization and reliability. This consistency provides a robust foundation for training offensive language detection models that are sensitive to the linguistic and sociolinguistic nuances of Moroccan Darija.

For dataset splitting, we divided the dataset into two subsets: 85% for training and 15% for testing. We used stratified sampling based on class labels to ensure that the class distribution remained consistent across both sets. This approach helps reduce the risk of model bias toward a particular class and supports more reliable evaluation.

3.2. LLMs Selection for Fine-Tuning

To develop an effective model for offensive text classification in Moroccan Darija, we conducted a comparative study by fine-tuning several state-of-the-art LLMs for text understanding and generation. This process involved taking pre-trained models (e.g., GPT-3) and further training them on task-specific data to better adapt them to the offensive language classification task. We selected the following models: LLaMA (LLaMA 2: 7B and 13B; LLaMA 3.1: 8B; LLaMA 3.2: 3B), Gemma (Gemma 2: 2B and 9B), Mistral (Mistral 3 and Mistral Small 3), DeepSeek (DeepSeek R1: 7B, 8B, and 14B), and GPT (GPT-4o-mini-2024-07-18). We also experimented with three leading Arabic LLMs: ArabianGPT, Falcon Arabic, and AtlasChat.

Considering the constraints of computational resources and time, we did not include very large models such as LLaMA 3.1 405B and LLaMA 3.3 70B. Furthermore, all of the selected LLM base models are generative models designed for tasks such as question answering and chatbot interaction. Therefore, fine-tuning them for classification, especially in Arabic, requires carefully designed prompt engineering to ensure that the models produce appropriate outputs. We tested multiple prompt styles and retained the best-performing ones. Figure 2 shows the final prompt framework (with respect to each model’s prompt styling), tailored to Darija, which guides the model to classify a text as “Offensive” or “Not Offensive.

For all open-weight models, we performed hyperparameter tuning using a random search approach. The search space included learning rates

[1 \times 10^{- 5}, 5 \times 10^{- 5}, 1 \times 10^{- 4}, 2 \times 10^{- 4}]

, per-device batch sizes [2, 4, 8], LoRA ranks [4, 8, 16], LoRA alpha values [8, 16, 32], and dropout rates [0.0, 0.05, 0.1], with the ranges informed by prior transformer fine-tuning studies. Each configuration was evaluated on the validation set, and the configuration achieving the highest F1-score was selected for final training.

The final hyperparameters for open-weight models were: training for 3 epochs, per-device batch size of 4 for both training and evaluation, learning rate of

2 \times 10^{- 4}

, linear learning rate scheduler, LoRA configuration with rank

r = 8

, lora_alpha = 16, dropout rate = 0.05, bias = “none”, and task type set to CAUSAL_LM. We applied 4-bit quantization to reduce GPU memory usage. Checkpoints were saved at the end of each epoch, and logs were recorded every 10 steps.

For closed-source models such as GPT-4, we set the temperature to 0.8, batch size to 8, and learning rate to 1.5. This configuration yielded the best performance in our experiments while ensuring stable behavior when using the OpenAI API.

Experiments in this study were carried out on NVIDIA Tesla T4 GPUs, accessed via Google Colab Pro and the Kaggle platform. The duration of model execution and fine-tuning varied depending on the specific architecture. Table 3 summarizes the fine-tuning times (35 min–6 h 42 min) and prediction times (20 min–2 h) for all models. During fine-tuning, GPU memory consumption remained within 14–16 GB of VRAM, depending on model size and batch configuration.

3.3. Offensive Language Detection Model Evaluation

3.3.1. Evaluation Methods

To assess model effectiveness under standard, controlled settings, we used traditional evaluation metrics, including accuracy, F1-score, precision, and recall. These metrics allowed us to compare different models objectively and select the most suitable one based on its ability to correctly classify offensive content. In particular, the F1-score provided a balanced view of performance by accounting for both false positives and false negatives, which is especially important in tasks such as offensive language detection, where class imbalance can affect model behavior.

Beyond these traditional metrics, we also evaluated model robustness using the ASR, the percentage of adversarial examples that successfully fool fine-tuned LLMs into misclassification. We calculated ASR across two adversarial attack settings (i.e., black-box and white-box) and at different linguistic levels: character, word, and sentence. Accordingly, 50% of the test set (1778 samples) was selected to generate adversarial examples. This proportion was chosen to balance computational efficiency with statistical representativeness. Our objective was to simulate realistic attack scenarios and gain meaningful insights into model resilience against both subtle and natural perturbations.

3.3.2. Robustness Evaluation via Black-Box and White-Box Attacks

The black-box attacks employed in this study are summarized in Table 4. Robustness to black-box attacks is particularly important because these attacks reflect real-world conditions in which inputs may be noisy, unpredictable, or intentionally altered, and models must still perform reliably without any prior knowledge of the input structure.

The white-box attacks employed in this study are summarized in Table 5. Robustness to white-box attacks is critical because these attacks exploit the model’s internal gradients to craft precise perturbations, thereby revealing deeper vulnerabilities and testing resilience under targeted adversarial conditions.

The process of generating adversarial examples begins by analyzing the input sentence to identify the words that contribute most to the model’s prediction. We selected the two most influential words as targets for manipulation. Using these keywords, adversarial examples are generated by applying perturbations at the character level (e.g., swapping or modifying letters), word level (e.g., synonym replacement), or sentence level (e.g., full paraphrasing). After generation, each adversarial sentence is evaluated for semantic similarity to the original. Only those that preserve the original meaning are considered valid adversarial inputs for robustness testing. In the following sections, we explain this pipeline step by step.

In black-box attacks, model gradients are inaccessible. Therefore, we used a scoring-based approach to assess the importance of each word within a sentence. This method evaluates how the removal of individual tokens affects the model’s classification outcome. Formally, let x and x′ denote the original input sentence and the sentence obtained after removing the i-th word (wᵢ), respectively, as illustrated in Equations (4) and (5).

x = [w_{1}, w_{2}, \dots {, w}_{i - 1}, w_{i}, w_{i + 1}, \dots, w_{n}]

(4)

x^{'} = [w_{1}, w_{2}, \dots {, w}_{i - 1}, w_{i + 1}, \dots, w_{n}]

(5)

Word importance score is defined as follows, Equation (6).

I_{w_{i}} = P (x) - P (x^{'})

(6)

where P is the prediction probability.

After computing the importance scores

I_{w_{i}}

for all words, they are sorted in descending order according to their values.

White-box adversarial attack generation relies on full access to a model’s internal architecture and parameters, which is only possible with open-source models such as ArabianGPT and Atlas-Chat. In contrast, closed-source models such as Falcon Arabic and GPT-4o Mini do not expose their internals, which prevents the direct application of white-box techniques. Moreover, adversarial examples crafted for one open-source model are often specific to that model and may not generalize to others, especially those based on different architectures. This creates a challenge for fair evaluation. To address this issue, we adopt the cross-model attack approach recommended in [22], which generates adversarial samples designed to transfer across different models. This strategy supports a more consistent and fair assessment of vulnerabilities in both open- and closed-source systems. Therefore, in this experiment, we fine-tuned DarijaBERT [23], a BERT-based model trained specifically on a Darija corpus, making it well aligned with our dataset and particularly suitable for this text classification. In addition, this model has shown promising results in the literature [17]. Using the fine-tuned DarijaBERT, we generated adversarial samples based on its gradients. The model achieved strong performance on our dataset, reaching a classification F1-score of approximately 86%.

To identify the two most important words in each sentence, we adopted two approaches: an attention-based approach and a gradient-based approach. The attention-based approach extracts important words by analyzing attention weights from the model’s layers. Specifically, it extracts attention weights from the last layer [23] and averages them across all attention heads to compute a single attention score for each token. These scores indicate how strongly the model attends to each token when making predictions, thereby serving as a proxy for word importance. Because attention weights are obtained directly from the forward pass without computing gradients, this method is computationally lightweight [24] while still capturing salient tokens in the sentence. In contrast, the gradient-based approach identifies important words by computing the gradients of the model’s output with respect to the input tokens. These gradients quantify how much a small change in each token affects the model’s predictions. The process involves a forward pass to obtain the model’s output and a backward pass to compute gradients. The absolute values of these gradients are then used as importance scores [19]. This approach directly measures each token’s contribution to the model’s decision, making it particularly interpretable for classification tasks. However, it is computationally more expensive than attention-based methods because it requires a backward pass. As shown in Table 6, adopting this approach allowed us to identify, for each sentence, two high-impact words along with their importance scores.

For character-level attacks, after extracting the important words, we applied the TextBugger algorithm [25], which generates adversarial examples using four character-level modifications: swapping adjacent letters, substituting characters, deleting characters, or inserting random letters. We also used the HotFlip approach [26] to identify effective character substitutions by using gradients to identify and flip the character whose replacement maximally increases the model’s loss. For each white-box character-level attack, four candidate sentences were generated using both attention-based and gradient-based methods. From these candidates, we selected the one that successfully misled our fine-tuned DarijaBERT model. To maximize ASR, we combined the successful adversarial samples generated by both approaches, resulting in a more robust evaluation of our candidate LLMs (As shown in Table 13). An example of this attack is presented in Table 7.

For word-level attacks, we used GPT-4o Mini to generate synonyms for the two most important words in each sentence. As illustrated in Table 8, the process involved several key steps. First, we identified the part of speech (POS) of each important word using GPT. Next, we generated 15 synonyms for each word, ensuring that they matched the original word’s POS. Each synonym was then used to replace the original word, producing multiple candidate adversarial sentences. To maintain semantic integrity, we retained only those candidates with a semantic similarity score above 75%, as discussed further in the next section. Finally, we evaluated these modified sentences using the fine-tuned DarijaBERT model and kept only those that successfully misled the model.

For sentence-level attacks, we employed GPT-4o Mini to rephrase the entire sentence while preserving its original meaning and grammatical structure. To ensure semantic alignment, we retained only candidates with a semantic similarity score above 75% as valid adversarial examples. An example of this attack is presented in Table 9.

To measure the similarity between the original sentences and their adversarial versions, we used two methods. First, for semantic-based similarity, we relied on word embeddings, which have proven effective in capturing both semantic and syntactic relationships across various domains [27,28]. We used the Arabic-BERT model [29] to encode sentences into vectors (embeddings) and then computed the cosine similarity between the original and adversarial sentence embeddings. The second approach was manual verification. Since Darija Arabic is a low-resource language, automated metrics may not fully capture meaning preservation [27]. Therefore, we asked two native Moroccan speakers to compare each original sentence with its adversarial version and assign a label using the following scale: 0 = meaning is lost; 1 = meaning is partially preserved but slightly distorted; 2 = meaning is fully preserved. We then introduced the Normalized Average Similarity Score (NASS) equation to calculate the average similarity for each adversarial dataset created using Equation (7).

N A S S = \frac{1}{3 N} \sum_{i = 0}^{N - 1} \sum_{j = 1}^{n} x_{i, j}

(7)

n: number of labelers (in our case, 2)
N: number of samples
X_i,j: label

3.4. Adversarial Training for Robust Model Building

Adversarial training is a technique designed to improve a model’s performance, robustness, and reliability when faced with diverse and potentially challenging user inputs [30]. In this study, our objective was to develop a compact yet robust model capable of accurately recognizing both explicit and implicit offensive content in Moroccan Darija. The model is intended for deployment in real-world applications and must demonstrate resilience to various forms of input variation, including typographical errors, intentional perturbations, and diverse linguistic expressions that users may employ to circumvent content detection systems. Therefore, following the initial fine-tuning and evaluation of multiple models, the top-performing LLMs that demonstrated the highest robustness were selected for an additional fine-tuning phase using an augmented training dataset that strategically combined the original training data with a newly generated adversarial dataset. The adversarial dataset was created by applying diverse attack techniques, including both black-box and white-box attacks, as detailed in Table 4 and Table 5, directly to the original training set in order to simulate realistic adversarial conditions that the model may encounter during deployment. This comprehensive approach exposes the model to a wide range of challenging inputs during fine-tuning, thereby improving its generalization capabilities and resilience to adversarial manipulations [31] while maintaining accuracy in detecting offensive content in Moroccan Darija.

4. Results and Discussion

4.1. Preliminary Results

To gain a comprehensive understanding of LLM performance in low-resource languages, particularly for offensive language detection in Darija, we fine-tuned several open-source and closed-source models, including general-purpose, reasoning, and Arabic-focused LLMs. This experiment serves both as a baseline for evaluating task complexity and as an initial benchmark for LLM performance on this underexplored challenge. The results are summarized in Table 10.

As presented in Table 10, the results reveal several clear trends across the evaluated models. Older and smaller models, such as LLaMA 2 (7B and 13B), performed poorly, often near random chance. Many models exhibited a bias toward a single class, resulting in either low precision or low recall. For example, LLaMA 3.1 (8B) achieved 53% precision, favoring the offensive class. Newer and larger models, including LLaMA 3.2 (74% F1-score) and Mistral Small 3 (24B, 81% F1-score), showed improved performance and reduced bias, but they remained imperfect and sometimes over-classified sentences as offensive. Across all models, the linguistic complexity of Moroccan Arabic presented a significant challenge, as the same words can appear in both offensive and non-offensive contexts, leading to inconsistent predictions.

GPT-4 O-mini has a clear edge thanks to its large-scale pretraining, moderation-focused fine-tuning, and ability to handle multiple languages. Falcon Arabic and ArabianGPT also achieved notable results, with F1-scores of 81% and 80%, respectively. Their precision and recall scores were approximately balanced, indicating that the models effectively distinguished offensive from non-offensive text. This strong performance can be attributed to their fine-tuning on large-scale Arabic corpora. Similarly, AtlasChat achieved slightly better results, likely due to its fine-tuning on Moroccan Darija. This outcome underscores the effectiveness of adapting LLMs to dialect-specific corpora. Fine-tuning on such tailored datasets allows models to learn the nuanced morphological patterns unique to the dialect.

4.2. Robustness Analysis of Models Against Adversarial Attacks

Following the preliminary results, we selected the four best-performing models, namely AtlasChat, ArabianGPT, Falcon Arabic, and GPT-4o Mini, for further testing beyond traditional evaluation metrics. The goal was to assess their robustness in real-world scenarios, where user inputs can vary significantly and may be affected by different types of perturbation.

4.2.1. Models’ Performance on Black Box Adversarial Attack

To prevent bias in the performance evaluation, we first measured the semantic similarity between each original sentence and its generated adversarial counterpart. Table 11 presents the average semantic similarity results, computed using the embedding-based method, for all generated black-box attack samples. The results show that the adversarial datasets consistently preserved the meaning of the original sentences. The average semantic similarity scores remained above 0.80 across all attacks, with low standard deviation. In addition, the NASS manual validation further confirms the high semantic similarity across the different attacks, with scores consistently above 0.81, confirming that the attacks do not introduce significant semantic bias.

As shown in Table 12, the outcomes of the robustness tests show that the attacks were able to mislead the models, with some perturbations outperforming others. Under the neutral suffix attack (B.B.D.1), GPT-4o Mini demonstrated the highest robustness, maintaining its original F1-score of 0.88 with no loss in precision (0.88) or recall (0.88), and achieving the lowest ASR of 11.58%. ArabianGPT and Falcon Arabic maintained approximately the same F1-score. AtlasChat, despite being fine-tuned on Moroccan Darija, recorded the highest ASR (18.36%).

The space insertion attack (B.B.D.2) caused a notable decline in model performance. AtlasChat showed a 4% drop in F1-score and a high ASR of 23.20%, with reduced precision (0.71), indicating an increase in false positives. ArabianGPT and Falcon Arabic also experienced performance drops of 3% and 2%, respectively. In contrast, GPT-4o Mini demonstrated greater robustness, achieving the lowest ASR (14.11%). These results suggest that the attack exploits tokenization weaknesses, particularly in models with less exposure to informal or morphologically complex text. This effect is especially relevant in dialects such as Darija, where many words already exhibit non-standard spacing and morphology. The higher ASR observed for AtlasChat, ArabianGPT, and Falcon Arabic may reflect their more limited exposure to non-standard token patterns during fine-tuning.

The deleting-spaces-between-words attack (B.B.D.3) caused a modest decline in performance across all four models. These results suggest that, although the structural distortion introduced by space deletion slightly affects performance, robust models such as GPT-4o Mini, Falcon Arabic, and AtlasChat retain relatively stable recall, indicating resilience in identifying offensive content despite input irregularities.

The adding-dots-between-characters attack (B.B.D.4) led to a 4% decrease in AtlasChat’s F1-score, with precision dropping to 0.71 and recall remaining high at 0.87. This disparity reflects a tendency to over-identify offensive content, resulting in a higher false positive rate. Similarly, ArabianGPT experienced a 3% performance drop, with precision rising to 0.88 and recall falling to 0.68, suggesting a substantial increase in false negatives. GPT-4o Mini again proved relatively robust, achieving the lowest ASR of 14.34%. These outcomes highlight how minor character-level perturbations can disrupt token-level understanding, particularly for models not extensively fine-tuned on noisy or dialectal text.

The results further emphasize GPT-4o Mini’s consistency under adversarial conditions. The swapping-letters-with-adjacent-keyboard-characters attack (B.B.D.5) simulates realistic human typing errors and presents a significant challenge to all models. AtlasChat’s F1-score dropped by 4%. ArabianGPT and Falcon Arabic also experienced a 3% drop in F1-score. Interestingly, GPT-4o Mini showed the most substantial decline, with a 10% drop in F1-score. All four models demonstrated clear vulnerability to this attack, as reflected by the high ASR, which ranged from 17.91% to 22.30%. These findings indicate that even minor character-level transpositions, which are common in real-world inputs, can significantly impair model performance, underscoring the need for more resilient embedding strategies.

The random-noise-injection attack (B.B.D.6) resulted in a moderate decline in model performance. The attack achieved non-trivial ASR values, ranging from 13.38% to 19.29%, indicating that all models are vulnerable to this type of noise-based perturbation. These results highlight the limitations of current embedding and input normalization strategies when faced with subtle character-level inconsistencies.

The repeating-vowels attack (B.B.D.7) caused minor performance degradation across all tested models. This attack achieved relatively high ASR values, ranging from 10.27% to 19.71%. These results underscore the models’ vulnerability to phonologically plausible noise and reaffirm the need to incorporate robust character-level preprocessing and data augmentation strategies to mitigate adversarial effects in real-world user inputs.

The robustness evaluation of AtlasChat, ArabianGPT, Falcon Arabic, and GPT-4o Mini across various black-box attacks revealed critical vulnerabilities. While GPT-4o Mini and Falcon Arabic consistently demonstrated the highest F1-scores, they, like the other models, still experienced notable performance degradation under attack, with ASR values ranging from 11% to 23%.

4.2.2. Models’ Performance on White Box Adversarial Attack

As presented in Table 13, the adversarial attacks generated using the white-box approach achieved a high ASR against the transformer-based DarijaBERT model. This indicates that the adversarial samples were effectively crafted to challenge the model’s robustness under strong, targeted perturbations. Despite the high ASR, the generated adversarial samples remained semantically close to the original inputs. This is shown in Table 14, which reports the semantic similarity scores of the white-box-generated samples; all scores exceeded 75%, confirming that the core meaning of the original sentences was preserved. Furthermore, human evaluation using the NASS method supports these findings, reinforcing the semantic validity and overall quality of the adversarial examples.

Table 13. DarijaBERT average Attack success rate for each of the character-level attacks.

Approach\ASR	W.B. 1	W.B. 2	W.B. 3	W.B. 4
Attention	23.90%	27.73%	29.25%	28.12%
Gradient	32.56%	39.99%	42.86%	42.01%
Combined	35.45%	40.28%	42.86%	44.35%

As shown in Table 15, Adversarial attacks consistently reduced performance, with larger drops observed under white-box attacks compared with black-box attacks. For example, black-box random noise caused only modest reductions of around 1% to 2%, while targeted white-box attacks led to more substantial drops, up to 13%, accompanied by higher ASR, ranging roughly from 8% to 13%. This greater vulnerability in the white-box setting can be attributed to the targeted perturbation of key offensive terms. Because these words are central to the model’s classification of offensive content, any distortion can mislead the model into classifying the input as non-offensive. This is reflected in the increase in false negatives and the decline in recall across all models.

As shown in Figure 3, the sentence-level attack that rephrases the entire sentence (W.B.D.6) resulted in the greatest model degradation, ranging from 1% to 16%, as well as the highest ASR, ranging from 12% to 30% across all evaluated models. These results underscore a critical weakness in the candidate LLMs, especially ArabianGPT and AtlasChat: their limited ability to generalize beyond surface-level lexical patterns. This vulnerability suggests that the models are not effectively leveraging rich morphological and semantic representations of language. Rather than relying on broader linguistic knowledge, such as the ability to recognize synonyms, antonyms, and syntactic rephrasing, the fine-tuned models appear to depend heavily on memorized frequent word patterns from the training data. As a result, offensive classification becomes overly dependent on the presence of specific high-frequency tokens. When these tokens are rephrased, substituted, or slightly perturbed, the models often fail to recognize the underlying offensive intent, leading to a significant increase in false negatives. This is reflected in recall drops of 32%, 28%, and 29% for AtlasChat, ArabianGPT, and GPT-4o Mini, respectively, indicating that these models struggle to generalize beyond their training distribution. This is a major limitation for real-world robustness, where language is diverse and highly variable. In contrast, Falcon Arabic achieved the lowest ASR, indicating greater robustness against this type of attack. This result may be attributed to the large Arabic and dialectal Arabic corpora used to train the model, which likely helped it capture semantic relationships between words and phrases, including synonymous expressions.

As shown in Table 15, although DarijaBERT is highly susceptible to these attacks (ASR reaching 44.35% under combined white-box attacks, as shown in Table 13), the transferred attacks had a moderate impact on the target LLMs. For example, W.B.D.1 achieved ASR values of 23.11% on AtlasChat, 15.52% on ArabianGPT, 25.81% on Falcon Arabic, and 14.90% on GPT-4o Mini. This indicates that, although the target models are more robust than the surrogate model, a significant portion of attacks remains effective. This finding demonstrates cross-model vulnerability and highlights the importance of evaluating robustness under transferred adversarial scenarios. Overall, these results provide insight into how perturbations learned from a low-resource dialect model can generalize to larger, multilingual, or closed-source LLMs, emphasizing both the challenges and the value of surrogate-based attack strategies.

Despite claims from leading AI companies, our results highlight that most LLMs performed suboptimal when applied to low-resource languages, raising important questions about their reliability in real-world applications. The robustness of the top four models is also questionable, particularly given that adversarial testing was conducted on only 50% of the test set. Overall, the results indicate that these models remain vulnerable and struggle to learn the linguistic nuances of the Darija dialect. Findings from both black-box and white-box attacks underscore the need to develop larger, higher-quality corpora and dialect-specific lexicons. Such resources would help language models better capture the structure, patterns, and morphology of each dialect, leading to more reliable and robust systems.

Among the four models, GPT-4o Mini achieved the best performance in robustness testing, followed by Falcon Arabic and ArabianGPT. Falcon Arabic demonstrated performance comparable to GPT-4o Mini, underscoring the importance of using representative datasets when training LLMs. Interestingly, although AtlasChat was fine-tuned on a Moroccan Arabic dataset, it achieved the weakest robustness results in detecting offensive content in Darija. This outcome may be attributed to the quality and representativeness of the dataset used during fine-tuning.

Adversarial training is widely recommended in the literature as a method for improving LLM performance in low-resource languages such as Moroccan Arabic. In this study, we aimed to evaluate the effectiveness of this approach. To this end, we selected GPT-4o Mini and ArabianGPT, as they demonstrated the strongest robustness results, for an additional round of fine-tuning using the adversarial dataset we developed. Because we are assessing model robustness and reliability for deployment in real-world applications, the system is typically accessed through APIs or user interfaces. In such settings, users cannot inspect or modify the internal model, making black-box interaction or attack the only realistic option. Therefore, we used only black-box attacks to evaluate model resilience after adversarial training.

As shown in Table 16, the adversarial training process improved ArabianGPT’s performance by 9% in terms of F1-score. Notably, recall values also increased across all adversarial attacks, indicating an enhanced ability to detect offensive sentences effectively. Similarly, GPT-4o Mini improved by 3%, achieving a remarkable F1-score of 91%. This represents the highest performance reported to date for offensive language detection in Moroccan Arabic and surpasses the current state-of-the-art results reported for Standard Arabic (Table 1) by 6%.

Furthermore, our model demonstrated substantial gains across various types of adversarial attacks. As shown in Table 17, robustness against character-level and word-level attacks improved by 10.74% and 39.61%, respectively. The most notable improvement was observed for suffix addition attacks, with an 83.77% gain. Other perturbations, such as inserting dots between letters and inserting spaces between letters, yielded robustness gains of 19.22% and 14%, respectively. Modifications involving random noise and single-character alterations improved robustness by 8.53% and 3.88%, respectively. In addition, the model better handled the deletion of spaces between words and vowel repetition, with improvements of 7.24% and 4.82%. For sentence-level attacks, we achieved a 26.67% increase in robustness.

Despite the complex morphology and structure of Darija, adversarial training significantly enhanced the model’s robustness. These results highlight the importance of using adversarial training approaches in dialectal settings to address linguistic variability, data scarcity challenges effectively, and handle biases in the dataset and noise and spelling variations common in social media platforms. Our methodology also yielded substantial improvements over existing ASR benchmarks reported in the literature. Specifically, our model consistently demonstrated enhanced robustness, with gains ranging between 4% to and 80%.

The overall results underscore a critical insight: even when an LLM demonstrates strong performance on general benchmarks, this does not necessarily translate into robustness or reliability in dialect-specific tasks. In particular, high scores on standard evaluations often fail to reflect a model’s ability to generalize across linguistic variation. A key limitation lies in the data used for training; models require exposure to large-scale, diverse, and most importantly, representative datasets in the target dialect. Current fine-tuning approaches rely predominantly on labeled datasets or text extracted from books and dialogues. While valuable, these sources are typically structured and clean; as a result, models are not exposed to noisy or perturbed text during training. Moreover, linguistic resources such as text corpora, lexicons, WordNets, and ontologies are either extremely limited or entirely absent for low-resource languages like Moroccan Darija.

Furthermore, an often-overlooked challenge in dialectal language modeling is tokenization. Most existing tokenization algorithms (e.g., Byte Pair Encoding or WordPiece) are optimized for MSA or other high-resource languages. These tokenizers often fail to segment words or subwords meaningfully in dialects, resulting in fragmented representations that obscure the language’s underlying structure. Custom or adaptive tokenization strategies, informed by dialect-specific morphological rules, are therefore essential for preserving semantic and syntactic integrity during preprocessing. Additionally, embedding techniques play a crucial role in capturing the nuances of dialectal languages. Traditional word embeddings trained on Standard Arabic are inadequate for representing the rich variability of Darija. Contextual embeddings, such as those produced by transformer-based models (e.g., AraBERT or CAMeLBERT), offer improvements; however, even these models tend to underperform when faced with dialectal input on which they were not trained. Therefore, pretraining embeddings on dialect-specific corpora or leveraging multi-dialectal pretraining strategies is essential for developing more robust and representative language models.

Truly teaching an LLM to understand a language goes beyond memorizing surface-level patterns or associations. It requires the model to grasp the deeper sociomorphological, syntactic, and semantic characteristics of the language, elements that are typically captured and formalized through rich linguistic resources, dialect-aware tokenization methods, and embedding strategies tailored to the unique characteristics of dialectal Arabic [34]. Therefore, future research should prioritize the creation and enrichment of foundational knowledge bases for dialectal Arabic. Building comprehensive linguistic infrastructures, such as annotated corpora and lexical databases, is essential before compact models can be expected to generalize effectively. Only through such groundwork can we enable the development of robust and reliable LLMs tailored to dialectal varieties such as Moroccan Darija.

In addition to robustness considerations, fairness remains an important challenge in offensive language detection for dialectal Arabic. However, addressing such biases requires fairness-aware evaluation frameworks and mitigation strategies that explicitly control the statistical dependence between predictions and sensitive attributes. Recent work has proposed information-theoretic approaches that introduce a differentiable mutual-information regularizer during training, encouraging model predictions to remain independent of protected attributes while preserving task-relevant information [35,36]. Such techniques provide a principled way to reduce reliance on demographic proxies. Since the present study focuses on robustness against lexical and semantic adversarial perturbations, integrating fairness-aware regularization is left for future work, where dedicated bias detection and mitigation techniques can be explored.

5. Conclusions

Unlike previous research, which mainly focused on transformer-based embedding approaches and deep-learning classifiers to detect offensive content in Darija, we build upon this work by testing several state-of-the-art LLMs. We incorporate general multilingual LLMs as well as Arabic-specific ones to gain insight into their ability to generalize in a low-resource dialect scenario. To further improve the dataset and enhance its representativeness, we created a new dataset from existing datasets by standardizing preprocessing and labeling procedures across sources. We also limit the bias and improve the quality of labels by manually re-annotating the entire dataset. Samples that were decidable as offensive or non-offensive content were filtered. This careful selection helps the models learn and distinguish the underlying patterns of each class more effectively.

While a few LLMs were multilingual, the preliminary results were suboptimal because they had not been trained on Arabic text, as is the case with LLaMA and Mistral. However, GPT-4o Mini, ArabianGPT, Falcon Arabic, and AtlasChat demonstrated promising results, achieving F1-scores above 80%. This performance can be attributed to their training on Arabic content, which enabled them to outperform the other models. We further tested the four best-performing models with adversarial attacks with black-box and white-box attacks to examine their robustness. Results showed considerable vulnerabilities across all candidate models. In particular, the models tended to misclassify offensive content as non-offensive, a failure which went unnoticed under regular evaluation metrics. The best result was achieved by GPT-4o Mini, achieving the lowest ASR, then Falcon Arabic, ArabianGPT, and finally AtlasChat.

While large language models have made great strides, there are still several outstanding vulnerabilities to robustness, and this raises important issues for their use in practice, particularly for low-resource languages. Model fairness issues also negatively impact performance and dependability. Therefore, additional empirical studies are necessary to address these issues and begin to close the gap.

To enhance LLM robustness, we adopted adversarial training, an approach widely recommended in the literature. Accordingly, we further fine-tuned GPT-4o Mini and ArabianGPT using our dataset, which combined original sentences with their adversarial counterparts. The dataset was carefully designed to ensure representativeness and included adversarial examples generated from all identified attack types. This approach significantly improved model performance and robustness, yielding an F1-score of 91% and surpassing the current SOTA result by 6%. Furthermore, it reduced the ASR by an average of 32%. Although these results may not be directly generalizable to all LLMs, they provide valuable insight into the potential effectiveness of similar approaches for models with comparable architectures and linguistic foundations. To enhance LLM robustness, we further fine-tuned GPT-4o Mini and ArabianGPT based on a dataset we developed consisting of original sentences and adversarial sentences. This approach improved the model’s performance and robustness to attacks, achieving an F1-score of 91%, improving the current SOTA result by 6%. It also improved the ASR by 32% on average. While these figures may not be entirely transferable to all LLMs, they show promise and further justify testing these methodologies on LLMs with similar configurations and linguistic underpinnings.

Beyond these findings, the work points to broader lessons for NLP in low-resource dialects: careful dataset curation, rigorous labeling, and targeted fine-tuning can make a real difference in model performance and reliability. While our focus was Moroccan Arabic, these insights can inform work on other low-resource languages and dialects as well.

A future direction of this work is the construction of a larger and more representative dataset specifically tailored to Darija. Moreover, we will also extend the robustness evaluation by considering additional adversarial settings, including fairness-oriented bias evaluation through sensitive attribute substitutions (e.g., gender or nationality), alongside the development of an LLM designed explicitly for this dialect and equipped with robustness mechanisms to handle the linguistic complexity and variability inherent in Moroccan Arabic effectively.

Author Contributions

Conceptualization, A.M. and E.H.N.; methodology, S.O., K.R., A.M. and E.H.N.; software, S.O. and K.R.; validation, A.M., E.H.N. and S.E.G.; formal analysis, S.O., K.R., A.M. and E.H.N.; investigation, S.O. and K.R.; resources, A.M.; data curation, K.R. and A.M.; writing—original draft preparation, S.O. and K.R.; writing—review and editing, S.O., K.R., A.M. and E.H.N.; visualization, S.O. and K.R.; supervision, A.M., E.H.N. and S.E.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Al Akhawayn University’s seed money grant under contract number 92567.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in Mendeley Data at https://data.mendeley.com/datasets/2y4m97b7dc/4 (accessed on 25 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qu, X.; Gu, Y.; Xia, Q.; Li, Z.; Wang, Z.; Huai, B. A survey on Arabic named entity recognition: Past, recent advances, and future trends. IEEE Trans. Knowl. Data Eng. 2023, 36, 943–959. [Google Scholar] [CrossRef]
Farghaly, A.; Shaalan, K. Arabic natural language processing: Challenges and solutions. ACM Trans. Asian Lang. Inf. Process. 2009, 8, 1–22. [Google Scholar] [CrossRef]
Younes, J.; Souissi, E.; Achour, H.; Ferchichi, A. Language resources for Maghrebi Arabic dialects’ NLP: A survey. Lang. Resour. Eval. 2020, 54, 1079–1142. [Google Scholar] [CrossRef]
Yang, J.; Jin, H.; Tang, R.; Han, X.; Feng, Q.; Jiang, H.; Zhong, S.; Yin, B.; Hu, X. Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond. ACM Trans. Knowl. Discov. Data 2024, 18, 1–32. [Google Scholar] [CrossRef]
Filippi, S.; Motyl, B. Large language models in engineering education: A systematic review and suggestions for practical adoption. Information 2024, 15, 345. [Google Scholar] [CrossRef]
Nejjar, M.; Zacharias, L.; Stiehle, F.; Weber, I. LLMs for science: Usage for code generation and data analysis. J. Softw. Evol. Process. 2025, 37, e2723. [Google Scholar] [CrossRef]
Kansal, K.; Pundir, A.; Nigam, S. Adversarial robustness in NLP: A study on Indian low-resource languages. In Proceedings of the 5th Asian Conference on Innovation in Technology (ASIANCON 2025), Pimpri, India, 22–23 August 2025; pp. 1–6. [Google Scholar] [CrossRef]
Xing, X.; Jin, Z.; Jin, D.; Wang, B.; Zhang, Q.; Huang, X. Tasty burgers, soggy fries: Probing aspect robustness in aspect-based sentiment analysis. In Proceedings of the EMNLP 2020 Conference, Online, 16–20 November 2020; pp. 3594–3605. [Google Scholar] [CrossRef]
He, J.; Wang, L.; Wang, J.; Liu, Z.; Na, H.; Wang, Z.; Wang, W.; Chen, Q. Guardians of discourse: Evaluating LLMs on multilingual offensive language detection. In Proceedings of the IEEE Smart World Congress (SWC 2024), Denarau Island, Fiji, 2–7 December 2024; pp. 1603–1608. [Google Scholar] [CrossRef]
Wiedemann, G.; Yimam, S.M.; Biemann, C. Fine-tuning pre-trained transformer networks for offensive language detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation (SemEval 2020), Barcelona, Spain, 12–13 December 2020; pp. 1638–1644. [Google Scholar] [CrossRef]
Zampieri, M.; Rosenthal, S.; Nakov, P.; Dmonte, A.; Ranasinghe, T. OffensEval 2023: Offensive language identification in the age of large language models. Nat. Lang. Eng. 2023, 29, 1416–1435. [Google Scholar] [CrossRef]
Casula, C.; Tonelli, S. Generation-based data augmentation for offensive language detection: Is it worth it? In Proceedings of the EACL 2023 Conference, Dubrovnik, Croatia, 2–6 May 2023; pp. 3359–3377. [Google Scholar] [CrossRef]
Yang, Y.; Kim, J.; Kim, Y.; Ho, N.; Thorne, J.; Yun, S.-Y. HARE: Explainable hate speech detection with step-by-step reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 5490–5505. [Google Scholar] [CrossRef]
Zampieri, M.; Nakov, P.; Rosenthal, S.; Atanasova, P.; Karadzhov, G.; Mubarak, H.; Derczynski, L.; Pitenis, Z.; Çöltekin, Ç. SemEval-2020 Task 12: Multilingual offensive language identification in social media. In Proceedings of the Fourteenth Workshop on Semantic Evaluation (SemEval 2020), Barcelona, Spain, 12–13 December 2020; pp. 1425–1447. [Google Scholar] [CrossRef]
Mohaouchane, H.; Mourhir, A.; Nikolov, N.S. Detecting offensive language on Arabic social media using deep learning. In Proceedings of the 6th International Conference on Social Networks Analysis, Management and Security (SNAMS), Granada, Spain, 22–25 October 2019; pp. 466–471. [Google Scholar] [CrossRef]
Essefar, K.; Ait Baha, H.; El Mahdaouy, A.; El Mekki, A.; Berrada, I. OMCD: Offensive Moroccan comments dataset. Lang. Resour. Eval. 2023, 57, 1745–1765. [Google Scholar] [CrossRef]
Abdellaoui, I.; Ibrahimi, A.; El Bouni, M.A.; Mourhir, A.; Driouech, S.; Aghzal, M. Investigating offensive language detection in a low-resource setting with a robustness perspective. Big Data Cogn. Comput. 2024, 8, 170. [Google Scholar] [CrossRef]
Chakraborty, A.; Alam, M.; Dey, V.; Chattopadhyay, A.; Mukhopadhyay, D. A survey on adversarial attacks and defenses. CAAI Trans. Intell. Technol. 2021, 6, 25–45. [Google Scholar] [CrossRef]
Zhu, K.; Wang, J.; Zhou, J.; Wang, Z.; Chen, H.; Wang, Y.; Yang, L.; Ye, W.; Zhang, Y.; Gong, N.Z.; et al. PromptRobust: Evaluating the robustness of large language models on adversarial prompts. In Proceedings of the ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis (LAMPS 2024), Salt Lake City, UT, USA, 14 October 2024. [Google Scholar] [CrossRef]
Wei, A.; Haghtalab, N.; Steinhardt, J. Jailbroken: How does LLM safety training fail? In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Available online: https://dL.acm.org/doi/10.5555/3666122.3669630 (accessed on 25 January 2026).
Wang, B.; Xu, C.; Wang, S.; Gan, Z.; Cheng, Y.; Gao, J.; Awadallah, A.H.; Li, B. Adversarial GLUE: A multi-task benchmark for robustness evaluation of language models. arXiv 2021. [Google Scholar] [CrossRef]
Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J.Z.; Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv 2023. [Google Scholar] [CrossRef]
Gaanoun, K.; Naira, A.M.; Allak, A.; Imade, B. DarijaBERT: A step forward in NLP for the written Moroccan dialect. Int. J. Data Sci. Anal. 2024, 20, 917–929. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, Y.; Huang, Y.; Song, S.; Yang, M.; Tang, B.; Xiong, F.; Li, Z. Attention heads of large language models: A survey. arXiv 2024. [Google Scholar] [CrossRef]
Li, J.; Ji, S.; Du, T.; Li, B.; Wang, T. TextBugger: Generating adversarial text against real-world applications. arXiv 2018. [Google Scholar] [CrossRef]
Ebrahimi, J.; Rao, A.; Lowd, D.; Dou, D. HotFlip: White-box adversarial examples for text classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia, 15–20 July 2018; pp. 31–36. [Google Scholar] [CrossRef]
Alian, M.; Awajan, A. Semantic similarity for English and Arabic texts: A review. J. Inf. Knowl. Manag. 2020, 19, 2050033. [Google Scholar] [CrossRef]
Al Sulaiman, M.; Moussa, A.M.; Abdou, S.; Elgibreen, H.; Faisal, M.; Rashwan, M. Semantic textual similarity for modern standard and dialectal Arabic using transfer learning. PLoS ONE 2022, 17, e0272991. [Google Scholar] [CrossRef] [PubMed]
Inoue, G.; Alhafni, B.; Baimukan, N.; Bouamor, H.; Habash, N. The interplay of variant, size, and task type in Arabic pre-trained language models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Virtual, 9 April 2021; Available online: https://aclanthology.org/2021.wanlp-1.10 (accessed on 25 January 2026).
Zhao, W.; Alwidian, S.; Mahmoud, Q.H. Adversarial training methods for deep learning: A systematic review. Algorithms 2022, 15, 283. [Google Scholar] [CrossRef]
Bai, T.; Luo, J.; Zhao, J.; Wen, B.; Wang, Q. Recent advances in adversarial training for adversarial robustness. arXiv 2021. [Google Scholar] [CrossRef]
Gao, J.; Lanchantin, J.; Soffa, M.L.; Qi, Y. Black-box generation of adversarial text sequences to evade deep learning classifiers. In Proceedings of the IEEE Security and Privacy Workshops (SPW 2018), San Francisco, CA, USA, 24 May 2018; pp. 50–56. [Google Scholar] [CrossRef]
Brown, H.; Lin, L.; Kawaguchi, K.; Shieh, M. Self-evaluation as a defense against adversarial attacks on LLMs. arXiv 2024. [Google Scholar] [CrossRef]
Aghzal, M.; Mourhir, A. Distributional word representations for code-mixed text in Moroccan Darija. Procedia Comput. Sci. 2021, 189, 266–273. [Google Scholar] [CrossRef]
Kang, J.; Xie, T.; Wu, X.; Maciejewski, R.; Tong, H. InfoFair: Information-theoretic intersectional fairness. arXiv 2021. [Google Scholar] [CrossRef]
Incremona, A.; Pozzi, A.; Guiscardi, A.; Tessera, D. A differentiable and uncertainty-aware mutual information regularizer for bias mitigation. Neurocomputing 2025, 132, 498. [Google Scholar] [CrossRef]

Figure 1. Claud Sonnet 3.5 Prompt for labeling DarOLD samples.

Figure 2. Prompt for fine-tuning GPT-4o Mini.

Figure 3. Attack success rate on GPT-4-o-mini, AtlasChat, and Arabian GPT for the offensive speech detection task in the Moroccan Darija (B.B.D: black box attack, W.B.D: white box attack).

Table 1. Current state-of-the-art results in offensive language detection.

Reference	Language	Model	Results (F1-Score)
[8]	English	RoBERTa-Large	92.00%
[7]	Spanish	Mistral	87.70%
[7]	German	GPT-3	76.80%
[12]	Arabic	AraBERT	90.17%
	Danish	Bert	81.19%
	Greek	mBERT	85.22%
	Turkish	XLM-RoBERTa-base and XLM-RoBERTa-large mode	82.58%
[15]	Moroccan Arabic	Darija RoBERTa	85%

Table 2. Dataset relabeling distribution.

Label Category	Original Count	Updated Count	Number of Changes	Percentage Changed
Offensive	7057	7505	288	4.08%
Non-Offensive	7057	6609	736	10.43%
Total	14,114	14,114	1024	7.25%

Table 3. Fine-tuning and prediction times across all LLMs.

Model	LLaMA	Gemma	Mistral	DeepSeek	GPT-4	Arabian-GPT	AtlasChat
Fine tuning	2 h 30 min	2 h 45 min	3 h 21 min	1 h 10 min	35 min	4 h 20 min	6 h 42 min
Prediction	46 min	1 h 37 min	1 h 12 min	43 min	20 min	49 min	1 h 24 min

Table 4. Employed Black Box attacks.

Black Box Attacks	Type	Description
B.B.D.1	Adding Suffix	Adding a neutral sentence to the original sentence.
B.B.D.2	Adding space	Adding space between letters in two selected words.
B.B.D.3	Deleting spaces	Deleting spaces between two words
B.B.D.4	Adding dots	Adding dots between letters in two randomly selected words
B.B.D.5	Swap with the adjacent	Changing two letters with an adjacent one in the Arabic/French keyboard.
B.B.D.6	Random noise	Modifying two characters with random noise.
B.B.D.7	Repeating vowels	Adding extra vowel characters to a word by repeating them.

Table 5. Employed White Box attack.

White Box Attacks	Level	Approach Used	Replacement Method	Description
W.B.D.1	character: Swap	Attention and Gradient	TextBugger	Swap two adjacent letters in the important word
W.B.D.2	character: substitute	Attention and Gradient	HotFlip	Substitute a letter in the word with a random letter
W.B.D.3	character: delete	Attention and Gradient	HotFlip	Delete a random letter from the word
W.B.D.4	character: insert	Attention and Gradient	TextBugger	Insert a random letter in the word
W.B.D.5	Word: synonym	Gradient	GPT-4-O-mini	Change the two important words with validated synonyms
W.B.D.6	Sentence: paraphrase	GPT-4-O-mini	GPT-4-O-mini	Rephrase the whole sentence by replacing words with suitable synonyms

Table 6. The two most important words using Gradient-based approach.

Text	True Label	Predicted Label	Prediction Probability	Important Word 1	Score Word 1	Important Word 2	Score Word 2
كتستهزءي منو كيصبن علاش عبيب الا كيصبن خدام علي راسو الحماري متاكد من نفسها بحلا كتخترع شي حاجه	1	1	0.98	كتستهزءي mocking	0.11	كتخترع inventing	0.074
You are mocking him while he is working. Why are you making fun of him? He is doing his job seriously, but you act like you’re inventing things.	1	1	0.98	كتستهزءي mocking	0.11	كتخترع inventing	0.074
اوصيكم يا اخوتي اهل الرباط لا ترشحوا العثماني او الحزب العداله والتنميه	0	0	0.92	العداله Justice	0.11	والتنميه Development	0.11
I advise you, my brothers from Rabat, not to vote for Othmani or the Justice and Development Party.	0	0	0.92	العداله Justice	0.11	والتنميه Development	0.11
تعاليقكم تعجب سيد حازم راصو بيه تقريبا شي عمليه وشوفو وجهو وداك شي دايرلو طبيب وجا مع ذااك للبرنامج ايوا لعجب ونتوما كتظحكو فيه	1	1	0.88	كتظحكو laughing	0.17	للبرنامج show	0.13
Your comments are surprising about this guy; he has his head wrapped; he probably had some kind of operation. Just look at his face, and he still came to the show like that, and yet you’re laughing at him.	1	1	0.88	كتظحكو laughing	0.17	للبرنامج show	0.13

Table 7. Character-Level Perturbation Examples for Offensive Language Detection.

Original Sentence	Important Words	Generated Adversarial Sentence
الطبيب ما فهمش الحالة	الطبيب، الحالة	الطيب ما فهمش الحلة
The doctor did not understand the case.	doctor, case	The dctor did not understand the cse.
واش خاصني نمشي للمستشفى؟	نمشي، المستشفى	واش خاصني نمضي للكستشفى؟
Do I need to go to the hospital?	go, hospital	Do I need to gd to the hocpital.

Table 8. Word-Level Perturbation Examples for Offensive Language Detection.

Original Text	Predicted Label	Important Words	Semantically Similar Words	Adversarial Text	New Label
كتكلو رزق عبد الله وكيخرج فيكوم You are unlawfully benefiting from Abdellah’s resources, and this wrongdoing is backfiring on you, leading to negative consequences.	Not Offensive	الله God كيخرج results in negative consequences	الواحد، العالي،العزيز، ، اله،اللطيف، والهوى،الكبير، نبيل، الرزاق، السلام، المولى، الحي، كولشي، امين.	كتكلو رزق عبد العزيزوكيبهدل فيكوم	Offensive
			different names of God, the Mighty, the Greatest.	كتكلو رزق عبد العزيزوكيبهدل فيكوم
			تحت .below/كيفضحexpose/, كيدويtalk/, كيهضرspeak/, كيطعنstab/, كيهجمattack/, كيشنقstrangle/, كيصطادhunt/, كيدّيهاhandle/, كيبهدلhumiliate/, كيصفّيeliminate/, كيشرّفhonor/, كيخوّنbetray/, كيضربhit/, كيطيّحdrop/, كيشهرdefame/	You are unlawfully benefiting from Abdellaziz resources, and this wrongdoing is humiliating you.

Table 9. Sentence-Level Perturbation Examples for Offensive Language Detection.

Original Sentence	Generated Adversarial Sentence
كتخربق بزاف وكتقول كلام ماشي فمحلو You talk nonsense a lot and say things that make no sense.	الهضرة ديالك كلها تفاهة وما عندها حتى معنى Your talk is all nonsense and completely meaningless
هاد السيد ماشي مربي ومكيحترمش الناس This man is not well-mannered and doesn’t respect people.	تصرفات هاد الشخص كتدل باللي ما عندوش احترام للناس This person’s behavior shows a lack of respect for others.

Table 10. Results of offensive language detection in Moroccan Darija. Bolded values represent the highest performance scores.

Model	Accuracy	F1-Score	Precision	Recall
Llama 2(7b)	0.50	0.50	0.49	0.51
Llama 2 (13b)	0.51	0.59	0.50	0.72
DeepSeek (R1 14b)	0.68	0.55	0.69	0.59
DeepSeek (R1 7b)	0.59	0.56	0.59	0.54
Llama 3.1 (8b)	0.62	0.62	0.53	0.75
Mistral 3 (7b)	0.66	0.64	0.71	0.72
Gemma 2 (9b)	0.75	0.69	0.85	0.58
Gemma 2 (2b)	0.62	0.70	0.56	0.92
DeepSeek (R1 8b)	0.70	0.71	0.67	0.76
Llama 3.2 (8b)	0.74	0.74	0.73	0.75
Arabian GPT (3b)	0.81	0.80	0.84	0.80
Mistral-small 3 (24b)	0.80	0.81	0.75	0.87
Falcon Arabic -	0.83	0.81	0.82	0.80
Atlas chat (2b)	0.82	0.82	0.80	0.84
GPT-4 (4 o mini)	0.88	0.88	0.88	0.88

Table 11. Semantic Similarity results using Arabic-BERT model (Method 1) and Normalized Average Similarity Score (Method 2) for Black box attacks. (Black Box Data: B.B.D).

Adversarial Dataset	Average Similarity	Standard Deviation	NASS
B.B.D.1	0.8590	0.1263	0.9759
B.B.D.2	0.8006	0.1568	0.9112
B.B.D.3	0.9591	0.0544	0.8111
B.B.D.4	0.8870	0.1104	0.9260
B.B.D.5	0.9087	0.0905	0.8542
B.B.D.6	0.9591	0.0544	0.9226
B.B.D.7	0.8791	0.0799	0.9794

Table 12. Atlas chat, Arabian-GPT, Falcon Arabic, and GPT-4-o-mini performance on black box attack (B.B.D: Black Box Data), Bolded values indicate the best performance.

Attack	Atlas Chat				Arabian GPT				Falcon Arabic				GPT-4-o-Mini
Metrics	F1	P	R	ASR	F1	P	R	ASR	F1	P	R	ASR	F1	P	R	ASR
Original data	0.82	0.80	0.84	-	0.80	0.84	0.80	-	0.81	0.82	0.80	-	0.88	0.88	0.88	-
B.B.D.1	0.81	0.79	0.83	18.36	0.80	0.90	0.72	12.71	0.80	0.83	0.78	14.19	0.88	0.88	0.88	11.58
B.B.D.2	0.78	0.71	0.86	23.20	0.77	0.86	0.70	17.26	0.78	0.80	0.77	18.65	0.85	0.85	0.85	14.11
B.B.D.3	0.81	0.79	0.83	18.19	0.80	0.88	0.73	13.04	0.80	0.81	0.79	11.73	0.86	0.8	0.86	13.94
B.B.D.4	0.78	0.71	0.87	22.67	0.77	0.88	0.68	17.60	0.76	0.78	0.75	19.46	0.85	0.85	0.85	14.34
B.B.D.5	0.78	0.73	0.83	22.30	0.77	0.81	0.74	19.62	0.78	0.79	0.77	17.91	0.78	0.80	0.79	20.75
B.B.D.6	0.80	0.78	0.82	19.29	0.79	0.84	0.80	14.67	0.79	0.80	0.78	13.38	0.86	0.86	0.86	13.83
B.B.D.7	0.80	0.77	0.82	19.71	0.80	0.90	0.72	13.77	0.81	0.82	0.80	10.27	0.87	0.8	0.87	12.76

Table 14. Semantic Similarity results using Arabic-Bert model (Method 1) and Normalized Average Similarity Score (Method 2) for White box attacks.

Adversarial Dataset	Average Similarity	Standard Deviation	NASS
W.B.D.1	0.9469	0.0665	0.7501
W.B.D.2	0.9428	0.0691	0.8228
W.B.D.3	0.9355	0.0754	0.8940
W.B.D.4	0.9338	0.0772	0.9097
W.B.D.5	0.8134	0.0813	0.7831
W.B.D.6	0.7917	0.1147	0.8658

Table 15. Atlas chat, Arabian-GPT, and GPT-4-o-mini performance to White box attack (W.B.D: White Box Data).

Attack	Atlas Chat				Arabian GPT				Falcon Arabic				GPT-4-o-mini
Metrics	F1	P	R	ASR	F1	P	R	ASR	F1	P	R	ASR	F1	P	R	ASR
Original data	0.82	0.80	0.84	-	0.80	0.84	0.80	-	0.81	0.82	0.80	-	0.88	0.88	0.88	-
W.B.D.1	0.79	0.77	0.82	23.11	0.79	0.86	0.73	15.52	0.76	0.77	0.74	25.81	0.85	0.85	0.85	14.90
W.B.D.2	0.76	0.77	0.75	26.88	0.73	0.86	0.63	23.73	0.77	0.78	0.75	23.64	0.76	0.76	0.76	23.50
W.B.D.3	0.77	0.75	0.78	27.33	0.78	0.85	0.71	17.37	0.78	0.79	0.77	19.72	0.80	0.81	0.80	19.01
W.B.D.4	0.75	0.74	0.76	29.52	0.75	0.82	0.68	22.55	0.75	0.76	0.74	27.49	0.75	0.75	0.75	22.44
W.B.D.5	0.69	0.80	0.60	20.69	0.70	0.83	0.60	17.99	0.80	0.81	0.79	10.58	0.73	0.82	0.65	15.80
W.B.D.6	0.64	0.83	0.52	29.88	0.66	0.89	0.52	28.45	0.80	0.81	0.79	12.03	0.71	0.87	0.59	23.95

Table 16. Arabian GPT and GPT-4-o-mini performance before and after adversarial training.

Metrics/ Attack	F1	P	R	ASR Before	ASR After	F1	P	R	ASR Before	ASR After
Models	Arabian GPT					GPT-4-omini
Original dataset	0.80	0.84	0.80	-	-	0.88	0.88	0.88	-	-
Adv dataset	0.89	0.90	0.88	-	-	0.91	0.91	0.92	-	-
B.B.D.1	0.87	0.88	0.86	12.71	8.43	0.89	0.90	0.88	11.58	7.23
B.B.D.2	0.86	0.87	0.85	17.26	12.91	0.88	0.89	0.86	14.11	9.89
B.B.D.3	0.88	0.89	0.87	13.04	8.76	0.90	0.91	0.89	13.94	9.31
B.B.D.4	0.86	0.87	0.85	17.60	10.38	0.89	0.90	0.87	14.34	9.78
B.B.D.5	0.84	0.86	0.82	19.62	17.89	0.86	0.88	0.84	20.75	13.12
B.B.D.6	0.87	0.88	0.86	14.67	10.42	0.89	0.90	0.88	13.83	9.47
B.B.D.7	0.88	0.89	0.87	13.77	9.56	0.90	0.91	0.88	12.76	8.18

Table 17. Robustness Comparison Between Literature ASR and Our Proposed Models ASR.

Attack	Ref.	Literature ASR	Fine-Tuned GPT-4	Improvement
Character level (no type is presented)	[19]	21%	10.26	10.74%
Word-level attack (no type is presented)	[19]	68%	9.89	39.61%
Word-level attack (no type is presented)	[32]	31%	9.89	39.61%
Adding Suffix	[33]	91%	7.23%	83.77%
inserting dots between letters	[17]	29%	9.78%	19.22%
inserting spaces between letters	[17]	24%	9.89%	14%
modifying a character with random noise	[17]	18%	9.47%	8.53%
modifying a single character	[17]	17%	13.12%	3,88%
deleting spaces between two words	[17]	16%	8.76%	7.24%
repeating vowels	[17]	13%	8.18%	4.82%
Rephrasing the sentence	[21]	55.41%	12.03%	26.67%
Rephrasing the sentence	[19]	22%	12.03%	26.67%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ouali, S.; Raisi, K.; Mourhir, A.; Nfaoui, E.H.; Garouani, S.E. Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic. Big Data Cogn. Comput. 2026, 10, 132. https://doi.org/10.3390/bdcc10050132

AMA Style

Ouali S, Raisi K, Mourhir A, Nfaoui EH, Garouani SE. Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic. Big Data and Cognitive Computing. 2026; 10(5):132. https://doi.org/10.3390/bdcc10050132

Chicago/Turabian Style

Ouali, Soufiyan, Kanza Raisi, Asmaa Mourhir, El Habib Nfaoui, and Said El Garouani. 2026. "Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic" Big Data and Cognitive Computing 10, no. 5: 132. https://doi.org/10.3390/bdcc10050132

APA Style

Ouali, S., Raisi, K., Mourhir, A., Nfaoui, E. H., & Garouani, S. E. (2026). Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic. Big Data and Cognitive Computing, 10(5), 132. https://doi.org/10.3390/bdcc10050132

Article Menu

Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic

Abstract

1. Introduction

2. Background and Related Work

2.1. Offensive Language Detection Methods

2.2. Robustness of Offensive Detection Models

3. Materials and Methods

3.1. Dataset Acquisition and Preparation

3.2. LLMs Selection for Fine-Tuning

3.3. Offensive Language Detection Model Evaluation

3.3.1. Evaluation Methods

3.3.2. Robustness Evaluation via Black-Box and White-Box Attacks

3.4. Adversarial Training for Robust Model Building

4. Results and Discussion

4.1. Preliminary Results

4.2. Robustness Analysis of Models Against Adversarial Attacks

4.2.1. Models’ Performance on Black Box Adversarial Attack

4.2.2. Models’ Performance on White Box Adversarial Attack

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI