Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic
Abstract
1. Introduction
2. Background and Related Work
2.1. Offensive Language Detection Methods
2.2. Robustness of Offensive Detection Models
3. Materials and Methods
3.1. Dataset Acquisition and Preparation
3.2. LLMs Selection for Fine-Tuning
3.3. Offensive Language Detection Model Evaluation
3.3.1. Evaluation Methods
3.3.2. Robustness Evaluation via Black-Box and White-Box Attacks
- n: number of labelers (in our case, 2)
- N: number of samples
- Xi,j: label
3.4. Adversarial Training for Robust Model Building
4. Results and Discussion
4.1. Preliminary Results
4.2. Robustness Analysis of Models Against Adversarial Attacks
4.2.1. Models’ Performance on Black Box Adversarial Attack
4.2.2. Models’ Performance on White Box Adversarial Attack
| Approach\ASR | W.B. 1 | W.B. 2 | W.B. 3 | W.B. 4 |
|---|---|---|---|---|
| Attention | 23.90% | 27.73% | 29.25% | 28.12% |
| Gradient | 32.56% | 39.99% | 42.86% | 42.01% |
| Combined | 35.45% | 40.28% | 42.86% | 44.35% |
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Qu, X.; Gu, Y.; Xia, Q.; Li, Z.; Wang, Z.; Huai, B. A survey on Arabic named entity recognition: Past, recent advances, and future trends. IEEE Trans. Knowl. Data Eng. 2023, 36, 943–959. [Google Scholar] [CrossRef]
- Farghaly, A.; Shaalan, K. Arabic natural language processing: Challenges and solutions. ACM Trans. Asian Lang. Inf. Process. 2009, 8, 1–22. [Google Scholar] [CrossRef]
- Younes, J.; Souissi, E.; Achour, H.; Ferchichi, A. Language resources for Maghrebi Arabic dialects’ NLP: A survey. Lang. Resour. Eval. 2020, 54, 1079–1142. [Google Scholar] [CrossRef]
- Yang, J.; Jin, H.; Tang, R.; Han, X.; Feng, Q.; Jiang, H.; Zhong, S.; Yin, B.; Hu, X. Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond. ACM Trans. Knowl. Discov. Data 2024, 18, 1–32. [Google Scholar] [CrossRef]
- Filippi, S.; Motyl, B. Large language models in engineering education: A systematic review and suggestions for practical adoption. Information 2024, 15, 345. [Google Scholar] [CrossRef]
- Nejjar, M.; Zacharias, L.; Stiehle, F.; Weber, I. LLMs for science: Usage for code generation and data analysis. J. Softw. Evol. Process. 2025, 37, e2723. [Google Scholar] [CrossRef]
- Kansal, K.; Pundir, A.; Nigam, S. Adversarial robustness in NLP: A study on Indian low-resource languages. In Proceedings of the 5th Asian Conference on Innovation in Technology (ASIANCON 2025), Pimpri, India, 22–23 August 2025; pp. 1–6. [Google Scholar] [CrossRef]
- Xing, X.; Jin, Z.; Jin, D.; Wang, B.; Zhang, Q.; Huang, X. Tasty burgers, soggy fries: Probing aspect robustness in aspect-based sentiment analysis. In Proceedings of the EMNLP 2020 Conference, Online, 16–20 November 2020; pp. 3594–3605. [Google Scholar] [CrossRef]
- He, J.; Wang, L.; Wang, J.; Liu, Z.; Na, H.; Wang, Z.; Wang, W.; Chen, Q. Guardians of discourse: Evaluating LLMs on multilingual offensive language detection. In Proceedings of the IEEE Smart World Congress (SWC 2024), Denarau Island, Fiji, 2–7 December 2024; pp. 1603–1608. [Google Scholar] [CrossRef]
- Wiedemann, G.; Yimam, S.M.; Biemann, C. Fine-tuning pre-trained transformer networks for offensive language detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation (SemEval 2020), Barcelona, Spain, 12–13 December 2020; pp. 1638–1644. [Google Scholar] [CrossRef]
- Zampieri, M.; Rosenthal, S.; Nakov, P.; Dmonte, A.; Ranasinghe, T. OffensEval 2023: Offensive language identification in the age of large language models. Nat. Lang. Eng. 2023, 29, 1416–1435. [Google Scholar] [CrossRef]
- Casula, C.; Tonelli, S. Generation-based data augmentation for offensive language detection: Is it worth it? In Proceedings of the EACL 2023 Conference, Dubrovnik, Croatia, 2–6 May 2023; pp. 3359–3377. [Google Scholar] [CrossRef]
- Yang, Y.; Kim, J.; Kim, Y.; Ho, N.; Thorne, J.; Yun, S.-Y. HARE: Explainable hate speech detection with step-by-step reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 5490–5505. [Google Scholar] [CrossRef]
- Zampieri, M.; Nakov, P.; Rosenthal, S.; Atanasova, P.; Karadzhov, G.; Mubarak, H.; Derczynski, L.; Pitenis, Z.; Çöltekin, Ç. SemEval-2020 Task 12: Multilingual offensive language identification in social media. In Proceedings of the Fourteenth Workshop on Semantic Evaluation (SemEval 2020), Barcelona, Spain, 12–13 December 2020; pp. 1425–1447. [Google Scholar] [CrossRef]
- Mohaouchane, H.; Mourhir, A.; Nikolov, N.S. Detecting offensive language on Arabic social media using deep learning. In Proceedings of the 6th International Conference on Social Networks Analysis, Management and Security (SNAMS), Granada, Spain, 22–25 October 2019; pp. 466–471. [Google Scholar] [CrossRef]
- Essefar, K.; Ait Baha, H.; El Mahdaouy, A.; El Mekki, A.; Berrada, I. OMCD: Offensive Moroccan comments dataset. Lang. Resour. Eval. 2023, 57, 1745–1765. [Google Scholar] [CrossRef]
- Abdellaoui, I.; Ibrahimi, A.; El Bouni, M.A.; Mourhir, A.; Driouech, S.; Aghzal, M. Investigating offensive language detection in a low-resource setting with a robustness perspective. Big Data Cogn. Comput. 2024, 8, 170. [Google Scholar] [CrossRef]
- Chakraborty, A.; Alam, M.; Dey, V.; Chattopadhyay, A.; Mukhopadhyay, D. A survey on adversarial attacks and defenses. CAAI Trans. Intell. Technol. 2021, 6, 25–45. [Google Scholar] [CrossRef]
- Zhu, K.; Wang, J.; Zhou, J.; Wang, Z.; Chen, H.; Wang, Y.; Yang, L.; Ye, W.; Zhang, Y.; Gong, N.Z.; et al. PromptRobust: Evaluating the robustness of large language models on adversarial prompts. In Proceedings of the ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis (LAMPS 2024), Salt Lake City, UT, USA, 14 October 2024. [Google Scholar] [CrossRef]
- Wei, A.; Haghtalab, N.; Steinhardt, J. Jailbroken: How does LLM safety training fail? In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Available online: https://dL.acm.org/doi/10.5555/3666122.3669630 (accessed on 25 January 2026).
- Wang, B.; Xu, C.; Wang, S.; Gan, Z.; Cheng, Y.; Gao, J.; Awadallah, A.H.; Li, B. Adversarial GLUE: A multi-task benchmark for robustness evaluation of language models. arXiv 2021. [Google Scholar] [CrossRef]
- Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J.Z.; Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv 2023. [Google Scholar] [CrossRef]
- Gaanoun, K.; Naira, A.M.; Allak, A.; Imade, B. DarijaBERT: A step forward in NLP for the written Moroccan dialect. Int. J. Data Sci. Anal. 2024, 20, 917–929. [Google Scholar] [CrossRef]
- Zheng, Z.; Wang, Y.; Huang, Y.; Song, S.; Yang, M.; Tang, B.; Xiong, F.; Li, Z. Attention heads of large language models: A survey. arXiv 2024. [Google Scholar] [CrossRef]
- Li, J.; Ji, S.; Du, T.; Li, B.; Wang, T. TextBugger: Generating adversarial text against real-world applications. arXiv 2018. [Google Scholar] [CrossRef]
- Ebrahimi, J.; Rao, A.; Lowd, D.; Dou, D. HotFlip: White-box adversarial examples for text classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia, 15–20 July 2018; pp. 31–36. [Google Scholar] [CrossRef]
- Alian, M.; Awajan, A. Semantic similarity for English and Arabic texts: A review. J. Inf. Knowl. Manag. 2020, 19, 2050033. [Google Scholar] [CrossRef]
- Al Sulaiman, M.; Moussa, A.M.; Abdou, S.; Elgibreen, H.; Faisal, M.; Rashwan, M. Semantic textual similarity for modern standard and dialectal Arabic using transfer learning. PLoS ONE 2022, 17, e0272991. [Google Scholar] [CrossRef] [PubMed]
- Inoue, G.; Alhafni, B.; Baimukan, N.; Bouamor, H.; Habash, N. The interplay of variant, size, and task type in Arabic pre-trained language models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Virtual, 9 April 2021; Available online: https://aclanthology.org/2021.wanlp-1.10 (accessed on 25 January 2026).
- Zhao, W.; Alwidian, S.; Mahmoud, Q.H. Adversarial training methods for deep learning: A systematic review. Algorithms 2022, 15, 283. [Google Scholar] [CrossRef]
- Bai, T.; Luo, J.; Zhao, J.; Wen, B.; Wang, Q. Recent advances in adversarial training for adversarial robustness. arXiv 2021. [Google Scholar] [CrossRef]
- Gao, J.; Lanchantin, J.; Soffa, M.L.; Qi, Y. Black-box generation of adversarial text sequences to evade deep learning classifiers. In Proceedings of the IEEE Security and Privacy Workshops (SPW 2018), San Francisco, CA, USA, 24 May 2018; pp. 50–56. [Google Scholar] [CrossRef]
- Brown, H.; Lin, L.; Kawaguchi, K.; Shieh, M. Self-evaluation as a defense against adversarial attacks on LLMs. arXiv 2024. [Google Scholar] [CrossRef]
- Aghzal, M.; Mourhir, A. Distributional word representations for code-mixed text in Moroccan Darija. Procedia Comput. Sci. 2021, 189, 266–273. [Google Scholar] [CrossRef]
- Kang, J.; Xie, T.; Wu, X.; Maciejewski, R.; Tong, H. InfoFair: Information-theoretic intersectional fairness. arXiv 2021. [Google Scholar] [CrossRef]
- Incremona, A.; Pozzi, A.; Guiscardi, A.; Tessera, D. A differentiable and uncertainty-aware mutual information regularizer for bias mitigation. Neurocomputing 2025, 132, 498. [Google Scholar] [CrossRef]



| Reference | Language | Model | Results (F1-Score) |
|---|---|---|---|
| [8] | English | RoBERTa-Large | 92.00% |
| [7] | Spanish | Mistral | 87.70% |
| [7] | German | GPT-3 | 76.80% |
| [12] | Arabic | AraBERT | 90.17% |
| Danish | Bert | 81.19% | |
| Greek | mBERT | 85.22% | |
| Turkish | XLM-RoBERTa-base and XLM-RoBERTa-large mode | 82.58% | |
| [15] | Moroccan Arabic | Darija RoBERTa | 85% |
| Label Category | Original Count | Updated Count | Number of Changes | Percentage Changed |
|---|---|---|---|---|
| Offensive | 7057 | 7505 | 288 | 4.08% |
| Non-Offensive | 7057 | 6609 | 736 | 10.43% |
| Total | 14,114 | 14,114 | 1024 | 7.25% |
| Model | LLaMA | Gemma | Mistral | DeepSeek | GPT-4 | Arabian-GPT | AtlasChat |
|---|---|---|---|---|---|---|---|
| Fine tuning | 2 h 30 min | 2 h 45 min | 3 h 21 min | 1 h 10 min | 35 min | 4 h 20 min | 6 h 42 min |
| Prediction | 46 min | 1 h 37 min | 1 h 12 min | 43 min | 20 min | 49 min | 1 h 24 min |
| Black Box Attacks | Type | Description |
|---|---|---|
| B.B.D.1 | Adding Suffix | Adding a neutral sentence to the original sentence. |
| B.B.D.2 | Adding space | Adding space between letters in two selected words. |
| B.B.D.3 | Deleting spaces | Deleting spaces between two words |
| B.B.D.4 | Adding dots | Adding dots between letters in two randomly selected words |
| B.B.D.5 | Swap with the adjacent | Changing two letters with an adjacent one in the Arabic/French keyboard. |
| B.B.D.6 | Random noise | Modifying two characters with random noise. |
| B.B.D.7 | Repeating vowels | Adding extra vowel characters to a word by repeating them. |
| White Box Attacks | Level | Approach Used | Replacement Method | Description |
|---|---|---|---|---|
| W.B.D.1 | character: Swap | Attention and Gradient | TextBugger | Swap two adjacent letters in the important word |
| W.B.D.2 | character: substitute | Attention and Gradient | HotFlip | Substitute a letter in the word with a random letter |
| W.B.D.3 | character: delete | Attention and Gradient | HotFlip | Delete a random letter from the word |
| W.B.D.4 | character: insert | Attention and Gradient | TextBugger | Insert a random letter in the word |
| W.B.D.5 | Word: synonym | Gradient | GPT-4-O-mini | Change the two important words with validated synonyms |
| W.B.D.6 | Sentence: paraphrase | GPT-4-O-mini | GPT-4-O-mini | Rephrase the whole sentence by replacing words with suitable synonyms |
| Text | True Label | Predicted Label | Prediction Probability | Important Word 1 | Score Word 1 | Important Word 2 | Score Word 2 |
|---|---|---|---|---|---|---|---|
| كتستهزءي منو كيصبن علاش عبيب الا كيصبن خدام علي راسو الحماري متاكد من نفسها بحلا كتخترع شي حاجه | 1 | 1 | 0.98 | كتستهزءي mocking | 0.11 | كتخترع inventing | 0.074 |
| You are mocking him while he is working. Why are you making fun of him? He is doing his job seriously, but you act like you’re inventing things. | |||||||
| اوصيكم يا اخوتي اهل الرباط لا ترشحوا العثماني او الحزب العداله والتنميه | 0 | 0 | 0.92 | العداله Justice | 0.11 | والتنميه Development | 0.11 |
| I advise you, my brothers from Rabat, not to vote for Othmani or the Justice and Development Party. | |||||||
تعاليقكم تعجب سيد حازم راصو بيه تقريبا شي عمليه وشوفو وجهو وداك شي دايرلو طبيب وجا مع ذااك للبرنامج ايوا لعجب ونتوما كتظحكو فيه | 1 | 1 | 0.88 | كتظحكو laughing | 0.17 | للبرنامج show | 0.13 |
| Your comments are surprising about this guy; he has his head wrapped; he probably had some kind of operation. Just look at his face, and he still came to the show like that, and yet you’re laughing at him. |
| Original Sentence | Important Words | Generated Adversarial Sentence |
|---|---|---|
| الطبيب ما فهمش الحالة | الطبيب، الحالة | الطيب ما فهمش الحلة |
| The doctor did not understand the case. | doctor, case | The dctor did not understand the cse. |
| واش خاصني نمشي للمستشفى؟ | نمشي، المستشفى | واش خاصني نمضي للكستشفى؟ |
| Do I need to go to the hospital? | go, hospital | Do I need to gd to the hocpital. |
| Original Text | Predicted Label | Important Words | Semantically Similar Words | Adversarial Text | New Label |
|---|---|---|---|---|---|
| كتكلو رزق عبد الله وكيخرج فيكوم You are unlawfully benefiting from Abdellah’s resources, and this wrongdoing is backfiring on you, leading to negative consequences. | Not Offensive | الله God كيخرج results in negative consequences | الواحد، العالي،العزيز، ، اله،اللطيف، والهوى،الكبير، نبيل، الرزاق، السلام، المولى، الحي، كولشي، امين. | كتكلو رزق عبد العزيزوكيبهدل فيكوم | Offensive |
| different names of God, the Mighty, the Greatest. | |||||
تحت .below/كيفضحexpose/, كيدويtalk/, كيهضرspeak/, كيطعنstab/, كيهجمattack/, كيشنقstrangle/, كيصطادhunt/, كيدّيهاhandle/, كيبهدلhumiliate/, كيصفّيeliminate/, كيشرّفhonor/, كيخوّنbetray/, كيضربhit/, كيطيّحdrop/, كيشهرdefame/ | You are unlawfully benefiting from Abdellaziz resources, and this wrongdoing is humiliating you. |
| Original Sentence | Generated Adversarial Sentence |
|---|---|
| كتخربق بزاف وكتقول كلام ماشي فمحلو You talk nonsense a lot and say things that make no sense. | الهضرة ديالك كلها تفاهة وما عندها حتى معنى Your talk is all nonsense and completely meaningless |
| هاد السيد ماشي مربي ومكيحترمش الناس This man is not well-mannered and doesn’t respect people. | تصرفات هاد الشخص كتدل باللي ما عندوش احترام للناس This person’s behavior shows a lack of respect for others. |
| Model | Accuracy | F1-Score | Precision | Recall |
|---|---|---|---|---|
| Llama 2(7b) | 0.50 | 0.50 | 0.49 | 0.51 |
| Llama 2 (13b) | 0.51 | 0.59 | 0.50 | 0.72 |
| DeepSeek (R1 14b) | 0.68 | 0.55 | 0.69 | 0.59 |
| DeepSeek (R1 7b) | 0.59 | 0.56 | 0.59 | 0.54 |
| Llama 3.1 (8b) | 0.62 | 0.62 | 0.53 | 0.75 |
| Mistral 3 (7b) | 0.66 | 0.64 | 0.71 | 0.72 |
| Gemma 2 (9b) | 0.75 | 0.69 | 0.85 | 0.58 |
| Gemma 2 (2b) | 0.62 | 0.70 | 0.56 | 0.92 |
| DeepSeek (R1 8b) | 0.70 | 0.71 | 0.67 | 0.76 |
| Llama 3.2 (8b) | 0.74 | 0.74 | 0.73 | 0.75 |
| Arabian GPT (3b) | 0.81 | 0.80 | 0.84 | 0.80 |
| Mistral-small 3 (24b) | 0.80 | 0.81 | 0.75 | 0.87 |
| Falcon Arabic - | 0.83 | 0.81 | 0.82 | 0.80 |
| Atlas chat (2b) | 0.82 | 0.82 | 0.80 | 0.84 |
| GPT-4 (4 o mini) | 0.88 | 0.88 | 0.88 | 0.88 |
| Adversarial Dataset | Average Similarity | Standard Deviation | NASS |
|---|---|---|---|
| B.B.D.1 | 0.8590 | 0.1263 | 0.9759 |
| B.B.D.2 | 0.8006 | 0.1568 | 0.9112 |
| B.B.D.3 | 0.9591 | 0.0544 | 0.8111 |
| B.B.D.4 | 0.8870 | 0.1104 | 0.9260 |
| B.B.D.5 | 0.9087 | 0.0905 | 0.8542 |
| B.B.D.6 | 0.9591 | 0.0544 | 0.9226 |
| B.B.D.7 | 0.8791 | 0.0799 | 0.9794 |
| Attack | Atlas Chat | Arabian GPT | Falcon Arabic | GPT-4-o-Mini | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Metrics | F1 | P | R | ASR | F1 | P | R | ASR | F1 | P | R | ASR | F1 | P | R | ASR |
| Original data | 0.82 | 0.80 | 0.84 | - | 0.80 | 0.84 | 0.80 | - | 0.81 | 0.82 | 0.80 | - | 0.88 | 0.88 | 0.88 | - |
| B.B.D.1 | 0.81 | 0.79 | 0.83 | 18.36 | 0.80 | 0.90 | 0.72 | 12.71 | 0.80 | 0.83 | 0.78 | 14.19 | 0.88 | 0.88 | 0.88 | 11.58 |
| B.B.D.2 | 0.78 | 0.71 | 0.86 | 23.20 | 0.77 | 0.86 | 0.70 | 17.26 | 0.78 | 0.80 | 0.77 | 18.65 | 0.85 | 0.85 | 0.85 | 14.11 |
| B.B.D.3 | 0.81 | 0.79 | 0.83 | 18.19 | 0.80 | 0.88 | 0.73 | 13.04 | 0.80 | 0.81 | 0.79 | 11.73 | 0.86 | 0.8 | 0.86 | 13.94 |
| B.B.D.4 | 0.78 | 0.71 | 0.87 | 22.67 | 0.77 | 0.88 | 0.68 | 17.60 | 0.76 | 0.78 | 0.75 | 19.46 | 0.85 | 0.85 | 0.85 | 14.34 |
| B.B.D.5 | 0.78 | 0.73 | 0.83 | 22.30 | 0.77 | 0.81 | 0.74 | 19.62 | 0.78 | 0.79 | 0.77 | 17.91 | 0.78 | 0.80 | 0.79 | 20.75 |
| B.B.D.6 | 0.80 | 0.78 | 0.82 | 19.29 | 0.79 | 0.84 | 0.80 | 14.67 | 0.79 | 0.80 | 0.78 | 13.38 | 0.86 | 0.86 | 0.86 | 13.83 |
| B.B.D.7 | 0.80 | 0.77 | 0.82 | 19.71 | 0.80 | 0.90 | 0.72 | 13.77 | 0.81 | 0.82 | 0.80 | 10.27 | 0.87 | 0.8 | 0.87 | 12.76 |
| Adversarial Dataset | Average Similarity | Standard Deviation | NASS |
|---|---|---|---|
| W.B.D.1 | 0.9469 | 0.0665 | 0.7501 |
| W.B.D.2 | 0.9428 | 0.0691 | 0.8228 |
| W.B.D.3 | 0.9355 | 0.0754 | 0.8940 |
| W.B.D.4 | 0.9338 | 0.0772 | 0.9097 |
| W.B.D.5 | 0.8134 | 0.0813 | 0.7831 |
| W.B.D.6 | 0.7917 | 0.1147 | 0.8658 |
| Attack | Atlas Chat | Arabian GPT | Falcon Arabic | GPT-4-o-mini | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Metrics | F1 | P | R | ASR | F1 | P | R | ASR | F1 | P | R | ASR | F1 | P | R | ASR |
| Original data | 0.82 | 0.80 | 0.84 | - | 0.80 | 0.84 | 0.80 | - | 0.81 | 0.82 | 0.80 | - | 0.88 | 0.88 | 0.88 | - |
| W.B.D.1 | 0.79 | 0.77 | 0.82 | 23.11 | 0.79 | 0.86 | 0.73 | 15.52 | 0.76 | 0.77 | 0.74 | 25.81 | 0.85 | 0.85 | 0.85 | 14.90 |
| W.B.D.2 | 0.76 | 0.77 | 0.75 | 26.88 | 0.73 | 0.86 | 0.63 | 23.73 | 0.77 | 0.78 | 0.75 | 23.64 | 0.76 | 0.76 | 0.76 | 23.50 |
| W.B.D.3 | 0.77 | 0.75 | 0.78 | 27.33 | 0.78 | 0.85 | 0.71 | 17.37 | 0.78 | 0.79 | 0.77 | 19.72 | 0.80 | 0.81 | 0.80 | 19.01 |
| W.B.D.4 | 0.75 | 0.74 | 0.76 | 29.52 | 0.75 | 0.82 | 0.68 | 22.55 | 0.75 | 0.76 | 0.74 | 27.49 | 0.75 | 0.75 | 0.75 | 22.44 |
| W.B.D.5 | 0.69 | 0.80 | 0.60 | 20.69 | 0.70 | 0.83 | 0.60 | 17.99 | 0.80 | 0.81 | 0.79 | 10.58 | 0.73 | 0.82 | 0.65 | 15.80 |
| W.B.D.6 | 0.64 | 0.83 | 0.52 | 29.88 | 0.66 | 0.89 | 0.52 | 28.45 | 0.80 | 0.81 | 0.79 | 12.03 | 0.71 | 0.87 | 0.59 | 23.95 |
| Metrics/ Attack | F1 | P | R | ASR Before | ASR After | F1 | P | R | ASR Before | ASR After |
|---|---|---|---|---|---|---|---|---|---|---|
| Models | Arabian GPT | GPT-4-omini | ||||||||
| Original dataset | 0.80 | 0.84 | 0.80 | - | - | 0.88 | 0.88 | 0.88 | - | - |
| Adv dataset | 0.89 | 0.90 | 0.88 | - | - | 0.91 | 0.91 | 0.92 | - | - |
| B.B.D.1 | 0.87 | 0.88 | 0.86 | 12.71 | 8.43 | 0.89 | 0.90 | 0.88 | 11.58 | 7.23 |
| B.B.D.2 | 0.86 | 0.87 | 0.85 | 17.26 | 12.91 | 0.88 | 0.89 | 0.86 | 14.11 | 9.89 |
| B.B.D.3 | 0.88 | 0.89 | 0.87 | 13.04 | 8.76 | 0.90 | 0.91 | 0.89 | 13.94 | 9.31 |
| B.B.D.4 | 0.86 | 0.87 | 0.85 | 17.60 | 10.38 | 0.89 | 0.90 | 0.87 | 14.34 | 9.78 |
| B.B.D.5 | 0.84 | 0.86 | 0.82 | 19.62 | 17.89 | 0.86 | 0.88 | 0.84 | 20.75 | 13.12 |
| B.B.D.6 | 0.87 | 0.88 | 0.86 | 14.67 | 10.42 | 0.89 | 0.90 | 0.88 | 13.83 | 9.47 |
| B.B.D.7 | 0.88 | 0.89 | 0.87 | 13.77 | 9.56 | 0.90 | 0.91 | 0.88 | 12.76 | 8.18 |
| Attack | Ref. | Literature ASR | Fine-Tuned GPT-4 | Improvement |
|---|---|---|---|---|
| Character level (no type is presented) | [19] | 21% | 10.26 | 10.74% |
| Word-level attack (no type is presented) | [19] | 68% | 9.89 | 39.61% |
| Word-level attack (no type is presented) | [32] | 31% | 9.89 | 39.61% |
| Adding Suffix | [33] | 91% | 7.23% | 83.77% |
| inserting dots between letters | [17] | 29% | 9.78% | 19.22% |
| inserting spaces between letters | [17] | 24% | 9.89% | 14% |
| modifying a character with random noise | [17] | 18% | 9.47% | 8.53% |
| modifying a single character | [17] | 17% | 13.12% | 3,88% |
| deleting spaces between two words | [17] | 16% | 8.76% | 7.24% |
| repeating vowels | [17] | 13% | 8.18% | 4.82% |
| Rephrasing the sentence | [21] | 55.41% | 12.03% | 26.67% |
| Rephrasing the sentence | [19] | 22% | 12.03% | 26.67% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ouali, S.; Raisi, K.; Mourhir, A.; Nfaoui, E.H.; Garouani, S.E. Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic. Big Data Cogn. Comput. 2026, 10, 132. https://doi.org/10.3390/bdcc10050132
Ouali S, Raisi K, Mourhir A, Nfaoui EH, Garouani SE. Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic. Big Data and Cognitive Computing. 2026; 10(5):132. https://doi.org/10.3390/bdcc10050132
Chicago/Turabian StyleOuali, Soufiyan, Kanza Raisi, Asmaa Mourhir, El Habib Nfaoui, and Said El Garouani. 2026. "Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic" Big Data and Cognitive Computing 10, no. 5: 132. https://doi.org/10.3390/bdcc10050132
APA StyleOuali, S., Raisi, K., Mourhir, A., Nfaoui, E. H., & Garouani, S. E. (2026). Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic. Big Data and Cognitive Computing, 10(5), 132. https://doi.org/10.3390/bdcc10050132

