RU-OLD: A Comprehensive Analysis of Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning, Deep Learning, and Transformer Models

Zain, Muhammad; Hussain, Nisar; Qasim, Amna; Mehak, Gull; Ahmad, Fiaz; Sidorov, Grigori; Gelbukh, Alexander

doi:10.3390/a18070396

Open AccessArticle

RU-OLD: A Comprehensive Analysis of Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning, Deep Learning, and Transformer Models

by

Muhammad Zain

^1,†,

Nisar Hussain

^1,†

,

Amna Qasim

^1,†

,

Gull Mehak

^1,†,

Fiaz Ahmad

²,

Grigori Sidorov

^1,*,†

and

Alexander Gelbukh

¹

Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN), Av. Juan de Dios Batiz, s/n, Mexico City 07320, Mexico

²

Department of Computer Science, University of Central Punjab, Punjab 54810, Pakistan

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2025, 18(7), 396; https://doi.org/10.3390/a18070396

Submission received: 3 June 2025 / Revised: 17 June 2025 / Accepted: 24 June 2025 / Published: 28 June 2025

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

Download

Browse Figures

Versions Notes

Abstract

The detection of abusive language in Roman Urdu is important for secure digital interaction. This work investigates machine learning (ML), deep learning (DL), and transformer-based methods for detecting offensive language in Roman Urdu comments collected from YouTube news channels. Extracted features use TF-IDF and Count Vectorizer for unigrams, bigrams, and trigrams. Of all the ML models—Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and Naïve Bayes (NB)—the best performance was achieved by the same SVM. DL models involved evaluating Bi-LSTM and CNN models, where the CNN model outperformed the others. Moreover, transformer variants such as LLaMA 2 and ModernBERT (MBERT) were instantiated and fine-tuned with LoRA (Low-Rank Adaptation) for better efficiency. LoRA has been tuned for large language models (LLMs), a family of advanced machine learning frameworks, based on the principle of making the process efficient with extremely low computational cost with better enhancement. According to the experimental results, LLaMA 2 with LoRA attained the highest F1-score of 96.58%, greatly exceeding the performance of other approaches. To elaborate, LoRA-optimized transformers perform well in capturing detailed subtleties of linguistic nuances, lending themselves well to Roman Urdu offensive language detection. The study compares the performance of conventional and contemporary NLP methods, highlighting the relevance of effective fine-tuning methods. Our findings pave the way for scalable and accurate automated moderation systems for online platforms supporting multiple languages.

Keywords: deep learning; machine learning; support vector machine; large language model

Share and Cite

MDPI and ACS Style

Zain, M.; Hussain, N.; Qasim, A.; Mehak, G.; Ahmad, F.; Sidorov, G.; Gelbukh, A. RU-OLD: A Comprehensive Analysis of Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning, Deep Learning, and Transformer Models. Algorithms 2025, 18, 396. https://doi.org/10.3390/a18070396

AMA Style

Zain M, Hussain N, Qasim A, Mehak G, Ahmad F, Sidorov G, Gelbukh A. RU-OLD: A Comprehensive Analysis of Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning, Deep Learning, and Transformer Models. Algorithms. 2025; 18(7):396. https://doi.org/10.3390/a18070396

Chicago/Turabian Style

Zain, Muhammad, Nisar Hussain, Amna Qasim, Gull Mehak, Fiaz Ahmad, Grigori Sidorov, and Alexander Gelbukh. 2025. "RU-OLD: A Comprehensive Analysis of Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning, Deep Learning, and Transformer Models" Algorithms 18, no. 7: 396. https://doi.org/10.3390/a18070396

APA Style

Zain, M., Hussain, N., Qasim, A., Mehak, G., Ahmad, F., Sidorov, G., & Gelbukh, A. (2025). RU-OLD: A Comprehensive Analysis of Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning, Deep Learning, and Transformer Models. Algorithms, 18(7), 396. https://doi.org/10.3390/a18070396

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RU-OLD: A Comprehensive Analysis of Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning, Deep Learning, and Transformer Models

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI