MDPI - Publisher of Open Access Journals

29 pages, 1051 KiB

Open AccessArticle

Urdu Toxicity Detection: A Multi-Stage and Multi-Label Classification Approach

by Ayesha Rashid, Sajid Mahmood, Usman Inayat and Muhammad Fahad Zia

AI 2025, 6(8), 194; https://doi.org/10.3390/ai6080194 - 21 Aug 2025

Viewed by 87

Social media empowers freedom of expression but is often misused for abuse and hate. The detection of such content is crucial, especially in under-resourced languages like Urdu. To address this challenge, this paper designed a comprehensive multilabel dataset, the Urdu toxicity corpus (UTC). Second, the Urdu toxicity detection model is developed, which detects toxic content from an Urdu dataset presented in Nastaliq Font. The proposed framework initially processed the gathered data and then applied feature engineering using term frequency-inverse document frequency, bag-of-words, and N-gram techniques. Subsequently, the synthetic minority over-sampling technique is used to address the data imbalance problem, and manual data annotation is performed to ensure label accuracy. Four machine learning models, namely logistic regression, support vector machine, random forest, and gradient boosting, are applied to preprocessed data. The results indicate that the RF outperformed all evaluation metrics. Deep learning algorithms, including long short-term memory (LSTM), Bidirectional LSTM, and gated recurrent unit, have also been applied to UTC for classification purposes. Random forest outperforms the other models, achieving a precision, recall, F1-score, and accuracy of 0.97, 0.99, 0.98, and 0.99, respectively. The proposed model demonstrates a strong potential to detect rude, offensive, abusive, and hate speech content from user comments in Urdu Nastaliq. Full article

(This article belongs to the Special Issue AI-Driven Innovations: Emerging Trends, Security, and Industrial Solutions)

► Show Figures

Figure 1

23 pages, 3836 KiB

Open AccessArticle

RUDA-2025: Depression Severity Detection Using Pre-Trained Transformers on Social Media Data

by Muhammad Ahmad, Pierpaolo Basile, Fida Ullah, Ildar Batyrshin and Grigori Sidorov

AI 2025, 6(8), 191; https://doi.org/10.3390/ai6080191 - 18 Aug 2025

Viewed by 175

Abstract

Depression is a serious mental health disorder affecting cognition, emotions, and behavior. It impacts over 300 million people globally, with mental health care costs exceeding $1 trillion annually. Traditional diagnostic methods are often expensive, time-consuming, stigmatizing, and difficult to access. This study leverages NLP techniques to identify depressive cues in social media posts, focusing on both standard Urdu and code-mixed Roman Urdu, which are often overlooked in existing research. To the best of our knowledge, a script-conversion and combination-based approach for Roman Urdu and Nastaliq Urdu has not been explored earlier. To address this gap, our study makes four key contributions. First, we created a manually annotated dataset named Ruda-2025, containing posts in code-mixed Roman Urdu and Nastaliq Urdu for both binary and multiclass classification. The binary classes are depression” and not depression, with the depression class further divided into fine-grained categories: Mild, Moderate, and Severe depression alongside not depression. Second, we applied first-time two novel techniques to the RUDA-2025 dataset: (1) script-conversion approach that translates between code-mixed Roman Urdu and Standard Urdu and (2) combination-based approach that merges both scripts to make a single dataset to address linguistic challenges in depression assessment. Finally, we employed 60 different experiments using a combination of traditional machine learning and deep learning techniques to find the best-fit model for the detection of mental disorder. Based on our analysis, our proposed model (mBERT) using custom attention mechanism outperformed baseline (XGB) in combination-based, code-mixed Roman and Nastaliq Urdu script conversions. Full article

► Show Figures

Figure 1

16 pages, 2121 KiB

Open AccessArticle

Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu

by Fida Ullah, Alexander Gelbukh, Muhammad Tayyab Zamir, Edgardo Manuel Felipe Riverόn and Grigori Sidorov

Computers 2024, 13(10), 258; https://doi.org/10.3390/computers13100258 - 10 Oct 2024

Cited by 3 | Viewed by 2873

Abstract

Identifying and categorizing proper nouns in text, known as named entity recognition (NER), is crucial for various natural language processing tasks. However, developing effective NER techniques for low-resource languages like Urdu poses challenges due to limited training data, particularly in the nastaliq script. To address this, our study introduces a novel data augmentation method, “contextual word embeddings augmentation” (CWEA), for Urdu, aiming to enrich existing datasets. The extended dataset, comprising 160,132 tokens and 114,912 labeled entities, significantly enhances the coverage of named entities compared to previous datasets. We evaluated several transformer models on this augmented dataset, including BERT-multilingual, RoBERTa-Urdu-small, BERT-base-cased, and BERT-large-cased. Notably, the BERT-multilingual model outperformed others, achieving the highest macro F1 score of 0.982%. This surpassed the macro f1 scores of the RoBERTa-Urdu-small (0.884%), BERT-large-cased (0.916%), and BERT-base-cased (0.908%) models. Additionally, our neural network model achieved a micro F1 score of 96%, while the RNN model achieved 97% and the BiLSTM model achieved a macro F1 score of 96% on augmented data. Our findings underscore the efficacy of data augmentation techniques in enhancing NER performance for low-resource languages like Urdu. Full article

(This article belongs to the Special Issue Harnessing Artificial Intelligence for Social and Semantic Understanding)

► Show Figures

Figure 1

Search Results (3)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (3)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI