Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (3)

Search Parameters:
Keywords = Nastaliq Urdu

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
29 pages, 1051 KiB  
Article
Urdu Toxicity Detection: A Multi-Stage and Multi-Label Classification Approach
by Ayesha Rashid, Sajid Mahmood, Usman Inayat and Muhammad Fahad Zia
AI 2025, 6(8), 194; https://doi.org/10.3390/ai6080194 - 21 Aug 2025
Viewed by 87
Abstract
Social media empowers freedom of expression but is often misused for abuse and hate. The detection of such content is crucial, especially in under-resourced languages like Urdu. To address this challenge, this paper designed a comprehensive multilabel dataset, the Urdu toxicity corpus (UTC). [...] Read more.
Social media empowers freedom of expression but is often misused for abuse and hate. The detection of such content is crucial, especially in under-resourced languages like Urdu. To address this challenge, this paper designed a comprehensive multilabel dataset, the Urdu toxicity corpus (UTC). Second, the Urdu toxicity detection model is developed, which detects toxic content from an Urdu dataset presented in Nastaliq Font. The proposed framework initially processed the gathered data and then applied feature engineering using term frequency-inverse document frequency, bag-of-words, and N-gram techniques. Subsequently, the synthetic minority over-sampling technique is used to address the data imbalance problem, and manual data annotation is performed to ensure label accuracy. Four machine learning models, namely logistic regression, support vector machine, random forest, and gradient boosting, are applied to preprocessed data. The results indicate that the RF outperformed all evaluation metrics. Deep learning algorithms, including long short-term memory (LSTM), Bidirectional LSTM, and gated recurrent unit, have also been applied to UTC for classification purposes. Random forest outperforms the other models, achieving a precision, recall, F1-score, and accuracy of 0.97, 0.99, 0.98, and 0.99, respectively. The proposed model demonstrates a strong potential to detect rude, offensive, abusive, and hate speech content from user comments in Urdu Nastaliq. Full article
Show Figures

Figure 1

23 pages, 3836 KiB  
Article
RUDA-2025: Depression Severity Detection Using Pre-Trained Transformers on Social Media Data
by Muhammad Ahmad, Pierpaolo Basile, Fida Ullah, Ildar Batyrshin and Grigori Sidorov
AI 2025, 6(8), 191; https://doi.org/10.3390/ai6080191 - 18 Aug 2025
Viewed by 175
Abstract
Depression is a serious mental health disorder affecting cognition, emotions, and behavior. It impacts over 300 million people globally, with mental health care costs exceeding $1 trillion annually. Traditional diagnostic methods are often expensive, time-consuming, stigmatizing, and difficult to access. This study leverages [...] Read more.
Depression is a serious mental health disorder affecting cognition, emotions, and behavior. It impacts over 300 million people globally, with mental health care costs exceeding $1 trillion annually. Traditional diagnostic methods are often expensive, time-consuming, stigmatizing, and difficult to access. This study leverages NLP techniques to identify depressive cues in social media posts, focusing on both standard Urdu and code-mixed Roman Urdu, which are often overlooked in existing research. To the best of our knowledge, a script-conversion and combination-based approach for Roman Urdu and Nastaliq Urdu has not been explored earlier. To address this gap, our study makes four key contributions. First, we created a manually annotated dataset named Ruda-2025, containing posts in code-mixed Roman Urdu and Nastaliq Urdu for both binary and multiclass classification. The binary classes are depression” and not depression, with the depression class further divided into fine-grained categories: Mild, Moderate, and Severe depression alongside not depression. Second, we applied first-time two novel techniques to the RUDA-2025 dataset: (1) script-conversion approach that translates between code-mixed Roman Urdu and Standard Urdu and (2) combination-based approach that merges both scripts to make a single dataset to address linguistic challenges in depression assessment. Finally, we employed 60 different experiments using a combination of traditional machine learning and deep learning techniques to find the best-fit model for the detection of mental disorder. Based on our analysis, our proposed model (mBERT) using custom attention mechanism outperformed baseline (XGB) in combination-based, code-mixed Roman and Nastaliq Urdu script conversions. Full article
Show Figures

Figure 1

16 pages, 2121 KiB  
Article
Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu
by Fida Ullah, Alexander Gelbukh, Muhammad Tayyab Zamir, Edgardo Manuel Felipe Riverόn and Grigori Sidorov
Computers 2024, 13(10), 258; https://doi.org/10.3390/computers13100258 - 10 Oct 2024
Cited by 3 | Viewed by 2873
Abstract
Identifying and categorizing proper nouns in text, known as named entity recognition (NER), is crucial for various natural language processing tasks. However, developing effective NER techniques for low-resource languages like Urdu poses challenges due to limited training data, particularly in the nastaliq script. [...] Read more.
Identifying and categorizing proper nouns in text, known as named entity recognition (NER), is crucial for various natural language processing tasks. However, developing effective NER techniques for low-resource languages like Urdu poses challenges due to limited training data, particularly in the nastaliq script. To address this, our study introduces a novel data augmentation method, “contextual word embeddings augmentation” (CWEA), for Urdu, aiming to enrich existing datasets. The extended dataset, comprising 160,132 tokens and 114,912 labeled entities, significantly enhances the coverage of named entities compared to previous datasets. We evaluated several transformer models on this augmented dataset, including BERT-multilingual, RoBERTa-Urdu-small, BERT-base-cased, and BERT-large-cased. Notably, the BERT-multilingual model outperformed others, achieving the highest macro F1 score of 0.982%. This surpassed the macro f1 scores of the RoBERTa-Urdu-small (0.884%), BERT-large-cased (0.916%), and BERT-base-cased (0.908%) models. Additionally, our neural network model achieved a micro F1 score of 96%, while the RNN model achieved 97% and the BiLSTM model achieved a macro F1 score of 96% on augmented data. Our findings underscore the efficacy of data augmentation techniques in enhancing NER performance for low-resource languages like Urdu. Full article
Show Figures

Figure 1

Back to TopTop