MDPI - Publisher of Open Access Journals

19 pages, 914 KiB

Open AccessArticle

RU-OLD: A Comprehensive Analysis of Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning, Deep Learning, and Transformer Models

by Muhammad Zain, Nisar Hussain, Amna Qasim, Gull Mehak, Fiaz Ahmad, Grigori Sidorov and Alexander Gelbukh

Algorithms 2025, 18(7), 396; https://doi.org/10.3390/a18070396 - 28 Jun 2025

Cited by 1 | Viewed by 400

Abstract

The detection of abusive language in Roman Urdu is important for secure digital interaction. This work investigates machine learning (ML), deep learning (DL), and transformer-based methods for detecting offensive language in Roman Urdu comments collected from YouTube news channels. Extracted features use TF-IDF [...] Read more.

The detection of abusive language in Roman Urdu is important for secure digital interaction. This work investigates machine learning (ML), deep learning (DL), and transformer-based methods for detecting offensive language in Roman Urdu comments collected from YouTube news channels. Extracted features use TF-IDF and Count Vectorizer for unigrams, bigrams, and trigrams. Of all the ML models—Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and Naïve Bayes (NB)—the best performance was achieved by the same SVM. DL models involved evaluating Bi-LSTM and CNN models, where the CNN model outperformed the others. Moreover, transformer variants such as LLaMA 2 and ModernBERT (MBERT) were instantiated and fine-tuned with LoRA (Low-Rank Adaptation) for better efficiency. LoRA has been tuned for large language models (LLMs), a family of advanced machine learning frameworks, based on the principle of making the process efficient with extremely low computational cost with better enhancement. According to the experimental results, LLaMA 2 with LoRA attained the highest F1-score of 96.58%, greatly exceeding the performance of other approaches. To elaborate, LoRA-optimized transformers perform well in capturing detailed subtleties of linguistic nuances, lending themselves well to Roman Urdu offensive language detection. The study compares the performance of conventional and contemporary NLP methods, highlighting the relevance of effective fine-tuning methods. Our findings pave the way for scalable and accurate automated moderation systems for online platforms supporting multiple languages. Full article

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

► Show Figures

Figure 1

18 pages, 373 KiB

Open AccessArticle

Machine Learning- and Deep Learning-Based Multi-Model System for Hate Speech Detection on Facebook

by Amna Naseeb, Muhammad Zain, Nisar Hussain, Amna Qasim, Fiaz Ahmad, Grigori Sidorov and Alexander Gelbukh

Algorithms 2025, 18(6), 331; https://doi.org/10.3390/a18060331 - 1 Jun 2025

Cited by 2 | Viewed by 702

Abstract

Hate speech is a complex topic that transcends language, culture, and even social spheres. Recently, the spread of hate speech on social media sites like Facebook has added a new layer of complexity to the issue of online safety and content moderation. This [...] Read more.

Hate speech is a complex topic that transcends language, culture, and even social spheres. Recently, the spread of hate speech on social media sites like Facebook has added a new layer of complexity to the issue of online safety and content moderation. This study seeks to minimize this problem by developing an Arabic script-based tool for automatically detecting hate speech in Roman Urdu, an informal script used most commonly for South Asian digital communications. Roman Urdu is relatively complex as there are no standardized spellings, leading to syntactic variations, which increases the difficulty of hate speech detection. To tackle this problem, we adopt a holistic strategy using a combination of six machine learning (ML) and four Deep Learning (DL) models, a dataset from Facebook comments, which was preprocessed (tokenization, stopwords removal, etc.), and text vectorization (TF-IDF, word embeddings). The ML algorithms used in this study are LR, SVM, RF, NB, KNN, and GBM. We also use deep learning architectures like CNN, RNN, LSTM, and GRU to increase the accuracy of the classification further. It is proven by the experimental results that deep learning models outperform the traditional ML approaches by a significant margin, with CNN and LSTM achieving accuracies of 95.1% and 96.2%, respectively. As far as we are aware, this is the first work that investigates QLoRA for fine-tuning large models for the task of offensive language detection in Roman Urdu. Full article

(This article belongs to the Special Issue Linguistic and Cognitive Approaches to Dialog Agents)

► Show Figures

Figure 1

23 pages, 522 KiB

Open AccessArticle

ORUD-Detect: A Comprehensive Approach to Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning–Deep Learning Models with Embedding Techniques

by Nisar Hussain, Amna Qasim, Gull Mehak, Olga Kolesnikova, Alexander Gelbukh and Grigori Sidorov

Information 2025, 16(2), 139; https://doi.org/10.3390/info16020139 - 13 Feb 2025

Cited by 3 | Viewed by 1289

Abstract

With the rapid expansion of social media, detecting offensive language has become critically important for healthy online interactions. This poses a considerable challenge for low-resource languages such as Roman Urdu which are widely spoken on platforms like Facebook. In this paper, we perform [...] Read more.

With the rapid expansion of social media, detecting offensive language has become critically important for healthy online interactions. This poses a considerable challenge for low-resource languages such as Roman Urdu which are widely spoken on platforms like Facebook. In this paper, we perform a comprehensive study of offensive language detection models on Roman Urdu datasets using both Machine Learning (ML) and Deep Learning (DL) approaches. We present a dataset of 89,968 Facebook comments and extensive preprocessing techniques such as TF-IDF features, Word2Vec, and fastText embeddings to address linguistic idiosyncrasies and code-mixed aspects of Roman Urdu. Among the ML models, a linear kernel Support Vector Machine (SVM) model scored the best performance, with an F1 score of 94.76, followed by SVM models with radial and polynomial kernels. Even the use of BoW uni-gram features with naive Bayes produced competitive results, with an F1 score of 94.26. The DL models performed well, with Bi-LSTM returning an F1 score of 98.00 with Word2Vec embeddings and fastText-based Bi-RNN performing at 97.00, showcasing the inference of contextual embeddings and soft similarity. The CNN model also gave a good result, with an F1 score of 96.00. The CNN model also achieved an F1 score of 96.00. This study presents hybrid ML and DL approaches to improve offensive language detection approaches for low-resource languages. This research opens up new doors to providing safer online environments for widespread Roman Urdu users. Full article

(This article belongs to the Special Issue Application of Machine Learning in Data Science and Computational Intelligence)

► Show Figures

Figure 1

16 pages, 511 KiB

Open AccessArticle

Hybrid Machine Learning and Deep Learning Approaches for Insult Detection in Roman Urdu Text

by Nisar Hussain, Amna Qasim, Gull Mehak, Olga Kolesnikova, Alexander Gelbukh and Grigori Sidorov

AI 2025, 6(2), 33; https://doi.org/10.3390/ai6020033 - 8 Feb 2025

Cited by 5 | Viewed by 1596

Abstract

Thisstudy introduces a new model for detecting insults in Roman Urdu, filling an important gap in natural language processing (NLP) for low-resource languages. The transliterated nature of Roman Urdu also poses specific challenges from a computational linguistics perspective, including non-standardized grammar, variation in [...] Read more.

Thisstudy introduces a new model for detecting insults in Roman Urdu, filling an important gap in natural language processing (NLP) for low-resource languages. The transliterated nature of Roman Urdu also poses specific challenges from a computational linguistics perspective, including non-standardized grammar, variation in spellings for the same word, and high levels of code-mixing with English, which together make automated insult detection for Roman Urdu a highly complex problem. To address these problems, we created a large-scale dataset with 46,045 labeled comments from social media websites such as Twitter, Facebook, and YouTube. This is the first dataset for insult detection for Roman Urdu that was created and annotated with insulting and non-insulting content. Advanced preprocessing methods such as text cleaning, text normalization, and tokenization are used in the study, as well as feature extraction using TF–IDF through unigram (Uni), bigram (Bi), trigram (Tri), and their unions: Uni+Bi+Trigram. We compared ten machine learning algorithms (logistic regression, support vector machines, random forest, gradient boosting, AdaBoost, and XGBoost) and three deep learning topologies (CNN, LSTM, and Bi-LSTM). Different models were compared, and ensemble ones were proven to give the highest F1-scores, reaching 97.79%, 97.78%, and 95.25%, respectively, for AdaBoost, decision tree, TF–IDF, and Uni+Bi+Trigram configurations. Deeper learning models also performed on par, with CNN achieving an F1-score of 97.01%. Overall, the results highlight the utility of n-gram features and the combination of robust classifiers in detecting insults. This study makes strides in improving NLP for Roman Urdu, yet further research has established the foundation of pre-trained transformers and hybrid approaches; this could overcome existing systems and platform limitations. This study has conscious implications, mainly on the construction of automated moderation tools to achieve safer online spaces, especially for South Asian social media websites. Full article

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

► Show Figures

Figure 1

15 pages, 2289 KiB

Open AccessArticle

Policy-Based Spam Detection of Tweets Dataset

by Momna Dar, Faiza Iqbal, Rabia Latif, Ayesha Altaf and Nor Shahida Mohd Jamail

Electronics 2023, 12(12), 2662; https://doi.org/10.3390/electronics12122662 - 14 Jun 2023

Cited by 10 | Viewed by 2731

Abstract

Spam communications from spam ads and social media platforms such as Facebook, Twitter, and Instagram are increasing, making spam detection more popular. Many languages are used for spam review identification, including Chinese, Urdu, Roman Urdu, English, Turkish, etc.; however, there are fewer high-quality [...] Read more.

Spam communications from spam ads and social media platforms such as Facebook, Twitter, and Instagram are increasing, making spam detection more popular. Many languages are used for spam review identification, including Chinese, Urdu, Roman Urdu, English, Turkish, etc.; however, there are fewer high-quality datasets available for Urdu. This is mainly because Urdu is less extensively used on social media networks such as Twitter, making it harder to collect huge volumes of relevant data. This paper investigates policy-based Urdu tweet spam detection. This study aims to collect over 1,100,000 real-time tweets from multiple users. The dataset is carefully filtered to comply with Twitter’s 100-tweet-per-hour limit. For data collection, the snscrape library is utilized, which is equipped with an API for accessing various attributes such as username, URL, and tweet content. Then, a machine learning pipeline consisting of TF-IDF, Count Vectorizer, and the following machine learning classifiers: multinomial naïve Bayes, support vector classifier RBF, logical regression, and BERT, are developed. Based on Twitter policy standards, feature extraction is performed, and the dataset is separated into training and testing sets for spam analysis. Experimental results show that the logistic regression classifier has achieved the highest accuracy, with an F1-score of 0.70 and an accuracy of 99.55%. The findings of the study show the effectiveness of policy-based spam detection in Urdu tweets using machine learning and BERT layer models and contribute to the development of a robust Urdu language social media spam detection method. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

14 pages, 1811 KiB

Open AccessArticle

Innovations in Urdu Sentiment Analysis Using Machine and Deep Learning Techniques for Two-Class Classification of Symmetric Datasets

by Khalid Bin Muhammad and S. M. Aqil Burney

Symmetry 2023, 15(5), 1027; https://doi.org/10.3390/sym15051027 - 5 May 2023

Cited by 13 | Viewed by 6731

Abstract

Many investigations have performed sentiment analysis to gauge public opinions in various languages, including English, French, Chinese, and others. The most spoken language in South Asia is Urdu. However, less work has been carried out on Urdu, as Roman Urdu is also used [...] Read more.

Many investigations have performed sentiment analysis to gauge public opinions in various languages, including English, French, Chinese, and others. The most spoken language in South Asia is Urdu. However, less work has been carried out on Urdu, as Roman Urdu is also used in social media (Urdu written in English alphabets); therefore, it is easy to use it in English language processing software. Lots of data in Urdu, as well as in Roman Urdu, are posted on social media sites such as Instagram, Twitter, Facebook, etc. This research focused on the collection of pure Urdu Language data and the preprocessing of the data, applying feature extraction, and innovative methods to perform sentiment analysis. After reviewing previous efforts, machine learning and deep learning algorithms were applied to the data. The obtained results were compared, and hybrid methods were also recommended in this research, enabling new avenues to conduct Urdu language data sentiment analysis. Full article

► Show Figures

Figure 1

26 pages, 3512 KiB

Open AccessArticle

Roman Urdu Hate Speech Detection Using Transformer-Based Model for Cyber Security Applications

by Muhammad Bilal, Atif Khan, Salman Jan, Shahrulniza Musa and Shaukat Ali

Sensors 2023, 23(8), 3909; https://doi.org/10.3390/s23083909 - 12 Apr 2023

Cited by 32 | Viewed by 6611

Abstract

Social media applications, such as Twitter and Facebook, allow users to communicate and share their thoughts, status updates, opinions, photographs, and videos around the globe. Unfortunately, some people utilize these platforms to disseminate hate speech and abusive language. The growth of hate speech [...] Read more.

Social media applications, such as Twitter and Facebook, allow users to communicate and share their thoughts, status updates, opinions, photographs, and videos around the globe. Unfortunately, some people utilize these platforms to disseminate hate speech and abusive language. The growth of hate speech may result in hate crimes, cyber violence, and substantial harm to cyberspace, physical security, and social safety. As a result, hate speech detection is a critical issue for both cyberspace and physical society, necessitating the development of a robust application capable of detecting and combating it in real-time. Hate speech detection is a context-dependent problem that requires context-aware mechanisms for resolution. In this study, we employed a transformer-based model for Roman Urdu hate speech classification due to its ability to capture the text context. In addition, we developed the first Roman Urdu pre-trained BERT model, which we named BERT-RU. For this purpose, we exploited the capabilities of BERT by training it from scratch on the largest Roman Urdu dataset consisting of 173,714 text messages. Traditional and deep learning models were used as baseline models, including LSTM, BiLSTM, BiLSTM + Attention Layer, and CNN. We also investigated the concept of transfer learning by using pre-trained BERT embeddings in conjunction with deep learning models. The performance of each model was evaluated in terms of accuracy, precision, recall, and F-measure. The generalization of each model was evaluated on a cross-domain dataset. The experimental results revealed that the transformer-based model, when directly applied to the classification task of the Roman Urdu hate speech, outperformed traditional machine learning, deep learning models, and pre-trained transformer-based models in terms of accuracy, precision, recall, and F-measure, with scores of 96.70%, 97.25%, 96.74%, and 97.89%, respectively. In addition, the transformer-based model exhibited superior generalization on a cross-domain dataset. Full article

(This article belongs to the Special Issue Application of Transfer Learning and Ensembling Techniques for Cyber Security, Medicine, and Education Using Sensing Data)

► Show Figures

Figure 1

26 pages, 3801 KiB

Open AccessArticle

Geo-Spatial Mapping of Hate Speech Prediction in Roman Urdu

by Samia Aziz, Muhammad Shahzad Sarfraz, Muhammad Usman, Muhammad Umar Aftab and Hafiz Tayyab Rauf

Mathematics 2023, 11(4), 969; https://doi.org/10.3390/math11040969 - 14 Feb 2023

Cited by 10 | Viewed by 4731 | Correction

Abstract

Social media has transformed into a crucial channel for political expression. Twitter, especially, is a vital platform used to exchange political hate in Pakistan. Political hate speech affects the public image of politicians, targets their supporters, and hurts public sentiments. Hate speech is [...] Read more.

Social media has transformed into a crucial channel for political expression. Twitter, especially, is a vital platform used to exchange political hate in Pakistan. Political hate speech affects the public image of politicians, targets their supporters, and hurts public sentiments. Hate speech is a controversial public speech that promotes violence toward a person or group based on specific characteristics. Although studies have been conducted to identify hate speech in European languages, Roman languages have yet to receive much attention. In this research work, we present the automatic detection of political hate speech in Roman Urdu. An exclusive political hate speech labeled dataset (RU-PHS) containing 5002 instances and city-level information has been developed. To overcome the vast lexical structure of Roman Urdu, we propose an algorithm for the lexical unification of Roman Urdu. Three vectorization techniques are developed: TF-IDF, word2vec, and fastText. A comparative analysis of the accuracy and time complexity of conventional machine learning models and fine-tuned neural networks using dense word representations is presented for classifying and predicting political hate speech. The results show that a random forest and the proposed feed-forward neural network achieve an accuracy of 93% using fastText word embedding to distinguish between neutral and politically offensive speech. The statistical information helps identify trends and patterns, and the hotspot and cluster analysis assist in pinpointing Punjab as a highly susceptible area in Pakistan in terms of political hate tweet generation. Full article

(This article belongs to the Special Issue New Insights in Machine Learning and Deep Neural Networks)

► Show Figures

Figure 1

22 pages, 6124 KiB

Open AccessArticle

Detection of Cyberbullying Patterns in Low Resource Colloquial Roman Urdu Microtext using Natural Language Processing, Machine Learning, and Ensemble Techniques

by Amirita Dewani, Mohsin Ali Memon, Sania Bhatti, Adel Sulaiman, Mohammed Hamdi, Hani Alshahrani, Abdullah Alghamdi and Asadullah Shaikh

Appl. Sci. 2023, 13(4), 2062; https://doi.org/10.3390/app13042062 - 5 Feb 2023

Cited by 20 | Viewed by 4057

Abstract

Social media platforms have become a substratum for people to enunciate their opinions and ideas across the globe. Due to anonymity preservation and freedom of expression, it is possible to humiliate individuals and groups, disregarding social etiquette online, inevitably proliferating and diversifying the [...] Read more.

Social media platforms have become a substratum for people to enunciate their opinions and ideas across the globe. Due to anonymity preservation and freedom of expression, it is possible to humiliate individuals and groups, disregarding social etiquette online, inevitably proliferating and diversifying the incidents of cyberbullying and cyber hate speech. This intimidating problem has recently sought the attention of researchers and scholars worldwide. Still, the current practices to sift the online content and offset the hatred spread do not go far enough. One factor contributing to this is the recent prevalence of regional languages in social media, the dearth of language resources, and flexible detection approaches, specifically for low-resource languages. In this context, most existing studies are oriented towards traditional resource-rich languages and highlight a huge gap in recently embraced resource-poor languages. One such language currently adopted worldwide and more typically by South Asian users for textual communication on social networks is Roman Urdu. It is derived from Urdu and written using a Left-to-Right pattern and Roman scripting. This language elicits numerous computational challenges while performing natural language preprocessing tasks due to its inflections, derivations, lexical variations, and morphological richness. To alleviate this problem, this research proposes a cyberbullying detection approach for analyzing textual data in the Roman Urdu language based on advanced preprocessing methods, voting-based ensemble techniques, and machine learning algorithms. The study has extracted a vast number of features, including statistical features, word N-Grams, combined n-grams, and BOW model with TFIDF weighting in different experimental settings using GridSearchCV and cross-validation techniques. The detection approach has been designed to tackle users’ textual input by considering user-specific writing styles on social media in a colloquial and non-standard form. The experimental results show that SVM with embedded hybrid N-gram features produced the highest average accuracy of around 83%. Among the ensemble voting-based techniques, XGboost achieved the optimal accuracy of 79%. Both implicit and explicit Roman Urdu instances were evaluated, and the categorization of severity based on prediction probabilities was performed. Time complexity is also analyzed in terms of execution time, indicating that LR, using different parameters and feature combinations, is the fastest algorithm. The results are promising with respect to standard assessment metrics and indicate the feasibility of the proposed approach in cyberbullying detection for the Roman Urdu language. Full article

(This article belongs to the Special Issue Machine Learning Techniques for the Exploration and Understanding of Complex Systems II)

► Show Figures

Figure 1

16 pages, 2457 KiB

Open AccessArticle

Crowd Control, Planning, and Prediction Using Sentiment Analysis: An Alert System for City Authorities

by Tariq Malik, Najma Hanif, Ahsen Tahir, Safeer Abbas, Muhammad Shoaib Hanif, Faiza Tariq, Shuja Ansari, Qammer Hussain Abbasi and Muhammad Ali Imran

Appl. Sci. 2023, 13(3), 1592; https://doi.org/10.3390/app13031592 - 26 Jan 2023

Cited by 2 | Viewed by 4243

Abstract

Modern means of communication, economic crises, and political decisions play imperative roles in reshaping political and administrative systems throughout the world. Twitter, a micro-blogging website, has gained paramount importance in terms of public opinion-sharing. Manual intelligence of law enforcement agencies (i.e., in changing [...] Read more.

Modern means of communication, economic crises, and political decisions play imperative roles in reshaping political and administrative systems throughout the world. Twitter, a micro-blogging website, has gained paramount importance in terms of public opinion-sharing. Manual intelligence of law enforcement agencies (i.e., in changing situations) cannot cope in real time. Thus, to address this problem, we built an alert system for government authorities in the province of Punjab, Pakistan. The alert system gathers real-time data from Twitter in English and Roman Urdu about forthcoming gatherings (protests, demonstrations, assemblies, rallies, sit-ins, marches, etc.). To determine public sentiment regarding upcoming anti-government gatherings (protests, demonstrations, assemblies, rallies, sit-ins, marches, etc.), the alert system determines the polarity of tweets. Using keywords, the system provides information for future gatherings by extracting the entities like date, time, and location from Twitter data obtained in real time. Our system was trained and tested with different machine learning (ML) algorithms, such as random forest (RF), decision tree (DT), support vector machine (SVM), multinomial naïve Bayes (MNB), and Gaussian naïve Bayes (GNB), along with two vectorization techniques, i.e., term frequency–inverse document frequency (TFIDF) and count vectorization. Moreover, this paper compares the accuracy results of sentiment analysis (SA) of Twitter data by applying supervised machine learning (ML) algorithms. In our research experiment, we used two data sets, i.e., a small data set of 1000 tweets and a large data set of 4000 tweets. Results showed that RF along with count vectorization performed best for the small data set with an accuracy of 82%; with the large data set, MNB along with count vectorization outperformed all other classifiers with an accuracy of 75%. Additionally, language models, e.g., bigram and trigram, were used to generate the word clouds of positive and negative words to visualize the most frequently used words. Full article

(This article belongs to the Special Issue Application of Machine Learning in Text Mining)

► Show Figures

Figure 1

19 pages, 2766 KiB

Open AccessArticle

A Novel Approach for Emotion Detection and Sentiment Analysis for Low Resource Urdu Language Based on CNN-LSTM

by Farhat Ullah, Xin Chen, Syed Bilal Hussain Shah, Saoucene Mahfoudh, Muhammad Abul Hassan and Nagham Saeed

Electronics 2022, 11(24), 4096; https://doi.org/10.3390/electronics11244096 - 8 Dec 2022

Cited by 16 | Viewed by 4616

Abstract

Emotion detection (ED) and sentiment analysis (SA) play a vital role in identifying an individual’s level of interest in any given field. Humans use facial expressions, voice pitch, gestures, and words to convey their emotions. Emotion detection and sentiment analysis in English and [...] Read more.

Emotion detection (ED) and sentiment analysis (SA) play a vital role in identifying an individual’s level of interest in any given field. Humans use facial expressions, voice pitch, gestures, and words to convey their emotions. Emotion detection and sentiment analysis in English and Chinese have received much attention in the last decade. Still, poor-resource languages such as Urdu have been mostly disregarded, which is the primary focus of this research. Roman Urdu should also be investigated like other languages because social media platforms are frequently used for communication. Roman Urdu faces a significant challenge in the absence of corpus for emotion detection and sentiment analysis because linguistic resources are vital for natural language processing. In this study, we create a corpus of 1021 sentences for emotion detection and 20,251 sentences for sentiment analysis, both obtained from various areas, and annotate it with the aid of human annotators from six and three classes, respectively. In order to train large-scale unlabeled data, the bag-of-word, term frequency-inverse document frequency, and Skip-gram models are employed, and the learned word vector is then fed into the CNN-LSTM model. In addition to our proposed approach, we also use other fundamental algorithms, including a convolutional neural network, long short-term memory, artificial neural networks, and recurrent neural networks for comparison. The result indicates that the CNN-LSTM proposed method paired with Word2Vec is more effective than other approaches regarding emotion detection and evaluating sentiment analysis in Roman Urdu. Furthermore, we compare our based model with some previous work. Both emotion detection and sentiment analysis have seen significant improvements, jumping from an accuracy of 85% to 95% and from 89% to 93.3%, respectively. Full article

(This article belongs to the Special Issue Artificial Intelligence Technologies and Applications)

► Show Figures

Figure 1

19 pages, 1297 KiB

Open AccessArticle

Roman Urdu Sentiment Analysis Using Transfer Learning

by Dun Li, Kanwal Ahmed, Zhiyun Zheng, Syed Agha Hassnain Mohsan, Mohammed H. Alsharif, Myriam Hadjouni, Mona M. Jamjoom and Samih M. Mostafa

Appl. Sci. 2022, 12(20), 10344; https://doi.org/10.3390/app122010344 - 14 Oct 2022

Cited by 22 | Viewed by 6323

Abstract

Numerous studies have been conducted to meet the growing need for analytic tools capable of processing increasing amounts of textual data available online, and sentiment analysis has emerged as a frontrunner in this field. Current studies are focused on the English language, while [...] Read more.

Numerous studies have been conducted to meet the growing need for analytic tools capable of processing increasing amounts of textual data available online, and sentiment analysis has emerged as a frontrunner in this field. Current studies are focused on the English language, while minority languages, such as Roman Urdu, are ignored because of their complex syntax and lexical varieties. In recent years, deep neural networks have become the standard in this field. The entire potential of DL models for text SA has not yet been fully explored, despite their early success. For sentiment analysis, CNN has surpassed in accuracy, although it still has some imperfections. To begin, CNNs need a significant amount of data to train. Second, it presumes that all words have the same impact on the polarity of a statement. To fill these voids, this study proposes a CNN with an attention mechanism and transfer learning to improve SA performance. Compared to state-of-the-art methods, our proposed model appears to have achieved greater classification accuracy in experiments. Full article

(This article belongs to the Special Issue Recent Trends in Natural Language Processing and Its Applications)

► Show Figures

Figure 1

24 pages, 1076 KiB

Open AccessArticle

Attention-Based RU-BiLSTM Sentiment Analysis Model for Roman Urdu

by Bilal Ahmed Chandio, Ali Shariq Imran, Maheen Bakhtyar, Sher Muhammad Daudpota and Junaid Baber

Appl. Sci. 2022, 12(7), 3641; https://doi.org/10.3390/app12073641 - 4 Apr 2022

Cited by 29 | Viewed by 7765

Abstract

Deep neural networks have emerged as a leading approach towards handling many natural language processing (NLP) tasks. Deep networks initially conquered the problems of computer vision. However, dealing with sequential data such as text and sound was a nightmare for such networks as [...] Read more.

Deep neural networks have emerged as a leading approach towards handling many natural language processing (NLP) tasks. Deep networks initially conquered the problems of computer vision. However, dealing with sequential data such as text and sound was a nightmare for such networks as traditional deep networks are not reliable in preserving contextual information. This may not harm the results in the case of image processing where we do not care about the sequence, but when we consider the data collected from text for processing, such networks may trigger disastrous results. Moreover, establishing sentence semantics in a colloquial text such as Roman Urdu is a challenge. Additionally, the sparsity and high dimensionality of data in such informal text have encountered a significant challenge for building sentence semantics. To overcome this problem, we propose a deep recurrent architecture RU-BiLSTM based on bidirectional LSTM (BiLSTM) coupled with word embedding and an attention mechanism for sentiment analysis of Roman Urdu. Our proposed model uses the bidirectional LSTM to preserve the context in both directions and the attention mechanism to concentrate on more important features. Eventually, the last dense softmax output layer is used to acquire the binary and ternary classification results. We empirically evaluated our model on two available datasets of Roman Urdu, i.e., RUECD and RUSA-19. Our proposed model outperformed the baseline models on many grounds, and a significant improvement of 6% to 8% is achieved over baseline models. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

18 pages, 1375 KiB

Open AccessArticle

Deep Sentiment Analysis Using CNN-LSTM Architecture of English and Roman Urdu Text Shared in Social Media

by Lal Khan, Ammar Amjad, Kanwar Muhammad Afaq and Hsien-Tsung Chang

Appl. Sci. 2022, 12(5), 2694; https://doi.org/10.3390/app12052694 - 4 Mar 2022

Cited by 104 | Viewed by 12803

Abstract

Sentiment analysis (SA) has been an active research subject in the domain of natural language processing due to its important functions in interpreting people’s perspectives and drawing successful opinion-based judgments. On social media, Roman Urdu is one of the most extensively utilized dialects. [...] Read more.

Sentiment analysis (SA) has been an active research subject in the domain of natural language processing due to its important functions in interpreting people’s perspectives and drawing successful opinion-based judgments. On social media, Roman Urdu is one of the most extensively utilized dialects. Sentiment analysis of Roman Urdu is difficult due to its morphological complexities and varied dialects. The purpose of this paper is to evaluate the performance of various word embeddings for Roman Urdu and English dialects using the CNN-LSTM architecture with traditional machine learning classifiers. We introduce a novel deep learning architecture for Roman Urdu and English dialect SA based on two layers: LSTM for long-term dependency preservation and a one-layer CNN model for local feature extraction. To obtain the final classification, the feature maps learned by CNN and LSTM are fed to several machine learning classifiers. Various word embedding models support this concept. Extensive tests on four corpora show that the proposed model performs exceptionally well in Roman Urdu and English text sentiment classification, with an accuracy of 0.904, 0.841, 0.740, and 0.748 against MDPI, RUSA, RUSA-19, and UCL datasets, respectively. The results show that the SVM classifier and the Word2Vec CBOW (Continuous Bag of Words) model are more beneficial options for Roman Urdu sentiment analysis, but that BERT word embedding, two-layer LSTM, and SVM as a classifier function are more suitable options for English language sentiment analysis. The suggested model outperforms existing well-known advanced models on relevant corpora, improving the accuracy by up to 5%. Full article

(This article belongs to the Topic Machine and Deep Learning)

► Show Figures

Figure 1

29 pages, 1562 KiB

Open AccessReview

A Review of Urdu Sentiment Analysis with Multilingual Perspective: A Case of Urdu and Roman Urdu Language

by Ihsan Ullah Khan, Aurangzeb Khan, Wahab Khan, Mazliham Mohd Su’ud, Muhammad Mansoor Alam, Fazli Subhan and Muhammad Zubair Asghar

Computers 2022, 11(1), 3; https://doi.org/10.3390/computers11010003 - 27 Dec 2021

Cited by 27 | Viewed by 19541

Abstract

Research efforts in the field of sentiment analysis have exponentially increased in the last few years due to its applicability in areas such as online product purchasing, marketing, and reputation management. Social media and online shopping sites have become a rich source of [...] Read more.

Research efforts in the field of sentiment analysis have exponentially increased in the last few years due to its applicability in areas such as online product purchasing, marketing, and reputation management. Social media and online shopping sites have become a rich source of user-generated data. Manufacturing, sales, and marketing organizations are progressively turning their eyes to this source to get worldwide feedback on their activities and products. Millions of sentences in Urdu and Roman Urdu are posted daily on social sites, such as Facebook, Instagram, Snapchat, and Twitter. Disregarding people’s opinions in Urdu and Roman Urdu and considering only resource-rich English language leads to the vital loss of this vast amount of data. Our research focused on collecting research papers related to Urdu and Roman Urdu language and analyzing them in terms of preprocessing, feature extraction, and classification techniques. This paper contains a comprehensive study of research conducted on Roman Urdu and Urdu text for a product review. This study is divided into categories, such as collection of relevant corpora, data preprocessing, feature extraction, classification platforms and approaches, limitations, and future work. The comparison was made based on evaluating different research factors, such as corpus, lexicon, and opinions. Each reviewed paper was evaluated according to some provided benchmarks and categorized accordingly. Based on results obtained and the comparisons made, we suggested some helpful steps in a future study. Full article

► Show Figures

Figure 1

Search Results (15)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (15)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI