MDPI - Publisher of Open Access Journals

37 pages, 5216 KiB

Open AccessArticle

Unraveling the Overall Picture of Japanese Dialect Variation: What Factors Shape the Big Picture?

by Wilbert Heeringa and Fumio Inoue

Languages 2025, 10(6), 141; https://doi.org/10.3390/languages10060141 - 12 Jun 2025

Viewed by 830

We studied the Japanese dialect by calculating aggregated PMI Levenshtein distances among local Japanese dialects using data from 2400 locations and 141 items from the Linguistic Atlas of Japan Database (LAJDB). Through factor analysis, we found the latent linguistic variables underlying the aggregated [...] Read more.

We studied the Japanese dialect by calculating aggregated PMI Levenshtein distances among local Japanese dialects using data from 2400 locations and 141 items from the Linguistic Atlas of Japan Database (LAJDB). Through factor analysis, we found the latent linguistic variables underlying the aggregated distances. We found two factors, the first of which reflects a division into five groups, and the second of which reflects the long-standing East/West cultural contrast in mainland Japan, also known as the AB division. In the latter division, the eastern group includes the Okinawa islands. We paid special attention to the Tokyo dialect, which is associated with Standard Japanese. In a second factor analysis, only distances to the Tokyo dialect were considered. Although the patterns represented by the four factors vary, they consistently show that dialects geographically closer to Tokyo are more similar to the Tokyo dialect. Additionally, the first three factors reflected the similarity of the Hokkaido varieties to Tokyo’s local dialect. The results of the factor analyses were linked back to the individual variation patterns of the 141 items. A more precise analysis of Tokyo’s position within the Japanese dialect continuum revealed that it is situated within a region of local dialects characterized by relatively small average linguistic distances to other dialects. This area includes the more central area of mainland Japan and Hokkaido. When the influence of geographical distance is filtered out, only the local dialects of Hokkaido remain as dialects with the smallest average distance to other local dialects. Additionally, we observed that dialects geographically close to Tokyo are most closely related to it. However, when we again use distances that are controlled for geographical distance, the local dialects on Hokkaido stand out as being very related to the Tokyo dialect. This probably indicates that the Tokyo dialect has had a relatively large influence on Hokkaido. Full article

(This article belongs to the Special Issue Dialectal Dynamics)

► Show Figures

Figure 1

24 pages, 4913 KiB

Open AccessArticle

Region-Wise Recognition and Classification of Arabic Dialects and Vocabulary: A Deep Learning Approach

by Fawaz S. Al–Anzi and Bibin Shalini Sundaram Thankaleela

Appl. Sci. 2025, 15(12), 6516; https://doi.org/10.3390/app15126516 - 10 Jun 2025

Viewed by 728

Abstract

This article presents a unique approach to Arabic dialect identification using a pre-trained speech classification model. The system categorizes Arabic audio clips into their respective dialects by employing 1D and 2D convolutional neural network technologies built from diverse dialects from the Arab region [...] Read more.

This article presents a unique approach to Arabic dialect identification using a pre-trained speech classification model. The system categorizes Arabic audio clips into their respective dialects by employing 1D and 2D convolutional neural network technologies built from diverse dialects from the Arab region using deep learning models. Its objective is to enhance traditional linguistic handling and speech technology by accurately classifying Arabic audio clips into their corresponding dialects. The techniques involved include record gathering, preprocessing, feature extraction, prototypical architecture, and assessment metrics. The algorithm distinguishes various Arabic dialects, such as A (Arab nation authorized dialectal), EGY (Egyptian Arabic), GLF (Gulf Arabic), LAV and LF (Levantine Arabic, spoken in Syria, Lebanon, and Jordan), MSA (Modern Standard Arabic), NOR (North African Arabic), and SA (Saudi Arabic). Experimental results demonstrate the efficiency of the proposed approach in accurately determining diverse Arabic dialects, achieving a testing accuracy of 94.28% and a validation accuracy of 95.55%, surpassing traditional machine learning models such as Random Forest and SVM and advanced erudition models such as CNN and CNN2D. Full article

(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)

► Show Figures

Figure 1

25 pages, 3269 KiB

Open AccessArticle

Augmentation and Classification of Requests in Moroccan Dialect to Improve Quality of Public Service: A Comparative Study of Algorithms

by Hajar Zaidani, Rim Koulali, Abderrahim Maizate and Mohamed Ouzzif

Future Internet 2025, 17(4), 176; https://doi.org/10.3390/fi17040176 - 17 Apr 2025

Viewed by 722

Abstract

Moroccan Law 55.19 aims to streamline administrative procedures, fostering trust between citizens and public administrations. To implement this law effectively and enhance public service quality, it is essential to use the Moroccan dialect to involve a wide range of people by leveraging Natural [...] Read more.

Moroccan Law 55.19 aims to streamline administrative procedures, fostering trust between citizens and public administrations. To implement this law effectively and enhance public service quality, it is essential to use the Moroccan dialect to involve a wide range of people by leveraging Natural Language Processing (NLP) techniques customized to its specific linguistic characteristics. It is worth noting that the Moroccan dialect presents a unique linguistic landscape, marked by the coexistence of multiple texts. Though it has emerged as the preferred medium of communication on social media, reaching wide audiences, its perceived difficulty of comprehension remains unaddressed. This article introduces a new approach to addressing these challenges. First, we compiled and processed a dataset of Moroccan dialect requests for public administration documents, employing a new augmentation technique to enhance its size and diversity. Second, we conducted text classification experiments using various machine learning algorithms, ranging from traditional methods to advanced large language models (LLMs), to categorize the requests into three classes. The results indicate promising outcomes, with an accuracy of more than 80% for LLMs. Finally, we propose a chatbot system architecture for deploying the most efficient classification algorithm. This solution also contains a voice assistant system that can contribute to the social inclusion of illiterate people. The article concludes by outlining potential avenues for future research. Full article

(This article belongs to the Special Issue AI Based Natural Language Processing: Emerging Approaches and Applications)

► Show Figures

Figure 1

26 pages, 3529 KiB

Open AccessArticle

Protecting Intellectual Security Through Hate Speech Detection Using an Artificial Intelligence Approach

by Sadeem Alrasheed, Suliman Aladhadh and Abdulatif Alabdulatif

Algorithms 2025, 18(4), 179; https://doi.org/10.3390/a18040179 - 21 Mar 2025

Viewed by 841

Abstract

Online social networks (OSNs) have become an integral part of daily life, with platforms such as X (formerly Twitter) being among the most popular in the Middle East. However, X faces the problem of widespread hate speech aimed at spreading hostility between communities, [...] Read more.

Online social networks (OSNs) have become an integral part of daily life, with platforms such as X (formerly Twitter) being among the most popular in the Middle East. However, X faces the problem of widespread hate speech aimed at spreading hostility between communities, especially among Arabic-speaking users. This problem is exacerbated by the lack of effective tools for processing Arabic content and the complexity of the Arabic language, including its diverse grammar and dialects. This study developed a two-layer framework to detect and classify Arabic hate speech using machine learning and deep learning with various features and word embedding techniques. A large dataset of Arabic tweets was collected using the X API. The first layer of the framework focused on detecting hate speech, while the second layer classified it into religious, social, or political hate speech. Convolutional neural networks (CNN) outperformed other models, achieving an accuracy of 92% in hate speech detection and 93% in classification. These results highlight the framework’s effectiveness in addressing Arabic language complexities and improving content monitoring tools, thereby contributing to intellectual security and fostering a safer digital space. Full article

(This article belongs to the Special Issue Machine Learning for Pattern Recognition (2nd Edition))

► Show Figures

Figure 1

21 pages, 1753 KiB

Open AccessArticle

Explainable Deep Learning for COVID-19 Vaccine Sentiment in Arabic Tweets Using Multi-Self-Attention BiLSTM with XLNet

by Asmaa Hashem Sweidan, Nashwa El-Bendary, Shereen A. Taie, Amira M. Idrees and Esraa Elhariri

Big Data Cogn. Comput. 2025, 9(2), 37; https://doi.org/10.3390/bdcc9020037 - 10 Feb 2025

Cited by 1 | Viewed by 1159

Abstract

The COVID-19 pandemic has generated a vast corpus of online conversations regarding vaccines, predominantly on social media platforms like X (formerly known as Twitter). However, analyzing sentiment in Arabic text is challenging due to the diverse dialects and lack of readily available sentiment [...] Read more.

The COVID-19 pandemic has generated a vast corpus of online conversations regarding vaccines, predominantly on social media platforms like X (formerly known as Twitter). However, analyzing sentiment in Arabic text is challenging due to the diverse dialects and lack of readily available sentiment analysis resources for the Arabic language. This paper proposes an explainable Deep Learning (DL) approach designed for sentiment analysis of Arabic tweets related to COVID-19 vaccinations. The proposed approach utilizes a Bidirectional Long Short-Term Memory (BiLSTM) network with Multi-Self-Attention (MSA) mechanism for capturing contextual impacts over long spans within the tweets, while having the sequential nature of Arabic text constructively learned by the BiLSTM model. Moreover, the XLNet embeddings are utilized to feed contextual information into the model. Subsequently, two essential Explainable Artificial Intelligence (XAI) methods, namely Local Interpretable Model-Agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP), have been employed for gaining further insights into the features’ contributions to the overall model performance and accordingly achieving reasonable interpretation of the model’s output. Obtained experimental results indicate that the combined XLNet with BiLSTM model outperforms other implemented state-of-the-art methods, achieving an accuracy of 93.2% and an F-measure of 92% for average sentiment classification. The integration of LIME and SHAP techniques not only enhanced the model’s interpretability, but also provided detailed insights into the factors that influence the classification of emotions. These findings underscore the model’s effectiveness and reliability for sentiment analysis in low-resource languages such as Arabic. Full article

(This article belongs to the Special Issue Application of Deep Learning and Convolution Neural Networks for Social Healthcare)

► Show Figures

Figure 1

23 pages, 21519 KiB

Open AccessArticle

Regional Color Study of Traditional Village Based on Random Forest Model: Taking the Minjiang River Basin as an Example

by Deyi Kong, Xinhui Fei, Zexuan Lu, Xinyue Lin, Mengqing Cai and Zujian Chen

Buildings 2025, 15(4), 524; https://doi.org/10.3390/buildings15040524 - 8 Feb 2025

Cited by 2 | Viewed by 759

Abstract

From the color geography perspective, a field investigation was conducted in the Minjiang River Basin, constructing a color index system of traditional villages. In Python, a random forest model was constructed to screen out important color indexes for traditional village color classification and [...] Read more.

From the color geography perspective, a field investigation was conducted in the Minjiang River Basin, constructing a color index system of traditional villages. In Python, a random forest model was constructed to screen out important color indexes for traditional village color classification and explore its influence mechanism. Among eight color indexes, the important indexes are wall form and building face form, accounting for 30.50% and 19.40%, respectively. Based on this, the basin was divided into four color zones presenting color characteristics and eight color subzones presenting architectural features. The influence mechanism concerns dialect divisions that have shaped traditional villages of different color types, and the interconnection of water systems has promoted the connections among them. The application of traditional village colors in the new urban and rural planning can enhance local characteristics. Integrating the color resources of traditional villages contributes to the regional protection of culture and economic development. Full article

(This article belongs to the Section Architectural Design, Urban Science, and Real Estate)

► Show Figures

Figure 1

19 pages, 4218 KiB

Open AccessArticle

Dialect Classification and Everyday Culture: A Case Study from Austria

by Philip C. Vergeiner

Languages 2025, 10(2), 17; https://doi.org/10.3390/languages10020017 - 23 Jan 2025

Cited by 1 | Viewed by 1494

Abstract

Considering dialect areas as cultural areas has a long tradition in dialectology. Especially in the first half of the 20th century, researchers explored correspondences between dialect variation and other elements of everyday culture such as traditional clothing and customs. Since then, however, few [...] Read more.

Considering dialect areas as cultural areas has a long tradition in dialectology. Especially in the first half of the 20th century, researchers explored correspondences between dialect variation and other elements of everyday culture such as traditional clothing and customs. Since then, however, few studies have compared dialect variation with everyday culture, and virtually none have used quantitative methods. This study addresses this issue by employing a multivariate, dialectometric approach. It examines dialect variation in phonology and its relationship to non-linguistic aspects of everyday culture in Austria using two types of data: (a) dialect data from a recent dialect survey, and (b) ethnographic data published in the ‘Austrian Ethnographic Atlas’. Analyzing 90 phonetic-phonological and 36 ethnographic variables, statistical methods such as multidimensional scaling (MDS) and cluster analysis (CA) are employed. The results show only limited overlap between the linguistic and ethnographic data, with cultural patterns appearing more fragmented and small-scale. Geographical proximity is more indicative of cultural than linguistic similarity. MDS and CA reveal clear geographical patterns for the linguistic data that align with traditional dialect classifications. In contrast, the cultural data show less distinct clustering and only small-scale regions that do not coincide with the linguistic ones. This article discusses potential reasons for these differences. Full article

(This article belongs to the Special Issue Dialectal Dynamics)

► Show Figures

Figure 1

17 pages, 1865 KiB

Open AccessArticle

Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques

by Zineb Nassr, Faouzia Benabbou, Nawal Sael and Touria Hamim

Information 2025, 16(1), 39; https://doi.org/10.3390/info16010039 - 10 Jan 2025

Viewed by 1414

Abstract

Sentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured languages like [...] Read more.

Sentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured languages like English, unstructured languages, such as the Moroccan Dialect (MD), face substantial resource limitations and linguistic challenges, making effective sentiment analysis difficult. This study addresses this gap by exploring the integration of data-balancing techniques with machine learning (ML) methods, specifically investigating the impact of resampling techniques and feature extraction methods, including Term Frequency–Inverse Document Frequency (TF-IDF), Bag of Words (BOW), and N-grams. Through rigorous experimentation, we evaluate the effectiveness of these approaches in enhancing sentiment analysis accuracy for the Moroccan dialect. Our findings demonstrate that strategic resampling, combined with the TF-IDF method, significantly improves classification accuracy and robustness. We also explore the interaction between resampling strategies and feature extraction methods, revealing varying levels of effectiveness across different combinations. Notably, the Support Vector Machine (SVM) classifier, when paired with TF-IDF representation, achieves superior performance, with an accuracy of 90.24% and a precision of 90.34%. These results highlight the importance of tailored resampling techniques, appropriate feature extraction methods, and machine learning optimization in advancing sentiment analysis for under-resourced and dialect-heavy languages like the Moroccan dialect, providing a practical framework for future research and development in NLP for unstructured languages. Full article

(This article belongs to the Special Issue Application of Machine Learning in Data Science and Computational Intelligence)

► Show Figures

Graphical abstract

16 pages, 1512 KiB

Open AccessArticle

An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation

by Yi Qin and Feifan Yu

Sensors 2025, 25(2), 341; https://doi.org/10.3390/s25020341 - 9 Jan 2025

Viewed by 893

Abstract

The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end [...] Read more.

The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model for the North Shaanxi dialect, leveraging the Conformer architecture. To tailor the model to the coal mining context, we developed a specialized corpus reflecting the phonetic characteristics of the dialect and its usage in the industry. We investigated feature extraction techniques suitable for the North Shaanxi dialect, focusing on the unique pronunciation of initial consonants and vowels. A preprocessing module was designed to accommodate the dialect’s rapid speech tempo and polyphonic nature, enhancing recognition performance. To enhance the decoder’s text generation capability, we replaced the Conformer decoder with a Transformer architecture. Additionally, to mitigate the computational demands of the model, we incorporated Connectionist Temporal Classification (CTC) joint training for optimization. The experimental results on our self-established voice dataset for the Northern Shaanxi coal mining industry demonstrate that the proposed Conformer–Transformer–CTC model achieves a 9.2% and 10.3% reduction in the word error rate compared to the standalone Conformer and Transformer models, respectively, confirming the advancement of our method. The next step will involve researching how to improve the performance of dialect speech recognition by integrating external language models and extracting pronunciation features of different dialects, thereby achieving better recognition results. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

12 pages, 2630 KiB

Open AccessArticle

Multimodal Seed Data Augmentation for Low-Resource Audio Latin Cuengh Language

by Lanlan Jiang, Xingguo Qin, Jingwei Zhang and Jun Li

Appl. Sci. 2024, 14(20), 9533; https://doi.org/10.3390/app14209533 - 18 Oct 2024

Viewed by 1131

Abstract

Latin Cuengh is a low-resource dialect that is prevalent in select ethnic minority regions in China. This language presents unique challenges for intelligent research and preservation efforts, primarily due to its oral tradition and the limited availability of textual resources. Prior research has [...] Read more.

Latin Cuengh is a low-resource dialect that is prevalent in select ethnic minority regions in China. This language presents unique challenges for intelligent research and preservation efforts, primarily due to its oral tradition and the limited availability of textual resources. Prior research has sought to bolster intelligent processing capabilities with regard to Latin Cuengh through data augmentation techniques leveraging scarce textual data, with modest success. In this study, we introduce an innovative multimodal seed data augmentation model designed to significantly enhance the intelligent recognition and comprehension of this dialect. After supplementing the pre-trained model with extensive speech data, we fine-tune its performance with a modest corpus of multilingual textual seed data, employing both Latin Cuengh and Chinese texts as bilingual seed data to enrich its multilingual properties. We then refine its parameters through a variety of downstream tasks. The proposed model achieves a commendable performance across both multi-classification and binary classification tasks, with its average accuracy and F1 measure increasing by more than 3%. Moreover, the model’s training efficiency is substantially ameliorated through strategic seed data augmentation. Our research provides insights into the informatization of low-resource languages and contributes to their dissemination and preservation. Full article

(This article belongs to the Special Issue Applied and Innovative Computational Intelligence Systems: 3rd Edition)

► Show Figures

Figure 1

24 pages, 22050 KiB

Open AccessArticle

SOD: A Corpus for Saudi Offensive Language Detection Classification

by Afefa Asiri and Mostafa Saleh

Computers 2024, 13(8), 211; https://doi.org/10.3390/computers13080211 - 20 Aug 2024

Viewed by 1842

Abstract

Social media platforms like X (formerly known as Twitter) are integral to modern communication, enabling the sharing of news, emotions, and ideas. However, they also facilitate the spread of harmful content, and manual moderation of these platforms is impractical. Automated moderation tools, predominantly [...] Read more.

Social media platforms like X (formerly known as Twitter) are integral to modern communication, enabling the sharing of news, emotions, and ideas. However, they also facilitate the spread of harmful content, and manual moderation of these platforms is impractical. Automated moderation tools, predominantly developed for English, are insufficient for addressing online offensive language in Arabic, a language rich in dialects and informally used on social media. This gap underscores the need for dedicated, dialect-specific resources. This study introduces the Saudi Offensive Dialectal dataset (SOD), consisting of over 24,000 tweets annotated across three levels: offensive or non-offensive, with offensive tweets further categorized as general insults, hate speech, or sarcasm. A deeper analysis of hate speech identifies subtypes related to sports, religion, politics, race, and violence. A comprehensive descriptive analysis of the SOD is also provided to offer deeper insights into its composition. Using machine learning, traditional deep learning, and transformer-based deep learning models, particularly AraBERT, our research achieves a significant F1-Score of 87% in identifying offensive language. This score improves to 91% with data augmentation techniques addressing dataset imbalances. These results, which surpass many existing studies, demonstrate that a specialized dialectal dataset enhances detection efficacy compared to mixed-language datasets. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

► Show Figures

Figure 1

20 pages, 2640 KiB

Open AccessArticle

Enhancing Arabic Dialect Detection on Social Media: A Hybrid Model with an Attention Mechanism

by Wael M. S. Yafooz

Information 2024, 15(6), 316; https://doi.org/10.3390/info15060316 - 28 May 2024

Cited by 10 | Viewed by 3456

Abstract

Recently, the widespread use of social media and easy access to the Internet have brought about a significant transformation in the type of textual data available on the Web. This change is particularly evident in Arabic language usage, as the growing number of [...] Read more.

Recently, the widespread use of social media and easy access to the Internet have brought about a significant transformation in the type of textual data available on the Web. This change is particularly evident in Arabic language usage, as the growing number of users from diverse domains has led to a considerable influx of Arabic text in various dialects, each characterized by differences in morphology, syntax, vocabulary, and pronunciation. Consequently, researchers in language recognition and natural language processing have become increasingly interested in identifying Arabic dialects. Numerous methods have been proposed to recognize this informal data, owing to its crucial implications for several applications, such as sentiment analysis, topic modeling, text summarization, and machine translation. However, Arabic dialect identification is a significant challenge due to the vast diversity of the Arabic language in its dialects. This study introduces a novel hybrid machine and deep learning model, incorporating an attention mechanism for detecting and classifying Arabic dialects. Several experiments were conducted using a novel dataset that collected information from user-generated comments from Twitter of Arabic dialects, namely, Egyptian, Gulf, Jordanian, and Yemeni, to evaluate the effectiveness of the proposed model. The dataset comprises 34,905 rows extracted from Twitter, representing an unbalanced data distribution. The data annotation was performed by native speakers proficient in each dialect. The results demonstrate that the proposed model outperforms the performance of long short-term memory, bidirectional long short-term memory, and logistic regression models in dialect classification using different word representations as follows: term frequency-inverse document frequency, Word2Vec, and global vector for word representation. Full article

(This article belongs to the Special Issue Recent Advances in Social Media Mining and Analysis)

► Show Figures

Figure 1

24 pages, 5795 KiB

Open AccessArticle

Evaluating Arabic Emotion Recognition Task Using ChatGPT Models: A Comparative Analysis between Emotional Stimuli Prompt, Fine-Tuning, and In-Context Learning

by El Habib Nfaoui and Hanane Elfaik

J. Theor. Appl. Electron. Commer. Res. 2024, 19(2), 1118-1141; https://doi.org/10.3390/jtaer19020058 - 14 May 2024

Cited by 4 | Viewed by 4043

Abstract

Textual emotion recognition (TER) has significant commercial potential since it can be used as an excellent tool to monitor a brand/business reputation, understand customer satisfaction, and personalize recommendations. It is considered a natural language processing task that can be used to understand and [...] Read more.

Textual emotion recognition (TER) has significant commercial potential since it can be used as an excellent tool to monitor a brand/business reputation, understand customer satisfaction, and personalize recommendations. It is considered a natural language processing task that can be used to understand and classify emotions such as anger, happiness, and surprise being conveyed in a piece of text (product reviews, tweets, and comments). Despite the advanced development of deep learning and particularly transformer architectures, Arabic-focused models for emotion classification have not achieved satisfactory accuracy. This is mainly due to the morphological richness, agglutination, dialectal variation, and low-resource datasets of the Arabic language, as well as the unique features of user-generated text such as noisiness, shortness, and informal language. This study aims to illustrate the effectiveness of large language models on Arabic multi-label emotion classification. We evaluated GPT-3.5 Turbo and GPT-4 using three different settings: in-context learning, emotional stimuli prompt, and fine-tuning. The ultimate objective of this research paper is to determine if these LLMs, which have multilingual capabilities, could contribute to enhancing the aforementioned task and encourage its use within the context of an e-commerce environment for example. The experimental results indicated that the fine-tuned GPT-3.5 Turbo model achieved an accuracy of 62.03%, a micro-averaged F1-score of 73%, and a macro-averaged F1-score of 62%, establishing a new state-of-the-art benchmark for the task of Arabic multi-label emotion recognition. Full article

► Show Figures

Figure 1

26 pages, 2339 KiB

Open AccessArticle

Switching Self-Attention Text Classification Model with Innovative Reverse Positional Encoding for Right-to-Left Languages: A Focus on Arabic Dialects

by Laith H. Baniata and Sangwoo Kang

Mathematics 2024, 12(6), 865; https://doi.org/10.3390/math12060865 - 15 Mar 2024

Cited by 4 | Viewed by 2306

Abstract

Transformer models have emerged as frontrunners in the field of natural language processing, primarily due to their adept use of self-attention mechanisms to grasp the semantic linkages between words in sequences. Despite their strengths, these models often face challenges in single-task learning scenarios, [...] Read more.

Transformer models have emerged as frontrunners in the field of natural language processing, primarily due to their adept use of self-attention mechanisms to grasp the semantic linkages between words in sequences. Despite their strengths, these models often face challenges in single-task learning scenarios, particularly when it comes to delivering top-notch performance and crafting strong latent feature representations. This challenge is more pronounced in the context of smaller datasets and is particularly acute for under-resourced languages such as Arabic. In light of these challenges, this study introduces a novel methodology for text classification of Arabic texts. This method harnesses the newly developed Reverse Positional Encoding (RPE) technique. It adopts an inductive-transfer learning (ITL) framework combined with a switching self-attention shared encoder, thereby increasing the model’s adaptability and improving its sentence representation accuracy. The integration of Mixture of Experts (MoE) and RPE techniques empowers the model to process longer sequences more effectively. This enhancement is notably beneficial for Arabic text classification, adeptly supporting both the intricate five-point and the simpler ternary classification tasks. The empirical evidence points to its outstanding performance, achieving accuracy rates of 87.20% for the HARD dataset, 72.17% for the BRAD dataset, and 86.89% for the LABR dataset, as evidenced by the assessments conducted on these datasets. Full article

(This article belongs to the Special Issue Recent Trends and Advances in the Natural Language Processing)

► Show Figures

Figure 1

20 pages, 1045 KiB

Open AccessArticle

Transformer Text Classification Model for Arabic Dialects That Utilizes Inductive Transfer

by Laith H. Baniata and Sangwoo Kang

Mathematics 2023, 11(24), 4960; https://doi.org/10.3390/math11244960 - 14 Dec 2023

Cited by 9 | Viewed by 2890

Abstract

In the realm of the five-category classification endeavor, there has been limited exploration of applied techniques for classifying Arabic text. These methods have primarily leaned on single-task learning, incorporating manually crafted features that lack robust sentence representations. Recently, the Transformer paradigm has emerged [...] Read more.

In the realm of the five-category classification endeavor, there has been limited exploration of applied techniques for classifying Arabic text. These methods have primarily leaned on single-task learning, incorporating manually crafted features that lack robust sentence representations. Recently, the Transformer paradigm has emerged as a highly promising alternative. However, when these models are trained using single-task learning, they often face challenges in achieving outstanding performance and generating robust latent feature representations, especially when dealing with small datasets. This issue is particularly pronounced in the context of the Arabic dialect, which has a scarcity of available resources. Given these constraints, this study introduces an innovative approach to dissecting sentiment in Arabic text. This approach combines Inductive Transfer (INT) with the Transformer paradigm to augment the adaptability of the model and refine the representation of sentences. By employing self-attention SE-A and feed-forward sub-layers as a shared Transformer encoder for both the five-category and three-category Arabic text classification tasks, this proposed model adeptly discerns sentiment in Arabic dialect sentences. The empirical findings underscore the commendable performance of the proposed model, as demonstrated in assessments of the Hotel Arabic-Reviews Dataset, the Book Reviews Arabic Dataset, and the LARB dataset. Full article

(This article belongs to the Special Issue Recent Trends and Advances in the Natural Language Processing)

► Show Figures

Figure 1

Search Results (50)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (50)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI