Advancing Natural Language Processing for Low-Resource Languages and Dialects

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer Science and Engineering, Rangamati Science and Technology University, Rangamati 4500, Bangladesh
Interests: AI; evolutionary computing and image processing; NLP; AI in healthcare and agriculture

E-Mail Website
Guest Editor
Text Information Processing Laboratory, Kitami Institute of Technology, 165 Koen-cho, Kitami 090-8507, Japan
Interests: abusive text detection; affect and sentiment analysis; affective computing (AC); Ainu language; artificial intelligence (AI); automatic cyberbullying detection; computational linguistics (CLs); corpus linguistics; emotional intelligence; human–computer interaction (HCI); large language models; linguistics; natural language processing (NLP); offensive text detection; philosophy of emotions; pragmatics
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, SE-93187 Skellefteå, Sweden
Interests: pervasive and mobile computing
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Natural Language Processing (NLP) has achieved remarkable success in high-resource languages; however, the majority of the world’s languages and dialects remain underrepresented due to limited data, linguistic diversity, and cultural complexity. In this Special Issue, we aim to address this imbalance by highlighting recent advances, challenges, and innovative solutions for low-resource languages and dialects, with a focus on methodologies that improve language understanding, generation, and classification when annotated resources are scarce or unavailable. Topics of interest include, but are not limited to, multilingual and crosslingual learning, transfer learning, zero-shot and few-shot approaches, dataset creation and annotation strategies, explainable AI, and culturally grounded NLP applications. Special attention is given to dialectal variations, code-mixing, sarcasm, and context-dependent meanings that are often overlooked in conventional models. By collating interdisciplinary research from linguistics, computer science, and social sciences, in this Special Issue we seek to foster inclusive NLP technologies that support digital equity and preserve linguistic diversity. The contributions are expected to advance both theoretical understanding and practical applications, enabling NLP systems to better serve underrepresented communities worldwide.

Dr. Tanjim Mahmud
Prof. Dr. Michal Ptaszynski
Prof. Dr. Karl Andersson
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Machine Learning and Knowledge Extraction is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • low-resource and dialect natural language processing
  • multilingual and crosslingual NLP
  • language resources and dataset creation
  • dialect and code-mixed language processing
  • NLP for underrepresented languages
  • machine learning and deep learning for NLP
  • linguistic and cultural adaptation in NLP
  • ethical and inclusive AI for language technologies

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (3 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

18 pages, 1843 KB  
Article
MENARA: Medical Natural Arabic Response Assistant
by Ahmed Ibrahim, Abdullah Hosseini, Hoda Helmy, Maryam Arabi, Aya AlShareef, Wafa Lakhdhar and Ahmed Serag
Mach. Learn. Knowl. Extr. 2026, 8(4), 110; https://doi.org/10.3390/make8040110 - 21 Apr 2026
Viewed by 568
Abstract
Dialectal variation presents a major challenge for deploying medical language models in real-world healthcare settings, where patient–clinician communication often occurs in regional vernaculars rather than standardized language forms. This challenge is particularly pronounced in the Arabic-speaking world, where clinical interactions frequently take place [...] Read more.
Dialectal variation presents a major challenge for deploying medical language models in real-world healthcare settings, where patient–clinician communication often occurs in regional vernaculars rather than standardized language forms. This challenge is particularly pronounced in the Arabic-speaking world, where clinical interactions frequently take place in diverse dialects that differ substantially from Modern Standard Arabic. Fine-tuning and maintaining separate models for each dialect is computationally inefficient and difficult to scale, motivating more integrated approaches. In this work, we present MENARA, an Arabic medical language model constructed by merging Egyptian Arabic, Moroccan Darija, and medical-domain specialists through model merging. We extend prior feasibility findings through comprehensive evaluation of cross-dialect performance, medical safety, and cross-lingual knowledge retention. Specifically, we introduce a fine-grained dialect composition analysis to quantify lexical purity and structured code-switching behavior, benchmark against state-of-the-art Arabic LLMs, conduct subject-matter-expert assessment of both dialectal fidelity and medical appropriateness. The results show that model merging preserves core medical competence while enabling robust dialectal adaptation, achieving strong cross-dialect fidelity while substantially reducing storage and deployment overhead compared to maintaining separate models. These findings establish model merging as a potentially practical and resource-efficient paradigm for dialect-aware medical NLP in linguistically fragmented healthcare environments. Full article
Show Figures

Figure 1

23 pages, 878 KB  
Article
Enhancing Arabic Multi-Task Sentiment Analysis Through Distillation and Adversarial Training
by Hafida Hidani, Safâa El Ouahabi and Mouncef Filali Bouami
Mach. Learn. Knowl. Extr. 2026, 8(4), 100; https://doi.org/10.3390/make8040100 - 13 Apr 2026
Viewed by 642
Abstract
The rapid growth of Arabic social media content requires the development of accurate and efficient methods for sentiment analysis. We propose a resource-efficient multi-task learning (MTL) framework for modern standard Arabic (MSA). The model uses a shared AraBERT encoder to jointly predict emotion, [...] Read more.
The rapid growth of Arabic social media content requires the development of accurate and efficient methods for sentiment analysis. We propose a resource-efficient multi-task learning (MTL) framework for modern standard Arabic (MSA). The model uses a shared AraBERT encoder to jointly predict emotion, polarity, and intention. We integrate knowledge distillation (KD) from a large teacher model, self-distillation (SD) using model self-ensembling, and adversarial training (AT) as a regularization strategy. Experiments conducted on an annotated corpus of MSA tweets demonstrate that all distilled models outperform a fine-tuned multi-task baseline, and the combined KD+SD+AT configuration achieves competitive results. For instance, KD alone raised Macro F1 for emotion from 0.83 to 0.88 and for intention from 0.67 to 0.72. KD+SD+AT achieved the best intention F1 (0.76) and the highest polarity F1 (0.90). Notably, F1-scores for several minority classes show consistent improvement, particularly under KD and combined configurations. Paired t-tests confirm that several improvements, especially those obtained with KD and KD+SD+AT, are statistically significant (p<0.05). Our results indicate that distillation, combined with adversarial regularization, enables the development of smaller and more efficient Arabic sentiment models while maintaining competitive accuracy. These findings address a gap in Arabic multi-task sentiment analysis and provide a scalable, resource-efficient framework, along with empirical insights for distillation in Arabic language models. Full article
Show Figures

Figure 1

51 pages, 1067 KB  
Article
Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance
by Juuso Eronen, Michal Ptaszynski, Tomasz Wicherkiewicz, Robert Borges, Katarzyna Janic, Zhenzhen Liu, Tanjim Mahmud and Fumito Masui
Mach. Learn. Knowl. Extr. 2026, 8(3), 65; https://doi.org/10.3390/make8030065 - 7 Mar 2026
Viewed by 3134
Abstract
Selecting a source language for zero-shot cross-lingual transfer is typically done by intuition or by defaulting to English, despite large performance differences across language pairs. We study whether linguistic similarity can predict transfer performance and support principled source-language selection. We introduce quantified WALS [...] Read more.
Selecting a source language for zero-shot cross-lingual transfer is typically done by intuition or by defaulting to English, despite large performance differences across language pairs. We study whether linguistic similarity can predict transfer performance and support principled source-language selection. We introduce quantified WALS (qWALS), a typology-based similarity metric derived from features in the World Atlas of Language Structures, and evaluate it against existing similarity baselines. Validation uses three complementary signals: computational similarity scores, zero-shot transfer performance of multilingual transformers (mBERT and XLM-R) on four NLP tasks (dependency parsing, named entity recognition, sentiment analysis, and abusive language identification) across eight languages, and an expert-linguist similarity survey. Across tasks and models, higher linguistic similarity is associated with better transfer, and the survey provides independent support for the computational metrics. Full article
Show Figures

Figure 1

Back to TopTop