Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (26)

Search Parameters:
Keywords = multilingual comparable corpus

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
27 pages, 1222 KB  
Article
Query-Adaptive Hybrid Search
by Pavel Posokhov, Stepan Skrylnikov, Sergei Masliukhin, Alina Zavgorodniaia, Olesia Koroteeva and Yuri Matveev
Mach. Learn. Knowl. Extr. 2026, 8(4), 91; https://doi.org/10.3390/make8040091 - 5 Apr 2026
Viewed by 354
Abstract
The modern information retrieval field increasingly relies on hybrid search systems combining sparse retrieval with dense neural models. However, most existing hybrid frameworks employ static mixing coefficients and independent component training, failing to account for the specific needs of individual queries and corpus [...] Read more.
The modern information retrieval field increasingly relies on hybrid search systems combining sparse retrieval with dense neural models. However, most existing hybrid frameworks employ static mixing coefficients and independent component training, failing to account for the specific needs of individual queries and corpus heterogeneity. In this paper, we introduce an adaptive hybrid retrieval framework featuring query-driven alpha prediction that dynamically calibrates the mixing weights based on query latent representations instantiated in a lightweight low-latency configuration and a full-capacity encoder-scale predictor, enabling flexible trade-offs between computational efficiency and retrieval accuracy without relying on resource-inefficient LLM-based online evaluation. Furthermore, we propose antagonist negative sampling, a novel training paradigm that optimizes the dense encoder to resolve the systematic failures of the lexical retriever, prioritizing hard negatives where BM25 exhibits high uncertainty. Empirical evaluations on large-scale multilingual benchmarks (MLDR and MIRACL) indicate that our approach demonstrates superior average performance compared to state-of-the-art models such as BGE-M3 and mGTE, achieving an nDCG@10 of 74.3 on long-document retrieval. Notably, our framework recovers up to 92.5% of the theoretical oracle performance and yields significant improvements in nDCG@10 across 16 languages, particularly in challenging long-context scenarios. Full article
(This article belongs to the Special Issue Trustworthy AI: Integrating Knowledge, Retrieval, and Reasoning)
Show Figures

Figure 1

29 pages, 423 KB  
Article
Reliability-Aware Multilingual Sentiment Analytics for Agricultural Market Intelligence
by Jantima Polpinij, Christopher S. G. Khoo, Wei-Ning Cheng, Thananchai Khamket, Chumsak Sibunruang and Manasawee Kaenampornpan
Mathematics 2026, 14(7), 1220; https://doi.org/10.3390/math14071220 - 5 Apr 2026
Viewed by 345
Abstract
Public opinion on online platforms now plays an important role in agricultural markets, which have always been unpredictable. Although sentiment analysis has been widely applied to agricultural texts, most existing studies typically focus only on classification accuracy without connecting results to actual market [...] Read more.
Public opinion on online platforms now plays an important role in agricultural markets, which have always been unpredictable. Although sentiment analysis has been widely applied to agricultural texts, most existing studies typically focus only on classification accuracy without connecting results to actual market intelligence systems, especially in multilingual contexts. This paper introduces a reliability-aware transformer-based framework for analyzing sentiment in agricultural market intelligence across multiple languages. The framework leverages weakly supervised multilingual transformers to extract sentiment signals from large-scale unlabeled Thai and English texts about major agricultural commodities found online. To enhance robustness under weak supervision, the framework incorporates reliability-aware mechanisms, including confidence-based pseudo-label filtering, cross-source consistency refinement, and expert-guided calibration to reduce noise and account for bias between different data sources. Sentiment predictions are further aligned with market intelligence objectives through reliability-weighted aggregation, yielding interpretable sentiment indices that enable cross-lingual and cross-source comparability. We tested the framework extensively using a multilingual agricultural corpus derived from social media and news coverage of agriculture. The results show consistent improvements over both classical machine learning approaches and standard multilingual transformer baselines. Additional ablation studies and sensitivity analyses confirmed that reliability-aware mechanisms, particularly confidence thresholding, play a crucial role in getting the right balance between label quality and data coverage. Overall, the results indicate that reliability-aware multilingual sentiment analytics provide robust and actionable insights for agricultural market monitoring and policy analysis. Full article
(This article belongs to the Special Issue Application of Machine Learning and Data Mining, 2nd Edition)
25 pages, 1558 KB  
Article
Towards Scalable Monitoring: An Interpretable Multimodal Framework for Migration Content Detection on TikTok Under Data Scarcity
by Dimitrios Taranis, Gerasimos Razis and Ioannis Anagnostopoulos
Electronics 2026, 15(4), 850; https://doi.org/10.3390/electronics15040850 - 17 Feb 2026
Viewed by 526
Abstract
Short-form video platforms such as TikTok (TikTok Pte. Ltd., Singapore) host large volumes of user-generated, often ephemeral, content related to irregular migration, where relevant cues are distributed across visual scenes, on-screen text, and multilingual captions. Automatically identifying migration-related videos is challenging due to [...] Read more.
Short-form video platforms such as TikTok (TikTok Pte. Ltd., Singapore) host large volumes of user-generated, often ephemeral, content related to irregular migration, where relevant cues are distributed across visual scenes, on-screen text, and multilingual captions. Automatically identifying migration-related videos is challenging due to this multimodal complexity and the scarcity of labeled data in sensitive domains. This paper presents an interpretable multimodal classification framework designed for deployment under data-scarce conditions. We extract features from platform metadata, automated video analysis (Google Cloud Video Intelligence), and Optical Character Recognition (OCR) text, and compare text-only, OCR-only, and vision-only baselines against a multimodal fusion approach using Logistic Regression, Random Forest, and XGBoost. In this pilot study, multimodal fusion consistently improves class separation over single-modality models, achieving an F1-score of 0.92 for the migration-related class under stratified cross-validation. Given the limited sample size, these results are interpreted as evidence of feature separability rather than definitive generalization. Feature importance and SHAP analyses identify OCR-derived keywords, maritime cues, and regional indicators as the most influential predictors. To assess robustness under data scarcity, we apply SMOTE to synthetically expand the training set to 500 samples and evaluate performance on a small held-out set of real videos, observing stable results that further support feature-level robustness. Finally, we demonstrate scalability by constructing a weakly labeled corpus of 600 videos using the identified multimodal cues, highlighting the suitability of the proposed feature set for weakly supervised monitoring at scale. Overall, this work serves as a methodological blueprint for building interpretable multimodal monitoring pipelines in sensitive, low-resource settings. Full article
(This article belongs to the Special Issue Multimodal Learning for Multimedia Content Analysis and Understanding)
Show Figures

Figure 1

34 pages, 746 KB  
Article
An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages
by Ualsher Tukeyev, Assem Shormakova, Aidana Karibayeva, Diana Rakhimova, Balzhan Abduali, Dina Amirova, Nazym Rakhmanberdi and Rashid Aliyev
Computers 2026, 15(2), 73; https://doi.org/10.3390/computers15020073 - 28 Jan 2026
Viewed by 1187
Abstract
This study presents the application of free, open-source artificial intelligence (AI) techniques to advance machine translation for low-resource Turkic languages such as Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek. This machine translation problem for Turkic languages is part of a project to generate [...] Read more.
This study presents the application of free, open-source artificial intelligence (AI) techniques to advance machine translation for low-resource Turkic languages such as Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek. This machine translation problem for Turkic languages is part of a project to generate meeting minutes from speech transcripts. Due to limited parallel corpora and underdeveloped linguistic tools for these languages, traditional machine translation approaches often underperform. The goal is to reduce digital inequality for these languages and to support scalability. We investigate the effectiveness of free open-source pre-trained specialized and general-purpose AI models for morphologically rich state Turkic languages. This research includes developing parallel corpora for six Turkic languages, fine-tuning, and performance evaluation using BLEU, WER, TER, and chrF metrics. The parallel corpora for five pair languages, each of 300,000 and 500,000 sentences, were generated and cleaned. The results for corpora 500,000 parallel sentences show significant improvements compared with baseline NLLB-200 1.3B on average: BLEU increased by 23.81 points, chrF increased by 26.05 points, and WER and TER decreased by 0.36 and 33.95, respectively, after cleaning and fine-tuning. Six Turkic-language multilingual parallel corpora of 3 885 542 sentences were developed and the fine-tuning of NLLB-200 1.3B shows the following, compared with the results for 500,000 cleaned corpus: BLEU increased by 4.3 points, chrF increased by 1.7 points, and WER and TER decreased by 0.1 and 4.75, respectively These results demonstrate the high efficiency of corpus cleaning and synthetic data generation to improve the quality of machine translation for low-resource Turkic languages using AI models. These results were confirmed by external evaluation on the FLORES 200 dataset and human evaluation. The scientific contribution of this article is the development of a methodology for generating parallel corpora using a specialized AI model of machine translation and fine-tuning the specialized AI model on the created corpora, creating new multilingual parallel corpora of Azerbaijan–Kazakh, Kyrgyz–Kazakh, Turkish–Kazakh, Turkmen–Kazakh, and Uzbek–Kazakh pairs using the proposed methodology, cleaning them, and conducting fine-tuning experiments. Full article
Show Figures

Figure 1

15 pages, 692 KB  
Article
Reputation and Guest Experience in Bali’s Spa Hotels: A Big Data Perspective
by Neila Aisha, Angellie Williady and Hak-Seon Kim
Tour. Hosp. 2025, 6(4), 180; https://doi.org/10.3390/tourhosp6040180 - 17 Sep 2025
Viewed by 2991
Abstract
This study examines how psycholinguistic features of online reviews relate to guest satisfaction in Bali’s spa hotel market. Using LIWC-22 category rates from Google Maps reviews, a corpus of 15,560 quality-filtered reviews from ten leading spa hotels was analyzed. Exploratory factor analysis yielded [...] Read more.
This study examines how psycholinguistic features of online reviews relate to guest satisfaction in Bali’s spa hotel market. Using LIWC-22 category rates from Google Maps reviews, a corpus of 15,560 quality-filtered reviews from ten leading spa hotels was analyzed. Exploratory factor analysis yielded four interpretable dimensions—Social, Health and Wellness, Emotional Tone, and Lifestyle. In regressions predicting review star ratings (satisfaction), Social (β = 0.028) and Health and Wellness (β = 0.023) showed small but statistically detectable positive associations, whereas Emotional Tone (β = 0.006, t = 0.727) and Lifestyle (β = 0.004, t = 0.476) were not significant. The model’s explained variance is negligible (R2 = 0.001; F = 5.283, p < 0.05), reflecting the many influences on ratings beyond review language; findings are interpreted as directional associations rather than predictive effects. Practically, the results point to prioritizing interpersonal service cues and wellness/treatment assurances, with tone monitoring being used for service-recovery signals. The design favors interpretability (validated, word-based categories; full-history snapshot) over black-box complexity, and transferability is Bali-specific and conditional on comparable market features. Future work should add contextual covariates (e.g., price and location), apply explicit temporal segmentation, extend to multilingual corpora, and triangulate text analytics with brief questionnaires and qualitative inquiry to strengthen validity and explanatory power. Full article
Show Figures

Figure 1

25 pages, 1380 KB  
Review
A Systematic Review and Experimental Evaluation of Classical and Transformer-Based Models for Urdu Abstractive Text Summarization
by Muhammad Azhar, Adeen Amjad, Deshinta Arrova Dewi and Shahreen Kasim
Information 2025, 16(9), 784; https://doi.org/10.3390/info16090784 - 9 Sep 2025
Cited by 6 | Viewed by 1658
Abstract
The rapid growth of digital content in Urdu has created an urgent need for effective automatic text summarization (ATS) systems. While extractive methods have been widely studied, abstractive summarization for Urdu remains largely unexplored due to the language’s complex morphology and rich literary [...] Read more.
The rapid growth of digital content in Urdu has created an urgent need for effective automatic text summarization (ATS) systems. While extractive methods have been widely studied, abstractive summarization for Urdu remains largely unexplored due to the language’s complex morphology and rich literary tradition. This paper systematically evaluates four transformer-based language models (BERT-Urdu, BART, mT5, and GPT-2) for Urdu abstractive summarization, comparing their performance against conventional machine learning and deep learning approaches. Using multiple Urdu datasets—including the Urdu Summarization Corpus, Fake News Dataset, and Urdu-Instruct-News—we show that fine-tuned Transformer Language Models (TLMs) consistently outperform traditional methods, with the multilingual mT5 model achieving a 0.42 absolute improvement in F1-score over the best baseline. Our analysis reveals that mT5’s architecture is particularly effective at handling Urdu-specific challenges such as right-to-left script processing, diacritic interpretation, and complex verb–noun compounding. Furthermore, we present empirically validated hyperparameter configurations and training strategies for Urdu ATS, establishing transformer-based approaches as the new state-of-the-art for Urdu summarization. Notably, mT5 outperforms Seq2Seq baselines by up to 20% in ROUGE-L, underscoring the efficacy of Transformer-based models for low-resource languages. This work contributes both a systematic review of prior research and a novel empirical benchmark for advancing Urdu abstractive summarization. Full article
Show Figures

Figure 1

37 pages, 5086 KB  
Article
Global Embeddings, Local Signals: Zero-Shot Sentiment Analysis of Transport Complaints
by Aliya Nugumanova, Daniyar Rakhimzhanov and Aiganym Mansurova
Informatics 2025, 12(3), 82; https://doi.org/10.3390/informatics12030082 - 14 Aug 2025
Cited by 2 | Viewed by 3250
Abstract
Public transport agencies must triage thousands of multilingual complaints every day, yet the cost of training and serving fine-grained sentiment analysis models limits real-time deployment. The proposed “one encoder, any facet” framework therefore offers a reproducible, resource-efficient alternative to heavy fine-tuning for domain-specific [...] Read more.
Public transport agencies must triage thousands of multilingual complaints every day, yet the cost of training and serving fine-grained sentiment analysis models limits real-time deployment. The proposed “one encoder, any facet” framework therefore offers a reproducible, resource-efficient alternative to heavy fine-tuning for domain-specific sentiment analysis or opinion mining tasks on digital service data. To the best of our knowledge, we are the first to test this paradigm on operational multilingual complaints, where public transport agencies must prioritize thousands of Russian- and Kazakh-language messages each day. A human-labelled corpus of 2400 complaints is embedded with five open-source universal models. Obtained embeddings are matched to semantic “anchor” queries that describe three distinct facets: service aspect (eight classes), implicit frustration, and explicit customer request. In the strict zero-shot setting, the best encoder reaches 77% accuracy for aspect detection, 74% for frustration, and 80% for request; taken together, these signals reproduce human four-level priority in 60% of cases. Attaching a single-layer logistic probe on top of the frozen embeddings boosts performance to 89% for aspect, 83–87% for the binary facets, and 72% for end-to-end triage. Compared with recent fine-tuned sentiment analysis systems, our pipeline cuts memory demands by two orders of magnitude and eliminates task-specific training yet narrows the accuracy gap to under five percentage points. These findings indicate that a single frozen encoder, guided by handcrafted anchors and an ultra-light head, can deliver near-human triage quality across multiple pragmatic dimensions, opening the door to low-cost, language-agnostic monitoring of digital-service feedback. Full article
(This article belongs to the Special Issue Practical Applications of Sentiment Analysis)
Show Figures

Figure 1

23 pages, 1604 KB  
Article
Fine-Tuning Large Language Models for Kazakh Text Simplification
by Alymzhan Toleu, Gulmira Tolegen and Irina Ualiyeva
Appl. Sci. 2025, 15(15), 8344; https://doi.org/10.3390/app15158344 - 26 Jul 2025
Cited by 2 | Viewed by 2516
Abstract
This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validating its performance on 400 [...] Read more.
This paper addresses text simplification task for Kazakh, a morphologically rich, low-resource language, by introducing KazSim, an instruction-tuned model built on multilingual large language models (LLMs). First, we develop a heuristic pipeline to identify complex Kazakh sentences, manually validating its performance on 400 examples and comparing it against a purely LLM-based selection method; we then use this pipeline to assemble a parallel corpus of 8709 complex–simple pairs via LLM augmentation. For the simplification task, we benchmark KazSim against standard Seq2Seq systems, domain-adapted Kazakh LLMs, and zero-shot instruction-following models. On an automatically constructed test set, KazSim (Llama-3.3-70B) achieves BLEU 33.50, SARI 56.38, and F1 87.56 with a length ratio of 0.98, outperforming all baselines. We also explore prompt language (English vs. Kazakh) and conduct human evaluation with three native speakers: KazSim scores 4.08 for fluency, 4.09 for meaning preservation, and 4.42 for simplicity—significantly above GPT-4o-mini. Error analysis shows that remaining failures cluster into tone change, tense change, and semantic drift, reflecting Kazakh’s agglutinative morphology and flexible syntax. Full article
(This article belongs to the Special Issue Natural Language Processing and Text Mining)
Show Figures

Figure 1

20 pages, 1955 KB  
Article
Text Similarity Detection in Agglutinative Languages: A Case Study of Kazakh Using Hybrid N-Gram and Semantic Models
by Svitlana Biloshchytska, Arailym Tleubayeva, Oleksandr Kuchanskyi, Andrii Biloshchytskyi, Yurii Andrashko, Sapar Toxanov, Aidos Mukhatayev and Saltanat Sharipova
Appl. Sci. 2025, 15(12), 6707; https://doi.org/10.3390/app15126707 - 15 Jun 2025
Cited by 3 | Viewed by 2741
Abstract
This study presents an advanced hybrid approach for detecting near-duplicate texts in the Kazakh language, addressing the specific challenges posed by its agglutinative morphology. The proposed method combines statistical and semantic techniques, including N-gram analysis, TF-IDF, LSH, LSA, and LDA, and is benchmarked [...] Read more.
This study presents an advanced hybrid approach for detecting near-duplicate texts in the Kazakh language, addressing the specific challenges posed by its agglutinative morphology. The proposed method combines statistical and semantic techniques, including N-gram analysis, TF-IDF, LSH, LSA, and LDA, and is benchmarked against the bert-base-multilingual-cased model. Experiments were conducted on the purpose-built Arailym-aitu/KazakhTextDuplicates corpus, which contains over 25,000 manually modified text fragments using typical techniques, such as paraphrasing, word order changes, synonym substitution, and morphological transformations. The results show that the hybrid model achieves a precision of 1.00, a recall of 0.73, and an F1-score of 0.84, significantly outperforming traditional N-gram and TF-IDF approaches and demonstrating comparable accuracy to the BERT model while requiring substantially lower computational resources. The hybrid model proved highly effective in detecting various types of near-duplicate texts, including paraphrased and structurally modified content, making it suitable for practical applications in academic integrity verification, plagiarism detection, and intelligent text analysis. Moreover, this study highlights the potential of lightweight hybrid architectures as a practical alternative to large transformer-based models, particularly for languages with limited annotated corpora and linguistic resources. It lays the foundation for future research in cross-lingual duplicate detection and deep model adaptation for the Kazakh language. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

18 pages, 332 KB  
Article
Weakly-Supervised Multilingual Medical NER for Symptom Extraction for Low-Resource Languages
by Rigon Sallauka, Umut Arioz, Matej Rojc and Izidor Mlakar
Appl. Sci. 2025, 15(10), 5585; https://doi.org/10.3390/app15105585 - 16 May 2025
Cited by 3 | Viewed by 2850
Abstract
Patient-reported health data, especially patient-reported outcomes measures, are vital for improving clinical care but are often limited by memory bias, cognitive load, and inflexible questionnaires. Patients prefer conversational symptom reporting, highlighting the need for robust methods in symptom extraction and conversational intelligence. This [...] Read more.
Patient-reported health data, especially patient-reported outcomes measures, are vital for improving clinical care but are often limited by memory bias, cognitive load, and inflexible questionnaires. Patients prefer conversational symptom reporting, highlighting the need for robust methods in symptom extraction and conversational intelligence. This study presents a weakly-supervised pipeline for training and evaluating medical Named Entity Recognition (NER) models across eight languages, with a focus on low-resource settings. A merged English medical corpus, annotated using the Stanza i2b2 model, was translated into German, Greek, Spanish, Italian, Portuguese, Polish, and Slovenian, preserving the entity annotations medical problems, diagnostic tests, and treatments. Data augmentation addressed the class imbalance, and the fine-tuned BERT-based models outperformed baselines consistently. The English model achieved the highest F1 score (80.07%), followed by German (78.70%), Spanish (77.61%), Portuguese (77.21%), Slovenian (75.72%), Italian (75.60%), Polish (75.56%), and Greek (69.10%). Compared to the existing baselines, our models demonstrated notable performance gains, particularly in English, Spanish, and Italian. This research underscores the feasibility and effectiveness of weakly-supervised multilingual approaches for medical entity extraction, contributing to improved information access in clinical narratives—especially in under-resourced languages. Full article
Show Figures

Figure 1

22 pages, 487 KB  
Article
From Fact Drafts to Operational Systems: Semantic Search in Legal Decisions Using Fact Drafts
by Gergely Márk Csányi, Dorina Lakatos, István Üveges, Andrea Megyeri , János Pál Vadász, Dániel Nagy and Renátó Vági
Big Data Cogn. Comput. 2024, 8(12), 185; https://doi.org/10.3390/bdcc8120185 - 10 Dec 2024
Cited by 2 | Viewed by 3830
Abstract
This research paper presents findings from an investigation in the semantic similarity search task within the legal domain, using a corpus of 1172 Hungarian court decisions. The study establishes the groundwork for an operational semantic similarity search system designed to identify cases with [...] Read more.
This research paper presents findings from an investigation in the semantic similarity search task within the legal domain, using a corpus of 1172 Hungarian court decisions. The study establishes the groundwork for an operational semantic similarity search system designed to identify cases with comparable facts using preliminary legal fact drafts. Evaluating such systems often poses significant challenges, given the need for thorough document checks, which can be costly and limit evaluation reusability. To address this, the study employs manually created fact drafts for legal cases, enabling reliable ranking of original cases within retrieved documents and quantitative comparison of various vectorization methods. The study compares twelve different text embedding solutions (the most recent became available just a few weeks before the manuscript was written) identifying Cohere’s embed-multilingual-v3.0, Beijing Academy of Artificial Intelligence’s bge-m3, Jina AI’s jina-embeddings-v3, OpenAI’s text-embedding-3-large, and Microsoft’s multilingual-e5-large models as top performers. To overcome the transformer-based models’ context window limitation, we investigated chunking, striding, and last chunk scaling techniques, with last chunk scaling significantly improving embedding quality. The results suggest that the effectiveness of striding varies based on token count. Notably, employing striding with 16 tokens yielded optimal results, representing 3.125% of the context window size for the best-performing models. Results also suggested that from the models having 8192 token long context window the bge-m3 model is superior compared to jina-embeddings-v3 and text-embedding-3-large models in capturing the relevant parts of a document if the text contains significant amount of noise. The validity of the approach was evaluated and confirmed by legal experts. These insights led to an operational semantic search system for a prominent legal content provider. Full article
Show Figures

Figure 1

25 pages, 1115 KB  
Article
Explainable Pre-Trained Language Models for Sentiment Analysis in Low-Resourced Languages
by Koena Ronny Mabokela, Mpho Primus and Turgay Celik
Big Data Cogn. Comput. 2024, 8(11), 160; https://doi.org/10.3390/bdcc8110160 - 15 Nov 2024
Cited by 7 | Viewed by 5245
Abstract
Sentiment analysis is a crucial tool for measuring public opinion and understanding human communication across digital social media platforms. However, due to linguistic complexities and limited data or computational resources, it is under-represented in many African languages. While state-of-the-art Afrocentric pre-trained language models [...] Read more.
Sentiment analysis is a crucial tool for measuring public opinion and understanding human communication across digital social media platforms. However, due to linguistic complexities and limited data or computational resources, it is under-represented in many African languages. While state-of-the-art Afrocentric pre-trained language models (PLMs) have been developed for various natural language processing (NLP) tasks, their applications in eXplainable Artificial Intelligence (XAI) remain largely unexplored. In this study, we propose a novel approach that combines Afrocentric PLMs with XAI techniques for sentiment analysis. We demonstrate the effectiveness of incorporating attention mechanisms and visualization techniques in improving the transparency, trustworthiness, and decision-making capabilities of transformer-based models when making sentiment predictions. To validate our approach, we employ the SAfriSenti corpus, a multilingual sentiment dataset for South African under-resourced languages, and perform a series of sentiment analysis experiments. These experiments enable comprehensive evaluations, comparing the performance of Afrocentric models against mainstream PLMs. Our results show that the Afro-XLMR model outperforms all other models, achieving an average F1-score of 71.04% across five tested languages, and the lowest error rate among the evaluated models. Additionally, we enhance the interpretability and explainability of the Afro-XLMR model using Local Interpretable Model-Agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP). These XAI techniques ensure that sentiment predictions are not only accurate and interpretable but also understandable, fostering trust and reliability in AI-driven NLP technologies, particularly in the context of African languages. Full article
(This article belongs to the Special Issue Artificial Intelligence and Natural Language Processing)
Show Figures

Figure 1

19 pages, 914 KB  
Article
A Privacy-Preserving Multilingual Comparable Corpus Construction Method in Internet of Things
by Yu Weng, Shumin Dong and Chaomurilige
Mathematics 2024, 12(4), 598; https://doi.org/10.3390/math12040598 - 17 Feb 2024
Cited by 2 | Viewed by 2634
Abstract
With the expansion of the Internet of Things (IoT) and artificial intelligence (AI) technologies, multilingual scenarios are gradually increasing, and applications based on multilingual resources are also on the rise. In this process, apart from the need for the construction of multilingual resources, [...] Read more.
With the expansion of the Internet of Things (IoT) and artificial intelligence (AI) technologies, multilingual scenarios are gradually increasing, and applications based on multilingual resources are also on the rise. In this process, apart from the need for the construction of multilingual resources, privacy protection issues like data privacy leakage are increasingly highlighted. Comparable corpus is important in multilingual language information processing in IoT. However, the multilingual comparable corpus concerning privacy preserving is rare, so there is an urgent need to construct a multilingual corpus resource. This paper proposes a method for constructing a privacy-preserving multilingual comparable corpus, taking Chinese–Uighur–Tibetan IoT based news as an example, and mapping the different language texts to a unified language vector space to avoid sensitive information, then calculates the similarity between different language texts and serves as a comparability index to construct comparable relations. Through the decision-making mechanism of minimizing the impossibility, it can identify a comparable corpus pair of multilingual texts based on chapter size to realize the construction of a privacy-preserving Chinese–Uighur–Tibetan comparable corpus (CUTCC). Evaluation experiments demonstrate the effectiveness of our proposed provable method, which outperforms in accuracy rate by 77%, recall rate by 34% and F value by 47.17%. The CUTCC provides valuable privacy-preserving data resources support and language service for multilingual situations in IoT. Full article
Show Figures

Figure 1

12 pages, 318 KB  
Article
esCorpius-m: A Massive Multilingual Crawling Corpus with a Focus on Spanish
by Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Ksenia Kharitonova and Zoraida Callejas
Appl. Sci. 2023, 13(22), 12155; https://doi.org/10.3390/app132212155 - 8 Nov 2023
Cited by 1 | Viewed by 3191
Abstract
In recent years, transformer-based models have played a significant role in advancing language modeling for natural language processing. However, they require substantial amounts of data and there is a shortage of high-quality non-English corpora. Some recent initiatives have introduced multilingual datasets obtained through [...] Read more.
In recent years, transformer-based models have played a significant role in advancing language modeling for natural language processing. However, they require substantial amounts of data and there is a shortage of high-quality non-English corpora. Some recent initiatives have introduced multilingual datasets obtained through web crawling. However, there are notable limitations in the results for some languages, including Spanish. These datasets are either smaller compared to other languages or suffer from lower quality due to insufficient cleaning and deduplication. In this paper, we present esCorpius-m, a multilingual corpus extracted from around 1 petabyte of Common Crawl data. It is the most extensive corpus for some languages with such a level of high-quality content extraction, cleanliness, and deduplication. Our data curation process involves an efficient cleaning pipeline and various deduplication methods that maintain the integrity of document and paragraph boundaries. We also ensure compliance with EU regulations by retaining both the source web page URL and the WARC shared origin URL. Full article
Show Figures

Figure 1

44 pages, 1396 KB  
Article
Subordination in Turkish Heritage Children with and without Developmental Language Impairment
by Nebiye Hilal Șan
Languages 2023, 8(4), 239; https://doi.org/10.3390/languages8040239 - 19 Oct 2023
Cited by 5 | Viewed by 6119
Abstract
A large body of cross-linguistic research has shown that complex constructions, such as subordinate constructions, are vulnerable in bilingual DLD children, whereas they are robust in bilingual children with typical language development; therefore, they are argued to constitute a potential clinical marker for [...] Read more.
A large body of cross-linguistic research has shown that complex constructions, such as subordinate constructions, are vulnerable in bilingual DLD children, whereas they are robust in bilingual children with typical language development; therefore, they are argued to constitute a potential clinical marker for identifying DLD in bilingual contexts, especially when the majority language is assessed. However, it is not clear whether this also applies to heritage contexts, particularly in contexts in which the heritage language is affected by L2 contact-induced phenomena, as in the case of Heritage Turkish in Germany. In this study, we compare subordination using data obtained from 13 Turkish heritage children with and without DLD (age range 5; 1–11; 6) to 10 late successive (lL2) BiTDs (age range 7; 2–12; 2) and 10 Turkish adult heritage bilinguals (age range 20; 3–25; 10) by analyzing subordinate constructions using both Standard and Heritage Turkish as reference varieties. We further investigate which background factors predict performance in subordinate constructions. Speech samples were elicited using the sentence repetition task (SRT) from the TODİL standardized test battery and the Multilingual Assessment Instrument for Narratives (MAIN). A systematic analysis of a corpus of subordinate clauses constructed with respect to SRT and MAIN narrative production comprehension tasks shows that heritage children with TD and DLD may not be differentiated through these tasks, especially when their utterances are scored using the Standard Turkish variety as a baseline; however, they may be differentiated if the Heritage Turkish is considered as the baseline. The age of onset in the second language (AoO_L2) was the leading performance predictor in subordinate clause production in SRT and in both tasks of MAIN regardless of using Standard Turkish or Heritage Turkish as reference varieties in scoring. Full article
(This article belongs to the Special Issue Bilingualism and Language Impairment)
Show Figures

Figure 1

Back to TopTop