Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (282)

Search Parameters:
Keywords = multilingual researchers

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 298 KiB  
Review
Speaking the Self: How Native-Language Psychotherapy Enables Change in Refugees: A Person-Centered Perspective
by Viktoriya Zipper-Weber
Healthcare 2025, 13(15), 1920; https://doi.org/10.3390/healthcare13151920 - 6 Aug 2025
Abstract
Background: Since the outbreak of war in Ukraine, countless forcibly displaced individuals facing not only material loss, but also deep psychological distress, have sought refuge across Europe. For those traumatized by war, the absence of a shared language in therapy can hinder healing [...] Read more.
Background: Since the outbreak of war in Ukraine, countless forcibly displaced individuals facing not only material loss, but also deep psychological distress, have sought refuge across Europe. For those traumatized by war, the absence of a shared language in therapy can hinder healing and exacerbate suffering. While cultural diversity in psychotherapy has gained recognition, the role of native-language communication—especially from a person-centered perspective—remains underexplored. Methods: This narrative review with a thematic analysis examines whether and how psychotherapy in the mother tongue facilitates access to therapy and enhances therapeutic efficacy. Four inter-related clusters emerged: (1) the psychosocial context of trauma and displacement; (2) language as a structural gatekeeper to care (RQ1); (3) native-language therapy as a mechanism of change (RQ2); (4) potential risks such as over-identification or therapeutic mismatch (RQ2). Results: The findings suggest that native-language therapy can support the symbolic integration of trauma and foster the core conditions for healing. The implications for multilingual therapy formats, training in interpreter-mediated settings, and future research designs—including longitudinal, transnational studies—are discussed. Conclusions: In light of the current crises, language is not just a tool for access to therapy, but a pathway to psychological healing. Full article
(This article belongs to the Special Issue Healthcare for Immigrants and Refugees)
27 pages, 1481 KiB  
Article
Integration of Associative Tokens into Thematic Hyperspace: A Method for Determining Semantically Significant Clusters in Dynamic Text Streams
by Dmitriy Rodionov, Boris Lyamin, Evgenii Konnikov, Elena Obukhova, Gleb Golikov and Prokhor Polyakov
Big Data Cogn. Comput. 2025, 9(8), 197; https://doi.org/10.3390/bdcc9080197 - 25 Jul 2025
Viewed by 349
Abstract
With the exponential growth of textual data, traditional topic modeling methods based on static analysis demonstrate limited effectiveness in tracking the dynamics of thematic content. This research aims to develop a method for quantifying the dynamics of topics within text corpora using a [...] Read more.
With the exponential growth of textual data, traditional topic modeling methods based on static analysis demonstrate limited effectiveness in tracking the dynamics of thematic content. This research aims to develop a method for quantifying the dynamics of topics within text corpora using a thematic signal (TS) function that accounts for temporal changes and semantic relationships. The proposed method combines associative tokens with original lexical units to reduce thematic entropy and information noise. Approaches employed include topic modeling (LDA), vector representations of texts (TF-IDF, Word2Vec), and time series analysis. The method was tested on a corpus of news texts (5000 documents). Results demonstrated robust identification of semantically meaningful thematic clusters. An inverse relationship was observed between the level of thematic significance and semantic diversity, confirming a reduction in entropy using the proposed method. This approach allows for quantifying topic dynamics, filtering noise, and determining the optimal number of clusters. Future applications include analyzing multilingual data and integration with neural network models. The method shows potential for monitoring information flows and predicting thematic trends. Full article
Show Figures

Figure 1

27 pages, 1817 KiB  
Article
A Large Language Model-Based Approach for Multilingual Hate Speech Detection on Social Media
by Muhammad Usman, Muhammad Ahmad, Grigori Sidorov, Irina Gelbukh and Rolando Quintero Tellez
Computers 2025, 14(7), 279; https://doi.org/10.3390/computers14070279 - 15 Jul 2025
Viewed by 772
Abstract
The proliferation of hate speech on social media platforms poses significant threats to digital safety, social cohesion, and freedom of expression. Detecting such content—especially across diverse languages—remains a challenging task due to linguistic complexity, cultural context, and resource limitations. To address these challenges, [...] Read more.
The proliferation of hate speech on social media platforms poses significant threats to digital safety, social cohesion, and freedom of expression. Detecting such content—especially across diverse languages—remains a challenging task due to linguistic complexity, cultural context, and resource limitations. To address these challenges, this study introduces a comprehensive approach for multilingual hate speech detection. To facilitate robust hate speech detection across diverse languages, this study makes several key contributions. First, we created a novel trilingual hate speech dataset consisting of 10,193 manually annotated tweets in English, Spanish, and Urdu. Second, we applied two innovative techniques—joint multilingual and translation-based approaches—for cross-lingual hate speech detection that have not been previously explored for these languages. Third, we developed detailed hate speech annotation guidelines tailored specifically to all three languages to ensure consistent and high-quality labeling. Finally, we conducted 41 experiments employing machine learning models with TF–IDF features, deep learning models utilizing FastText and GloVe embeddings, and transformer-based models leveraging advanced contextual embeddings to comprehensively evaluate our approach. Additionally, we employed a large language model with advanced contextual embeddings to identify the best solution for the hate speech detection task. The experimental results showed that our GPT-3.5-turbo model significantly outperforms strong baselines, achieving up to an 8% improvement over XLM-R in Urdu hate speech detection and an average gain of 4% across all three languages. This research not only contributes a high-quality multilingual dataset but also offers a scalable and inclusive framework for hate speech detection in underrepresented languages. Full article
(This article belongs to the Special Issue Recent Advances in Social Networks and Social Media)
Show Figures

Figure 1

19 pages, 326 KiB  
Article
Motivational Dynamics in a Multilingual Context: University Students’ Perspectives on LOTE Learning
by Ali Göksu and Vincent Louis
Behav. Sci. 2025, 15(7), 931; https://doi.org/10.3390/bs15070931 - 10 Jul 2025
Viewed by 355
Abstract
Interest in language-learning motivation has been growing recently, particularly in multilingual contexts where individuals acquire additional languages beyond English. Despite increasing the focus on multilingualism within second-language acquisition (SLA) research, less research focuses on the motivational dynamics of multilingual learners in learning languages [...] Read more.
Interest in language-learning motivation has been growing recently, particularly in multilingual contexts where individuals acquire additional languages beyond English. Despite increasing the focus on multilingualism within second-language acquisition (SLA) research, less research focuses on the motivational dynamics of multilingual learners in learning languages other than English (LOTE). Addressing this gap, the present study investigates the complex motivational factors influencing multilingual university students in learning French as an additional language and LOTE within the Belgian context. The participants consisted of 121 multilingual university students who were learning French as an additional language and LOTE. Data were collected through questionnaire and semi-structured interviews, and analyzed using a combination of quantitative and qualitative methods to provide a comprehensive understanding of learners’ motivational profile. Findings revealed that multilingual learners’ motivation is multifaceted and dynamic, shaped by a combination of intrinsic interests (e.g., cultural appreciation and personal growth), extrinsic goals (e.g., academic and career aspirations), integrative motives, and prior language-learning experiences. The study also sheds light on the overlapping and evolving nature of motivational patterns and provides nuanced insights into LOTE learning motivation within multilingual settings. Full article
22 pages, 792 KiB  
Article
Childhood Heritage Languages: A Tangier Case Study
by Ariadna Saiz Mingo
Languages 2025, 10(7), 168; https://doi.org/10.3390/languages10070168 - 9 Jul 2025
Viewed by 392
Abstract
Through the testimony of a Tangier female citizen who grew up in the “prolific multilingual Spanish-French-Darija context of international Tangier”, this article analyzes the web of beliefs projected onto both the inherited and local languages within her linguistic repertoire. Starting from the daily [...] Read more.
Through the testimony of a Tangier female citizen who grew up in the “prolific multilingual Spanish-French-Darija context of international Tangier”, this article analyzes the web of beliefs projected onto both the inherited and local languages within her linguistic repertoire. Starting from the daily realities in which she was immersed and the social networks that she formed, we focus on the representations of communication and her affective relationship with the host societies. The analysis starts from the most immediate domestic context in which Spanish, in its variant Jaquetía (a dialect of Judeo-Spanish language spoken by the Sephardic Jews of northern Morocco) was displaced by French as the language of instruction. After an initial episode of reversible attrition, we witnessed various phenomena of translanguaging within the host society. Following the binomial “emotion-interrelational space”, we seek to discern the affective contexts associated with the languages of a multilingual childhood, and which emotional links are vital for maintaining inherited ones. This shift towards the valuation of the affective culture implies a reorientation of the gaze towards everyday experiences as a means of research in contexts of language contact. Full article
Show Figures

Figure 1

23 pages, 439 KiB  
Article
Evaluating Proprietary and Open-Weight Large Language Models as Universal Decimal Classification Recommender Systems
by Mladen Borovič, Eftimije Tomovski, Tom Li Dobnik and Sandi Majninger
Appl. Sci. 2025, 15(14), 7666; https://doi.org/10.3390/app15147666 - 8 Jul 2025
Viewed by 349
Abstract
Manual assignment of Universal Decimal Classification (UDC) codes is time-consuming and inconsistent as digital library collections expand. This study evaluates 17 large language models (LLMs) as UDC classification recommender systems, including ChatGPT variants (GPT-3.5, GPT-4o, and o1-mini), Claude models (3-Haiku and 3.5-Haiku), Gemini [...] Read more.
Manual assignment of Universal Decimal Classification (UDC) codes is time-consuming and inconsistent as digital library collections expand. This study evaluates 17 large language models (LLMs) as UDC classification recommender systems, including ChatGPT variants (GPT-3.5, GPT-4o, and o1-mini), Claude models (3-Haiku and 3.5-Haiku), Gemini series (1.0-Pro, 1.5-Flash, and 2.0-Flash), and Llama, Gemma, Mixtral, and DeepSeek architectures. Models were evaluated zero-shot on 900 English and Slovenian academic theses manually classified by professional librarians. Classification prompts utilized the RISEN framework, with evaluation using Levenshtein and Jaro–Winkler similarity, and a novel adjusted hierarchical similarity metric capturing UDC’s faceted structure. Proprietary systems consistently outperformed open-weight alternatives by 5–10% across metrics. GPT-4o achieved the highest hierarchical alignment, while open-weight models showed progressive improvements but remained behind commercial systems. Performance was comparable between languages, demonstrating robust multilingual capabilities. The results indicate that LLM-powered recommender systems can enhance library classification workflows. Future research incorporating fine-tuning and retrieval-augmented approaches may enable fully automated, high-precision UDC assignment systems. Full article
(This article belongs to the Special Issue Advanced Models and Algorithms for Recommender Systems)
Show Figures

Figure 1

15 pages, 1701 KiB  
Article
An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation
by Miloš Bogdanović, Milena Frtunić Gligorijević, Jelena Kocić and Leonid Stoimenov
Appl. Sci. 2025, 15(13), 7491; https://doi.org/10.3390/app15137491 - 3 Jul 2025
Viewed by 500
Abstract
Various areas of natural language processing (NLP) have greatly benefited from the development of large language models in recent years. This research addresses the challenge of developing efficient tokenizers for transformer-based domain-specific language models. Tokenization efficiency within transformer-based models is directly related to [...] Read more.
Various areas of natural language processing (NLP) have greatly benefited from the development of large language models in recent years. This research addresses the challenge of developing efficient tokenizers for transformer-based domain-specific language models. Tokenization efficiency within transformer-based models is directly related to model efficiency, which motivated the research we present in this paper. Our goal in this research was to demonstrate that the appropriate selection of data used for tokenizer training has a significant impact on tokenizer performance. Subsequently, we will demonstrate that efficient tokenizers and models can be developed even if language resources are limited. To do so, we will present a domain-adapted large language model tokenizer developed for masked language modeling of the Serbian legal domain. In this paper, we will present a comparison of the tokenization performance for a domain-adapted tokenizer in version 2 of the SrBERTa language model we developed, against the performances of five other tokenizers belonging to state-of-the-art multilingual, Slavic or Serbian-specific models—XLM-RoBERTa (base-sized), BERTić, Jerteh-81, SrBERTa v1, NER4Legal_SRB. The comparison is performed using a test dataset consisting of 275,660 samples of legal texts written in the Cyrillic alphabet gathered from the Official Gazette of the Republic of Serbia. This dataset contains 197,134 distinct words, while the overall word count is 5,265,352. We will show that our tokenizer, trained upon a domain-adapted dataset, outperforms presented tokenizers by at least 4.5% ranging to 54.62%, regarding the number of tokens generated for the whole test dataset. In terms of tokenizer fertility, we will show that our tokenizer outperforms compared tokenizers by at least 6.39% ranging to 56.8%. Full article
Show Figures

Figure 1

14 pages, 557 KiB  
Review
Teachers’ Beliefs About Multilingualism in Early Childhood Education Settings: A Scoping Review
by Zhijun Zheng
Educ. Sci. 2025, 15(7), 849; https://doi.org/10.3390/educsci15070849 - 2 Jul 2025
Viewed by 820
Abstract
There is an increasing number of multilingual children attending early childhood education and care (ECEC) settings around the world. Early childhood teachers play a crucial role in supporting these multilingual young children. As teachers’ teaching practices are directed by their beliefs, it is [...] Read more.
There is an increasing number of multilingual children attending early childhood education and care (ECEC) settings around the world. Early childhood teachers play a crucial role in supporting these multilingual young children. As teachers’ teaching practices are directed by their beliefs, it is significant to understand early childhood teachers’ beliefs about multilingualism in the existing literature in order to better support multilingual children. From 14 studies, this review categorised three main themes of early childhood teachers’ beliefs about multilingualism: multilingualism as a problem, multilingualism as a right, and concerns about multilingualism as a resource. Two studies examined factors associated with the variation in teachers’ beliefs. The findings of this review summarised various perspectives of teachers’ misconceptions and negative beliefs about multilingualism, although a small number of studies reported teachers’ positive beliefs about multilingualism in ECEC. This review addresses early childhood teachers’ knowledge gaps in child language development and multilingual pedagogies. In addition, this review identifies several research gaps for future studies. For example, more studies conducted in non-Western contexts and studies on teachers’ beliefs about supporting multilingual infants and toddlers are much needed. This review also contributes to informing future directions for professional development to empower early childhood teachers to support multilingualism. Full article
Show Figures

Figure 1

19 pages, 457 KiB  
Article
Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment
by Chen Shen, Lu Zhao, Cejin Fu, Bote Gan and Zhenlong Du
Sensors 2025, 25(13), 3973; https://doi.org/10.3390/s25133973 - 26 Jun 2025
Viewed by 596
Abstract
Although Singing Voice Synthesis (SVS) has revolutionized audio content creation, global linguistic diversity remains challenging. Current SVS research shows scant exploration of cross-lingual generalization, as fragmented, language-specific phoneme encodings (e.g., Pinyin, ARPA) hinder unified phonetic modeling. To address this challenge, we built a [...] Read more.
Although Singing Voice Synthesis (SVS) has revolutionized audio content creation, global linguistic diversity remains challenging. Current SVS research shows scant exploration of cross-lingual generalization, as fragmented, language-specific phoneme encodings (e.g., Pinyin, ARPA) hinder unified phonetic modeling. To address this challenge, we built a four-language dataset based on GTSinger’s speech data, using the International Phonetic Alphabet (IPA) for consistent phonetic representation and applying precise segmentation and calibration for improved quality. In particular, we propose a novel method of decomposing IPA phonemes into letters and diacritics, enabling the model to deeply learn the underlying rules of pronunciation and achieve better generalization. A dynamic IPA adaptation strategy further enables the application of learned phonetic representations to unseen languages. Based on VISinger2, we introduce Transinger, an innovative cross-lingual synthesis framework. Transinger achieves breakthroughs in phoneme representation learning by precisely modeling pronunciation, which effectively enables compositional generalization to unseen languages. It also integrates Conformer and RVQ techniques to optimize information extraction and generation, achieving outstanding cross-lingual synthesis performance. Objective and subjective experiments have confirmed that Transinger significantly outperforms state-of-the-art singing synthesis methods in terms of cross-lingual generalization. These results demonstrate that multilingual aligned representations can markedly enhance model learning efficacy and robustness, even for languages not seen during training. Moreover, the integration of a strategy that splits IPA phonemes into letters and diacritics allows the model to learn pronunciation more effectively, resulting in a qualitative improvement in generalization. Full article
Show Figures

Figure 1

22 pages, 5083 KiB  
Article
Intelligent Mobile-Assisted Language Learning: A Deep Learning Approach for Pronunciation Analysis and Personalized Feedback
by Fengqin Liu, Korawit Orkphol, Natthapon Pannurat, Thanat Sooknuan, Thanin Muangpool, Sanya Kuankid and Montri Phothisonothai
Inventions 2025, 10(4), 46; https://doi.org/10.3390/inventions10040046 - 24 Jun 2025
Viewed by 644
Abstract
This paper introduces an innovative mobile-assisted language-learning (MALL) system that harnesses deep learning technology to analyze pronunciation patterns and deliver real-time, personalized feedback. Drawing inspiration from how the human brain processes speech through neural pathways, our system analyzes multiple speech features using spectrograms, [...] Read more.
This paper introduces an innovative mobile-assisted language-learning (MALL) system that harnesses deep learning technology to analyze pronunciation patterns and deliver real-time, personalized feedback. Drawing inspiration from how the human brain processes speech through neural pathways, our system analyzes multiple speech features using spectrograms, mel-frequency cepstral coefficients (MFCCs), and formant frequencies in a manner that mirrors the auditory cortex’s interpretation of sound. The core of our approach utilizes a convolutional neural network (CNN) to classify pronunciation patterns from user-recorded speech. To enhance the assessment accuracy and provide nuanced feedback, we integrated a fuzzy inference system (FIS) that helps learners identify and correct specific pronunciation errors. The experimental results demonstrate that our multi-feature model achieved 82.41% to 90.52% accuracies in accent classification across diverse linguistic contexts. The user testing revealed statistically significant improvements in pronunciation skills, where learners showed a 5–20% enhancement in accuracy after using the system. The proposed MALL system offers a portable, accessible solution for language learners while establishing a foundation for future research in multilingual functionality and mobile platform optimization. By combining advanced speech analysis with intuitive feedback mechanisms, this system addresses a critical challenge in language acquisition and promotes more effective self-directed learning. Full article
Show Figures

Figure 1

19 pages, 4192 KiB  
Article
Supporting Multilingual Students’ Mathematical Discourse Through Teacher Professional Development Grounded in Design-Based Research: A Conceptual Framework
by Margarita Jiménez-Silva, Robin Martin, Rachel Restani, Suzanne Abdelrahim and Tony Albano
Educ. Sci. 2025, 15(6), 778; https://doi.org/10.3390/educsci15060778 - 19 Jun 2025
Viewed by 612
Abstract
This conceptual paper presents a framework for supporting multilingual students’ mathematical discourse through teacher professional development grounded in design-based research (DBR). Drawing on sociocultural learning theory, the Integrated Language and Mathematics Project (ILMP) was co-developed with elementary educators to promote integrated instruction that [...] Read more.
This conceptual paper presents a framework for supporting multilingual students’ mathematical discourse through teacher professional development grounded in design-based research (DBR). Drawing on sociocultural learning theory, the Integrated Language and Mathematics Project (ILMP) was co-developed with elementary educators to promote integrated instruction that simultaneously advances students’ mathematical understanding, language development, and cultural identity. The ILMP framework centers around three instructional pillars: attention to language, attention to mathematical thinking, and cultural responsiveness. Through collaborative inquiry cycles, educators engaged as learners, contributors, and designers of practice, iteratively enacting and reflecting on instructional strategies rooted in students’ linguistic and cultural assets. Teachers implemented discussion-rich mathematical tasks, supported by language scaffolds and culturally relevant contexts, to foster students’ mathematical reasoning and communication. This approach was particularly impactful for multilingual learners, whose language use and problem-solving strategies were both valued and elevated. This paper also discusses the opportunities and challenges of DBR and research–practice partnerships, including flexibility in implementation and navigating district-level priorities. Insights underscore the importance of practitioner agency, asset-based pedagogy, and the co-construction of professional learning. The ILMP framework offers a scalable, equity-oriented model for improving integrated language and mathematics instruction in diverse elementary classrooms and beyond. Full article
Show Figures

Figure 1

24 pages, 2410 KiB  
Article
UA-HSD-2025: Multi-Lingual Hate Speech Detection from Tweets Using Pre-Trained Transformers
by Muhammad Ahmad, Muhammad Waqas, Ameer Hamza, Sardar Usman, Ildar Batyrshin and Grigori Sidorov
Computers 2025, 14(6), 239; https://doi.org/10.3390/computers14060239 - 18 Jun 2025
Cited by 1 | Viewed by 801
Abstract
The rise in social media has improved communication but also amplified the spread of hate speech, creating serious societal risks. Automated detection remains difficult due to subjectivity, linguistic diversity, and implicit language. While prior research focuses on high-resource languages, this study addresses the [...] Read more.
The rise in social media has improved communication but also amplified the spread of hate speech, creating serious societal risks. Automated detection remains difficult due to subjectivity, linguistic diversity, and implicit language. While prior research focuses on high-resource languages, this study addresses the underexplored multilingual challenges of Arabic and Urdu hate speech through a comprehensive approach. To achieve this objective, this study makes four different key contributions. First, we have created a unique multi-lingual, manually annotated binary and multi-class dataset (UA-HSD-2025) sourced from X, which contains the five most important multi-class categories of hate speech. Secondly, we created detailed annotation guidelines to make a robust and perfect hate speech dataset. Third, we explore two strategies to address the challenges of multilingual data: a joint multilingual and translation-based approach. The translation-based approach involves converting all input text into a single target language before applying a classifier. In contrast, the joint multilingual approach employs a unified model trained to handle multiple languages simultaneously, enabling it to classify text across different languages without translation. Finally, we have employed state-of-the-art 54 different experiments using different machine learning using TF-IDF, deep learning using advanced pre-trained word embeddings such as FastText and Glove, and pre-trained language-based models using advanced contextual embeddings. Based on the analysis of the results, our language-based model (XLM-R) outperformed traditional supervised learning approaches, achieving 0.99 accuracy in binary classification for Arabic, Urdu, and joint-multilingual datasets, and 0.95, 0.94, and 0.94 accuracy in multi-class classification for joint-multilingual, Arabic, and Urdu datasets, respectively. Full article
(This article belongs to the Special Issue Recent Advances in Social Networks and Social Media)
Show Figures

Figure 1

25 pages, 3472 KiB  
Article
Exploring Multilingualism to Inform Linguistically and Culturally Responsive English Language Education
by Miriam Weidl and Elizabeth J. Erling
Educ. Sci. 2025, 15(6), 763; https://doi.org/10.3390/educsci15060763 - 16 Jun 2025
Cited by 1 | Viewed by 1216
Abstract
Linguistically and culturally responsive pedagogies (LCRPs) recognize students’ multilingual and cultural resources as central to inclusive and equitable learning. While such approaches are increasingly promoted in English language education (ELE), there remains limited understanding of the complexity of students’ multilingual trajectories—particularly in contexts [...] Read more.
Linguistically and culturally responsive pedagogies (LCRPs) recognize students’ multilingual and cultural resources as central to inclusive and equitable learning. While such approaches are increasingly promoted in English language education (ELE), there remains limited understanding of the complexity of students’ multilingual trajectories—particularly in contexts marked by migration and linguistic diversity. This article addresses this gap by presenting findings from the Udele project, which explores the lived experiences of multilingual learners in urban Austrian middle schools. Using an embedded case study design, we draw on a rich set of qualitative methods—including observations, interviews, fieldnotes, student artifacts, and language portraits—to explore how two students navigate their linguistic repertoires, identities, and learning experiences. Our analysis reveals that students’ language-related self-positionings influence their classroom engagement and broader identity narratives. The findings demonstrate how shifts in self-perception affect participation and motivation, and how the students actively negotiate their multilingual identities within and beyond the classroom context. The complexity uncovered in their multilingual repertoires and life experiences underscores the critical need for longitudinal, multilingual research approaches to fully capture the dynamic and nuanced trajectories of language learners. These findings challenge prevailing conceptualizations of multilingualism in ELE, highlighting the importance of incorporating students’ lived linguistic experiences into pedagogical frameworks. Full article
Show Figures

Figure 1

20 pages, 1955 KiB  
Article
Text Similarity Detection in Agglutinative Languages: A Case Study of Kazakh Using Hybrid N-Gram and Semantic Models
by Svitlana Biloshchytska, Arailym Tleubayeva, Oleksandr Kuchanskyi, Andrii Biloshchytskyi, Yurii Andrashko, Sapar Toxanov, Aidos Mukhatayev and Saltanat Sharipova
Appl. Sci. 2025, 15(12), 6707; https://doi.org/10.3390/app15126707 - 15 Jun 2025
Viewed by 609
Abstract
This study presents an advanced hybrid approach for detecting near-duplicate texts in the Kazakh language, addressing the specific challenges posed by its agglutinative morphology. The proposed method combines statistical and semantic techniques, including N-gram analysis, TF-IDF, LSH, LSA, and LDA, and is benchmarked [...] Read more.
This study presents an advanced hybrid approach for detecting near-duplicate texts in the Kazakh language, addressing the specific challenges posed by its agglutinative morphology. The proposed method combines statistical and semantic techniques, including N-gram analysis, TF-IDF, LSH, LSA, and LDA, and is benchmarked against the bert-base-multilingual-cased model. Experiments were conducted on the purpose-built Arailym-aitu/KazakhTextDuplicates corpus, which contains over 25,000 manually modified text fragments using typical techniques, such as paraphrasing, word order changes, synonym substitution, and morphological transformations. The results show that the hybrid model achieves a precision of 1.00, a recall of 0.73, and an F1-score of 0.84, significantly outperforming traditional N-gram and TF-IDF approaches and demonstrating comparable accuracy to the BERT model while requiring substantially lower computational resources. The hybrid model proved highly effective in detecting various types of near-duplicate texts, including paraphrased and structurally modified content, making it suitable for practical applications in academic integrity verification, plagiarism detection, and intelligent text analysis. Moreover, this study highlights the potential of lightweight hybrid architectures as a practical alternative to large transformer-based models, particularly for languages with limited annotated corpora and linguistic resources. It lays the foundation for future research in cross-lingual duplicate detection and deep model adaptation for the Kazakh language. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

19 pages, 626 KiB  
Article
A Kazakh–Chinese Cross-Lingual Joint Modeling Method for Question Understanding
by Yajing Ma, Yingxia Yu, Han Liu, Gulila Altenbek, Xiang Zhang and Yilixiati Tuersun
Appl. Sci. 2025, 15(12), 6643; https://doi.org/10.3390/app15126643 - 12 Jun 2025
Viewed by 441
Abstract
Current research on intelligent question answering mainly focuses on high-resource languages such as Chinese and English, with limited studies on question understanding and reasoning in low-resource languages. In addition, during the joint modeling of question understanding tasks, the interdependence among subtasks can lead [...] Read more.
Current research on intelligent question answering mainly focuses on high-resource languages such as Chinese and English, with limited studies on question understanding and reasoning in low-resource languages. In addition, during the joint modeling of question understanding tasks, the interdependence among subtasks can lead to error accumulation during the interaction phase, thereby affecting the prediction performance of the individual subtasks. To address the issue of error propagation caused by sentence-level intent encoding in the joint modeling of intent recognition and slot filling, this paper proposes a Cross-lingual Token-level Bi-Interactive Model (Bi-XTM). The model introduces a novel subtask interaction method that leverages the token-level intent output distribution as additional information for slot vector representation, effectively reducing error propagation and enhancing the information exchange between intent and slot vectors. Meanwhile, to address the scarcity of Kazakh (Arabic alphabet) language corpora, this paper constructs a cross-lingual joint question understanding dataset for the Xinjiang tourism domain, named JISD, which includes 16,548 Chinese samples and 1399 Kazakh samples. This dataset provides a new resource for cross-lingual intent recognition and slot filling joint tasks. Experimental results on the publicly available multi-lingual question understanding dataset MTOD and the newly constructed dataset demonstrate that the proposed Bi-XTM achieves state-of-the-art performance in both monolingual and cross-lingual settings. Full article
Show Figures

Figure 1

Back to TopTop