Next Article in Journal
Plant Microbial Fuel Cell-Based Sensing for Smart Rice
Previous Article in Journal
Machine Learning Assessment of Parkinson’s Disease Using a Novel Free-Living Egg-Beating Motor Task
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A CEFR-Graded Lexicon and Morphology-Aware Benchmarks for Kazakh Lexical Complexity Prediction

1
Department of Physics and Informatics, Taraz University Named After M.Kh. Dulaty, Taraz 080000, Kazakhstan
2
Department of Computer Science, Saken Seifullin Kazakh Agrotechnical Research University, Astana 010011, Kazakhstan
3
School of Engineering and Digital Sciences, Nazarbayev University, Astana 010000, Kazakhstan
4
Department of Information Technologies, Toraighyrov University, Pavlodar 140008, Kazakhstan
*
Authors to whom correspondence should be addressed.
Technologies 2026, 14(6), 346; https://doi.org/10.3390/technologies14060346 (registering DOI)
Submission received: 29 April 2026 / Revised: 25 May 2026 / Accepted: 27 May 2026 / Published: 9 June 2026

Abstract

Graded lexical resources aligned with the Common European Framework of Reference for Languages (CEFR) and lexical complexity prediction remain limited for low-resource Turkic languages, and the extent to which existing predictive models generalize to agglutinative morphology is unresolved. We introduce the first CEFR-graded lexicon for Kazakh, containing 4561 lemma–part-of-speech (POS) entries across A1–C1, and use it to test whether explicit morphology improves lexical complexity prediction. We compare handcrafted morphological features, XLM-RoBERTa contextual embeddings, and fusion models that combine both signal types on held-out CEFR classification. Our best model, a gated fusion of contextual embeddings with morphological features, achieves a macro-averaged F1 score of 0.360 and a mean absolute error of 1.125 on the held-out test set. Morphology provides useful information beyond character-level cues, contextual representations are strong on their own, and combining them yields the best supervised performance for this task. The paper therefore contributes a new CEFR resource for Turkic languages and evidence that morphology-aware modeling is useful for Kazakh lexical difficulty prediction. The results support Sustainable Development Goal 4 (Quality Education) by enabling objective assessment of learning-material complexity and adaptive Kazakh language learning. The derived lexicon and code are publicly available.

1. Introduction

The Common European Framework of Reference for Languages, or CEFR, provides a six-level proficiency scale, A1 to C2, widely used in language teaching and assessment [1]. CEFR-graded lexical resources are central to curriculum design, adaptive learning systems, and text simplification. The development of the first CEFR-graded lexicon for the Kazakh language directly supports the implementation of Sustainable Development Goal 4 (Quality Education). Creating such resources enhances the accessibility of high-quality linguistic tools and optimizes educational processes, which is particularly vital for low-resource languages. For major European languages, comprehensive lexicons such as CEFRLex cover around 13,000 lemma–POS entries each [2,3], and recent computational work on lexical complexity prediction (LCP) has shown strong progress with both feature-based and transformer-based methods [4,5,6].
However, this progress has largely excluded agglutinative, low-resource languages. No CEFR-graded lexicon exists for any Turkic language [7], and existing LCP models have been evaluated almost exclusively on morphologically simple languages. In agglutinative languages like Kazakh, Turkish, or Finnish, the relationship between lexical and morphological complexity is non-trivial. A single lemma such as бала, meaning child, can yield hundreds of suffixed forms, for example балаларымыздан meaning from our children, whose morphological structure affects learnability and difficulty.
Standard multilingual encoders such as XLM-RoBERTa [8], effective cross-lingually, rely on subword tokenization that often fragments these complex wordforms, obscuring morphology relevant to lexical difficulty. Whether morphology-aware representations can improve CEFR-level prediction in such languages remains an open question.
To address this gap, we focus on Kazakh, a Central Turkic language spoken by about 12 million people [9] and one with growing demand for CEFR-aligned pedagogical resources. We construct the first CEFR-graded lexicon for Kazakh. The handbook extraction yields 4561 lemma–POS entries across five levels, and the cleaned modeling set used in our experiments contains 4437 unique lemma–POS entries. Our supervised experiments contrast interpretable handcrafted features derived from Helsinki Finite-State Technology (HFST)-Apertium analysis [10], frequency, and orthography with contextual encoders and morphology–context fusion models. We also examine cross-lingual CEFR projection from Russian and English as a diagnostic signal. This study contributes to the expanding application of data mining within professional educational environments, where machine learning is used to optimize pedagogical processes [11] and manage institutional tasks such as career guidance [11]. Such assessment tools are also vital for establishing strategic university development indicators through formal decision-making methods [12].
Furthermore, the current work builds on established methodologies for multilingual text analysis, specifically the detection of patterns in machine-translated texts [13] and the use of entropy-based measures for identifying structural regularities across diverse languages [14]. These previous findings provide a technical foundation for our current focus on Kazakh morphological complexity.
The challenge of classifying objects by complexity or risk levels often requires accounting for multiple intersecting features. While fuzzy logic-based approaches are successfully applied in the financial sector to evaluate qualitative characteristics, such as the creditworthiness of service enterprises [15], the field of natural language processing (NLP) increasingly utilizes neural architectures for similar purposes to capture hidden dependencies in data. This approach is further supported by research into the development of hybrid thematic and neural network models for data learning [16], which demonstrates the effectiveness of combining structural and probabilistic signals in low-resource language processing.
We address three explicit research questions:
  • RQ1: How strong are interpretable handcrafted features for Kazakh CEFR prediction relative to an embeddings-only baseline?
  • RQ2: Does gated fusion of contextual and lexical representations improve over the classical baselines, and which architectural components contribute to the gain?
  • RQ3: How far can cross-lingual projection transfer CEFR information from Russian and English lexical resources to Kazakh?
The primary contribution of this work is a new resource together with baseline experiments that establish initial benchmarks; we do not claim state-of-the-art modeling advances. Specifically:
  • Resource. We construct the first CEFR-graded lexicon for any Turkic language. The handbook extraction yields 4561 lemma–POS entries across five levels (A1–C1), and the cleaned modeling set contains 4437 unique lemma–POS entries. This lexicon underpins the present experiments and can support future work on lexical complexity prediction, curriculum design, and adaptive language learning for Kazakh and potentially other Turkic languages.
  • Baseline study. We establish initial benchmarks by comparing handcrafted morphological features, contextual embeddings, and fusion models, providing a reference point for future systems on this dataset.
  • Empirical evidence. We show that engineered morphological features improve over character-level patterns alone, and that combining morphological and contextual representations yields the strongest supervised results in this setting. The gain comes from combining the two signal types rather than from a clearly superior fusion mechanism.
The derived lexicon and code are publicly available. (Archived repository: https://doi.org/10.5281/zenodo.19365834).
The remainder of the paper is organized as follows. Section 2 reviews related work, describes the datasets, and defines the modeling and evaluation procedure. Section 3 reports the supervised, diagnostic, cross-lingual, and LLM baseline results. Section 4 discusses the empirical implications and limitations. Section 5 concludes the paper.

2. Materials and Methods

This section first reviews related work that motivates our modeling choices (Section 2.1), then describes the language resources we compile (Section 2.2) and the supervised modeling pipeline and evaluation protocol applied to them (Section 2.3).
All computational experiments were implemented in Python 3.10.0 (Python Software Foundation, Wilmington, DE, USA). Classical models used scikit-learn 1.7.2, and neural models used PyTorch 2.9.1 (Meta Platforms, Inc., Menlo Park, CA, USA) and Hugging Face Transformers 4.46.1 (Hugging Face, Inc., Brooklyn, NY, USA). Neural experiments ran on a single NVIDIA L4 GPU (NVIDIA Corporation, Santa Clara, CA, USA).

2.1. Background

We organize prior work into four strands: lexical complexity prediction, CEFR-graded lexical resources, the role of morphology and frequency in lexical difficulty, and NLP for Kazakh and other agglutinative languages.

2.1.1. CEFR and Lexical Complexity Prediction

Automatic prediction of lexical complexity has received increasing attention, particularly since the SemEval-2021 Lexical Complexity Prediction shared task, which introduced the CompLex English corpus annotated on a 5-point Likert scale for both single words and multiword expressions [4]. The best-performing systems combined multiple pre-trained language models with pseudo-labeling and data augmentation, demonstrating that deep ensembles can achieve strong performance even with limited task-specific data [5]. Feature-based approaches have also proven competitive: Mosquera achieved a top-three ranking using a combination of lexical, contextual, and semantic features with traditional regression models [6], while JUST-BLUE and the University Politehnica of Bucharest (UPB) SemEval system showed that combinations of deep learning and hand-crafted features, including frequency, length, n-gram counts, and psycholinguistic norms, can remain highly competitive when carefully engineered [17,18].
Beyond the shared task, several studies have compared deep encodings with linguistic features for lexical complexity prediction. For example, Ortiz-Zambrano et al. [19] report that hybrid models combining handcrafted features with BERT/XLM-R embeddings yield substantial improvements over features alone, but also find that purely neural models are not uniformly superior and that handcrafted features remain fundamental, especially in low-resource settings. This resonates with findings in educational NLP that simple features such as syllable count, frequency bands, and orthographic length capture much of the variance in lexical difficulty [20].
Recent work has also extended lexical complexity prediction to contextual CEFR classification for language learning applications. Aleksandrova and Pouliot [21] develop a CEFR-based contextual classifier for English and French, using transformer-based representations to disambiguate polysemous lexical items in sentential context. Their model is deployed in the Mauril language learning application, demonstrating practical utility in adaptive vocabulary selection. At the text level, Zhang and Lu [22] present a Random Forest model that aligns linguistic complexity measures with CEFR-based difficulty of English texts, achieving 82.6% accuracy in classifying texts into coarse A/B/C proficiency bands and 62.6% at the six-level CEFR granularity, highlighting the predictive value of linguistically grounded features.
Multilingual extensions have broadened coverage beyond English: CompLex-ZH introduces lexical complexity datasets for Mandarin and Cantonese [23], while BengaliLCP focuses on Bengali lexical complexity [24]. Furthermore, recent research has explored the application of Large Language Models (LLMs) to enhance digital accessibility through automated language-level classification and adaptation [25], demonstrating the potential for creating more inclusive digital environments. These works show that lexical complexity prediction is feasible across typologically diverse languages. However, CEFR-based lexical difficulty for Turkic or other agglutinative low-resource languages remains an open question, as existing approaches have been evaluated almost exclusively on morphologically simpler languages.

2.1.2. CEFR Lexical Resources

The CEFRLex project represents the most comprehensive effort to create CEFR-graded lexical resources for multiple languages [2]. It provides receptive lexicons, derived from textbooks and simplified readers, and productive lexicons, derived from learner corpora, for English (EFLLex), French (FLELex), Swedish (SVALex, SweLLex), Dutch (NT2Lex), Spanish (ELELex), and German (DAFlex), each containing on the order of 13,000 lemma–POS entries with normalized frequency distributions across the six CEFR levels [2,3]. These resources underpin online lexical difficulty analyzers that can estimate text-level CEFR difficulty by aggregating word-level information [26].
The validity of these lexicons has been examined using independent gold standards and multilingual comparisons. Gräen et al. [27] compare EFLLex against external pedagogical resources such as the English Vocabulary Profile (EVP) and the Global Scale of English (GSE), finding that the English CEFRLex resource is broadly in accordance with them. They also exploit multilingual resources and translation probabilities to examine consistency across English, French, and Swedish, providing evidence that CEFR-based lexical difficulty can be aligned across languages via translation [27]. This supports the broader hypothesis that lexical difficulty information can transfer across related pedagogical resources, although evidence outside major European languages remains limited.
In addition to CEFRLex, public vocabulary lists such as the Oxford 3000 by CEFR level provide graded vocabulary for English [28]. Other projects such as KELLY provide frequency-based CEFR assignments for several European languages [29]. However, none of these lexical resources cover Kazakh or other Turkic languages. The absence of CEFR-graded lexicons for agglutinative, low-resource languages limits both empirical research and the development of adaptive learning tools. Our work addresses this gap by constructing the first CEFR-graded lexicon for Kazakh and by evaluating whether methods successful for European languages generalize to a typologically different setting.

2.1.3. Morphology, Frequency, and Lexical Difficulty

Psycholinguistic research has long established that morphological structure affects lexical access and perceived difficulty. Studies on base versus surface frequency show that both lemma frequency and wordform frequency influence recognition latencies and error rates [30]. High-frequency lemmas facilitate processing, but rare inflected or derived forms of those lemmas can still incur additional processing cost, especially when morphological structure is opaque. This interplay between morphology and frequency is particularly salient in morphologically rich languages, where productive inflection and derivation generate large paradigms from individual lemmas.
From a theoretical and typological perspective, Cotterell et al. [31] propose information-theoretic measures of inflectional paradigm complexity, showing that languages differ systematically in the entropy and predictability of their inflectional systems. Paradigm size, irregularity, and syncretism all contribute to morphological complexity, which in turn can affect difficulty for both native speakers and L2 learners. Experimental work has demonstrated that paradigm complexity modulates visual word recognition and that derivational and inflectional processes interact asymmetrically, with derivation often creating semantically less transparent forms that are harder to decompose [32].
Despite these findings, computational models of lexical complexity have typically incorporated only coarse morphological proxies, such as word length, character n-grams, or simple suffix counts, and have rarely used full morphological analyses or paradigm-level features. For agglutinative languages such as Kazakh or Turkish, this is a significant limitation. A simple A1 lemma can generate thousands of surface forms through productive suffixation, and rare complex forms may be perceived as substantially harder than the lemma’s nominal CEFR level would suggest. Our work explicitly targets this gap by integrating HFST-based morphological analyses into lexical complexity prediction for Kazakh and by examining how morphological richness interacts with lemma-level difficulty.

2.1.4. NLP for Kazakh and Agglutinative Languages

Kazakh is a Turkic language with rich agglutinative morphology, relatively free word order, and limited high-quality NLP resources. Several morphological analysis tools have been proposed. Early work by Kessikbayeva and Cicekli [33,34] developed rule-based morphological analyzers and a disambiguator for Kazakh based on two-level morphology, encoding Kazakh morphotactics and phonological alternations in finite-state transducers. The Apertium-kaz project provides an HFST-based morphological transducer for Kazakh with coverage reported around 90% on freely available corpora and high precision on a manually verified test set [10]. In parallel, the KazNLP toolkit offers data-driven morphological analysis and other preprocessing tools (normalization, tokenization, language identification) for Kazakh, built with CRF models and designed as a general-purpose NLP library [35]. Building on these foundational preprocessing capabilities, Akanova et al. [36] introduced a specialized algorithm for keyword search within Kazakh text corpora, which enhances the retrieval of thematic information in large-scale datasets. Such advancements in keyword extraction provide a technical framework for identifying salient linguistic patterns essential for complexity modeling.
These tools have been applied in various downstream tasks. Expanding these capabilities, Akanova et al. [37] developed a neurocomputer system specifically for the semantic analysis of Kazakh text, which enhances the understanding of complex linguistic relationships beyond simple morphological parsing. Akhmed-Zaki et al. [38] describe an information system for Kazakh language preprocessing that integrates morphological analysis, disambiguation, and wordform generation for applications in text analytics and lexicography. Morphological features have also been used in Kazakh text classification and short-text processing alongside convolutional neural networks [39]. However, none of these works consider CEFR-based lexical difficulty, and morphological analyzers have not been evaluated in the context of predicting learner-oriented difficulty scales.
Kazakh machine translation has seen rapid development, primarily driven by the creation of parallel corpora and neural models. The KazParC parallel corpus is the largest publicly available resource of its kind described in our references, containing 371,902 parallel sentences across Kazakh, English, Russian, and Turkish, and supporting the Tilmash neural MT system, which the authors report as competitive with commercial systems such as Google Translate and Yandex Translate on standard MT metrics [40]. Earlier work at WMT 2019 by Toral et al. [41] showed that incorporating morphological segmentation using Apertium improves English–Kazakh neural MT, mitigating data sparsity by breaking complex wordforms into morphologically meaningful units. Similar observations have been made for other low-resource agglutinative settings, where tokenization strategy and morphology-aware modeling remain important design choices [41,42].
Turkish, as a closely related Turkic language, provides a valuable reference point. Oflazer’s two-level description of Turkish morphology [43] remains a foundational finite-state account of Turkish morphotactics. More recent resources such as MorphoLex Turkish provide large-scale morphological lexicons with detailed information about root families, suffix frequencies, and morphological neighborhood structure [44]. Work on word-level segmentation in agglutinative languages using neural sequence models and transformer-based variants reports strong performance and improved handling of rich morphology [42]. Studies on Turkish NMT likewise emphasize that morphological complexity, tokenization strategy, and model architecture materially affect translation quality in low-resource settings [45]. Specifically, Toraman et al. [46] demonstrate that the granularity of tokenization can lead to significant variations in language model performance for Turkish, highlighting the non-trivial relationship between subword units and morphological structure.
Despite these advances, no prior work has combined Kazakh morphological analysis with CEFR-based lexical complexity prediction, nor has any study systematically compared feature-based and transformer-based models for lexical complexity prediction in a low-resource agglutinative setting. The present work fills this gap by (i) constructing the first CEFR-graded lexicon for Kazakh; (ii) integrating HFST-based morphological analyses and expanded frequency resources into a type-level lexical complexity prediction task; and (iii) empirically comparing traditional feature-based models and multilingual transformer architectures under realistic low-resource constraints.

2.2. Datasets

We compile three resources: (1) an expert-curated Kazakh CEFR-graded lexicon, (2) a monolingual frequency corpus from the Leipzig Corpora Collection (LCC) [47], and (3) cross-lingual reference mappings to Russian and English.

2.2.1. Kazakh CEFR-Graded Lexicon

We construct the first CEFR-graded lexicon for Kazakh from a state-sponsored pedagogical handbook series covering A1 through C1 [48]. The handbook extraction yields 4561 lemma–POS entries before cleaning and deduplication. Unlike frequency-derived resources like CEFRLex [2], our lexicon inherits prescriptive difficulty labels directly from expert-curated pedagogical minima. The use of expert-curated pedagogical minima as a gold standard ensures the model’s alignment with Kazakhstan’s state educational standards. This guarantees the practical applicability of the system within the framework of Sustainable Development Goal 4 (Quality Education), enabling the automated creation of learning materials that correspond to the actual proficiency levels of students. Table 1 summarizes this handbook-extracted inventory across CEFR levels and parts of speech.
Nouns constitute approximately 43% of the inventory, and their absolute share increases across levels, reflecting expansion into abstract and technical terms. Table 1 also illustrates how closed classes (numerals, pronouns) concentrate in A1–B1, while adjectives peak at B1–C1. Table 2 gives representative entries at each proficiency level. Full data cleaning details appear in Appendix A.

2.2.2. Monolingual Frequency Corpus

Frequency of occurrence is a strong predictor of lexical difficulty [20]. We compute frequency features and retrieve contextual sentences from a 17-million-token Kazakh corpus from the Leipzig Corpora Collection (LCC) [47]. Text is lowercased and tokenized using a Cyrillic-only regex to filter noise. We compute two frequency signals: an exact surface-form count obtained by matching the lemma string exactly as it appears in the corpus, and a secondary lemma-level count obtained by passing every corpus token through the Apertium-kaz lemmatizer before lookup. The exact surface-form count is our primary contextual lookup mechanism; the lemmatized fallback count partially mitigates agglutinative sparsity but does not serve as our primary contextual anchor. Under this exact surface-string matching regime, 38.4% of the 4350 distinct surface forms are attested, contributing both counts and contextual embeddings; the remaining 61.6% default to zero frequency and isolated-word encodings. We additionally compile an expanded lemma-frequency map by running the Apertium-kaz lemmatizer over the full corpus, yielding approximately 418,000 unique lemma entries. This lemmatized lookup raises effective frequency coverage to approximately 83% of the cleaned modeling set, substantially mitigating the sparsity of exact surface-form matching.

2.2.3. Cross-Lingual Reference Resources

To explore whether CEFR difficulty knowledge can be transferred from better-resourced languages to Kazakh, an approach that has shown promise for typologically related European languages [29], we project the 4350 distinct Kazakh surface forms into Russian and English CEFR lexicons via open-source machine translation (details in Section 2.3.5). Table 3 summarizes the source lexicons and MT pipelines. For lemmas with multiple translations matching the source lexicon, we assign the median CEFR level as a silver label. Unmatched translations are dropped. Russian serves as a natural projection bridge due to extensive lexical borrowing and high Kazakh–Russian bilingualism.

2.2.4. Dataset Partitioning

The handbook extraction yields 4561 lemma–POS entries. After cleaning and deduplication, the modeling set contains 4437 unique lemma–POS entries. Specifically, 18 entries with non-Cyrillic characters or length < 2 were removed, and 106 duplicate lemma–POS keys were resolved. Because some lemmas share the same surface form across POS categories, these 4437 entries correspond to 4350 distinct surface forms; all frequency and corpus lookups operate at the surface-form level, while model training uses the full 4437 lemma–POS set. These 4437 entries are partitioned into training, development, and test sets using a 70/15/15 split, yielding 3105, 666, and 666 instances, respectively. Stratified sampling by CEFR level ensures that the class distribution is preserved across all partitions. Table 4 reports the per-level counts.
The neural gated fusion model uses the 3105 training entries and reserves the development set for early stopping. Classical classifiers are evaluated on the same held-out test set of 666 entries.

2.3. Methodology

This subsection outlines the modeling setup used to predict CEFR lexical difficulty for Kazakh. We define the task, describe the feature representations and model families, then present the cross-lingual projection analysis and evaluation protocol.

2.3.1. Task Formulation

We frame Kazakh CEFR lexical complexity prediction as ordinal five-class classification. Given a lemma–POS pair ( w , p ) from the lexicon and, when available, a corpus sentence s containing w, the task is to predict one of the ordered labels. Equation (1) presents the label space:
Y = { A 1 , A 2 , B 1 , B 2 , C 1 } ,
encoded as integers { 1 , , 5 } . The task is type-level rather than token-level: each lemma–POS entry receives a single CEFR label regardless of its observed corpus contexts.

2.3.2. Feature Engineering

For the classical baselines, each lemma–POS pair is represented by a fixed 40-dimensional handcrafted vector organized into five groups. We also evaluate a separate 50-dimensional contextual embedding baseline obtained by applying PCA to XLM-RoBERTa-base representations. The neural model uses a richer engineered bank and retains a train-selected 72-dimensional lexical subset for the MorphMLP branch. Table 5 summarizes the feature groups, and Appendix B provides the full definitions and extraction algorithm.
Among the fixed feature groups, the morphological features extracted from the Apertium HFST analyzer [10] provide the strongest isolated signal in Table 5, which fits Kazakh’s productive morphology. Frequency features from the 17-million-token LCC corpus also contribute useful information, while the embeddings-only baseline remains competitive without surpassing the best handcrafted configuration. Full technical details appear in Appendix B.

2.3.3. Classical Classifiers

We evaluate three classical classifiers from scikit-learn, each preceded by StandardScaler feature normalization:
  • Logistic Regression (LR): Multinomial L2-regularized; max_iter = 2000, inverse-class weights.
  • Random Forest (RF): 200 trees, max_depth = 15, inverse-frequency weights.
  • Gradient Boosting (GB): 200 rounds, max_depth = 6.
Each classifier is evaluated on the full 40-dimensional handcrafted feature set. To isolate the contribution of each feature group, we additionally report nested ablation experiments using frequency-only, orthographic-only, morphological-only, POS-only, and TF-IDF-only subsets (Table 5).
For the stacking ensemble only, we additionally train an ExtraTrees classifier as a diverse tree-based probability source. It is used inside the ensemble but is not reported as a standalone baseline.

2.3.4. Gated Morphology–Context Fusion Model

To compare a learned fusion mechanism with simpler ways of combining contextual semantics and morphological structure, we use a dual-encoder architecture with gated fusion. Figure 1 summarizes the audited XLM-R configuration.
Context Encoder
XLM-RoBERTa-base (xlm-roberta-base) [8] encodes the sentential context. The target word is delimited with special [TGT] tokens, and its representation is obtained by mean-pooling over the corresponding subword positions in the final hidden layer, where d ctx = 768 . To mitigate overfitting on the training set of 3105 instances, we freeze all transformer layers except the top n = 2 , reducing the number of trainable transformer parameters to approximately 8.3% of the total.
Morphological Encoder
The morphological encoder consists of two parallel sub-networks whose outputs are concatenated and linearly projected. Unlike the classical baselines in Section 2.3.2, this branch does not consume the fixed 40-dimensional interpretable vector. Instead, it receives a 72-dimensional lexical feature subset drawn from the larger engineered feature bank. Feature selection is performed exclusively on the training split using ExtraTrees importance ranking (with a minimum of 20 morphological features retained), and the resulting feature mask is frozen before any development or test evaluation, ensuring no information leakage. Selected features are z-scored using training-set statistics.
  • CharCNN: a character-level convolutional network over the Kazakh Cyrillic alphabet (44 characters including pad and unknown tokens). Characters are embedded into d = 64 dimensions, then processed by three parallel 1D convolutions with kernel sizes { 2 , 3 , 4 } and 128 filters each, followed by ReLU activation and max-over-time pooling, yielding a vector h charcnn R 384 .
  • MorphMLP: a two-hidden-layer MLP with LayerNorm that projects the 72-dimensional selected lexical feature vector through hidden layers of 96 units each with ReLU activation and dropout p = 0.3 , yielding h mlp R 96 .
The concatenated CharCNN and MorphMLP outputs (480 dimensions) are linearly projected to d morph = 384 .
Gated Fusion
Rather than simple concatenation, we learn a sigmoid gate that weights the relative contribution of contextual and morphological representations:
c = W c h ctx + b c ,
m = W m h morph + b m ,
g = σ W g [ h ctx ; h morph ] + b g ,
f = g c + ( 1 g ) m ,
where W c R d f × d ctx , W m R d f × d morph , W g R d f × ( d ctx + d morph ) , and d f = 256 . Equations (2)–(5) define the fusion block. The fused representation f is passed through layer normalization, dropout with p = 0.3 , and a linear classifier W cls R K × d f with softmax activation.
Training Procedure
The model is optimized with AdamW in PyTorch using discriminative learning rates and focal loss [51] on a single NVIDIA L4 GPU. Full training details appear in Appendix C.

2.3.5. Cross-Lingual Projection

To explore whether translation can yield useful CEFR supervision, we align the 4350 distinct Kazakh surface forms with the English EFLLex (13,871 entries) and Russian KELLY (8947 entries) datasets using the translated lexicons described in Section 2.2.3. For items matched by exact string translation, the median source proficiency level is projected as a silver CEFR label. Our main analysis treats these projected labels as a diagnostic signal of cross-lingual alignment; their use for downstream augmentation remains exploratory.

2.3.6. Direct-Prompt LLM Baselines

To address whether instruction-tuned large language models can solve the task without task-specific training, we add direct-prompt baselines using both general multilingual and Kazakh-focused open-weight models. The general baselines are Qwen2.5-0.5B-Instruct and Qwen2.5-7B-Instruct [52]. The Kazakh-focused baselines are Sherkala-Chat 8B [53] and ISSAI KAZ-LLM 8B [54]. Each prompt gives the Kazakh lemma, its POS tag, and the closed label set {A1, A2, B1, B2, C1}; the model must return a single CEFR label. We test both Kazakh and English prompt templates and use deterministic decoding with no sampling. Labels are extracted from the generated text by matching the first valid CEFR label; unparseable outputs are counted as incorrect. No train or development labels are used to update the model, tune prompts, or calibrate decision thresholds.

2.3.7. Experimental Procedure

All supervised models use the same stratified train/development/test partitions. Training uses only the training split; the development split is reserved for hyperparameter selection, early stopping, and ensemble weight tuning; and the test split is evaluated once for the reported held-out scores. We do not retrain on the combined train+development set before test evaluation. The classical baselines are trained with standardized feature vectors, the neural ablations use the shared optimization settings in Appendix C, and the direct-prompt LLM baselines described in Section 2.3.6 are evaluated only as external comparison systems.

2.3.8. Evaluation Measures

Let N denote the number of evaluation instances, y ^ i the predicted CEFR level encoded as an ordinal integer { 1 , , 5 } , y i the corresponding gold label, and K = 5 the number of proficiency levels.
Accuracy (Acc)
Accuracy (Acc) is the fraction of exactly correct predictions:
Acc = 1 N i = 1 N I ( y ^ i = y i ) ,
where I ( · ) is the indicator function.
Macro-F1
The unweighted average of per-class F1 scores:
Macro - F 1 = 1 K k = 1 K F 1 k .
Mean Absolute Error (MAE)
The average ordinal distance between predicted and gold levels:
MAE = 1 N i = 1 N | y ^ i y i | .
Within-1 Accuracy (W-1)
Within-1 Accuracy (W-1) is the fraction of predictions within one CEFR level of the gold label ( | y ^ i y i | 1 ).

3. Results

We address our research questions through three sets of experiments: (1) evaluating supervised performance of classical and neural models, (2) diagnosing per-level and ordinal error distributions, and (3) testing cross-lingual projection. All reported supervised scores use the fixed 70/15/15 partitions.

3.1. Supervised Lexical Complexity Prediction

Table 6 shows a clear ranking on the held-out test set: the handcrafted baseline is competitive, the fusion model is stronger, and the ensemble is best overall. The experimental procedure used for training, development selection, and held-out evaluation is described in Section 2.3.7.
The direct-prompt LLM comparison in Table 7 provides an external baseline against recent open-weight instruction models. These models are not fine-tuned on the Kazakh CEFR lexicon, and their results are therefore best interpreted as a test of whether LLM priors can replace task-specific supervision. The answer is negative in the present setting: the best direct-prompt LLM result comes from ISSAI KAZ-LLM 8B with the English prompt (Macro-F1 0.214), but it remains well below the full-feature LR baseline and the supervised fusion models.

3.1.1. RQ1: Handcrafted Morphology Versus Context

RQ1 asks whether interpretable handcrafted features are competitive with an embeddings-only baseline. Table 6 answers this directly: the 40-feature LR reaches Macro-F1 0.314, outperforming the XLM-R PCA(50) embeddings-only LR (0.298), and both substantially exceed the frequency-only baseline (0.173). This indicates that the engineered representation captures morphological, orthographic, and frequency structure that the reduced embedding space does not fully retain, while also confirming that contextual embeddings provide a strong standalone signal for Kazakh CEFR prediction.

3.1.2. RQ2: Gated Fusion and Architectural Ablations

RQ2 is answered mainly by the ablation hierarchy in Table 8. Table 6 reports the held-out scores for the final audited model configurations on the fixed split, whereas Table 8 reports five-seed means and standard deviations for the architectural ablations. The latter is intended to show stability and component-wise behavior rather than to replace the final-model comparison. All ablation conditions use the same training protocol, and Appendix C gives the shared settings.
Character-level patterns alone are the weakest condition in the ablation, and adding engineered morphological features improves the lexical branch. This shows that explicit morphology contributes information that the CharCNN does not recover reliably from the limited training data. XLM-R context-only remains stronger than the morphology-only variants, highlighting the value of pretrained contextual representations. The strongest supervised results come from combining the two signal types: both fusion variants outperform the best classical baseline, while gated and concatenation fusion show no clear separation in Appendix D. The ensemble yields the best overall test metrics, but its margin over the standalone fusion model remains small and not clearly reliable.

3.2. Diagnostic Analysis

To better understand model behavior, we report per-level F1 scores and the distribution of prediction errors for the key configurations.
Error concentration, not the extreme labels, is the main challenge in this task. Table 9 shows that all supervised models struggle most with A2 and B1, while A1 and C1 are predicted more reliably. This fits the ordinal structure of the dataset: adjacent intermediate levels share more lexical and morphological properties, whereas the extremes are more distinct. Within that pattern, the fusion model is especially helpful on B2, while the ensemble shifts some of the gain toward A2, suggesting different tradeoffs across the intermediate classes.
The main benefit of the neural models is a reduction in larger ordinal mistakes. Table 10 shows that the fusion model increases exact matches and lowers the rate of errors that miss by two or more levels relative to the logistic baseline. The ensemble pushes that trend slightly further and yields the best MAE, but all models remain below 70% W-1 accuracy, underscoring the difficulty of fine-grained ordinal classification across five CEFR levels.

Qualitative Error Patterns

Manual inspection of misclassified test entries reveals several recurring patterns. First, morphologically complex A2 words such as туыстық (kinship, A2) are often predicted as B1 or B2: their derivational structure resembles higher-level vocabulary, but they denote everyday family concepts taught early. Second, abstract B2 nouns like мәдениет (culture) are frequently confused with B1 when they appear in high-frequency corpus contexts, pulling the contextual signal toward an easier level. Third, the gated fusion model corrects several cases where the classical baseline fails—for example, жарнама (advertisement, B1) is correctly classified by the fusion model but predicted as A2 by LR, because the morphological encoder captures the derivational suffix -нама as a productive B1+ indicator. These patterns suggest that the intermediate-level confusion is driven by a genuine overlap in lexical and morphological properties between adjacent CEFR levels, rather than by systematic model failure.

3.3. Cross-Lingual CEFR Projection

This experiment (RQ3) evaluates cross-lingual CEFR projection for Kazakh words translated into Russian and English. We translate all 4350 distinct Kazakh surface forms derived from the cleaned modeling set of 4437 unique lemma–POS entries to Russian (via Tilmash [40]) and to English (via OPUS-MT [50] and a Russian-pivot path, as described in Section 2.3.5). Translated words are looked up in the KELLY Russian CEFR lexicon [29] with 8947 entries and EFLLex [49] with 13,871 entries.

Projection and Augmentation

Cross-lingual projection is more useful as a diagnostic than as a training signal in the current setup. Table 11 shows moderate alignment with Russian but weak alignment with English, and Appendix E details the Russian breakdown by difficulty band and part of speech. Under this low-coverage exact-match regime, projected labels are best interpreted as evidence about cross-lingual consistency rather than as a stable source of downstream supervision.

4. Discussion

Our discussion focuses on what the experiments show reliably for this dataset and what remains unresolved. Across the supervised analyses, the consistent pattern is that morphology helps, contextual encoders are strong, and combining the two yields the clearest gains.

4.1. When Handcrafted Morphology Still Matters

Explicit morphological modeling matters because it adds information that character patterns alone do not recover reliably. The CharCNN cannot represent paradigm-level properties such as analysis count, derivational depth, and inflectional categories, whereas the MorphMLP branch improves the lexical model when those features are added. Appendix D supports the robustness of this contrast.
While our experiments are limited to Kazakh, this finding may be relevant to other agglutinative languages—Turkish, Finnish, and other Turkic languages—that share the property of productive suffixation and for which finite-state morphological analyzers are available. Whether the same pattern holds in those settings remains an empirical question.

4.2. Features and Embeddings

Pretrained contextual representations are stronger than the morphology-only neural branch, so the paper does not argue for hand-engineering in place of modern encoders. Instead, Table 6 and Table 8 point to complementarity: handcrafted features provide a solid classical baseline, XLM-R context-only is stronger than the morphology-only variants, and fusion models perform best overall. This pattern is consistent with findings in other low-resource LCP settings [20,21].

4.3. Direct-Prompt LLMs

The LLM experiment strengthens the interpretation that task-specific lexical supervision is still necessary for Kazakh CEFR prediction. Table 7 shows that direct prompting underperforms both the supervised fusion model and the stronger classical baselines in Macro-F1. Kazakh-focused LLMs improve over some general multilingual prompt baselines—ISSAI KAZ-LLM 8B with the English prompt reaches Macro-F1 0.214—but the gap remains substantial against the gated fusion model (0.360 Macro-F1) and the full-feature LR baseline (0.314 Macro-F1). This indicates that broad Kazakh language modeling capacity does not by itself recover the pedagogical level distinctions encoded in the Kazakh lexical minima.

4.4. Gated vs. Concatenation Fusion

The specific fusion mechanism appears less important than the availability of both signal types. Across the multi-seed ablation and the significance appendix, gated and concatenation fusion show no clear performance separation. We therefore interpret the gain as evidence for representation complementarity rather than for a uniquely superior gating design.

4.5. Ensembling

The ensemble offers a small additional gain, but not a decisive one. It yields the best overall test metrics, yet Appendix D does not show a clear advantage over the standalone fusion model. Its main value is that it redistributes errors across the most difficult classes, suggesting partially complementary failure modes between neural and classical systems.

4.6. Interpreting Cross-Lingual Projection

Cross-lingual projection remains more convincing as an analysis tool than as supervision. Russian shows moderate alignment with the Kazakh labels, whereas English remains weak and noisy under the current exact-match pipeline. This suggests that cross-lingual CEFR information may still be useful, but the present setup is not reliable enough for downstream augmentation.

4.7. Ordinal Structure

Although CEFR levels are inherently ordered and we report ordinal-aware metrics (MAE, Within-1), all models in this study use nominal classification losses. Explicit ordinal classifiers such as CORAL [55] or cumulative-link models could exploit the label ordering directly and may reduce large-distance errors. We leave a systematic comparison of ordinal loss functions to future work, noting that the current Within-1 accuracy of 69.2% already suggests room for improvement through ordinal-aware training.

4.8. Limitations

Our findings are subject to several constraints: (1) Lexicon scale: The cleaned modeling set of 4437 unique lemma–POS entries is modest compared to European resources, limiting the reliability of per-level estimates. (2) Validation: Labels reflect theoretical curriculum placement rather than empirical learnability, as they are derived from pedagogical handbooks without validation against learner performance. (3) Range: The absence of C2-level data limits the generalizability of our observations across the full CEFR scale. (4) Frequency noise: Exact surface-form lookup covers only 38.4% of the distinct surface forms in the modeling set; the expanded lemma-level lookup raises effective frequency coverage to approximately 83% of the modeling set, but the remaining entries still default to zero frequency. (5) Context mismatch: Type-level labels are paired with instance-level embeddings retrieved from corpus occurrences of the surface form. For homographic or multi-POS entries, retrieved sentences may represent a different sense or grammatical category, potentially introducing noise in the contextual signal.

4.9. Future Work

Several directions could strengthen and extend this work: (i) expanding the lexicon via additional pedagogical sources and expert annotation; (ii) collecting learner-based validation data to ground CEFR labels in empirical difficulty; (iii) richer morphological features such as paradigm entropy and suffix productivity; (iv) confidence-weighted cross-lingual projection that down-weights noisy silver labels; and (v) lemmatizing the source corpus prior to frequency counting.

5. Conclusions

This paper contributes a new resource and a bounded empirical result for Kazakh CEFR modeling. We introduce the first CEFR-graded lexicon for a Turkic language and use it to compare handcrafted, contextual, and fusion-based approaches to lexical difficulty prediction. The experiments show that engineered morphological information improves over character-only lexical modeling and that models combining morphology with contextual representations perform best overall on this dataset. At the same time, the results do not support a strong claim that gated fusion is better than simpler fusion. Taken together, the findings indicate that morphology-aware modeling is useful for Kazakh lexical difficulty prediction, while generalization to other Turkic languages remains an open empirical question. The presented resources and models lay the foundation for developing intelligent systems for Kazakh language learning. This contributes to the achievement of Sustainable Development Goal 4 (Quality Education) by ensuring inclusive and high-quality education through the implementation of innovative NLP solutions into educational practice.

Author Contributions

Conceptualization, G.Y., A.A., Z.G. and N.O.; methodology, G.Y., A.A., Z.G. and N.O.; software, G.Y., A.A. and Z.G.; formal analysis, G.Y., A.A., Z.G. and N.O.; investigation, G.Y., A.A., Z.G. and N.O.; visualization, G.Y., A.A. and Z.G.; writing—original draft preparation, G.Y., A.A., Z.G. and N.O.; writing—review and editing, G.Y., A.A., Z.G. and N.O.; supervision, A.A. and N.O.; project administration, A.A.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the Ministry of Science and Higher Education of the Republic of Kazakhstan under the “Zhas Galym” project for 2025–2027 (Individual Registration Number AP25793799, “Adaptive text translation and Kazakh language teaching system based on neural network algorithms”).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The CEFR-graded lexicon, trained model checkpoints, and experiment code are archived at Zenodo (https://doi.org/10.5281/zenodo.19365834). The Kazakh monolingual frequency corpus used for feature extraction is from the Leipzig Corpora Collection and is publicly available at https://wortschatz.uni-leipzig.de/en/download/Kazakh (accessed on 11 March 2026).

Acknowledgments

The authors express their gratitude to the Institute of Smart Systems and Artificial Intelligence (ISSAI), particularly to Yerbol Absalyamov, for providing access to the high-performance computing resources.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
AccAccuracy
BERTBidirectional Encoder Representations from Transformers
CEFRCommon European Framework of Reference for Languages
CharCNNCharacter-level convolutional neural network
CNNConvolutional neural network
CRFConditional random field
ETExtra trees
GBGradient boosting
HFSTHelsinki Finite-State Toolkit
LCCLeipzig Corpora Collection
LCPLexical complexity prediction
LLMLarge language model
LRLogistic regression
MAEMean absolute error
MLPMulti-layer perceptron
MTMachine translation
NLPNatural language processing
PCAPrincipal component analysis
POSPart of speech
RFRandom forest
SDGSustainable Development Goal
TF-IDFTerm frequency–inverse document frequency
W-1Within-1 accuracy
XLM-RCross-lingual Language Model–RoBERTa

Appendix A. Lexicon Construction and Cleaning Pipeline

Appendix A.1. Source Materials

The lexicon is derived from five official “Lexical Minimum” handbooks published by the Republic of Kazakhstan’s Ministry of Education for levels A1 through C1 [48]. Each handbook lists vocabulary items that learners are expected to acquire at the corresponding CEFR level, organized by part of speech. These handbooks are state-sponsored pedagogical standards used in Kazakh-language certification and curriculum design.

Appendix A.2. Extraction

All single-word entries were extracted via automated PDF parsing using a custom Python 3.10.0 pipeline. Each page was parsed with layout-aware text extraction to recover tabular structure.

Appendix A.3. Cleaning Pipeline

To obtain a lexicon suitable for computational modeling, we applied the following filtering steps:
  • Multiword removal: entries containing whitespace or hyphens were excluded, as this study targets type-level, single-word complexity prediction.
  • Script filtering: tokens containing non-Cyrillic characters were removed.
  • Length filtering: tokens shorter than two characters were discarded.
  • POS normalization: original handbook POS labels were mapped to a seven-way tagset: Noun, Verb, Adj, Adv, Num, Pron, and Other.
  • Duplicate resolution: 106 duplicate lemma–POS keys were resolved. Homographic forms with distinct POS tags were retained as separate entries; where the same lemma–POS pair appeared at multiple CEFR levels across handbooks, the lowest level was assigned.

Appendix A.4. Quality Checks

After automated cleaning, a native Kazakh-speaking annotator reviewed all entries flagged by the Apertium transducer as unrecognized. Common issues included minor orthographic variants and loanwords absent from the transducer’s vocabulary; these were retained in the lexicon with their handbook-assigned levels.

Appendix A.5. Licensing

The source handbooks are published as official educational standards by the Republic of Kazakhstan and are freely available. The derived lexicon consists solely of lemma–POS–level triples and does not reproduce the full text or presentation of the original handbooks.
The handbook extraction yields 4561 lemma–POS entries. After the cleaning and deduplication described in Section 2.2.4, the modeling set used in the experiments contains 4437 unique lemma–POS entries.

Appendix B. Feature Engineering Details

This appendix provides the full technical definitions of the handcrafted features and the formal extraction pipeline summarized in Section 2.3.2.

Appendix B.1. Extraction Algorithm

Algorithm A1 distinguishes the 40-dimensional handcrafted feature vector from the separate 50-dimensional contextual-embedding baseline (XLM-RoBERTa-base with PCA) used in Table 5; they are extracted in parallel, not concatenated into a single classical feature set.

Appendix B.2. Feature Group Definitions

Appendix B.2.1. Orthographic Features

We extract six features: (1) character length, (2) total vowel count (in Kazakh Cyrillic orthography we count 12 vowel characters: seven front {ә, е, и, ө, ү, і, э} and five back {а, o, ұ, ы, у}; counted as a single aggregate feature), (3) consonant count, (4) rare-character count (the nine Kazakh-specific Cyrillic characters), (5) heuristic syllable count (number of vowel nucleus groups), and (6) a binary indicator for long words ( | w | 8 ).
Algorithm A1 Handcrafted Feature and Embedding Extraction
Require:  Lemma–POS pair ( w , pos ) ; exact frequency dict F exact ; lemma frequency dict F lemma ; corpus size N; corpus C ; retrieval budget k; HFST transducer T
Ensure:  Handcrafted vector x hand ; embedding vector x emb
 1: ϕ orth Orthography(w)▹ 6 orthographic features
 2: ϕ pos EncodePos(pos)▹ 5-way one-hot + content-word flag
 3: f F exact . GET ( w , 0 )
 4: f F lemma . GET ( w , 0 )
 5: ϕ freq FreqStats ( f , f , N ) ▹ 4 frequency features
 6: A w AnalyzeHFST ( w , T )
 7: if A w = then
 8:      ϕ morph 0 19
 9: else
10:      ϕ morph SummarizeAnalyses(Aw)
11: end if
12: ϕ tfidf CharTfidfStats ( w , C ) ▹ 5 document-frequency features
13: x hand ϕ orth ϕ pos ϕ freq ϕ morph ϕ tfidf ▹ 40 dimensions
14: // Separate 50-dimensional embeddings-only baseline
15: S w Retrieve ( w , C , k )
16: if S w then
17:      e w 1 | S w | s S w XLM - R [ TGT ] ( s , w )
18: else
19:      e w XLM - R [ CLS ] ( w )
20: end if
21: x emb Pca50 ( e w )
22: return ( x hand , x emb )

Appendix B.2.2. POS Features

One-hot encoding over five coarse tags (Noun, Verb, Adj, Adv, Other) plus a binary is_content_word flag. We merge Num and Pron into Other.

Appendix B.2.3. Frequency Features

From the 17-million-token aggregated Leipzig corpus, we compute log 10 ( f + 1 ) where f is the raw lemma count, relative frequency f / N corpus , a binary in_corpus indicator, and a separate lemma-level log-frequency obtained via the Apertium lemmatizer.

Appendix B.2.4. Morphological Features

We analyze each citation-form lemma with the Apertium-kaz HFST finite-state morphological transducer [10], which covers approximately 85.6% of the lexicon entries used in the experiments (similar rates across all splits). The transducer returns all valid morphological analyses for a given surface form; from these we extract 19 features:
  • Analysis-level: a recognition flag, the number of analyses, a normalized ambiguity score, and a composite complexity score.
  • Morpheme counts: minimum, maximum, and mean number of morphemes across analyses.
  • Derivation: derivational_depth (number of derivational suffixes detected in the analysis).
  • Inflectional categories: Binary indicators for case, possession, plural, tense, and copula.
  • POS distribution: n_unique_pos (number of distinct POS tags across analyses) and a five-way distribution over Apertium POS categories.
For lemmas not recognized by the transducer, accounting for 14.4%, all morphological features default to zero. These features are extracted from the lemma listed in the lexicon, not from inflected surface forms.

Appendix B.2.5. TF-IDF Features

We compute character n-gram TF-IDF vectors over the lemma corpus and extract five document-frequency statistics: inverse document frequency (idf), document frequency ratio (df_ratio), log document frequency, TF-IDF score, and document spread.

Appendix B.2.6. Contextual Embeddings

For lemmas attested in the Leipzig corpus, we retrieve up to k = 10 sentences containing the target word, encode each sentence with XLM-RoBERTa-base, and extract the target-word representation by mean-pooling over its subword tokens marked with [TGT] delimiters. The per-sentence embeddings are averaged to obtain a single 768-dimensional vector per lemma. For unattested lemmas, we fall back to encoding the isolated lemma string. All 768-dimensional vectors are then reduced to 50 dimensions via PCA fitted on the training set. This PCA representation is the standalone embeddings-only classical baseline in Table 5; it uses the same XLM-RoBERTa-base encoder as the neural fusion model but in a frozen, pre-extracted mode rather than as a fine-tuned backbone.

Appendix C. Neural Training Details

All neural models use the shared optimization settings below. For the ablation results in Table 8, each neural configuration is trained with five random seeds (42, 123, 456, 789, 2024), and we report mean ± std over those runs. By contrast, the neural score reported in Table 6 for the final audited gated model is the saved fixed-split submission run. The significance appendix separately uses seed 42 only for bootstrap resampling reproducibility.
  • Optimizer: AdamW (weight decay 0.01, gradient clip 1.0).
  • Learning rates: Transformer backbone 2 × 10 5 , classification heads 5 × 10 4 ; linear warmup over 10% of steps. Morph-only models use a single rate of 5 × 10 4 .
  • Loss: Focal loss [51] with γ = 2.0 and inverse-frequency class weighting.
  • Architecture: XLM-RoBERTa-base (top 2 layers unfrozen); CharCNN with kernels [2,3,4] × 128 filters; MorphMLP with 72 → 96 → 96 dims (features selected via ExtraTrees importance, min 20 morphological); the concatenated lexical representation is then projected from 480 → 384.
  • Regularization: Dropout 0.3; batch size 16; random seed as specified above for each run.
  • Hardware: Single NVIDIA L4 GPU (24 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA).
Focal loss down-weights well-classified examples and concentrates gradient signal on hard instances:
L focal = α y ( 1 p y ) γ log p y ,
where p y is the predicted probability for the true class y, α y is the inverse-frequency class weight, and γ = 2.0 controls the focusing strength.

Appendix D. Statistical Significance Tests

Table A1 reports all pairwise significance tests referenced in the paper. For each comparison we report (i) the paired bootstrap test [56] on Macro-F1 using n = 10,000 resamples and resampling seed 42, giving the observed difference Δ F1, a two-sided p-value, and 95% confidence interval; and (ii) McNemar’s test on per-instance correctness, giving χ 2 and p-value. For McNemar counts below 25, we use the exact binomial test; otherwise we apply the χ 2 approximation with continuity correction.
Table A1. Pairwise significance tests. PBS = paired bootstrap on Macro-F1; McN = McNemar’s test. * p < 0.05; ** p < 0.01.
Table A1. Pairwise significance tests. PBS = paired bootstrap on Macro-F1; McN = McNemar’s test. * p < 0.05; ** p < 0.01.
PBS (Macro-F1)McNemar
Comparison (A vs. B) Δ F1 p 95% CI χ 2 p n AB / n BA
CharCNN vs. Morph-only−0.0360.068[−0.073, 0.002]9.550.002 **67/109
Morph-only vs. Context-only−0.0490.027 *[−0.092, −0.006]1.910.16794/115
LR vs. Gated fusion−0.0460.035 *[−0.089, −0.003]3.410.06593/121
Context-only vs. Gated fusion−0.0230.210[−0.059, 0.013]2.010.15763/81
Concat vs. Gated fusion−0.0110.509[−0.043, 0.022]3.250.07151/72
Gated vs. Ensemble−0.0020.869[−0.031, 0.026]0.100.74946/42

Appendix E. Cross-Lingual Projection Breakdown

Table A2 provides a detailed breakdown of the Russian cross-lingual alignment by difficulty band and part of speech, supplementing the aggregate metrics reported in Section 3.3.
Table A2. Russian cross-lingual alignment by difficulty band and POS.
Table A2. Russian cross-lingual alignment by difficulty band and POS.
StratumnrExact%MAEMean Diff
Band
A (A1–A2)33041.5%0.988 0.806
B (B1–B2)30234.8%0.947 + 0.391
C (C1)10812.0%1.602 + 1.528
POS
ADJ1010.4640.941 0.010
NOUN4280.4271.063 + 0.035
VERB1180.3511.169 0.203
ADV400.3780.850 + 0.100

References

  1. Council of Europe. Common European Framework of Reference for Languages: Learning, Teaching, Assessment; Cambridge University Press: Cambridge, UK, 2001. [Google Scholar]
  2. François, T.; Gala, N.; Watrin, P.; Fairon, C. FLELex: A Graded Lexical Resource for French Foreign Learners. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 3766–3773. [Google Scholar] [CrossRef]
  3. François, T. The CEFRLex Project: Multilingual CEFR-Graded Lexical Resources. 2021. Available online: https://cental.uclouvain.be/cefrlex/ (accessed on 11 March 2026).
  4. Shardlow, M.; Evans, R.; Paetzold, G.H.; Zampieri, M. SemEval-2021 Task 1: Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1–16. [Google Scholar] [CrossRef]
  5. Pan, C.; Song, B.; Wang, S.; Luo, Z. DeepBlueAI at SemEval-2021 Task 1: Lexical Complexity Prediction with A Deep Ensemble Approach. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, August 2021; Palmer, A., Schneider, N., Schluter, N., Emerson, G., Herbelot, A., Zhu, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 578–584. [Google Scholar] [CrossRef]
  6. Mosquera, A. Alejandro Mosquera at SemEval-2021 Task 1: Exploring Sentence and Word Features for Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, August 2021; Palmer, A., Schneider, N., Schluter, N., Emerson, G., Herbelot, A., Zhu, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 554–559. [Google Scholar] [CrossRef]
  7. Shardlow, M.; North, K.; Zampieri, M. Multilingual Resources for Lexical Complexity Prediction: A Review. In Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024, Torino, Italia, May 2024; Nunzio, G.M.D., Vezzani, F., Ermakova, L., Azarbonyad, H., Kamps, J., Eds.; ELRA and ICCL: Paris, France, 2024; pp. 51–59. [Google Scholar] [CrossRef]
  8. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
  9. Center for East European and Russian/Eurasian Studies. Kazakh, n.d. Available online: https://ceeres.uchicago.edu/languages/kazakh (accessed on 11 March 2026).
  10. Washington, J.; Salimzyanov, I.; Tyers, F. Finite-state morphological transducers for three Kypchak languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, May 2014; Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S., Eds.; European Language Resources Association (ELRA): Paris, France, 2014; pp. 3378–3385. [Google Scholar] [CrossRef]
  11. Kurmasheva, L.; Kurmashev, I.; Kulikov, V.; Kulikova, V.; Tajigitov, A. The Use of Data Mining in the Management of the Career Guidance Work of the University. Ann. Data Sci. 2025, 12, 1923–1940. [Google Scholar] [CrossRef]
  12. Kulikova, V.; Iklassova, K.; Kazanbayeva, A. Development of a decision making method to form the indicators for a university development plan. East.-Eur. J. Enterp. Technol. 2019, 3, 12–21. [Google Scholar] [CrossRef]
  13. Kulikov, V.; Kulikova, V.; Yerkebulan, G. Google/Yandex Translation Detection in the Patterns Identifying System of Multilingual Texts. Int. J. Comput. 2021, 20, 72–77. [Google Scholar] [CrossRef]
  14. Yerkebulan, G.; Kulikova, V.; Kulikov, V.; Kulsharipova, Z. Devising an entropy-based approach for identifying patterns in multilingual texts. East.-Eur. J. Enterp. Technol. 2021, 2, 16–22. [Google Scholar] [CrossRef]
  15. Makhazhanova, U.; Kerimkhulle, S.; Mukhanova, A.; Bayegizova, A.; Aitkozha, Z.; Mukhiyadin, A.; Tassuov, B.; Saliyeva, A.; Taberkhan, R.; Azieva, G. The Evaluation of Creditworthiness of Trade and Enterprises of Service Using the Method Based on Fuzzy Logic. Appl. Sci. 2022, 12, 11515. [Google Scholar] [CrossRef]
  16. Akanova, A.; Ospanova, N.; Sharipova, S.; Mauina, G.; Abdugulova, Z. Development of a thematic and neural network model for data learning. East.-Eur. J. Enterp. Technol. 2022, 4, 40–50. [Google Scholar] [CrossRef]
  17. Bani Yaseen, T.; Ismail, Q.; Al-Omari, S.; Al-Sobh, E.; Abdullah, M. JUST-BLUE at SemEval-2021 Task 1: Predicting Lexical Complexity using BERT and RoBERTa Pre-Trained Language Models. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 661–666. [Google Scholar] [CrossRef]
  18. Zaharia, G.E.; Cercel, D.C.; Dascalu, M. UPB at SemEval-2021 Task 1: Combining Deep Learning and Hand-Crafted Features for Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, August 2021; Palmer, A., Schneider, N., Schluter, N., Emerson, G., Herbelot, A., Zhu, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 609–616. [Google Scholar] [CrossRef]
  19. Ortiz-Zambrano, J.A.; Espín-Riofrío, C.H.; Montejo-Ráez, A. Deep Encodings vs. Linguistic Features in Lexical Complexity Prediction. Neural Comput. Appl. 2025, 37, 1171–1187. [Google Scholar] [CrossRef]
  20. Kyle, K.; Crossley, S.A. Automatically Assessing Lexical Sophistication: Indices, Tools, Findings, and Application. TESOL Q. 2015, 49, 757–786. [Google Scholar] [CrossRef]
  21. Aleksandrova, D.; Pouliot, V. CEFR-based Contextual Lexical Complexity Classifier in English and French. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Toronto, Canada, July 2023; Kochmar, E., Burstein, J., Horbach, A., Laarmann-Quante, R., Madnani, N., Tack, A., Yaneva, V., Yuan, Z., Zesch, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 518–527. [Google Scholar] [CrossRef]
  22. Zhang, X.; Lu, X. Aligning linguistic complexity with the difficulty of English texts for L2 learners based on CEFR levels. Stud. Second Lang. Acquis. 2025, 47, 1407–1434. [Google Scholar] [CrossRef]
  23. Qiu, L.; Guo, S.; Wong, T.S.; Chersoni, E.; Lee, J.; Huang, C.R. CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese. In Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), Miami, Florida, USA, November 2024; Shardlow, M., Saggion, H., Alva-Manchego, F., Zampieri, M., North, K., Štajner, S., Stodden, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 20–26. [Google Scholar] [CrossRef]
  24. Ayman, N.; Hossain, M.A.; Aziz, A.; Faruqui, R.U.; Chy, A.N. BengaliLCP: A Dataset for Lexical Complexity Prediction in the Bengali Texts. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, May 2024; Calzolari, N., Kan, M.Y., Hoste, V., Lenci, A., Sakti, S., Xue, N., Eds.; European Language Resources Association (ELRA) and ICCL: Paris, France, 2024; pp. 2227–2237. [Google Scholar] [CrossRef]
  25. Leewis, S.; Smit, K.; van de Hoef, A.; Hartman, F.; Kuiper, N.; Todorova, J.; Gerritsen, T. Enhancing Digital Accessibility through Language-Level Classification and Adaptation: Exploring the Role of Large Language Models for Inclusive Language. In Proceedings of the 2025 9th International Conference on Software and E-Business, New York, NY, USA, 7–9 November 2025; ICSeB ’25, pp. 17–23. [Google Scholar] [CrossRef]
  26. CENTAL. CEFRLex Online Lexical Difficulty Analyser. 2021. Available online: https://cental.uclouvain.be/cefrlex/analyse/ (accessed on 11 March 2026).
  27. Graën, J.; Alfter, D.; Schneider, G. Using Multilingual Resources to Evaluate CEFRLex for Learner Applications. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 346–355. [Google Scholar]
  28. Oxford University Press. The Oxford 3000 by CEFR Level. 2020. Available online: https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000 (accessed on 11 March 2026).
  29. Kilgarriff, A.; Charalabopoulou, F.; Gavrilidou, M.; Bondi Johannessen, J.; Khalil, S.; Johansson Kokkinakis, S.; Lew, R.; Sharoff, S.; Vadlapudi, R.; Volodina, E. Corpus-Based Vocabulary Lists for Language Learners for Nine Languages. Lang. Resour. Eval. 2014, 48, 121–163. [Google Scholar] [CrossRef]
  30. Vannest, J.; Newport, E.; Newman, A.; Bavelier, D. Interplay Between Morphology and Frequency in Lexical Access: The Case of the Base Frequency Effect. Brain Res. 2011, 1373, 144–159. [Google Scholar] [CrossRef] [PubMed]
  31. Cotterell, R.; Kirov, C.; Hulden, M.; Eisner, J. On the Complexity and Typology of Inflectional Morphological Systems. Trans. Assoc. Comput. Linguist. 2019, 7, 327–342. [Google Scholar] [CrossRef]
  32. Schreuder, R.; Baayen, R.H. Modeling Morphological Processing. In Morphological Aspects of Language Processing; Feldman, L.B., Ed.; Lawrence Erlbaum: Hillsdale, NJ, USA, 1995; pp. 131–154. [Google Scholar]
  33. Kessikbayeva, G.; Cicekli, I. Rule Based Morphological Analyzer of Kazakh Language. In Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM, Baltimore, Maryland, June 2014; Çetinoğlu, Ö., Heinz, J., Maletti, A., Riggle, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 46–54. [Google Scholar] [CrossRef]
  34. Kessikbayeva, G.; Cicekli, I. A Rule Based Morphological Analyzer and a Morphological Disambiguator for Kazakh Language. Linguist. Lit. Stud. 2016, 4, 96–104. [Google Scholar] [CrossRef]
  35. Yessenbayev, Z.; Kozhirbayev, Z.; Makazhanov, A. KazNLP: A Pipeline for Automated Processing of Texts Written in Kazakh Language. In Proceedings of the Speech and Computer: 22nd International Conference, SPECOM 2020, St. Petersburg, Russia, 7–9 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 657–666. [Google Scholar] [CrossRef]
  36. Akanova, A.; Ospanova, N.; Kukharenko, Y.; Abildinova, G. Development of the algorithm of keyword search in the Kazakh language text corpus. East.-Eur. J. Enterp. Technol. 2019, 5, 26–32. [Google Scholar] [CrossRef]
  37. Akanova, A.; Ismailova, A.; Oralbekova, Z.; Kenzhebayeva, Z.; Anarbekova, G. Neurocomputer System of Semantic Analysis of the Text in the Kazakh Language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2024, 23, 1–15. [Google Scholar] [CrossRef]
  38. Akhmed-Zaki, D.; Mansurova, M.; Madiyeva, G.; Kadyrbek, N.; Kyrgyzbayeva, M. Development of the Information System for the Kazakh Language Preprocessing. Cogent Eng. 2021, 8, 1896418. [Google Scholar] [CrossRef]
  39. Parhat, S.; Ablimit, M.; Hamdulla, A. A Robust Morpheme Sequence and Convolutional Neural Network-Based Uyghur and Kazakh Short Text Classification. Information 2019, 10, 387. [Google Scholar] [CrossRef]
  40. Yeshpanov, R.; Polonskaya, A.; Varol, H.A. KazParC: Kazakh Parallel Corpus for Machine Translation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 9633–9644. [Google Scholar] [CrossRef]
  41. Toral, A.; Edman, L.; Yeshmagambetova, G.; Spenader, J. Neural Machine Translation for English–Kazakh with Morphological Segmentation and Synthetic Data. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Florence, Italy, August 2019; Bojar, O., Chatterjee, R., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Huck, M., Yepes, A.J., Koehn, P., Martins, A., et al., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 386–392. [Google Scholar] [CrossRef]
  42. Villegas-Ch, W.; Gutierrez, R.; Maldonado Navarro, A.; Mera-Navarrete, A. Evaluating Neural Network Models for Word Segmentation in Agglutinative Languages: Comparison with Rule-Based Approaches and Statistical Models. IEEE Access 2024, 12, 157556–157573. [Google Scholar] [CrossRef]
  43. Oflazer, K. Two-level Description of Turkish Morphology. Lit. Linguist. Comput. 1994, 9, 137–148. [Google Scholar] [CrossRef]
  44. Arican, B.; Kuzgun, A.; Marşan, B.; Aslan, D.B.; Saniyar, E.; Cesur, N.; Kara, N.; Kuyrukcu, O.; Ozcelik, M.; Yenice, A.B.; et al. Morpholex Turkish: A Morphological Lexicon for Turkish. In Proceedings of the Globalex Workshop on Linked Lexicography Within the 13th Language Resources and Evaluation Conference, Marseille, France, June 2022; Kernerman, I., Krek, S., Eds.; European Language Resources Association: Paris, France, 2022; pp. 68–74. [Google Scholar]
  45. Acı, M.; Vuran Sarı, N.; İnan Acı, Ç. Morphological and Structural Complexity Analysis of Low-resource English–Turkish Language Pair Using Neural Machine Translation Models. PeerJ Comput. Sci. 2025, 11, e3072. [Google Scholar] [CrossRef]
  46. Toraman, C.; Yilmaz, E.H.; Şahi nuç, F.; Ozcelik, O. Impact of Tokenization on Language Models: An Analysis for Turkish. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–21. [Google Scholar] [CrossRef]
  47. Goldhahn, D.; Eckart, T.; Quasthoff, U. Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, May 2012; Calzolari, N., Choukri, K., Declerck, T., Doğan, M.U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S., Eds.; European Language Resources Association (ELRA): Paris, France, 2012; pp. 759–765. [Google Scholar] [CrossRef]
  48. Balabekov, A.Q.; Dauletkereyeva, N.Z.; Demessinova, L.M.; Iskakova, Z.M.; Kaliyeva, A.M.; Mussayeva, G.A.; Suyinzhanova, Z.K. Qazaq Tilinin Leksika-Grammatikalyq Minimumy. A1–C1 Dengeiler; Five-Volume Kazakh Lexical Minimum Handbook Series for CEFR Levels A1, A2, B1, B2, and C1; Sh. Shayakhmetov “Til-Qazyna” National Scientific and Practical Center: Astana, Kazakhstan, 2024. [Google Scholar]
  49. Dürlich, L.; François, T. EFLLex: A Graded Lexical Resource for Learners of English as a Foreign Language. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar] [CrossRef]
  50. Tiedemann, J.; Thottingal, S. OPUS-MT—Building open translation services for the World. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, Lisboa, Portugal, November 2020; Martins, A., Moniz, H., Fumega, S., Martins, B., Batista, F., Coheur, L., Parra, C., Trancoso, I., Turchi, M., Bisazza, A., et al., Eds.; European Association for Machine Translation: Geneva, Switzerland, 2020; pp. 479–480. [Google Scholar]
  51. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
  52. Qwen Team. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar] [CrossRef]
  53. Koto, F.; Joshi, R.; Mukhituly, N.; Wang, Y.; Xie, Z.; Pal, R.; Orel, D.; Mullah, P.; Turmakhan, D.; Goloburda, M.; et al. Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting. arXiv 2025, arXiv:2503.01493. [Google Scholar] [CrossRef]
  54. Institute of Smart Systems and Artificial Intelligence. Kazakh Large Language Model ISSAI KAZ-LLM, 2024. Model Page for ISSAI KAZ-LLM. Available online: https://issai.nu.edu.kz/kazllm/ (accessed on 20 May 2026).
  55. Cao, W.; Mirjalili, V.; Raschka, S. Rank consistent ordinal regression for neural networks with application to age estimation. Pattern Recognit. Lett. 2020, 140, 325–331. [Google Scholar] [CrossRef]
  56. Koehn, P. Statistical Significance Tests for Machine Translation Evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 388–395. [Google Scholar]
Figure 1. Compact overview of the audited dual-encoder architecture used in the main neural experiments. The context branch encodes the marked sentence with XLM-RoBERTa-base, while the lexical branch combines CharCNN and MorphMLP representations before gated fusion and CEFR classification.
Figure 1. Compact overview of the audited dual-encoder architecture used in the main neural experiments. The context branch encodes the marked sentence with XLM-RoBERTa-base, while the lexical branch combines CharCNN and MorphMLP representations before gated fusion and CEFR classification.
Technologies 14 00346 g001
Table 1. Kazakh CEFR lexicon summary.
Table 1. Kazakh CEFR lexicon summary.
LevelNounVerbAdjectiveAdverbNumeralPronounOtherTotal
A137313579263015304962
A231010377362619126697
B140218011256252095890
B23621667746710223891
C1537227150293121631121
Total198481149519391769114561
Table 2. Example lexicon entries by CEFR level. Glosses are approximate English translations.
Table 2. Example lexicon entries by CEFR level. Glosses are approximate English translations.
LevelLemmaPOSGloss
A1суNounwater
A1баруVerbto go
A2ауаNounair, weather
A2тiлекNounwish
B1байланысNounconnection
B1қоғамдықAdjectivepublic, social
B2тәуелсiздiкNounindependence
B2қалыптастыруVerbto form, develop
C1жаһандануNounglobalization
C1ықпалдастықNounintegration
Table 3. Cross-lingual CEFR projection resources.
Table 3. Cross-lingual CEFR projection resources.
Source LexiconEntriesMT PipelineKaz Cov.
KELLYru [29]8947Tilmash [40]17.0%
EFLLexen [49]13,871OPUS-MT [50]39.8%
Tilmash→OPUS (pivot)19.9%
Note: Only single-token translations matching source lexicons yield silver labels.
Table 4. Number of entries per CEFR level in each data split.
Table 4. Number of entries per CEFR level in each data split.
LevelTrainDevTestTotal
A1663142142947
A2478102103683
B1606130130866
B2607130130867
C17511621611074
Total31056666664437
Note: Stratified sampling preserves the global class distribution across training, development, and test partitions.
Table 5. Feature groups, dimensions, and ablation Macro-F1 on dev set. Morphological features contribute most to performance.
Table 5. Feature groups, dimensions, and ablation Macro-F1 on dev set. Morphological features contribute most to performance.
GroupKey IntuitionDimsAloneSource
OrthographicLength, vowel ratio, rare chars60.265Lemma
POSContent word flags60.178Lexicon
FrequencyLog-freq, corpus coverage40.312Leipzig
MorphologicalAnalysis ambiguity, deriv depth190.335Apertium
TF-IDFOrthographic distinctiveness50.221n-grams
Full handcraftedCombined feature bank400.345
Embeddings-onlyXLM-R PCA(50)500.305XLM-R
Table 6. Held-out test results for the final model configurations on the fixed split.
Table 6. Held-out test results for the final model configurations on the fixed split.
ModelSetupAccMacro-F1MAE
Frequency-only LRSurface/lemma frequency cues only0.3120.1731.626
Embeddings-only LRXLM-R PCA(50)0.3090.2981.411
LR40 handcrafted features0.3450.3141.261
RF40 handcrafted features0.3210.3001.263
GB40 handcrafted features0.3090.2871.279
Gated fusionXLM-R + CharCNN + MorphMLP0.3870.3601.125
EnsembleElasticNet stack of GB, RF, ET, Gatedprobs0.3810.3631.105
Table 7. Direct-prompt LLM baselines on the public held-out split. Prompt language indicates the language of the instruction template; all prompts include the Kazakh lemma, POS tag, and closed CEFR label set.
Table 7. Direct-prompt LLM baselines on the public held-out split. Prompt language indicates the language of the instruction template; all prompts include the Kazakh lemma, POS tag, and closed CEFR label set.
ModelPromptAccMacro-F1MAE
Qwen2.5-0.5B-InstructEnglish0.2280.1591.304
Qwen2.5-0.5B-InstructKazakh0.2060.1091.325
Qwen2.5-7B-InstructEnglish0.2410.1241.757
Qwen2.5-7B-InstructKazakh0.2250.1101.925
Sherkala-Chat 8BEnglish0.2120.1911.297
Sherkala-Chat 8BKazakh0.2030.1641.122
ISSAI KAZ-LLM 8BEnglish0.2590.2141.271
ISSAI KAZ-LLM 8BKazakh0.0880.1161.292
Table 8. Neural architecture ablation (mean ± std over 5 training seeds).
Table 8. Neural architecture ablation (mean ± std over 5 training seeds).
ModelMacro-F1MAE
CharCNN-only0.239 ± 0.0091.456 ± 0.036
Morph-only (CharCNN+MLP)0.303 ± 0.0131.240 ± 0.044
XLM-R context-only0.335 ± 0.0081.184 ± 0.029
Concat fusion0.346 ± 0.0071.144 ± 0.020
Gated fusion0.346 ± 0.0081.149 ± 0.013
Table 9. Per-CEFR-level F1 scores for selected models.
Table 9. Per-CEFR-level F1 scores for selected models.
ModelLevel
A1A2B1B2C1
Full-feature LR0.4340.1870.1980.2780.474
Gated fusion (XLM-R)0.5190.1940.2670.3360.485
Ensemblestack0.5670.2390.2560.2860.465
Table 10. Error distance distribution.
Table 10. Error distance distribution.
ModelExact±1≥2W-1
Full-feature LR0.3450.2930.3620.638
Gated fusion (XLM-R)0.3870.2910.3210.679
Ensemblestack0.3810.3110.3080.692
Table 11. Cross-lingual CEFR alignment between Kazakh and source languages (RQ3).
Table 11. Cross-lingual CEFR alignment between Kazakh and source languages (RQ3).
SourcenCov.%rMAEExact%W-1%
Russian74017.0%0.4121.06134.5%72.3%
English173139.8%0.0731.57921.8%51.6%
En (pivot)86619.9%0.0631.59821.0%50.6%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yerkebulan, G.; Akanova, A.; Galymzhan, Z.; Ospanova, N. A CEFR-Graded Lexicon and Morphology-Aware Benchmarks for Kazakh Lexical Complexity Prediction. Technologies 2026, 14, 346. https://doi.org/10.3390/technologies14060346

AMA Style

Yerkebulan G, Akanova A, Galymzhan Z, Ospanova N. A CEFR-Graded Lexicon and Morphology-Aware Benchmarks for Kazakh Lexical Complexity Prediction. Technologies. 2026; 14(6):346. https://doi.org/10.3390/technologies14060346

Chicago/Turabian Style

Yerkebulan, Gulnur, Akerke Akanova, Zhantore Galymzhan, and Nazira Ospanova. 2026. "A CEFR-Graded Lexicon and Morphology-Aware Benchmarks for Kazakh Lexical Complexity Prediction" Technologies 14, no. 6: 346. https://doi.org/10.3390/technologies14060346

APA Style

Yerkebulan, G., Akanova, A., Galymzhan, Z., & Ospanova, N. (2026). A CEFR-Graded Lexicon and Morphology-Aware Benchmarks for Kazakh Lexical Complexity Prediction. Technologies, 14(6), 346. https://doi.org/10.3390/technologies14060346

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop