A CEFR-Graded Lexicon and Morphology-Aware Benchmarks for Kazakh Lexical Complexity Prediction

Yerkebulan, Gulnur; Akanova, Akerke; Galymzhan, Zhantore; Ospanova, Nazira

doi:10.3390/technologies14060346

Open AccessArticle

A CEFR-Graded Lexicon and Morphology-Aware Benchmarks for Kazakh Lexical Complexity Prediction

¹

Department of Physics and Informatics, Taraz University Named After M.Kh. Dulaty, Taraz 080000, Kazakhstan

²

Department of Computer Science, Saken Seifullin Kazakh Agrotechnical Research University, Astana 010011, Kazakhstan

³

School of Engineering and Digital Sciences, Nazarbayev University, Astana 010000, Kazakhstan

⁴

Department of Information Technologies, Toraighyrov University, Pavlodar 140008, Kazakhstan

^*

Authors to whom correspondence should be addressed.

Technologies 2026, 14(6), 346; https://doi.org/10.3390/technologies14060346 (registering DOI)

Submission received: 29 April 2026 / Revised: 25 May 2026 / Accepted: 27 May 2026 / Published: 9 June 2026

Download

Browse Figure

Versions Notes

Abstract

Graded lexical resources aligned with the Common European Framework of Reference for Languages (CEFR) and lexical complexity prediction remain limited for low-resource Turkic languages, and the extent to which existing predictive models generalize to agglutinative morphology is unresolved. We introduce the first CEFR-graded lexicon for Kazakh, containing 4561 lemma–part-of-speech (POS) entries across A1–C1, and use it to test whether explicit morphology improves lexical complexity prediction. We compare handcrafted morphological features, XLM-RoBERTa contextual embeddings, and fusion models that combine both signal types on held-out CEFR classification. Our best model, a gated fusion of contextual embeddings with morphological features, achieves a macro-averaged F1 score of 0.360 and a mean absolute error of 1.125 on the held-out test set. Morphology provides useful information beyond character-level cues, contextual representations are strong on their own, and combining them yields the best supervised performance for this task. The paper therefore contributes a new CEFR resource for Turkic languages and evidence that morphology-aware modeling is useful for Kazakh lexical difficulty prediction. The results support Sustainable Development Goal 4 (Quality Education) by enabling objective assessment of learning-material complexity and adaptive Kazakh language learning. The derived lexicon and code are publicly available.

Keywords:

CEFR; lexical complexity prediction; CEFR-graded lexicon; morphology-aware modeling; agglutinative languages; low-resource languages; Kazakh NLP; XLM-RoBERTa; gated fusion

1. Introduction

The Common European Framework of Reference for Languages, or CEFR, provides a six-level proficiency scale, A1 to C2, widely used in language teaching and assessment [1]. CEFR-graded lexical resources are central to curriculum design, adaptive learning systems, and text simplification. The development of the first CEFR-graded lexicon for the Kazakh language directly supports the implementation of Sustainable Development Goal 4 (Quality Education). Creating such resources enhances the accessibility of high-quality linguistic tools and optimizes educational processes, which is particularly vital for low-resource languages. For major European languages, comprehensive lexicons such as CEFRLex cover around 13,000 lemma–POS entries each [2,3], and recent computational work on lexical complexity prediction (LCP) has shown strong progress with both feature-based and transformer-based methods [4,5,6].

However, this progress has largely excluded agglutinative, low-resource languages. No CEFR-graded lexicon exists for any Turkic language [7], and existing LCP models have been evaluated almost exclusively on morphologically simple languages. In agglutinative languages like Kazakh, Turkish, or Finnish, the relationship between lexical and morphological complexity is non-trivial. A single lemma such as бала, meaning child, can yield hundreds of suffixed forms, for example балаларымыздан meaning from our children, whose morphological structure affects learnability and difficulty.

Standard multilingual encoders such as XLM-RoBERTa [8], effective cross-lingually, rely on subword tokenization that often fragments these complex wordforms, obscuring morphology relevant to lexical difficulty. Whether morphology-aware representations can improve CEFR-level prediction in such languages remains an open question.

To address this gap, we focus on Kazakh, a Central Turkic language spoken by about 12 million people [9] and one with growing demand for CEFR-aligned pedagogical resources. We construct the first CEFR-graded lexicon for Kazakh. The handbook extraction yields 4561 lemma–POS entries across five levels, and the cleaned modeling set used in our experiments contains 4437 unique lemma–POS entries. Our supervised experiments contrast interpretable handcrafted features derived from Helsinki Finite-State Technology (HFST)-Apertium analysis [10], frequency, and orthography with contextual encoders and morphology–context fusion models. We also examine cross-lingual CEFR projection from Russian and English as a diagnostic signal. This study contributes to the expanding application of data mining within professional educational environments, where machine learning is used to optimize pedagogical processes [11] and manage institutional tasks such as career guidance [11]. Such assessment tools are also vital for establishing strategic university development indicators through formal decision-making methods [12].

Furthermore, the current work builds on established methodologies for multilingual text analysis, specifically the detection of patterns in machine-translated texts [13] and the use of entropy-based measures for identifying structural regularities across diverse languages [14]. These previous findings provide a technical foundation for our current focus on Kazakh morphological complexity.

The challenge of classifying objects by complexity or risk levels often requires accounting for multiple intersecting features. While fuzzy logic-based approaches are successfully applied in the financial sector to evaluate qualitative characteristics, such as the creditworthiness of service enterprises [15], the field of natural language processing (NLP) increasingly utilizes neural architectures for similar purposes to capture hidden dependencies in data. This approach is further supported by research into the development of hybrid thematic and neural network models for data learning [16], which demonstrates the effectiveness of combining structural and probabilistic signals in low-resource language processing.

We address three explicit research questions:

RQ1: How strong are interpretable handcrafted features for Kazakh CEFR prediction relative to an embeddings-only baseline?
RQ2: Does gated fusion of contextual and lexical representations improve over the classical baselines, and which architectural components contribute to the gain?
RQ3: How far can cross-lingual projection transfer CEFR information from Russian and English lexical resources to Kazakh?

The primary contribution of this work is a new resource together with baseline experiments that establish initial benchmarks; we do not claim state-of-the-art modeling advances. Specifically:

Resource. We construct the first CEFR-graded lexicon for any Turkic language. The handbook extraction yields 4561 lemma–POS entries across five levels (A1–C1), and the cleaned modeling set contains 4437 unique lemma–POS entries. This lexicon underpins the present experiments and can support future work on lexical complexity prediction, curriculum design, and adaptive language learning for Kazakh and potentially other Turkic languages.
Baseline study. We establish initial benchmarks by comparing handcrafted morphological features, contextual embeddings, and fusion models, providing a reference point for future systems on this dataset.
Empirical evidence. We show that engineered morphological features improve over character-level patterns alone, and that combining morphological and contextual representations yields the strongest supervised results in this setting. The gain comes from combining the two signal types rather than from a clearly superior fusion mechanism.

The derived lexicon and code are publicly available. (Archived repository: https://doi.org/10.5281/zenodo.19365834).

The remainder of the paper is organized as follows. Section 2 reviews related work, describes the datasets, and defines the modeling and evaluation procedure. Section 3 reports the supervised, diagnostic, cross-lingual, and LLM baseline results. Section 4 discusses the empirical implications and limitations. Section 5 concludes the paper.

2. Materials and Methods

This section first reviews related work that motivates our modeling choices (Section 2.1), then describes the language resources we compile (Section 2.2) and the supervised modeling pipeline and evaluation protocol applied to them (Section 2.3).

All computational experiments were implemented in Python 3.10.0 (Python Software Foundation, Wilmington, DE, USA). Classical models used scikit-learn 1.7.2, and neural models used PyTorch 2.9.1 (Meta Platforms, Inc., Menlo Park, CA, USA) and Hugging Face Transformers 4.46.1 (Hugging Face, Inc., Brooklyn, NY, USA). Neural experiments ran on a single NVIDIA L4 GPU (NVIDIA Corporation, Santa Clara, CA, USA).

2.1. Background

We organize prior work into four strands: lexical complexity prediction, CEFR-graded lexical resources, the role of morphology and frequency in lexical difficulty, and NLP for Kazakh and other agglutinative languages.

2.1.1. CEFR and Lexical Complexity Prediction

Automatic prediction of lexical complexity has received increasing attention, particularly since the SemEval-2021 Lexical Complexity Prediction shared task, which introduced the CompLex English corpus annotated on a 5-point Likert scale for both single words and multiword expressions [4]. The best-performing systems combined multiple pre-trained language models with pseudo-labeling and data augmentation, demonstrating that deep ensembles can achieve strong performance even with limited task-specific data [5]. Feature-based approaches have also proven competitive: Mosquera achieved a top-three ranking using a combination of lexical, contextual, and semantic features with traditional regression models [6], while JUST-BLUE and the University Politehnica of Bucharest (UPB) SemEval system showed that combinations of deep learning and hand-crafted features, including frequency, length, n-gram counts, and psycholinguistic norms, can remain highly competitive when carefully engineered [17,18].

Beyond the shared task, several studies have compared deep encodings with linguistic features for lexical complexity prediction. For example, Ortiz-Zambrano et al. [19] report that hybrid models combining handcrafted features with BERT/XLM-R embeddings yield substantial improvements over features alone, but also find that purely neural models are not uniformly superior and that handcrafted features remain fundamental, especially in low-resource settings. This resonates with findings in educational NLP that simple features such as syllable count, frequency bands, and orthographic length capture much of the variance in lexical difficulty [20].

Recent work has also extended lexical complexity prediction to contextual CEFR classification for language learning applications. Aleksandrova and Pouliot [21] develop a CEFR-based contextual classifier for English and French, using transformer-based representations to disambiguate polysemous lexical items in sentential context. Their model is deployed in the Mauril language learning application, demonstrating practical utility in adaptive vocabulary selection. At the text level, Zhang and Lu [22] present a Random Forest model that aligns linguistic complexity measures with CEFR-based difficulty of English texts, achieving 82.6% accuracy in classifying texts into coarse A/B/C proficiency bands and 62.6% at the six-level CEFR granularity, highlighting the predictive value of linguistically grounded features.

Multilingual extensions have broadened coverage beyond English: CompLex-ZH introduces lexical complexity datasets for Mandarin and Cantonese [23], while BengaliLCP focuses on Bengali lexical complexity [24]. Furthermore, recent research has explored the application of Large Language Models (LLMs) to enhance digital accessibility through automated language-level classification and adaptation [25], demonstrating the potential for creating more inclusive digital environments. These works show that lexical complexity prediction is feasible across typologically diverse languages. However, CEFR-based lexical difficulty for Turkic or other agglutinative low-resource languages remains an open question, as existing approaches have been evaluated almost exclusively on morphologically simpler languages.

2.1.2. CEFR Lexical Resources

The CEFRLex project represents the most comprehensive effort to create CEFR-graded lexical resources for multiple languages [2]. It provides receptive lexicons, derived from textbooks and simplified readers, and productive lexicons, derived from learner corpora, for English (EFLLex), French (FLELex), Swedish (SVALex, SweLLex), Dutch (NT2Lex), Spanish (ELELex), and German (DAFlex), each containing on the order of 13,000 lemma–POS entries with normalized frequency distributions across the six CEFR levels [2,3]. These resources underpin online lexical difficulty analyzers that can estimate text-level CEFR difficulty by aggregating word-level information [26].

The validity of these lexicons has been examined using independent gold standards and multilingual comparisons. Gräen et al. [27] compare EFLLex against external pedagogical resources such as the English Vocabulary Profile (EVP) and the Global Scale of English (GSE), finding that the English CEFRLex resource is broadly in accordance with them. They also exploit multilingual resources and translation probabilities to examine consistency across English, French, and Swedish, providing evidence that CEFR-based lexical difficulty can be aligned across languages via translation [27]. This supports the broader hypothesis that lexical difficulty information can transfer across related pedagogical resources, although evidence outside major European languages remains limited.

In addition to CEFRLex, public vocabulary lists such as the Oxford 3000 by CEFR level provide graded vocabulary for English [28]. Other projects such as KELLY provide frequency-based CEFR assignments for several European languages [29]. However, none of these lexical resources cover Kazakh or other Turkic languages. The absence of CEFR-graded lexicons for agglutinative, low-resource languages limits both empirical research and the development of adaptive learning tools. Our work addresses this gap by constructing the first CEFR-graded lexicon for Kazakh and by evaluating whether methods successful for European languages generalize to a typologically different setting.

2.1.3. Morphology, Frequency, and Lexical Difficulty

Psycholinguistic research has long established that morphological structure affects lexical access and perceived difficulty. Studies on base versus surface frequency show that both lemma frequency and wordform frequency influence recognition latencies and error rates [30]. High-frequency lemmas facilitate processing, but rare inflected or derived forms of those lemmas can still incur additional processing cost, especially when morphological structure is opaque. This interplay between morphology and frequency is particularly salient in morphologically rich languages, where productive inflection and derivation generate large paradigms from individual lemmas.

From a theoretical and typological perspective, Cotterell et al. [31] propose information-theoretic measures of inflectional paradigm complexity, showing that languages differ systematically in the entropy and predictability of their inflectional systems. Paradigm size, irregularity, and syncretism all contribute to morphological complexity, which in turn can affect difficulty for both native speakers and L2 learners. Experimental work has demonstrated that paradigm complexity modulates visual word recognition and that derivational and inflectional processes interact asymmetrically, with derivation often creating semantically less transparent forms that are harder to decompose [32].

Despite these findings, computational models of lexical complexity have typically incorporated only coarse morphological proxies, such as word length, character n-grams, or simple suffix counts, and have rarely used full morphological analyses or paradigm-level features. For agglutinative languages such as Kazakh or Turkish, this is a significant limitation. A simple A1 lemma can generate thousands of surface forms through productive suffixation, and rare complex forms may be perceived as substantially harder than the lemma’s nominal CEFR level would suggest. Our work explicitly targets this gap by integrating HFST-based morphological analyses into lexical complexity prediction for Kazakh and by examining how morphological richness interacts with lemma-level difficulty.

2.1.4. NLP for Kazakh and Agglutinative Languages

Kazakh is a Turkic language with rich agglutinative morphology, relatively free word order, and limited high-quality NLP resources. Several morphological analysis tools have been proposed. Early work by Kessikbayeva and Cicekli [33,34] developed rule-based morphological analyzers and a disambiguator for Kazakh based on two-level morphology, encoding Kazakh morphotactics and phonological alternations in finite-state transducers. The Apertium-kaz project provides an HFST-based morphological transducer for Kazakh with coverage reported around 90% on freely available corpora and high precision on a manually verified test set [10]. In parallel, the KazNLP toolkit offers data-driven morphological analysis and other preprocessing tools (normalization, tokenization, language identification) for Kazakh, built with CRF models and designed as a general-purpose NLP library [35]. Building on these foundational preprocessing capabilities, Akanova et al. [36] introduced a specialized algorithm for keyword search within Kazakh text corpora, which enhances the retrieval of thematic information in large-scale datasets. Such advancements in keyword extraction provide a technical framework for identifying salient linguistic patterns essential for complexity modeling.

These tools have been applied in various downstream tasks. Expanding these capabilities, Akanova et al. [37] developed a neurocomputer system specifically for the semantic analysis of Kazakh text, which enhances the understanding of complex linguistic relationships beyond simple morphological parsing. Akhmed-Zaki et al. [38] describe an information system for Kazakh language preprocessing that integrates morphological analysis, disambiguation, and wordform generation for applications in text analytics and lexicography. Morphological features have also been used in Kazakh text classification and short-text processing alongside convolutional neural networks [39]. However, none of these works consider CEFR-based lexical difficulty, and morphological analyzers have not been evaluated in the context of predicting learner-oriented difficulty scales.

Kazakh machine translation has seen rapid development, primarily driven by the creation of parallel corpora and neural models. The KazParC parallel corpus is the largest publicly available resource of its kind described in our references, containing 371,902 parallel sentences across Kazakh, English, Russian, and Turkish, and supporting the Tilmash neural MT system, which the authors report as competitive with commercial systems such as Google Translate and Yandex Translate on standard MT metrics [40]. Earlier work at WMT 2019 by Toral et al. [41] showed that incorporating morphological segmentation using Apertium improves English–Kazakh neural MT, mitigating data sparsity by breaking complex wordforms into morphologically meaningful units. Similar observations have been made for other low-resource agglutinative settings, where tokenization strategy and morphology-aware modeling remain important design choices [41,42].

Turkish, as a closely related Turkic language, provides a valuable reference point. Oflazer’s two-level description of Turkish morphology [43] remains a foundational finite-state account of Turkish morphotactics. More recent resources such as MorphoLex Turkish provide large-scale morphological lexicons with detailed information about root families, suffix frequencies, and morphological neighborhood structure [44]. Work on word-level segmentation in agglutinative languages using neural sequence models and transformer-based variants reports strong performance and improved handling of rich morphology [42]. Studies on Turkish NMT likewise emphasize that morphological complexity, tokenization strategy, and model architecture materially affect translation quality in low-resource settings [45]. Specifically, Toraman et al. [46] demonstrate that the granularity of tokenization can lead to significant variations in language model performance for Turkish, highlighting the non-trivial relationship between subword units and morphological structure.

Despite these advances, no prior work has combined Kazakh morphological analysis with CEFR-based lexical complexity prediction, nor has any study systematically compared feature-based and transformer-based models for lexical complexity prediction in a low-resource agglutinative setting. The present work fills this gap by (i) constructing the first CEFR-graded lexicon for Kazakh; (ii) integrating HFST-based morphological analyses and expanded frequency resources into a type-level lexical complexity prediction task; and (iii) empirically comparing traditional feature-based models and multilingual transformer architectures under realistic low-resource constraints.

2.2. Datasets

We compile three resources: (1) an expert-curated Kazakh CEFR-graded lexicon, (2) a monolingual frequency corpus from the Leipzig Corpora Collection (LCC) [47], and (3) cross-lingual reference mappings to Russian and English.

2.2.1. Kazakh CEFR-Graded Lexicon

We construct the first CEFR-graded lexicon for Kazakh from a state-sponsored pedagogical handbook series covering A1 through C1 [48]. The handbook extraction yields 4561 lemma–POS entries before cleaning and deduplication. Unlike frequency-derived resources like CEFRLex [2], our lexicon inherits prescriptive difficulty labels directly from expert-curated pedagogical minima. The use of expert-curated pedagogical minima as a gold standard ensures the model’s alignment with Kazakhstan’s state educational standards. This guarantees the practical applicability of the system within the framework of Sustainable Development Goal 4 (Quality Education), enabling the automated creation of learning materials that correspond to the actual proficiency levels of students. Table 1 summarizes this handbook-extracted inventory across CEFR levels and parts of speech.

Nouns constitute approximately 43% of the inventory, and their absolute share increases across levels, reflecting expansion into abstract and technical terms. Table 1 also illustrates how closed classes (numerals, pronouns) concentrate in A1–B1, while adjectives peak at B1–C1. Table 2 gives representative entries at each proficiency level. Full data cleaning details appear in Appendix A.

2.2.2. Monolingual Frequency Corpus

Frequency of occurrence is a strong predictor of lexical difficulty [20]. We compute frequency features and retrieve contextual sentences from a 17-million-token Kazakh corpus from the Leipzig Corpora Collection (LCC) [47]. Text is lowercased and tokenized using a Cyrillic-only regex to filter noise. We compute two frequency signals: an exact surface-form count obtained by matching the lemma string exactly as it appears in the corpus, and a secondary lemma-level count obtained by passing every corpus token through the Apertium-kaz lemmatizer before lookup. The exact surface-form count is our primary contextual lookup mechanism; the lemmatized fallback count partially mitigates agglutinative sparsity but does not serve as our primary contextual anchor. Under this exact surface-string matching regime, 38.4% of the 4350 distinct surface forms are attested, contributing both counts and contextual embeddings; the remaining 61.6% default to zero frequency and isolated-word encodings. We additionally compile an expanded lemma-frequency map by running the Apertium-kaz lemmatizer over the full corpus, yielding approximately 418,000 unique lemma entries. This lemmatized lookup raises effective frequency coverage to approximately 83% of the cleaned modeling set, substantially mitigating the sparsity of exact surface-form matching.

2.2.3. Cross-Lingual Reference Resources

To explore whether CEFR difficulty knowledge can be transferred from better-resourced languages to Kazakh, an approach that has shown promise for typologically related European languages [29], we project the 4350 distinct Kazakh surface forms into Russian and English CEFR lexicons via open-source machine translation (details in Section 2.3.5). Table 3 summarizes the source lexicons and MT pipelines. For lemmas with multiple translations matching the source lexicon, we assign the median CEFR level as a silver label. Unmatched translations are dropped. Russian serves as a natural projection bridge due to extensive lexical borrowing and high Kazakh–Russian bilingualism.

2.2.4. Dataset Partitioning

The handbook extraction yields 4561 lemma–POS entries. After cleaning and deduplication, the modeling set contains 4437 unique lemma–POS entries. Specifically, 18 entries with non-Cyrillic characters or length < 2 were removed, and 106 duplicate lemma–POS keys were resolved. Because some lemmas share the same surface form across POS categories, these 4437 entries correspond to 4350 distinct surface forms; all frequency and corpus lookups operate at the surface-form level, while model training uses the full 4437 lemma–POS set. These 4437 entries are partitioned into training, development, and test sets using a 70/15/15 split, yielding 3105, 666, and 666 instances, respectively. Stratified sampling by CEFR level ensures that the class distribution is preserved across all partitions. Table 4 reports the per-level counts.

The neural gated fusion model uses the 3105 training entries and reserves the development set for early stopping. Classical classifiers are evaluated on the same held-out test set of 666 entries.

2.3. Methodology

This subsection outlines the modeling setup used to predict CEFR lexical difficulty for Kazakh. We define the task, describe the feature representations and model families, then present the cross-lingual projection analysis and evaluation protocol.

2.3.1. Task Formulation

We frame Kazakh CEFR lexical complexity prediction as ordinal five-class classification. Given a lemma–POS pair

(w, p)

from the lexicon and, when available, a corpus sentence s containing w, the task is to predict one of the ordered labels. Equation (1) presents the label space:

Y = {A 1, A 2, B 1, B 2, C 1},

(1)

encoded as integers

{1, \dots, 5}

. The task is type-level rather than token-level: each lemma–POS entry receives a single CEFR label regardless of its observed corpus contexts.

2.3.2. Feature Engineering

For the classical baselines, each lemma–POS pair is represented by a fixed 40-dimensional handcrafted vector organized into five groups. We also evaluate a separate 50-dimensional contextual embedding baseline obtained by applying PCA to XLM-RoBERTa-base representations. The neural model uses a richer engineered bank and retains a train-selected 72-dimensional lexical subset for the MorphMLP branch. Table 5 summarizes the feature groups, and Appendix B provides the full definitions and extraction algorithm.

Among the fixed feature groups, the morphological features extracted from the Apertium HFST analyzer [10] provide the strongest isolated signal in Table 5, which fits Kazakh’s productive morphology. Frequency features from the 17-million-token LCC corpus also contribute useful information, while the embeddings-only baseline remains competitive without surpassing the best handcrafted configuration. Full technical details appear in Appendix B.

2.3.3. Classical Classifiers

We evaluate three classical classifiers from scikit-learn, each preceded by StandardScaler feature normalization:

Logistic Regression (LR): Multinomial L2-regularized; max_iter = 2000, inverse-class weights.
Random Forest (RF): 200 trees, max_depth = 15, inverse-frequency weights.
Gradient Boosting (GB): 200 rounds, max_depth = 6.

Each classifier is evaluated on the full 40-dimensional handcrafted feature set. To isolate the contribution of each feature group, we additionally report nested ablation experiments using frequency-only, orthographic-only, morphological-only, POS-only, and TF-IDF-only subsets (Table 5).

For the stacking ensemble only, we additionally train an ExtraTrees classifier as a diverse tree-based probability source. It is used inside the ensemble but is not reported as a standalone baseline.

2.3.4. Gated Morphology–Context Fusion Model

To compare a learned fusion mechanism with simpler ways of combining contextual semantics and morphological structure, we use a dual-encoder architecture with gated fusion. Figure 1 summarizes the audited XLM-R configuration.

Context Encoder

XLM-RoBERTa-base (xlm-roberta-base) [8] encodes the sentential context. The target word is delimited with special [TGT] tokens, and its representation is obtained by mean-pooling over the corresponding subword positions in the final hidden layer, where

d_{ctx} = 768

. To mitigate overfitting on the training set of 3105 instances, we freeze all transformer layers except the top

n = 2

, reducing the number of trainable transformer parameters to approximately 8.3% of the total.

Morphological Encoder

The morphological encoder consists of two parallel sub-networks whose outputs are concatenated and linearly projected. Unlike the classical baselines in Section 2.3.2, this branch does not consume the fixed 40-dimensional interpretable vector. Instead, it receives a 72-dimensional lexical feature subset drawn from the larger engineered feature bank. Feature selection is performed exclusively on the training split using ExtraTrees importance ranking (with a minimum of 20 morphological features retained), and the resulting feature mask is frozen before any development or test evaluation, ensuring no information leakage. Selected features are z-scored using training-set statistics.

CharCNN: a character-level convolutional network over the Kazakh Cyrillic alphabet (44 characters including pad and unknown tokens). Characters are embedded into $d = 64$ dimensions, then processed by three parallel 1D convolutions with kernel sizes ${2, 3, 4}$ and 128 filters each, followed by ReLU activation and max-over-time pooling, yielding a vector $h_{charcnn} \in R^{384}$ .
MorphMLP: a two-hidden-layer MLP with LayerNorm that projects the 72-dimensional selected lexical feature vector through hidden layers of 96 units each with ReLU activation and dropout $p = 0.3$ , yielding $h_{mlp} \in R^{96}$ .

The concatenated CharCNN and MorphMLP outputs (480 dimensions) are linearly projected to

d_{morph} = 384

.

Gated Fusion

Rather than simple concatenation, we learn a sigmoid gate that weights the relative contribution of contextual and morphological representations:

\begin{matrix} c & = W_{c} h_{ctx} + b_{c}, \end{matrix}

(2)

\begin{matrix} m & = W_{m} h_{morph} + b_{m}, \end{matrix}

(3)

\begin{matrix} g & = σ (W_{g} [h_{ctx}; h_{morph}] + b_{g}), \end{matrix}

(4)

\begin{matrix} f & = g ⊙ c + (1 - g) ⊙ m, \end{matrix}

(5)

where

W_{c} \in R^{d_{f} \times d_{ctx}}

,

W_{m} \in R^{d_{f} \times d_{morph}}

,

W_{g} \in R^{d_{f} \times (d_{ctx} + d_{morph})}

, and

d_{f} = 256

. Equations (2)–(5) define the fusion block. The fused representation

f

is passed through layer normalization, dropout with

p = 0.3

, and a linear classifier

W_{cls} \in R^{K \times d_{f}}

with softmax activation.

Training Procedure

The model is optimized with AdamW in PyTorch using discriminative learning rates and focal loss [51] on a single NVIDIA L4 GPU. Full training details appear in Appendix C.

2.3.5. Cross-Lingual Projection

To explore whether translation can yield useful CEFR supervision, we align the 4350 distinct Kazakh surface forms with the English EFLLex (13,871 entries) and Russian KELLY (8947 entries) datasets using the translated lexicons described in Section 2.2.3. For items matched by exact string translation, the median source proficiency level is projected as a silver CEFR label. Our main analysis treats these projected labels as a diagnostic signal of cross-lingual alignment; their use for downstream augmentation remains exploratory.

2.3.6. Direct-Prompt LLM Baselines

To address whether instruction-tuned large language models can solve the task without task-specific training, we add direct-prompt baselines using both general multilingual and Kazakh-focused open-weight models. The general baselines are Qwen2.5-0.5B-Instruct and Qwen2.5-7B-Instruct [52]. The Kazakh-focused baselines are Sherkala-Chat 8B [53] and ISSAI KAZ-LLM 8B [54]. Each prompt gives the Kazakh lemma, its POS tag, and the closed label set {A1, A2, B1, B2, C1}; the model must return a single CEFR label. We test both Kazakh and English prompt templates and use deterministic decoding with no sampling. Labels are extracted from the generated text by matching the first valid CEFR label; unparseable outputs are counted as incorrect. No train or development labels are used to update the model, tune prompts, or calibrate decision thresholds.

2.3.7. Experimental Procedure

All supervised models use the same stratified train/development/test partitions. Training uses only the training split; the development split is reserved for hyperparameter selection, early stopping, and ensemble weight tuning; and the test split is evaluated once for the reported held-out scores. We do not retrain on the combined train+development set before test evaluation. The classical baselines are trained with standardized feature vectors, the neural ablations use the shared optimization settings in Appendix C, and the direct-prompt LLM baselines described in Section 2.3.6 are evaluated only as external comparison systems.

2.3.8. Evaluation Measures

Let N denote the number of evaluation instances,

{\hat{y}}_{i}

the predicted CEFR level encoded as an ordinal integer

{1, \dots, 5}

,

y_{i}

the corresponding gold label, and

K = 5

the number of proficiency levels.

Accuracy (Acc)

Accuracy (Acc) is the fraction of exactly correct predictions:

Acc = \frac{1}{N} \sum_{i = 1}^{N} I ({\hat{y}}_{i} = y_{i}),

(6)

where

I (\cdot)

is the indicator function.

Macro-F1

The unweighted average of per-class F1 scores:

Macro - F 1 = \frac{1}{K} \sum_{k = 1}^{K} {F 1}_{k} .

(7)

Mean Absolute Error (MAE)

The average ordinal distance between predicted and gold levels:

MAE = \frac{1}{N} \sum_{i = 1}^{N} | {\hat{y}}_{i} - y_{i} | .

(8)

Within-1 Accuracy (W-1)

Within-1 Accuracy (W-1) is the fraction of predictions within one CEFR level of the gold label (

| {\hat{y}}_{i} - y_{i} | \leq 1

).

3. Results

We address our research questions through three sets of experiments: (1) evaluating supervised performance of classical and neural models, (2) diagnosing per-level and ordinal error distributions, and (3) testing cross-lingual projection. All reported supervised scores use the fixed 70/15/15 partitions.

3.1. Supervised Lexical Complexity Prediction

Table 6 shows a clear ranking on the held-out test set: the handcrafted baseline is competitive, the fusion model is stronger, and the ensemble is best overall. The experimental procedure used for training, development selection, and held-out evaluation is described in Section 2.3.7.

The direct-prompt LLM comparison in Table 7 provides an external baseline against recent open-weight instruction models. These models are not fine-tuned on the Kazakh CEFR lexicon, and their results are therefore best interpreted as a test of whether LLM priors can replace task-specific supervision. The answer is negative in the present setting: the best direct-prompt LLM result comes from ISSAI KAZ-LLM 8B with the English prompt (Macro-F1 0.214), but it remains well below the full-feature LR baseline and the supervised fusion models.

3.1.1. RQ1: Handcrafted Morphology Versus Context

RQ1 asks whether interpretable handcrafted features are competitive with an embeddings-only baseline. Table 6 answers this directly: the 40-feature LR reaches Macro-F1 0.314, outperforming the XLM-R PCA(50) embeddings-only LR (0.298), and both substantially exceed the frequency-only baseline (0.173). This indicates that the engineered representation captures morphological, orthographic, and frequency structure that the reduced embedding space does not fully retain, while also confirming that contextual embeddings provide a strong standalone signal for Kazakh CEFR prediction.

3.1.2. RQ2: Gated Fusion and Architectural Ablations

RQ2 is answered mainly by the ablation hierarchy in Table 8. Table 6 reports the held-out scores for the final audited model configurations on the fixed split, whereas Table 8 reports five-seed means and standard deviations for the architectural ablations. The latter is intended to show stability and component-wise behavior rather than to replace the final-model comparison. All ablation conditions use the same training protocol, and Appendix C gives the shared settings.

Character-level patterns alone are the weakest condition in the ablation, and adding engineered morphological features improves the lexical branch. This shows that explicit morphology contributes information that the CharCNN does not recover reliably from the limited training data. XLM-R context-only remains stronger than the morphology-only variants, highlighting the value of pretrained contextual representations. The strongest supervised results come from combining the two signal types: both fusion variants outperform the best classical baseline, while gated and concatenation fusion show no clear separation in Appendix D. The ensemble yields the best overall test metrics, but its margin over the standalone fusion model remains small and not clearly reliable.

3.2. Diagnostic Analysis

To better understand model behavior, we report per-level F1 scores and the distribution of prediction errors for the key configurations.

Error concentration, not the extreme labels, is the main challenge in this task. Table 9 shows that all supervised models struggle most with A2 and B1, while A1 and C1 are predicted more reliably. This fits the ordinal structure of the dataset: adjacent intermediate levels share more lexical and morphological properties, whereas the extremes are more distinct. Within that pattern, the fusion model is especially helpful on B2, while the ensemble shifts some of the gain toward A2, suggesting different tradeoffs across the intermediate classes.

The main benefit of the neural models is a reduction in larger ordinal mistakes. Table 10 shows that the fusion model increases exact matches and lowers the rate of errors that miss by two or more levels relative to the logistic baseline. The ensemble pushes that trend slightly further and yields the best MAE, but all models remain below 70% W-1 accuracy, underscoring the difficulty of fine-grained ordinal classification across five CEFR levels.

Qualitative Error Patterns

Manual inspection of misclassified test entries reveals several recurring patterns. First, morphologically complex A2 words such as туыстық (kinship, A2) are often predicted as B1 or B2: their derivational structure resembles higher-level vocabulary, but they denote everyday family concepts taught early. Second, abstract B2 nouns like мәдениет (culture) are frequently confused with B1 when they appear in high-frequency corpus contexts, pulling the contextual signal toward an easier level. Third, the gated fusion model corrects several cases where the classical baseline fails—for example, жарнама (advertisement, B1) is correctly classified by the fusion model but predicted as A2 by LR, because the morphological encoder captures the derivational suffix -нама as a productive B1+ indicator. These patterns suggest that the intermediate-level confusion is driven by a genuine overlap in lexical and morphological properties between adjacent CEFR levels, rather than by systematic model failure.

3.3. Cross-Lingual CEFR Projection

This experiment (RQ3) evaluates cross-lingual CEFR projection for Kazakh words translated into Russian and English. We translate all 4350 distinct Kazakh surface forms derived from the cleaned modeling set of 4437 unique lemma–POS entries to Russian (via Tilmash [40]) and to English (via OPUS-MT [50] and a Russian-pivot path, as described in Section 2.3.5). Translated words are looked up in the KELLY Russian CEFR lexicon [29] with 8947 entries and EFLLex [49] with 13,871 entries.

Projection and Augmentation

Cross-lingual projection is more useful as a diagnostic than as a training signal in the current setup. Table 11 shows moderate alignment with Russian but weak alignment with English, and Appendix E details the Russian breakdown by difficulty band and part of speech. Under this low-coverage exact-match regime, projected labels are best interpreted as evidence about cross-lingual consistency rather than as a stable source of downstream supervision.

4. Discussion

Our discussion focuses on what the experiments show reliably for this dataset and what remains unresolved. Across the supervised analyses, the consistent pattern is that morphology helps, contextual encoders are strong, and combining the two yields the clearest gains.

4.1. When Handcrafted Morphology Still Matters

Explicit morphological modeling matters because it adds information that character patterns alone do not recover reliably. The CharCNN cannot represent paradigm-level properties such as analysis count, derivational depth, and inflectional categories, whereas the MorphMLP branch improves the lexical model when those features are added. Appendix D supports the robustness of this contrast.

While our experiments are limited to Kazakh, this finding may be relevant to other agglutinative languages—Turkish, Finnish, and other Turkic languages—that share the property of productive suffixation and for which finite-state morphological analyzers are available. Whether the same pattern holds in those settings remains an empirical question.

4.2. Features and Embeddings

Pretrained contextual representations are stronger than the morphology-only neural branch, so the paper does not argue for hand-engineering in place of modern encoders. Instead, Table 6 and Table 8 point to complementarity: handcrafted features provide a solid classical baseline, XLM-R context-only is stronger than the morphology-only variants, and fusion models perform best overall. This pattern is consistent with findings in other low-resource LCP settings [20,21].

4.3. Direct-Prompt LLMs

The LLM experiment strengthens the interpretation that task-specific lexical supervision is still necessary for Kazakh CEFR prediction. Table 7 shows that direct prompting underperforms both the supervised fusion model and the stronger classical baselines in Macro-F1. Kazakh-focused LLMs improve over some general multilingual prompt baselines—ISSAI KAZ-LLM 8B with the English prompt reaches Macro-F1 0.214—but the gap remains substantial against the gated fusion model (0.360 Macro-F1) and the full-feature LR baseline (0.314 Macro-F1). This indicates that broad Kazakh language modeling capacity does not by itself recover the pedagogical level distinctions encoded in the Kazakh lexical minima.

4.4. Gated vs. Concatenation Fusion

The specific fusion mechanism appears less important than the availability of both signal types. Across the multi-seed ablation and the significance appendix, gated and concatenation fusion show no clear performance separation. We therefore interpret the gain as evidence for representation complementarity rather than for a uniquely superior gating design.

4.5. Ensembling

The ensemble offers a small additional gain, but not a decisive one. It yields the best overall test metrics, yet Appendix D does not show a clear advantage over the standalone fusion model. Its main value is that it redistributes errors across the most difficult classes, suggesting partially complementary failure modes between neural and classical systems.

4.6. Interpreting Cross-Lingual Projection

Cross-lingual projection remains more convincing as an analysis tool than as supervision. Russian shows moderate alignment with the Kazakh labels, whereas English remains weak and noisy under the current exact-match pipeline. This suggests that cross-lingual CEFR information may still be useful, but the present setup is not reliable enough for downstream augmentation.

4.7. Ordinal Structure

Although CEFR levels are inherently ordered and we report ordinal-aware metrics (MAE, Within-1), all models in this study use nominal classification losses. Explicit ordinal classifiers such as CORAL [55] or cumulative-link models could exploit the label ordering directly and may reduce large-distance errors. We leave a systematic comparison of ordinal loss functions to future work, noting that the current Within-1 accuracy of 69.2% already suggests room for improvement through ordinal-aware training.

4.8. Limitations

Our findings are subject to several constraints: (1) Lexicon scale: The cleaned modeling set of 4437 unique lemma–POS entries is modest compared to European resources, limiting the reliability of per-level estimates. (2) Validation: Labels reflect theoretical curriculum placement rather than empirical learnability, as they are derived from pedagogical handbooks without validation against learner performance. (3) Range: The absence of C2-level data limits the generalizability of our observations across the full CEFR scale. (4) Frequency noise: Exact surface-form lookup covers only 38.4% of the distinct surface forms in the modeling set; the expanded lemma-level lookup raises effective frequency coverage to approximately 83% of the modeling set, but the remaining entries still default to zero frequency. (5) Context mismatch: Type-level labels are paired with instance-level embeddings retrieved from corpus occurrences of the surface form. For homographic or multi-POS entries, retrieved sentences may represent a different sense or grammatical category, potentially introducing noise in the contextual signal.

4.9. Future Work

Several directions could strengthen and extend this work: (i) expanding the lexicon via additional pedagogical sources and expert annotation; (ii) collecting learner-based validation data to ground CEFR labels in empirical difficulty; (iii) richer morphological features such as paradigm entropy and suffix productivity; (iv) confidence-weighted cross-lingual projection that down-weights noisy silver labels; and (v) lemmatizing the source corpus prior to frequency counting.

5. Conclusions

This paper contributes a new resource and a bounded empirical result for Kazakh CEFR modeling. We introduce the first CEFR-graded lexicon for a Turkic language and use it to compare handcrafted, contextual, and fusion-based approaches to lexical difficulty prediction. The experiments show that engineered morphological information improves over character-only lexical modeling and that models combining morphology with contextual representations perform best overall on this dataset. At the same time, the results do not support a strong claim that gated fusion is better than simpler fusion. Taken together, the findings indicate that morphology-aware modeling is useful for Kazakh lexical difficulty prediction, while generalization to other Turkic languages remains an open empirical question. The presented resources and models lay the foundation for developing intelligent systems for Kazakh language learning. This contributes to the achievement of Sustainable Development Goal 4 (Quality Education) by ensuring inclusive and high-quality education through the implementation of innovative NLP solutions into educational practice.

Author Contributions

Conceptualization, G.Y., A.A., Z.G. and N.O.; methodology, G.Y., A.A., Z.G. and N.O.; software, G.Y., A.A. and Z.G.; formal analysis, G.Y., A.A., Z.G. and N.O.; investigation, G.Y., A.A., Z.G. and N.O.; visualization, G.Y., A.A. and Z.G.; writing—original draft preparation, G.Y., A.A., Z.G. and N.O.; writing—review and editing, G.Y., A.A., Z.G. and N.O.; supervision, A.A. and N.O.; project administration, A.A.; funding acquisition, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the Ministry of Science and Higher Education of the Republic of Kazakhstan under the “Zhas Galym” project for 2025–2027 (Individual Registration Number AP25793799, “Adaptive text translation and Kazakh language teaching system based on neural network algorithms”).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The CEFR-graded lexicon, trained model checkpoints, and experiment code are archived at Zenodo (https://doi.org/10.5281/zenodo.19365834). The Kazakh monolingual frequency corpus used for feature extraction is from the Leipzig Corpora Collection and is publicly available at https://wortschatz.uni-leipzig.de/en/download/Kazakh (accessed on 11 March 2026).

Acknowledgments

The authors express their gratitude to the Institute of Smart Systems and Artificial Intelligence (ISSAI), particularly to Yerbol Absalyamov, for providing access to the high-performance computing resources.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

Acc	Accuracy
BERT	Bidirectional Encoder Representations from Transformers
CEFR	Common European Framework of Reference for Languages
CharCNN	Character-level convolutional neural network
CNN	Convolutional neural network
CRF	Conditional random field
ET	Extra trees
GB	Gradient boosting
HFST	Helsinki Finite-State Toolkit
LCC	Leipzig Corpora Collection
LCP	Lexical complexity prediction
LLM	Large language model
LR	Logistic regression
MAE	Mean absolute error
MLP	Multi-layer perceptron
MT	Machine translation
NLP	Natural language processing
PCA	Principal component analysis
POS	Part of speech
RF	Random forest
SDG	Sustainable Development Goal
TF-IDF	Term frequency–inverse document frequency
W-1	Within-1 accuracy
XLM-R	Cross-lingual Language Model–RoBERTa

Appendix A. Lexicon Construction and Cleaning Pipeline

Appendix A.1. Source Materials

The lexicon is derived from five official “Lexical Minimum” handbooks published by the Republic of Kazakhstan’s Ministry of Education for levels A1 through C1 [48]. Each handbook lists vocabulary items that learners are expected to acquire at the corresponding CEFR level, organized by part of speech. These handbooks are state-sponsored pedagogical standards used in Kazakh-language certification and curriculum design.

Appendix A.2. Extraction

All single-word entries were extracted via automated PDF parsing using a custom Python 3.10.0 pipeline. Each page was parsed with layout-aware text extraction to recover tabular structure.

Appendix A.3. Cleaning Pipeline

To obtain a lexicon suitable for computational modeling, we applied the following filtering steps:

Multiword removal: entries containing whitespace or hyphens were excluded, as this study targets type-level, single-word complexity prediction.
Script filtering: tokens containing non-Cyrillic characters were removed.
Length filtering: tokens shorter than two characters were discarded.
POS normalization: original handbook POS labels were mapped to a seven-way tagset: Noun, Verb, Adj, Adv, Num, Pron, and Other.
Duplicate resolution: 106 duplicate lemma–POS keys were resolved. Homographic forms with distinct POS tags were retained as separate entries; where the same lemma–POS pair appeared at multiple CEFR levels across handbooks, the lowest level was assigned.

Appendix A.4. Quality Checks

After automated cleaning, a native Kazakh-speaking annotator reviewed all entries flagged by the Apertium transducer as unrecognized. Common issues included minor orthographic variants and loanwords absent from the transducer’s vocabulary; these were retained in the lexicon with their handbook-assigned levels.

Appendix A.5. Licensing

The source handbooks are published as official educational standards by the Republic of Kazakhstan and are freely available. The derived lexicon consists solely of lemma–POS–level triples and does not reproduce the full text or presentation of the original handbooks.

The handbook extraction yields 4561 lemma–POS entries. After the cleaning and deduplication described in Section 2.2.4, the modeling set used in the experiments contains 4437 unique lemma–POS entries.

Appendix B. Feature Engineering Details

This appendix provides the full technical definitions of the handcrafted features and the formal extraction pipeline summarized in Section 2.3.2.

Appendix B.1. Extraction Algorithm

Algorithm A1 distinguishes the 40-dimensional handcrafted feature vector from the separate 50-dimensional contextual-embedding baseline (XLM-RoBERTa-base with PCA) used in Table 5; they are extracted in parallel, not concatenated into a single classical feature set.

Appendix B.2. Feature Group Definitions

Appendix B.2.1. Orthographic Features

We extract six features: (1) character length, (2) total vowel count (in Kazakh Cyrillic orthography we count 12 vowel characters: seven front {ә, е, и, ө, ү, і, э} and five back {а, o, ұ, ы, у}; counted as a single aggregate feature), (3) consonant count, (4) rare-character count (the nine Kazakh-specific Cyrillic characters), (5) heuristic syllable count (number of vowel nucleus groups), and (6) a binary indicator for long words (

| w | \geq 8

).

Algorithm A1 Handcrafted Feature and Embedding Extraction
Require: Lemma–POS pair $(w, pos)$ ; exact frequency dict $F_{exact}$ ; lemma frequency dict $F_{lemma}$ ; corpus size N; corpus $C$ ; retrieval budget k; HFST transducer $T$
Ensure: Handcrafted vector $x_{hand}$ ; embedding vector $x_{emb}$
1: $ϕ_{orth} \leftarrow$ Orthography(w)	▹ 6 orthographic features
2: $ϕ_{pos} \leftarrow$ EncodePos(pos)	▹ 5-way one-hot + content-word flag
3: $f \leftarrow F_{exact} . GET (w, 0)$
4: $f_{ℓ} \leftarrow F_{lemma} . GET (w, 0)$
5: $ϕ_{freq} \leftarrow$ FreqStats $(f, f_{ℓ}, N)$	▹ 4 frequency features
6: $A_{w} \leftarrow$ AnalyzeHFST $(w, T)$
7: if $A_{w} = \emptyset$ then
8: $ϕ_{morph} \leftarrow 0_{19}$
9: else
10: $ϕ_{morph} \leftarrow$ SummarizeAnalyses(A_w)
11: end if
12: $ϕ_{tfidf} \leftarrow$ CharTfidfStats $(w, C)$	▹ 5 document-frequency features
13: $x_{hand} \leftarrow ϕ_{orth} \oplus ϕ_{pos} \oplus ϕ_{freq} \oplus ϕ_{morph} \oplus ϕ_{tfidf}$	▹ 40 dimensions

14: // Separate 50-dimensional embeddings-only baseline
15: $S_{w} \leftarrow$ Retrieve $(w, C, k)$
16: if $S_{w} \neq \emptyset$ then
17: $e_{w} \leftarrow \frac{1}{\| S_{w} \|} \sum_{s \in S_{w}} {XLM - R}_{[TGT]} (s, w)$
18: else
19: $e_{w} \leftarrow {XLM - R}_{[CLS]} (w)$
20: end if
21: $x_{emb} \leftarrow$ Pca₅₀ $(e_{w})$
22: return $(x_{hand}, x_{emb})$

Appendix B.2.2. POS Features

One-hot encoding over five coarse tags (Noun, Verb, Adj, Adv, Other) plus a binary is_content_word flag. We merge Num and Pron into Other.

Appendix B.2.3. Frequency Features

From the 17-million-token aggregated Leipzig corpus, we compute

\log_{10} (f + 1)

where f is the raw lemma count, relative frequency

f / N_{corpus}

, a binary in_corpus indicator, and a separate lemma-level log-frequency obtained via the Apertium lemmatizer.

Appendix B.2.4. Morphological Features

We analyze each citation-form lemma with the Apertium-kaz HFST finite-state morphological transducer [10], which covers approximately 85.6% of the lexicon entries used in the experiments (similar rates across all splits). The transducer returns all valid morphological analyses for a given surface form; from these we extract 19 features:

Analysis-level: a recognition flag, the number of analyses, a normalized ambiguity score, and a composite complexity score.
Morpheme counts: minimum, maximum, and mean number of morphemes across analyses.
Derivation: derivational_depth (number of derivational suffixes detected in the analysis).
Inflectional categories: Binary indicators for case, possession, plural, tense, and copula.
POS distribution: n_unique_pos (number of distinct POS tags across analyses) and a five-way distribution over Apertium POS categories.

For lemmas not recognized by the transducer, accounting for 14.4%, all morphological features default to zero. These features are extracted from the lemma listed in the lexicon, not from inflected surface forms.

Appendix B.2.5. TF-IDF Features

We compute character n-gram TF-IDF vectors over the lemma corpus and extract five document-frequency statistics: inverse document frequency (idf), document frequency ratio (df_ratio), log document frequency, TF-IDF score, and document spread.

Appendix B.2.6. Contextual Embeddings

For lemmas attested in the Leipzig corpus, we retrieve up to

k = 10

sentences containing the target word, encode each sentence with XLM-RoBERTa-base, and extract the target-word representation by mean-pooling over its subword tokens marked with [TGT] delimiters. The per-sentence embeddings are averaged to obtain a single 768-dimensional vector per lemma. For unattested lemmas, we fall back to encoding the isolated lemma string. All 768-dimensional vectors are then reduced to 50 dimensions via PCA fitted on the training set. This PCA representation is the standalone embeddings-only classical baseline in Table 5; it uses the same XLM-RoBERTa-base encoder as the neural fusion model but in a frozen, pre-extracted mode rather than as a fine-tuned backbone.

Appendix C. Neural Training Details

All neural models use the shared optimization settings below. For the ablation results in Table 8, each neural configuration is trained with five random seeds (42, 123, 456, 789, 2024), and we report mean ± std over those runs. By contrast, the neural score reported in Table 6 for the final audited gated model is the saved fixed-split submission run. The significance appendix separately uses seed 42 only for bootstrap resampling reproducibility.

Optimizer: AdamW (weight decay 0.01, gradient clip 1.0).
Learning rates: Transformer backbone $2 \times 10^{- 5}$ , classification heads $5 \times 10^{- 4}$ ; linear warmup over 10% of steps. Morph-only models use a single rate of $5 \times 10^{- 4}$ .
Loss: Focal loss [51] with $γ = 2.0$ and inverse-frequency class weighting.
Architecture: XLM-RoBERTa-base (top 2 layers unfrozen); CharCNN with kernels [2,3,4] × 128 filters; MorphMLP with 72 → 96 → 96 dims (features selected via ExtraTrees importance, min 20 morphological); the concatenated lexical representation is then projected from 480 → 384.
Regularization: Dropout 0.3; batch size 16; random seed as specified above for each run.
Hardware: Single NVIDIA L4 GPU (24 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA).

Focal loss down-weights well-classified examples and concentrates gradient signal on hard instances:

L_{focal} = - α_{y} {(1 - p_{y})}^{γ} \log p_{y},

(A1)

where

p_{y}

is the predicted probability for the true class y,

α_{y}

is the inverse-frequency class weight, and

γ = 2.0

controls the focusing strength.

Appendix D. Statistical Significance Tests

Table A1 reports all pairwise significance tests referenced in the paper. For each comparison we report (i) the paired bootstrap test [56] on Macro-F1 using

n = 10,000

resamples and resampling seed 42, giving the observed difference

Δ

F1, a two-sided p-value, and 95% confidence interval; and (ii) McNemar’s test on per-instance correctness, giving

χ^{2}

and p-value. For McNemar counts below 25, we use the exact binomial test; otherwise we apply the

χ^{2}

approximation with continuity correction.

Table A1. Pairwise significance tests. PBS = paired bootstrap on Macro-F1; McN = McNemar’s test. * p < 0.05; ** p < 0.01.

	PBS (Macro-F1)			McNemar
Comparison (A vs. B)	$Δ$ F1	$p$	95% CI	$χ^{2}$	$p$	$n_{AB} / n_{BA}$
CharCNN vs. Morph-only	−0.036	0.068	[−0.073, 0.002]	9.55	0.002 **	67/109
Morph-only vs. Context-only	−0.049	0.027 *	[−0.092, −0.006]	1.91	0.167	94/115
LR vs. Gated fusion	−0.046	0.035 *	[−0.089, −0.003]	3.41	0.065	93/121
Context-only vs. Gated fusion	−0.023	0.210	[−0.059, 0.013]	2.01	0.157	63/81
Concat vs. Gated fusion	−0.011	0.509	[−0.043, 0.022]	3.25	0.071	51/72
Gated vs. Ensemble	−0.002	0.869	[−0.031, 0.026]	0.10	0.749	46/42

Appendix E. Cross-Lingual Projection Breakdown

Table A2 provides a detailed breakdown of the Russian cross-lingual alignment by difficulty band and part of speech, supplementing the aggregate metrics reported in Section 3.3.

Table A2. Russian cross-lingual alignment by difficulty band and POS.

Stratum	n	r	Exact%	MAE	Mean Diff
Band
A (A1–A2)	330	–	41.5%	0.988	$- 0.806$
B (B1–B2)	302	–	34.8%	0.947	$+ 0.391$
C (C1)	108	–	12.0%	1.602	$+ 1.528$
POS
ADJ	101	0.464	–	0.941	$- 0.010$
NOUN	428	0.427	–	1.063	$+ 0.035$
VERB	118	0.351	–	1.169	$- 0.203$
ADV	40	0.378	–	0.850	$+ 0.100$

References

Council of Europe. Common European Framework of Reference for Languages: Learning, Teaching, Assessment; Cambridge University Press: Cambridge, UK, 2001. [Google Scholar]
François, T.; Gala, N.; Watrin, P.; Fairon, C. FLELex: A Graded Lexical Resource for French Foreign Learners. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 3766–3773. [Google Scholar] [CrossRef]
François, T. The CEFRLex Project: Multilingual CEFR-Graded Lexical Resources. 2021. Available online: https://cental.uclouvain.be/cefrlex/ (accessed on 11 March 2026).
Shardlow, M.; Evans, R.; Paetzold, G.H.; Zampieri, M. SemEval-2021 Task 1: Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1–16. [Google Scholar] [CrossRef]
Pan, C.; Song, B.; Wang, S.; Luo, Z. DeepBlueAI at SemEval-2021 Task 1: Lexical Complexity Prediction with A Deep Ensemble Approach. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, August 2021; Palmer, A., Schneider, N., Schluter, N., Emerson, G., Herbelot, A., Zhu, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 578–584. [Google Scholar] [CrossRef]
Mosquera, A. Alejandro Mosquera at SemEval-2021 Task 1: Exploring Sentence and Word Features for Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, August 2021; Palmer, A., Schneider, N., Schluter, N., Emerson, G., Herbelot, A., Zhu, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 554–559. [Google Scholar] [CrossRef]
Shardlow, M.; North, K.; Zampieri, M. Multilingual Resources for Lexical Complexity Prediction: A Review. In Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024, Torino, Italia, May 2024; Nunzio, G.M.D., Vezzani, F., Ermakova, L., Azarbonyad, H., Kamps, J., Eds.; ELRA and ICCL: Paris, France, 2024; pp. 51–59. [Google Scholar] [CrossRef]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
Center for East European and Russian/Eurasian Studies. Kazakh, n.d. Available online: https://ceeres.uchicago.edu/languages/kazakh (accessed on 11 March 2026).
Washington, J.; Salimzyanov, I.; Tyers, F. Finite-state morphological transducers for three Kypchak languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, May 2014; Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S., Eds.; European Language Resources Association (ELRA): Paris, France, 2014; pp. 3378–3385. [Google Scholar] [CrossRef]
Kurmasheva, L.; Kurmashev, I.; Kulikov, V.; Kulikova, V.; Tajigitov, A. The Use of Data Mining in the Management of the Career Guidance Work of the University. Ann. Data Sci. 2025, 12, 1923–1940. [Google Scholar] [CrossRef]
Kulikova, V.; Iklassova, K.; Kazanbayeva, A. Development of a decision making method to form the indicators for a university development plan. East.-Eur. J. Enterp. Technol. 2019, 3, 12–21. [Google Scholar] [CrossRef]
Kulikov, V.; Kulikova, V.; Yerkebulan, G. Google/Yandex Translation Detection in the Patterns Identifying System of Multilingual Texts. Int. J. Comput. 2021, 20, 72–77. [Google Scholar] [CrossRef]
Yerkebulan, G.; Kulikova, V.; Kulikov, V.; Kulsharipova, Z. Devising an entropy-based approach for identifying patterns in multilingual texts. East.-Eur. J. Enterp. Technol. 2021, 2, 16–22. [Google Scholar] [CrossRef]
Makhazhanova, U.; Kerimkhulle, S.; Mukhanova, A.; Bayegizova, A.; Aitkozha, Z.; Mukhiyadin, A.; Tassuov, B.; Saliyeva, A.; Taberkhan, R.; Azieva, G. The Evaluation of Creditworthiness of Trade and Enterprises of Service Using the Method Based on Fuzzy Logic. Appl. Sci. 2022, 12, 11515. [Google Scholar] [CrossRef]
Akanova, A.; Ospanova, N.; Sharipova, S.; Mauina, G.; Abdugulova, Z. Development of a thematic and neural network model for data learning. East.-Eur. J. Enterp. Technol. 2022, 4, 40–50. [Google Scholar] [CrossRef]
Bani Yaseen, T.; Ismail, Q.; Al-Omari, S.; Al-Sobh, E.; Abdullah, M. JUST-BLUE at SemEval-2021 Task 1: Predicting Lexical Complexity using BERT and RoBERTa Pre-Trained Language Models. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 661–666. [Google Scholar] [CrossRef]
Zaharia, G.E.; Cercel, D.C.; Dascalu, M. UPB at SemEval-2021 Task 1: Combining Deep Learning and Hand-Crafted Features for Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online, August 2021; Palmer, A., Schneider, N., Schluter, N., Emerson, G., Herbelot, A., Zhu, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 609–616. [Google Scholar] [CrossRef]
Ortiz-Zambrano, J.A.; Espín-Riofrío, C.H.; Montejo-Ráez, A. Deep Encodings vs. Linguistic Features in Lexical Complexity Prediction. Neural Comput. Appl. 2025, 37, 1171–1187. [Google Scholar] [CrossRef]
Kyle, K.; Crossley, S.A. Automatically Assessing Lexical Sophistication: Indices, Tools, Findings, and Application. TESOL Q. 2015, 49, 757–786. [Google Scholar] [CrossRef]
Aleksandrova, D.; Pouliot, V. CEFR-based Contextual Lexical Complexity Classifier in English and French. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Toronto, Canada, July 2023; Kochmar, E., Burstein, J., Horbach, A., Laarmann-Quante, R., Madnani, N., Tack, A., Yaneva, V., Yuan, Z., Zesch, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 518–527. [Google Scholar] [CrossRef]
Zhang, X.; Lu, X. Aligning linguistic complexity with the difficulty of English texts for L2 learners based on CEFR levels. Stud. Second Lang. Acquis. 2025, 47, 1407–1434. [Google Scholar] [CrossRef]
Qiu, L.; Guo, S.; Wong, T.S.; Chersoni, E.; Lee, J.; Huang, C.R. CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese. In Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), Miami, Florida, USA, November 2024; Shardlow, M., Saggion, H., Alva-Manchego, F., Zampieri, M., North, K., Štajner, S., Stodden, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 20–26. [Google Scholar] [CrossRef]
Ayman, N.; Hossain, M.A.; Aziz, A.; Faruqui, R.U.; Chy, A.N. BengaliLCP: A Dataset for Lexical Complexity Prediction in the Bengali Texts. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, May 2024; Calzolari, N., Kan, M.Y., Hoste, V., Lenci, A., Sakti, S., Xue, N., Eds.; European Language Resources Association (ELRA) and ICCL: Paris, France, 2024; pp. 2227–2237. [Google Scholar] [CrossRef]
Leewis, S.; Smit, K.; van de Hoef, A.; Hartman, F.; Kuiper, N.; Todorova, J.; Gerritsen, T. Enhancing Digital Accessibility through Language-Level Classification and Adaptation: Exploring the Role of Large Language Models for Inclusive Language. In Proceedings of the 2025 9th International Conference on Software and E-Business, New York, NY, USA, 7–9 November 2025; ICSeB ’25, pp. 17–23. [Google Scholar] [CrossRef]
CENTAL. CEFRLex Online Lexical Difficulty Analyser. 2021. Available online: https://cental.uclouvain.be/cefrlex/analyse/ (accessed on 11 March 2026).
Graën, J.; Alfter, D.; Schneider, G. Using Multilingual Resources to Evaluate CEFRLex for Learner Applications. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 346–355. [Google Scholar]
Oxford University Press. The Oxford 3000 by CEFR Level. 2020. Available online: https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000 (accessed on 11 March 2026).
Kilgarriff, A.; Charalabopoulou, F.; Gavrilidou, M.; Bondi Johannessen, J.; Khalil, S.; Johansson Kokkinakis, S.; Lew, R.; Sharoff, S.; Vadlapudi, R.; Volodina, E. Corpus-Based Vocabulary Lists for Language Learners for Nine Languages. Lang. Resour. Eval. 2014, 48, 121–163. [Google Scholar] [CrossRef]
Vannest, J.; Newport, E.; Newman, A.; Bavelier, D. Interplay Between Morphology and Frequency in Lexical Access: The Case of the Base Frequency Effect. Brain Res. 2011, 1373, 144–159. [Google Scholar] [CrossRef] [PubMed]
Cotterell, R.; Kirov, C.; Hulden, M.; Eisner, J. On the Complexity and Typology of Inflectional Morphological Systems. Trans. Assoc. Comput. Linguist. 2019, 7, 327–342. [Google Scholar] [CrossRef]
Schreuder, R.; Baayen, R.H. Modeling Morphological Processing. In Morphological Aspects of Language Processing; Feldman, L.B., Ed.; Lawrence Erlbaum: Hillsdale, NJ, USA, 1995; pp. 131–154. [Google Scholar]
Kessikbayeva, G.; Cicekli, I. Rule Based Morphological Analyzer of Kazakh Language. In Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM, Baltimore, Maryland, June 2014; Çetinoğlu, Ö., Heinz, J., Maletti, A., Riggle, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 46–54. [Google Scholar] [CrossRef]
Kessikbayeva, G.; Cicekli, I. A Rule Based Morphological Analyzer and a Morphological Disambiguator for Kazakh Language. Linguist. Lit. Stud. 2016, 4, 96–104. [Google Scholar] [CrossRef]
Yessenbayev, Z.; Kozhirbayev, Z.; Makazhanov, A. KazNLP: A Pipeline for Automated Processing of Texts Written in Kazakh Language. In Proceedings of the Speech and Computer: 22nd International Conference, SPECOM 2020, St. Petersburg, Russia, 7–9 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 657–666. [Google Scholar] [CrossRef]
Akanova, A.; Ospanova, N.; Kukharenko, Y.; Abildinova, G. Development of the algorithm of keyword search in the Kazakh language text corpus. East.-Eur. J. Enterp. Technol. 2019, 5, 26–32. [Google Scholar] [CrossRef]
Akanova, A.; Ismailova, A.; Oralbekova, Z.; Kenzhebayeva, Z.; Anarbekova, G. Neurocomputer System of Semantic Analysis of the Text in the Kazakh Language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2024, 23, 1–15. [Google Scholar] [CrossRef]
Akhmed-Zaki, D.; Mansurova, M.; Madiyeva, G.; Kadyrbek, N.; Kyrgyzbayeva, M. Development of the Information System for the Kazakh Language Preprocessing. Cogent Eng. 2021, 8, 1896418. [Google Scholar] [CrossRef]
Parhat, S.; Ablimit, M.; Hamdulla, A. A Robust Morpheme Sequence and Convolutional Neural Network-Based Uyghur and Kazakh Short Text Classification. Information 2019, 10, 387. [Google Scholar] [CrossRef]
Yeshpanov, R.; Polonskaya, A.; Varol, H.A. KazParC: Kazakh Parallel Corpus for Machine Translation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 9633–9644. [Google Scholar] [CrossRef]
Toral, A.; Edman, L.; Yeshmagambetova, G.; Spenader, J. Neural Machine Translation for English–Kazakh with Morphological Segmentation and Synthetic Data. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Florence, Italy, August 2019; Bojar, O., Chatterjee, R., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Huck, M., Yepes, A.J., Koehn, P., Martins, A., et al., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 386–392. [Google Scholar] [CrossRef]
Villegas-Ch, W.; Gutierrez, R.; Maldonado Navarro, A.; Mera-Navarrete, A. Evaluating Neural Network Models for Word Segmentation in Agglutinative Languages: Comparison with Rule-Based Approaches and Statistical Models. IEEE Access 2024, 12, 157556–157573. [Google Scholar] [CrossRef]
Oflazer, K. Two-level Description of Turkish Morphology. Lit. Linguist. Comput. 1994, 9, 137–148. [Google Scholar] [CrossRef]
Arican, B.; Kuzgun, A.; Marşan, B.; Aslan, D.B.; Saniyar, E.; Cesur, N.; Kara, N.; Kuyrukcu, O.; Ozcelik, M.; Yenice, A.B.; et al. Morpholex Turkish: A Morphological Lexicon for Turkish. In Proceedings of the Globalex Workshop on Linked Lexicography Within the 13th Language Resources and Evaluation Conference, Marseille, France, June 2022; Kernerman, I., Krek, S., Eds.; European Language Resources Association: Paris, France, 2022; pp. 68–74. [Google Scholar]
Acı, M.; Vuran Sarı, N.; İnan Acı, Ç. Morphological and Structural Complexity Analysis of Low-resource English–Turkish Language Pair Using Neural Machine Translation Models. PeerJ Comput. Sci. 2025, 11, e3072. [Google Scholar] [CrossRef]
Toraman, C.; Yilmaz, E.H.; Şahi nuç, F.; Ozcelik, O. Impact of Tokenization on Language Models: An Analysis for Turkish. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–21. [Google Scholar] [CrossRef]
Goldhahn, D.; Eckart, T.; Quasthoff, U. Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, May 2012; Calzolari, N., Choukri, K., Declerck, T., Doğan, M.U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S., Eds.; European Language Resources Association (ELRA): Paris, France, 2012; pp. 759–765. [Google Scholar] [CrossRef]
Balabekov, A.Q.; Dauletkereyeva, N.Z.; Demessinova, L.M.; Iskakova, Z.M.; Kaliyeva, A.M.; Mussayeva, G.A.; Suyinzhanova, Z.K. Qazaq Tilinin Leksika-Grammatikalyq Minimumy. A1–C1 Dengeiler; Five-Volume Kazakh Lexical Minimum Handbook Series for CEFR Levels A1, A2, B1, B2, and C1; Sh. Shayakhmetov “Til-Qazyna” National Scientific and Practical Center: Astana, Kazakhstan, 2024. [Google Scholar]
Dürlich, L.; François, T. EFLLex: A Graded Lexical Resource for Learners of English as a Foreign Language. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar] [CrossRef]
Tiedemann, J.; Thottingal, S. OPUS-MT—Building open translation services for the World. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, Lisboa, Portugal, November 2020; Martins, A., Moniz, H., Fumega, S., Martins, B., Batista, F., Coheur, L., Parra, C., Trancoso, I., Turchi, M., Bisazza, A., et al., Eds.; European Association for Machine Translation: Geneva, Switzerland, 2020; pp. 479–480. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Qwen Team. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar] [CrossRef]
Koto, F.; Joshi, R.; Mukhituly, N.; Wang, Y.; Xie, Z.; Pal, R.; Orel, D.; Mullah, P.; Turmakhan, D.; Goloburda, M.; et al. Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting. arXiv 2025, arXiv:2503.01493. [Google Scholar] [CrossRef]
Institute of Smart Systems and Artificial Intelligence. Kazakh Large Language Model ISSAI KAZ-LLM, 2024. Model Page for ISSAI KAZ-LLM. Available online: https://issai.nu.edu.kz/kazllm/ (accessed on 20 May 2026).
Cao, W.; Mirjalili, V.; Raschka, S. Rank consistent ordinal regression for neural networks with application to age estimation. Pattern Recognit. Lett. 2020, 140, 325–331. [Google Scholar] [CrossRef]
Koehn, P. Statistical Significance Tests for Machine Translation Evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 388–395. [Google Scholar]

Figure 1. Compact overview of the audited dual-encoder architecture used in the main neural experiments. The context branch encodes the marked sentence with XLM-RoBERTa-base, while the lexical branch combines CharCNN and MorphMLP representations before gated fusion and CEFR classification.

Table 1. Kazakh CEFR lexicon summary.

Level	Noun	Verb	Adjective	Adverb	Numeral	Pronoun	Other	Total
A1	373	135	79	26	30	15	304	962
A2	310	103	77	36	26	19	126	697
B1	402	180	112	56	25	20	95	890
B2	362	166	77	46	7	10	223	891
C1	537	227	150	29	3	12	163	1121
Total	1984	811	495	193	91	76	911	4561

Table 2. Example lexicon entries by CEFR level. Glosses are approximate English translations.

Level	Lemma	POS	Gloss
A1	су	Noun	water
A1	бару	Verb	to go
A2	ауа	Noun	air, weather
A2	тiлек	Noun	wish
B1	байланыс	Noun	connection
B1	қоғамдық	Adjective	public, social
B2	тәуелсiздiк	Noun	independence
B2	қалыптастыру	Verb	to form, develop
C1	жаһандану	Noun	globalization
C1	ықпалдастық	Noun	integration

Table 3. Cross-lingual CEFR projection resources.

Source Lexicon	Entries	MT Pipeline	Kaz Cov.
KELLY^ru [29]	8947	Tilmash [40]	17.0%
EFLLex^en [49]	13,871	OPUS-MT [50]	39.8%
EFLLex^en [49]	13,871	Tilmash→OPUS (pivot)	19.9%

Note: Only single-token translations matching source lexicons yield silver labels.

Table 4. Number of entries per CEFR level in each data split.

Level	Train	Dev	Test	Total
A1	663	142	142	947
A2	478	102	103	683
B1	606	130	130	866
B2	607	130	130	867
C1	751	162	161	1074
Total	3105	666	666	4437

Note: Stratified sampling preserves the global class distribution across training, development, and test partitions.

Table 5. Feature groups, dimensions, and ablation Macro-F1 on dev set. Morphological features contribute most to performance.

Group	Key Intuition	Dims	Alone	Source
Orthographic	Length, vowel ratio, rare chars	6	0.265	Lemma
POS	Content word flags	6	0.178	Lexicon
Frequency	Log-freq, corpus coverage	4	0.312	Leipzig
Morphological	Analysis ambiguity, deriv depth	19	0.335	Apertium
TF-IDF	Orthographic distinctiveness	5	0.221	n-grams
Full handcrafted	Combined feature bank	40	0.345	–
Embeddings-only	XLM-R PCA(50)	50	0.305	XLM-R

Table 6. Held-out test results for the final model configurations on the fixed split.

Model	Setup	Acc	Macro-F1	MAE
Frequency-only LR	Surface/lemma frequency cues only	0.312	0.173	1.626
Embeddings-only LR	XLM-R PCA(50)	0.309	0.298	1.411
LR	40 handcrafted features	0.345	0.314	1.261
RF	40 handcrafted features	0.321	0.300	1.263
GB	40 handcrafted features	0.309	0.287	1.279
Gated fusion	XLM-R + CharCNN + MorphMLP	0.387	0.360	1.125
Ensemble	ElasticNet stack of GB, RF, ET, Gated_probs	0.381	0.363	1.105

Table 7. Direct-prompt LLM baselines on the public held-out split. Prompt language indicates the language of the instruction template; all prompts include the Kazakh lemma, POS tag, and closed CEFR label set.

Model	Prompt	Acc	Macro-F1	MAE
Qwen2.5-0.5B-Instruct	English	0.228	0.159	1.304
Qwen2.5-0.5B-Instruct	Kazakh	0.206	0.109	1.325
Qwen2.5-7B-Instruct	English	0.241	0.124	1.757
Qwen2.5-7B-Instruct	Kazakh	0.225	0.110	1.925
Sherkala-Chat 8B	English	0.212	0.191	1.297
Sherkala-Chat 8B	Kazakh	0.203	0.164	1.122
ISSAI KAZ-LLM 8B	English	0.259	0.214	1.271
ISSAI KAZ-LLM 8B	Kazakh	0.088	0.116	1.292

Table 8. Neural architecture ablation (mean ± std over 5 training seeds).

Model	Macro-F1	MAE
CharCNN-only	0.239 ± 0.009	1.456 ± 0.036
Morph-only (CharCNN+MLP)	0.303 ± 0.013	1.240 ± 0.044
XLM-R context-only	0.335 ± 0.008	1.184 ± 0.029
Concat fusion	0.346 ± 0.007	1.144 ± 0.020
Gated fusion	0.346 ± 0.008	1.149 ± 0.013

Table 9. Per-CEFR-level F1 scores for selected models.

Model	Level
Model	A1	A2	B1	B2	C1
Full-feature LR	0.434	0.187	0.198	0.278	0.474
Gated fusion (XLM-R)	0.519	0.194	0.267	0.336	0.485
Ensemble_stack	0.567	0.239	0.256	0.286	0.465

Table 10. Error distance distribution.

Model	Exact	±1	≥2	W-1
Full-feature LR	0.345	0.293	0.362	0.638
Gated fusion (XLM-R)	0.387	0.291	0.321	0.679
Ensemble_stack	0.381	0.311	0.308	0.692

Table 11. Cross-lingual CEFR alignment between Kazakh and source languages (RQ3).

Source	n	Cov.%	r	MAE	Exact%	W-1%
Russian	740	17.0%	0.412	1.061	34.5%	72.3%
English	1731	39.8%	0.073	1.579	21.8%	51.6%
En (pivot)	866	19.9%	0.063	1.598	21.0%	50.6%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yerkebulan, G.; Akanova, A.; Galymzhan, Z.; Ospanova, N. A CEFR-Graded Lexicon and Morphology-Aware Benchmarks for Kazakh Lexical Complexity Prediction. Technologies 2026, 14, 346. https://doi.org/10.3390/technologies14060346

AMA Style

Yerkebulan G, Akanova A, Galymzhan Z, Ospanova N. A CEFR-Graded Lexicon and Morphology-Aware Benchmarks for Kazakh Lexical Complexity Prediction. Technologies. 2026; 14(6):346. https://doi.org/10.3390/technologies14060346

Chicago/Turabian Style

Yerkebulan, Gulnur, Akerke Akanova, Zhantore Galymzhan, and Nazira Ospanova. 2026. "A CEFR-Graded Lexicon and Morphology-Aware Benchmarks for Kazakh Lexical Complexity Prediction" Technologies 14, no. 6: 346. https://doi.org/10.3390/technologies14060346

APA Style

Yerkebulan, G., Akanova, A., Galymzhan, Z., & Ospanova, N. (2026). A CEFR-Graded Lexicon and Morphology-Aware Benchmarks for Kazakh Lexical Complexity Prediction. Technologies, 14(6), 346. https://doi.org/10.3390/technologies14060346

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A CEFR-Graded Lexicon and Morphology-Aware Benchmarks for Kazakh Lexical Complexity Prediction

Abstract

1. Introduction

2. Materials and Methods

2.1. Background

2.1.1. CEFR and Lexical Complexity Prediction

2.1.2. CEFR Lexical Resources

2.1.3. Morphology, Frequency, and Lexical Difficulty

2.1.4. NLP for Kazakh and Agglutinative Languages

2.2. Datasets

2.2.1. Kazakh CEFR-Graded Lexicon

2.2.2. Monolingual Frequency Corpus

2.2.3. Cross-Lingual Reference Resources

2.2.4. Dataset Partitioning

2.3. Methodology

2.3.1. Task Formulation

2.3.2. Feature Engineering

2.3.3. Classical Classifiers

2.3.4. Gated Morphology–Context Fusion Model

Context Encoder

Morphological Encoder

Gated Fusion

Training Procedure

2.3.5. Cross-Lingual Projection

2.3.6. Direct-Prompt LLM Baselines

2.3.7. Experimental Procedure

2.3.8. Evaluation Measures

Accuracy (Acc)

Macro-F1

Mean Absolute Error (MAE)

Within-1 Accuracy (W-1)

3. Results

3.1. Supervised Lexical Complexity Prediction

3.1.1. RQ1: Handcrafted Morphology Versus Context

3.1.2. RQ2: Gated Fusion and Architectural Ablations

3.2. Diagnostic Analysis

Qualitative Error Patterns

3.3. Cross-Lingual CEFR Projection

Projection and Augmentation

4. Discussion

4.1. When Handcrafted Morphology Still Matters

4.2. Features and Embeddings

4.3. Direct-Prompt LLMs

4.4. Gated vs. Concatenation Fusion

4.5. Ensembling

4.6. Interpreting Cross-Lingual Projection

4.7. Ordinal Structure

4.8. Limitations

4.9. Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Lexicon Construction and Cleaning Pipeline

Appendix A.1. Source Materials

Appendix A.2. Extraction

Appendix A.3. Cleaning Pipeline

Appendix A.4. Quality Checks

Appendix A.5. Licensing

Appendix B. Feature Engineering Details

Appendix B.1. Extraction Algorithm

Appendix B.2. Feature Group Definitions

Appendix B.2.1. Orthographic Features

Appendix B.2.2. POS Features

Appendix B.2.3. Frequency Features

Appendix B.2.4. Morphological Features

Appendix B.2.5. TF-IDF Features

Appendix B.2.6. Contextual Embeddings

Appendix C. Neural Training Details

Appendix D. Statistical Significance Tests

Appendix E. Cross-Lingual Projection Breakdown

References

Share and Cite

Article Metrics