1. Introduction
The Common European Framework of Reference for Languages, or CEFR, provides a six-level proficiency scale, A1 to C2, widely used in language teaching and assessment [
1]. CEFR-graded lexical resources are central to curriculum design, adaptive learning systems, and text simplification. The development of the first CEFR-graded lexicon for the Kazakh language directly supports the implementation of Sustainable Development Goal 4 (Quality Education). Creating such resources enhances the accessibility of high-quality linguistic tools and optimizes educational processes, which is particularly vital for low-resource languages. For major European languages, comprehensive lexicons such as CEFRLex cover around 13,000 lemma–POS entries each [
2,
3], and recent computational work on lexical complexity prediction (LCP) has shown strong progress with both feature-based and transformer-based methods [
4,
5,
6].
However, this progress has largely excluded agglutinative, low-resource languages. No CEFR-graded lexicon exists for any Turkic language [
7], and existing LCP models have been evaluated almost exclusively on morphologically simple languages. In agglutinative languages like Kazakh, Turkish, or Finnish, the relationship between lexical and morphological complexity is non-trivial. A single lemma such as бала, meaning child, can yield hundreds of suffixed forms, for example балаларымыздан meaning from our children, whose morphological structure affects learnability and difficulty.
Standard multilingual encoders such as XLM-RoBERTa [
8], effective cross-lingually, rely on subword tokenization that often fragments these complex wordforms, obscuring morphology relevant to lexical difficulty. Whether morphology-aware representations can improve CEFR-level prediction in such languages remains an open question.
To address this gap, we focus on Kazakh, a Central Turkic language spoken by about 12 million people [
9] and one with growing demand for CEFR-aligned pedagogical resources. We construct the first CEFR-graded lexicon for Kazakh. The handbook extraction yields 4561 lemma–POS entries across five levels, and the cleaned modeling set used in our experiments contains 4437 unique lemma–POS entries. Our supervised experiments contrast interpretable handcrafted features derived from Helsinki Finite-State Technology (HFST)-Apertium analysis [
10], frequency, and orthography with contextual encoders and morphology–context fusion models. We also examine cross-lingual CEFR projection from Russian and English as a diagnostic signal. This study contributes to the expanding application of data mining within professional educational environments, where machine learning is used to optimize pedagogical processes [
11] and manage institutional tasks such as career guidance [
11]. Such assessment tools are also vital for establishing strategic university development indicators through formal decision-making methods [
12].
Furthermore, the current work builds on established methodologies for multilingual text analysis, specifically the detection of patterns in machine-translated texts [
13] and the use of entropy-based measures for identifying structural regularities across diverse languages [
14]. These previous findings provide a technical foundation for our current focus on Kazakh morphological complexity.
The challenge of classifying objects by complexity or risk levels often requires accounting for multiple intersecting features. While fuzzy logic-based approaches are successfully applied in the financial sector to evaluate qualitative characteristics, such as the creditworthiness of service enterprises [
15], the field of natural language processing (NLP) increasingly utilizes neural architectures for similar purposes to capture hidden dependencies in data. This approach is further supported by research into the development of hybrid thematic and neural network models for data learning [
16], which demonstrates the effectiveness of combining structural and probabilistic signals in low-resource language processing.
We address three explicit research questions:
RQ1: How strong are interpretable handcrafted features for Kazakh CEFR prediction relative to an embeddings-only baseline?
RQ2: Does gated fusion of contextual and lexical representations improve over the classical baselines, and which architectural components contribute to the gain?
RQ3: How far can cross-lingual projection transfer CEFR information from Russian and English lexical resources to Kazakh?
The primary contribution of this work is a new resource together with baseline experiments that establish initial benchmarks; we do not claim state-of-the-art modeling advances. Specifically:
Resource. We construct the first CEFR-graded lexicon for any Turkic language. The handbook extraction yields 4561 lemma–POS entries across five levels (A1–C1), and the cleaned modeling set contains 4437 unique lemma–POS entries. This lexicon underpins the present experiments and can support future work on lexical complexity prediction, curriculum design, and adaptive language learning for Kazakh and potentially other Turkic languages.
Baseline study. We establish initial benchmarks by comparing handcrafted morphological features, contextual embeddings, and fusion models, providing a reference point for future systems on this dataset.
Empirical evidence. We show that engineered morphological features improve over character-level patterns alone, and that combining morphological and contextual representations yields the strongest supervised results in this setting. The gain comes from combining the two signal types rather than from a clearly superior fusion mechanism.
The remainder of the paper is organized as follows.
Section 2 reviews related work, describes the datasets, and defines the modeling and evaluation procedure.
Section 3 reports the supervised, diagnostic, cross-lingual, and LLM baseline results.
Section 4 discusses the empirical implications and limitations.
Section 5 concludes the paper.
2. Materials and Methods
This section first reviews related work that motivates our modeling choices (
Section 2.1), then describes the language resources we compile (
Section 2.2) and the supervised modeling pipeline and evaluation protocol applied to them (
Section 2.3).
All computational experiments were implemented in Python 3.10.0 (Python Software Foundation, Wilmington, DE, USA). Classical models used scikit-learn 1.7.2, and neural models used PyTorch 2.9.1 (Meta Platforms, Inc., Menlo Park, CA, USA) and Hugging Face Transformers 4.46.1 (Hugging Face, Inc., Brooklyn, NY, USA). Neural experiments ran on a single NVIDIA L4 GPU (NVIDIA Corporation, Santa Clara, CA, USA).
2.1. Background
We organize prior work into four strands: lexical complexity prediction, CEFR-graded lexical resources, the role of morphology and frequency in lexical difficulty, and NLP for Kazakh and other agglutinative languages.
2.1.1. CEFR and Lexical Complexity Prediction
Automatic prediction of lexical complexity has received increasing attention, particularly since the SemEval-2021 Lexical Complexity Prediction shared task, which introduced the CompLex English corpus annotated on a 5-point Likert scale for both single words and multiword expressions [
4]. The best-performing systems combined multiple pre-trained language models with pseudo-labeling and data augmentation, demonstrating that deep ensembles can achieve strong performance even with limited task-specific data [
5]. Feature-based approaches have also proven competitive: Mosquera achieved a top-three ranking using a combination of lexical, contextual, and semantic features with traditional regression models [
6], while JUST-BLUE and the University Politehnica of Bucharest (UPB) SemEval system showed that combinations of deep learning and hand-crafted features, including frequency, length,
n-gram counts, and psycholinguistic norms, can remain highly competitive when carefully engineered [
17,
18].
Beyond the shared task, several studies have compared deep encodings with linguistic features for lexical complexity prediction. For example, Ortiz-Zambrano et al. [
19] report that hybrid models combining handcrafted features with BERT/XLM-R embeddings yield substantial improvements over features alone, but also find that purely neural models are not uniformly superior and that handcrafted features remain fundamental, especially in low-resource settings. This resonates with findings in educational NLP that simple features such as syllable count, frequency bands, and orthographic length capture much of the variance in lexical difficulty [
20].
Recent work has also extended lexical complexity prediction to contextual CEFR classification for language learning applications. Aleksandrova and Pouliot [
21] develop a CEFR-based contextual classifier for English and French, using transformer-based representations to disambiguate polysemous lexical items in sentential context. Their model is deployed in the Mauril language learning application, demonstrating practical utility in adaptive vocabulary selection. At the text level, Zhang and Lu [
22] present a Random Forest model that aligns linguistic complexity measures with CEFR-based difficulty of English texts, achieving 82.6% accuracy in classifying texts into coarse A/B/C proficiency bands and 62.6% at the six-level CEFR granularity, highlighting the predictive value of linguistically grounded features.
Multilingual extensions have broadened coverage beyond English: CompLex-ZH introduces lexical complexity datasets for Mandarin and Cantonese [
23], while BengaliLCP focuses on Bengali lexical complexity [
24]. Furthermore, recent research has explored the application of Large Language Models (LLMs) to enhance digital accessibility through automated language-level classification and adaptation [
25], demonstrating the potential for creating more inclusive digital environments. These works show that lexical complexity prediction is feasible across typologically diverse languages. However, CEFR-based lexical difficulty for Turkic or other agglutinative low-resource languages remains an open question, as existing approaches have been evaluated almost exclusively on morphologically simpler languages.
2.1.2. CEFR Lexical Resources
The CEFRLex project represents the most comprehensive effort to create CEFR-graded lexical resources for multiple languages [
2]. It provides receptive lexicons, derived from textbooks and simplified readers, and productive lexicons, derived from learner corpora, for English (EFLLex), French (FLELex), Swedish (SVALex, SweLLex), Dutch (NT2Lex), Spanish (ELELex), and German (DAFlex), each containing on the order of 13,000 lemma–POS entries with normalized frequency distributions across the six CEFR levels [
2,
3]. These resources underpin online lexical difficulty analyzers that can estimate text-level CEFR difficulty by aggregating word-level information [
26].
The validity of these lexicons has been examined using independent gold standards and multilingual comparisons. Gräen et al. [
27] compare EFLLex against external pedagogical resources such as the English Vocabulary Profile (EVP) and the Global Scale of English (GSE), finding that the English CEFRLex resource is broadly in accordance with them. They also exploit multilingual resources and translation probabilities to examine consistency across English, French, and Swedish, providing evidence that CEFR-based lexical difficulty can be aligned across languages via translation [
27]. This supports the broader hypothesis that lexical difficulty information can transfer across related pedagogical resources, although evidence outside major European languages remains limited.
In addition to CEFRLex, public vocabulary lists such as the Oxford 3000 by CEFR level provide graded vocabulary for English [
28]. Other projects such as KELLY provide frequency-based CEFR assignments for several European languages [
29]. However, none of these lexical resources cover Kazakh or other Turkic languages. The absence of CEFR-graded lexicons for agglutinative, low-resource languages limits both empirical research and the development of adaptive learning tools. Our work addresses this gap by constructing the first CEFR-graded lexicon for Kazakh and by evaluating whether methods successful for European languages generalize to a typologically different setting.
2.1.3. Morphology, Frequency, and Lexical Difficulty
Psycholinguistic research has long established that morphological structure affects lexical access and perceived difficulty. Studies on base versus surface frequency show that both lemma frequency and wordform frequency influence recognition latencies and error rates [
30]. High-frequency lemmas facilitate processing, but rare inflected or derived forms of those lemmas can still incur additional processing cost, especially when morphological structure is opaque. This interplay between morphology and frequency is particularly salient in morphologically rich languages, where productive inflection and derivation generate large paradigms from individual lemmas.
From a theoretical and typological perspective, Cotterell et al. [
31] propose information-theoretic measures of inflectional paradigm complexity, showing that languages differ systematically in the entropy and predictability of their inflectional systems. Paradigm size, irregularity, and syncretism all contribute to morphological complexity, which in turn can affect difficulty for both native speakers and L2 learners. Experimental work has demonstrated that paradigm complexity modulates visual word recognition and that derivational and inflectional processes interact asymmetrically, with derivation often creating semantically less transparent forms that are harder to decompose [
32].
Despite these findings, computational models of lexical complexity have typically incorporated only coarse morphological proxies, such as word length, character n-grams, or simple suffix counts, and have rarely used full morphological analyses or paradigm-level features. For agglutinative languages such as Kazakh or Turkish, this is a significant limitation. A simple A1 lemma can generate thousands of surface forms through productive suffixation, and rare complex forms may be perceived as substantially harder than the lemma’s nominal CEFR level would suggest. Our work explicitly targets this gap by integrating HFST-based morphological analyses into lexical complexity prediction for Kazakh and by examining how morphological richness interacts with lemma-level difficulty.
2.1.4. NLP for Kazakh and Agglutinative Languages
Kazakh is a Turkic language with rich agglutinative morphology, relatively free word order, and limited high-quality NLP resources. Several morphological analysis tools have been proposed. Early work by Kessikbayeva and Cicekli [
33,
34] developed rule-based morphological analyzers and a disambiguator for Kazakh based on two-level morphology, encoding Kazakh morphotactics and phonological alternations in finite-state transducers. The Apertium-kaz project provides an HFST-based morphological transducer for Kazakh with coverage reported around 90% on freely available corpora and high precision on a manually verified test set [
10]. In parallel, the KazNLP toolkit offers data-driven morphological analysis and other preprocessing tools (normalization, tokenization, language identification) for Kazakh, built with CRF models and designed as a general-purpose NLP library [
35]. Building on these foundational preprocessing capabilities, Akanova et al. [
36] introduced a specialized algorithm for keyword search within Kazakh text corpora, which enhances the retrieval of thematic information in large-scale datasets. Such advancements in keyword extraction provide a technical framework for identifying salient linguistic patterns essential for complexity modeling.
These tools have been applied in various downstream tasks. Expanding these capabilities, Akanova et al. [
37] developed a neurocomputer system specifically for the semantic analysis of Kazakh text, which enhances the understanding of complex linguistic relationships beyond simple morphological parsing. Akhmed-Zaki et al. [
38] describe an information system for Kazakh language preprocessing that integrates morphological analysis, disambiguation, and wordform generation for applications in text analytics and lexicography. Morphological features have also been used in Kazakh text classification and short-text processing alongside convolutional neural networks [
39]. However, none of these works consider CEFR-based lexical difficulty, and morphological analyzers have not been evaluated in the context of predicting learner-oriented difficulty scales.
Kazakh machine translation has seen rapid development, primarily driven by the creation of parallel corpora and neural models. The KazParC parallel corpus is the largest publicly available resource of its kind described in our references, containing 371,902 parallel sentences across Kazakh, English, Russian, and Turkish, and supporting the Tilmash neural MT system, which the authors report as competitive with commercial systems such as Google Translate and Yandex Translate on standard MT metrics [
40]. Earlier work at WMT 2019 by Toral et al. [
41] showed that incorporating morphological segmentation using Apertium improves English–Kazakh neural MT, mitigating data sparsity by breaking complex wordforms into morphologically meaningful units. Similar observations have been made for other low-resource agglutinative settings, where tokenization strategy and morphology-aware modeling remain important design choices [
41,
42].
Turkish, as a closely related Turkic language, provides a valuable reference point. Oflazer’s two-level description of Turkish morphology [
43] remains a foundational finite-state account of Turkish morphotactics. More recent resources such as MorphoLex Turkish provide large-scale morphological lexicons with detailed information about root families, suffix frequencies, and morphological neighborhood structure [
44]. Work on word-level segmentation in agglutinative languages using neural sequence models and transformer-based variants reports strong performance and improved handling of rich morphology [
42]. Studies on Turkish NMT likewise emphasize that morphological complexity, tokenization strategy, and model architecture materially affect translation quality in low-resource settings [
45]. Specifically, Toraman et al. [
46] demonstrate that the granularity of tokenization can lead to significant variations in language model performance for Turkish, highlighting the non-trivial relationship between subword units and morphological structure.
Despite these advances, no prior work has combined Kazakh morphological analysis with CEFR-based lexical complexity prediction, nor has any study systematically compared feature-based and transformer-based models for lexical complexity prediction in a low-resource agglutinative setting. The present work fills this gap by (i) constructing the first CEFR-graded lexicon for Kazakh; (ii) integrating HFST-based morphological analyses and expanded frequency resources into a type-level lexical complexity prediction task; and (iii) empirically comparing traditional feature-based models and multilingual transformer architectures under realistic low-resource constraints.
2.2. Datasets
We compile three resources: (1) an expert-curated Kazakh CEFR-graded lexicon, (2) a monolingual frequency corpus from the Leipzig Corpora Collection (LCC) [
47], and (3) cross-lingual reference mappings to Russian and English.
2.2.1. Kazakh CEFR-Graded Lexicon
We construct the first CEFR-graded lexicon for Kazakh from a state-sponsored pedagogical handbook series covering A1 through C1 [
48]. The handbook extraction yields 4561 lemma–POS entries before cleaning and deduplication. Unlike frequency-derived resources like CEFRLex [
2], our lexicon inherits prescriptive difficulty labels directly from expert-curated pedagogical minima. The use of expert-curated pedagogical minima as a gold standard ensures the model’s alignment with Kazakhstan’s state educational standards. This guarantees the practical applicability of the system within the framework of Sustainable Development Goal 4 (Quality Education), enabling the automated creation of learning materials that correspond to the actual proficiency levels of students.
Table 1 summarizes this handbook-extracted inventory across CEFR levels and parts of speech.
Nouns constitute approximately 43% of the inventory, and their absolute share increases across levels, reflecting expansion into abstract and technical terms.
Table 1 also illustrates how closed classes (numerals, pronouns) concentrate in A1–B1, while adjectives peak at B1–C1.
Table 2 gives representative entries at each proficiency level. Full data cleaning details appear in
Appendix A.
2.2.2. Monolingual Frequency Corpus
Frequency of occurrence is a strong predictor of lexical difficulty [
20]. We compute frequency features and retrieve contextual sentences from a 17-million-token Kazakh corpus from the Leipzig Corpora Collection (LCC) [
47]. Text is lowercased and tokenized using a Cyrillic-only regex to filter noise. We compute two frequency signals: an exact surface-form count obtained by matching the lemma string exactly as it appears in the corpus, and a secondary lemma-level count obtained by passing every corpus token through the Apertium-kaz lemmatizer before lookup. The exact surface-form count is our primary contextual lookup mechanism; the lemmatized fallback count partially mitigates agglutinative sparsity but does not serve as our primary contextual anchor. Under this exact surface-string matching regime, 38.4% of the 4350 distinct surface forms are attested, contributing both counts and contextual embeddings; the remaining 61.6% default to zero frequency and isolated-word encodings. We additionally compile an expanded lemma-frequency map by running the Apertium-kaz lemmatizer over the full corpus, yielding approximately 418,000 unique lemma entries. This lemmatized lookup raises effective frequency coverage to approximately 83% of the cleaned modeling set, substantially mitigating the sparsity of exact surface-form matching.
2.2.3. Cross-Lingual Reference Resources
To explore whether CEFR difficulty knowledge can be transferred from better-resourced languages to Kazakh, an approach that has shown promise for typologically related European languages [
29], we project the 4350 distinct Kazakh surface forms into Russian and English CEFR lexicons via open-source machine translation (details in
Section 2.3.5).
Table 3 summarizes the source lexicons and MT pipelines. For lemmas with multiple translations matching the source lexicon, we assign the median CEFR level as a silver label. Unmatched translations are dropped. Russian serves as a natural projection bridge due to extensive lexical borrowing and high Kazakh–Russian bilingualism.
2.2.4. Dataset Partitioning
The handbook extraction yields 4561 lemma–POS entries. After cleaning and deduplication, the modeling set contains 4437 unique lemma–POS entries. Specifically, 18 entries with non-Cyrillic characters or length < 2 were removed, and 106 duplicate lemma–POS keys were resolved. Because some lemmas share the same surface form across POS categories, these 4437 entries correspond to 4350 distinct surface forms; all frequency and corpus lookups operate at the surface-form level, while model training uses the full 4437 lemma–POS set. These 4437 entries are partitioned into training, development, and test sets using a 70/15/15 split, yielding 3105, 666, and 666 instances, respectively. Stratified sampling by CEFR level ensures that the class distribution is preserved across all partitions.
Table 4 reports the per-level counts.
The neural gated fusion model uses the 3105 training entries and reserves the development set for early stopping. Classical classifiers are evaluated on the same held-out test set of 666 entries.
2.3. Methodology
This subsection outlines the modeling setup used to predict CEFR lexical difficulty for Kazakh. We define the task, describe the feature representations and model families, then present the cross-lingual projection analysis and evaluation protocol.
2.3.1. Task Formulation
We frame Kazakh CEFR lexical complexity prediction as ordinal five-class classification. Given a lemma–POS pair
from the lexicon and, when available, a corpus sentence
s containing
w, the task is to predict one of the ordered labels. Equation (
1) presents the label space:
encoded as integers
. The task is type-level rather than token-level: each lemma–POS entry receives a single CEFR label regardless of its observed corpus contexts.
2.3.2. Feature Engineering
For the classical baselines, each lemma–POS pair is represented by a fixed 40-dimensional handcrafted vector organized into five groups. We also evaluate a separate 50-dimensional contextual embedding baseline obtained by applying PCA to XLM-RoBERTa-base representations. The neural model uses a richer engineered bank and retains a train-selected 72-dimensional lexical subset for the MorphMLP branch.
Table 5 summarizes the feature groups, and
Appendix B provides the full definitions and extraction algorithm.
Among the fixed feature groups, the morphological features extracted from the Apertium HFST analyzer [
10] provide the strongest isolated signal in
Table 5, which fits Kazakh’s productive morphology. Frequency features from the 17-million-token LCC corpus also contribute useful information, while the embeddings-only baseline remains competitive without surpassing the best handcrafted configuration. Full technical details appear in
Appendix B.
2.3.3. Classical Classifiers
We evaluate three classical classifiers from scikit-learn, each preceded by StandardScaler feature normalization:
Logistic Regression (LR): Multinomial L2-regularized; max_iter = 2000, inverse-class weights.
Random Forest (RF): 200 trees, max_depth = 15, inverse-frequency weights.
Gradient Boosting (GB): 200 rounds, max_depth = 6.
Each classifier is evaluated on the full 40-dimensional handcrafted feature set. To isolate the contribution of each feature group, we additionally report nested ablation experiments using frequency-only, orthographic-only, morphological-only, POS-only, and TF-IDF-only subsets (
Table 5).
For the stacking ensemble only, we additionally train an ExtraTrees classifier as a diverse tree-based probability source. It is used inside the ensemble but is not reported as a standalone baseline.
2.3.4. Gated Morphology–Context Fusion Model
To compare a learned fusion mechanism with simpler ways of combining contextual semantics and morphological structure, we use a dual-encoder architecture with gated fusion.
Figure 1 summarizes the audited XLM-R configuration.
Context Encoder
XLM-RoBERTa-base (
xlm-roberta-base) [
8] encodes the sentential context. The target word is delimited with special
[TGT] tokens, and its representation is obtained by mean-pooling over the corresponding subword positions in the final hidden layer, where
. To mitigate overfitting on the training set of 3105 instances, we freeze all transformer layers except the top
, reducing the number of trainable transformer parameters to approximately 8.3% of the total.
Morphological Encoder
The morphological encoder consists of two parallel sub-networks whose outputs are concatenated and linearly projected. Unlike the classical baselines in
Section 2.3.2, this branch does not consume the fixed 40-dimensional interpretable vector. Instead, it receives a 72-dimensional lexical feature subset drawn from the larger engineered feature bank. Feature selection is performed exclusively on the training split using ExtraTrees importance ranking (with a minimum of 20 morphological features retained), and the resulting feature mask is frozen before any development or test evaluation, ensuring no information leakage. Selected features are z-scored using training-set statistics.
CharCNN: a character-level convolutional network over the Kazakh Cyrillic alphabet (44 characters including pad and unknown tokens). Characters are embedded into dimensions, then processed by three parallel 1D convolutions with kernel sizes and 128 filters each, followed by ReLU activation and max-over-time pooling, yielding a vector .
MorphMLP: a two-hidden-layer MLP with LayerNorm that projects the 72-dimensional selected lexical feature vector through hidden layers of 96 units each with ReLU activation and dropout , yielding .
The concatenated CharCNN and MorphMLP outputs (480 dimensions) are linearly projected to .
Gated Fusion
Rather than simple concatenation, we learn a sigmoid gate that weights the relative contribution of contextual and morphological representations:
where
,
,
, and
. Equations (
2)–(5) define the fusion block. The fused representation
is passed through layer normalization, dropout with
, and a linear classifier
with softmax activation.
Training Procedure
The model is optimized with AdamW in PyTorch using discriminative learning rates and focal loss [
51] on a single NVIDIA L4 GPU. Full training details appear in
Appendix C.
2.3.5. Cross-Lingual Projection
To explore whether translation can yield useful CEFR supervision, we align the 4350 distinct Kazakh surface forms with the English EFLLex (13,871 entries) and Russian KELLY (8947 entries) datasets using the translated lexicons described in
Section 2.2.3. For items matched by exact string translation, the median source proficiency level is projected as a silver CEFR label. Our main analysis treats these projected labels as a diagnostic signal of cross-lingual alignment; their use for downstream augmentation remains exploratory.
2.3.6. Direct-Prompt LLM Baselines
To address whether instruction-tuned large language models can solve the task without task-specific training, we add direct-prompt baselines using both general multilingual and Kazakh-focused open-weight models. The general baselines are Qwen2.5-0.5B-Instruct and Qwen2.5-7B-Instruct [
52]. The Kazakh-focused baselines are Sherkala-Chat 8B [
53] and ISSAI KAZ-LLM 8B [
54]. Each prompt gives the Kazakh lemma, its POS tag, and the closed label set {A1, A2, B1, B2, C1}; the model must return a single CEFR label. We test both Kazakh and English prompt templates and use deterministic decoding with no sampling. Labels are extracted from the generated text by matching the first valid CEFR label; unparseable outputs are counted as incorrect. No train or development labels are used to update the model, tune prompts, or calibrate decision thresholds.
2.3.7. Experimental Procedure
All supervised models use the same stratified train/development/test partitions. Training uses only the training split; the development split is reserved for hyperparameter selection, early stopping, and ensemble weight tuning; and the test split is evaluated once for the reported held-out scores. We do not retrain on the combined train+development set before test evaluation. The classical baselines are trained with standardized feature vectors, the neural ablations use the shared optimization settings in
Appendix C, and the direct-prompt LLM baselines described in
Section 2.3.6 are evaluated only as external comparison systems.
2.3.8. Evaluation Measures
Let N denote the number of evaluation instances, the predicted CEFR level encoded as an ordinal integer , the corresponding gold label, and the number of proficiency levels.
Accuracy (Acc)
Accuracy (Acc) is the fraction of exactly correct predictions:
where
is the indicator function.
Macro-F1
The unweighted average of per-class F1 scores:
Mean Absolute Error (MAE)
The average ordinal distance between predicted and gold levels:
Within-1 Accuracy (W-1)
Within-1 Accuracy (W-1) is the fraction of predictions within one CEFR level of the gold label ().
4. Discussion
Our discussion focuses on what the experiments show reliably for this dataset and what remains unresolved. Across the supervised analyses, the consistent pattern is that morphology helps, contextual encoders are strong, and combining the two yields the clearest gains.
4.1. When Handcrafted Morphology Still Matters
Explicit morphological modeling matters because it adds information that character patterns alone do not recover reliably. The CharCNN cannot represent paradigm-level properties such as analysis count, derivational depth, and inflectional categories, whereas the MorphMLP branch improves the lexical model when those features are added.
Appendix D supports the robustness of this contrast.
While our experiments are limited to Kazakh, this finding may be relevant to other agglutinative languages—Turkish, Finnish, and other Turkic languages—that share the property of productive suffixation and for which finite-state morphological analyzers are available. Whether the same pattern holds in those settings remains an empirical question.
4.2. Features and Embeddings
Pretrained contextual representations are stronger than the morphology-only neural branch, so the paper does not argue for hand-engineering in place of modern encoders. Instead,
Table 6 and
Table 8 point to complementarity: handcrafted features provide a solid classical baseline, XLM-R context-only is stronger than the morphology-only variants, and fusion models perform best overall. This pattern is consistent with findings in other low-resource LCP settings [
20,
21].
4.3. Direct-Prompt LLMs
The LLM experiment strengthens the interpretation that task-specific lexical supervision is still necessary for Kazakh CEFR prediction.
Table 7 shows that direct prompting underperforms both the supervised fusion model and the stronger classical baselines in Macro-F1. Kazakh-focused LLMs improve over some general multilingual prompt baselines—ISSAI KAZ-LLM 8B with the English prompt reaches Macro-F1 0.214—but the gap remains substantial against the gated fusion model (0.360 Macro-F1) and the full-feature LR baseline (0.314 Macro-F1). This indicates that broad Kazakh language modeling capacity does not by itself recover the pedagogical level distinctions encoded in the Kazakh lexical minima.
4.4. Gated vs. Concatenation Fusion
The specific fusion mechanism appears less important than the availability of both signal types. Across the multi-seed ablation and the significance appendix, gated and concatenation fusion show no clear performance separation. We therefore interpret the gain as evidence for representation complementarity rather than for a uniquely superior gating design.
4.5. Ensembling
The ensemble offers a small additional gain, but not a decisive one. It yields the best overall test metrics, yet
Appendix D does not show a clear advantage over the standalone fusion model. Its main value is that it redistributes errors across the most difficult classes, suggesting partially complementary failure modes between neural and classical systems.
4.6. Interpreting Cross-Lingual Projection
Cross-lingual projection remains more convincing as an analysis tool than as supervision. Russian shows moderate alignment with the Kazakh labels, whereas English remains weak and noisy under the current exact-match pipeline. This suggests that cross-lingual CEFR information may still be useful, but the present setup is not reliable enough for downstream augmentation.
4.7. Ordinal Structure
Although CEFR levels are inherently ordered and we report ordinal-aware metrics (MAE, Within-1), all models in this study use nominal classification losses. Explicit ordinal classifiers such as CORAL [
55] or cumulative-link models could exploit the label ordering directly and may reduce large-distance errors. We leave a systematic comparison of ordinal loss functions to future work, noting that the current Within-1 accuracy of 69.2% already suggests room for improvement through ordinal-aware training.
4.8. Limitations
Our findings are subject to several constraints: (1) Lexicon scale: The cleaned modeling set of 4437 unique lemma–POS entries is modest compared to European resources, limiting the reliability of per-level estimates. (2) Validation: Labels reflect theoretical curriculum placement rather than empirical learnability, as they are derived from pedagogical handbooks without validation against learner performance. (3) Range: The absence of C2-level data limits the generalizability of our observations across the full CEFR scale. (4) Frequency noise: Exact surface-form lookup covers only 38.4% of the distinct surface forms in the modeling set; the expanded lemma-level lookup raises effective frequency coverage to approximately 83% of the modeling set, but the remaining entries still default to zero frequency. (5) Context mismatch: Type-level labels are paired with instance-level embeddings retrieved from corpus occurrences of the surface form. For homographic or multi-POS entries, retrieved sentences may represent a different sense or grammatical category, potentially introducing noise in the contextual signal.
4.9. Future Work
Several directions could strengthen and extend this work: (i) expanding the lexicon via additional pedagogical sources and expert annotation; (ii) collecting learner-based validation data to ground CEFR labels in empirical difficulty; (iii) richer morphological features such as paradigm entropy and suffix productivity; (iv) confidence-weighted cross-lingual projection that down-weights noisy silver labels; and (v) lemmatizing the source corpus prior to frequency counting.