In this section, we provide a comprehensive description of the datasets, baselines, implementation details, evaluation metrics, and ablation studies used to assess the efficacy of our proposed method. All experiments are conducted on three target languages under low-resource conditions, with an additional cross-lingual evaluation on Turkish.
4.1. Experimental Setup
4.1.1. Dataset and Tagging Schema
We evaluated MRL-POS on three low-resourced agglutinative languages using the POS_ukg dataset, which is sourced from the Leipzig Corpora Bank and is annotated using both manual and automatic methods. The annotation process adheres to a three-stage pipeline:
Automatic Pre-tagging: Initial tags are generated using UKU-Tagger 1.0, a local rule-based system trained on 10k seed sentences. The tool implements strict suffix-priority rules to minimize ambiguity.
Expert Validation: Two linguists independently revise the pre-tag data over two rounds, resolving conflicts via majority voting. Inter-annotator agreement by Cohen’s Kappa [
33] reaches 0.87 (Uy), 0.83 (Kg), and 0.85 (Uz), indicating strong consistency.
UD Compliance: Final tags are mapped to Universal Dependencies v2.10, ensuring cross-lingual compatibility. Domain analysis reveals 92% news-text.
We randomly shuffled and split the dataset into training (80%), development (10%), and test (10%) sets.
Table 3 summarizes the number of sentences and unique stems in each split. For cross-lingual validation, we extended the evaluation to Turkish using the IMST-UB treebank [
34].
4.1.2. POS Tag Set and Distribution
We adopt a coarse-grained tag set, as no prior research has established a uniform POS tag specification for the three target languages.
Table 4 lists each POS label and its abbreviated tag. In the dataset, 12 major POS tag classes are selected, which have extensive coverage in the corpora. Some minor and rare classes are omitted due to the domain characteristics of news texts.
Table 5 presents the absolute frequency of each tag in POS_ukg, excluding the punctuation tag.
4.1.3. Evaluation Metrics
We adopted a comprehensive set of evaluation metrics to rigorously assess model performance across key dimensions relevant to MRLs:
Overall F1: Primary metric for overall POS tagging accuracy, balancing precision and recall. This is essential for class-imbalanced datasets common in MRLs.
OOV accuracy (OOV Acc): Measures performance on tokens whose word form is absent from training vocabulary, and computes accuracy as the fraction of these OOV tokens that receive the correct POS tag.
Polysemous word disambiguation F
1 (Poly-F
1): The macro-averaged F
1 score computed only over tokens belonging to polysemous word types (i.e., Uz “
yoz” = summer[N]/write[V]). Specifically, if
T is the set of all polysemous types and for each type
t we collect all test tokens
St;, then
where
.
Affix segmentation F1 (Affix-F1): Quantifies alignment between predicted and gold-standard morphological boundaries. Validates our dynamic affix selection mechanism.
Linguistically valid affix retention (Ling-Affix Ret): Percentage of retained affixes conforming to linguistic rules. Assesses morphological awareness beyond mere accuracy.
4.1.4. Baseline Models
To ensure comprehensive evaluation, we compare MRL-POS against five competitive baselines, ranging from traditional character-level models to state-of-the-art multilingual contextual encoders. Each baseline is implemented with a CRF output layer to maintain consistent decoding.
BiLSTM + CRF: A word-level standard sequence labeling baseline.
Char-BiLSTM + CRF: A character-level recurrent model where each token is encoded from scratch using a bidirectional LSTM over its character sequence.
Hybrid (CharCNN + Word Embedding + CRF): In this setting, we concatenate pretrained FastText word embeddings with character-level CNN features extracted from each token. This baseline tests whether static word-level signals, when enriched by surface form morphology, suffice for POS tagging.
XLM-R (base) + CRF: A strong contextual baseline using the multilingual RoBERTa encoder. No explicit affix or character information is provided; POS decisions rely solely on sentence-level context.
Fixed Affix + XLM-R + CRF: Instead of dynamically selecting affixes, we extract fixed-length suffix n-grams (length 3–7) from each token and embed them alongside XLM-R outputs. These affix embeddings are concatenated to the token representation before feeding into the CRF layer.
mBERT + CRF: We use multilingual BERT as a contextual backbone to test the portability of our approach across encoder architectures. This model does not include the Kyrgyz language.
4.1.5. Implementation Details
All models use the HuggingFace implementation of XLM-R Base (12 layers, hidden size 768). Affix embeddings are 128-dimensional; the BiLSTM hidden size is 128. The co-attention projection dimension equals 768. We train with AdamW: learning rates of 3 × 10−5 for Transformer parameters and 1 × 10−3 for affix-module parameters. We employ gradient clipping at 1.0 and a batch size of 32 sentences. Early stopping is triggered if development F1 does not improve for three consecutive epochs, up to a maximum of 30 epochs. Each configuration is run five times with seeds {42, 100, 202, 303, 404} to report mean ± σ.
All models were trained on a single NVIDIA RTX 3080 Ti GPU (12 GB VRAM) with 32 GB system RAM. MRL-POS requires ~2.5 h per training run (10 epochs, early stopping) and peaks at 8.5 GB GPU memory, compared to ~2.0 h and 7.8 GB for XLM-R fine-tuning.
4.2. Main Results
4.2.1. Performance Comparison
Table 6 presents the overall F
1 scores of MRL-POS and six baseline models across Uyghur, Kyrgyz, and Uzbek. MRL-POS consistently achieves the highest performance across all languages, with an average F
1 of 84.10%, outperforming all baselines by a notable margin. Compared to the best-performing baseline, XLM-R + fixed affix (80.83%), MRL-POS shows a relative gain of 3.27 points. Similarly, it outperforms mBERT and XLM-R by 8.48 and 4.05 points, respectively, demonstrating the advantage of integrating morphological information with contextual representations. Moreover, models incorporating character-level features such as Hybrid and Char-BiLSTM achieve 78.92% and 77.08%, respectively, but remain 5.18 and 7.02 points below MRL-POS, highlighting the insufficiency of fixed or surface morphological features, even with strong pretrained encoders. Furthermore, the largest performance gap is observed on Kyrgyz, the language with the smallest training set, where MRL-POS outperforms BiLSTM-CRF by 9.76 points and XLM-R by 3.28 points, demonstrating our model’s low-resource robustness.
Overall, these results confirm that dynamically selecting affixes significantly enhances POS tagging in low-resource, morphologically rich settings. MRL-POS not only improves overall accuracy but also offers greater robustness to rare and out-of-vocabulary word forms than both pure contextual models and fixed morphological heuristics.
4.2.2. OOV Word Handling
For OOV handling, we compare only with models possessing explicit morphological processing capabilities such as XLM-R and Char-BiLSTM.
Table 7 reports OOV word tagging accuracy for Char-BiLSTM, XLM-R, and our MRL-POS on Uyghur, Kyrgyz, and Uzbek, along with their average. The purely character-based BiLSTM achieves an overall OOV accuracy of 61.14%, but its limited contextual understanding reduces effectiveness on unseen word forms. Leveraging pretrained multilingual context raises the overall OOV accuracy up to 71.97%, indicating that sentence-level representations aid in handling novel morphological variants. MRL-POS significantly outperforms both baselines, confirming that explicitly modeling and attending to affix information is essential for accurate tagging of unseen words in low-resource MRL settings.
This significant improvement is attributed to the following:
Dynamic affix selection: Adaptive n-gram ranges effectively capture complex word structures, such as the Uyghur word kör-sät-küch-siz-lik (“inability to show”), which includes five unseen suffixes.
Frequency-aware filtering: Rare but valid affixes, like the Kyrgyz -übüz in kelüülükübüz, are retained, whereas fixed-threshold methods discard them.
Contextual grounding: Co-attention mechanisms enhance stem representations, especially when suffix patterns are ambiguous.
4.2.3. Polysemy Resolution
For polysemy resolution, XLM-R is used as the sole baseline due to its comparable contextual modeling capabilities with our approach. Non-contextual models are excluded, as they lack the necessary capacity for effective polysemy handling.
Table 8 demonstrates that our context–affix interaction mechanism improves polysemous word disambiguation by 5.2 F
1 points over XLM-R on average, confirming that the dynamic affix selection mechanism and global context interaction enhance POS assignment for words with multiple senses. Three representative cases illustrate this:
Uzbek “yoz”: Our model reaches 93.4% accuracy in distinguishing the noun “summer” from the verb root “to write.” It uses the preceding context (“keyingi,” “next”) alongside suffix cues (locative “-da” vs. gerund “-ish”) to make the correct tag.
Kyrgyz “kir”: With 89.7% accuracy, MRL-POS resolves the “dirt [N]” vs. “enter [V]” ambiguity by gating out irrelevant affix dimensions via co-attention, thus focusing on the affix patterns that signal each sense.
Uyghur “ata”: Our model correctly tags 91.2% of instances as “father [N]” versus the verbal sense “to name [V],” by weighting sentence-level dependencies and selecting the appropriate affix features.
4.3. Ablation Studies
To assess the individual contributions of key components within the proposed MRL-POS framework, we conducted a comprehensive ablation study across all three target languages. We evaluated five ablated variants of the model as shown in
Table 9, each omitting or modifying a critical architectural element:
w/o Dynamic Affix Selection: Substitutes the adaptive n-gram segmentation strategy with fixed-length (3–7 characters) n-grams and static frequency thresholds.
w/o Context–Affix Co-Attention: Removes the dual-gating co-attention module and instead performs a naïve concatenation of contextual and affix embeddings.
w/o Layer-wise Attention: Uses only the final layer output of the XLM-R encoder, ignoring intermediate representations.
XLM-R only: Only contextual embeddings from XLM-R are used for POS classification, excluding all affix-based modules.
Affix-only: Retains only the affix-based BiLSTM-attention module while removing the contextual encoder, to evaluate the effectiveness of morphological features in isolation.
The ablation results reveal the individual and combined importance of contextual and morphological components in the MRL-POS architecture. The full model consistently outperforms all variants, confirming the effectiveness of its integrated design. When affix features are entirely removed, as in the XLM-R-only configuration, performance declines substantially across all metrics—particularly in OOV accuracy, which drops by over 12 points, indicating that contextual embeddings alone are insufficient for capturing the rich morphological patterns of agglutinative languages.
Conversely, the affix-only model, while retaining some morphological cues, performs the worst overall. This underscores the necessity of contextual information for accurate disambiguation of word senses and syntactic roles. Among the ablated settings, removing the dynamic affix selection module leads to the most significant reduction in OOV accuracy, emphasizing the value of adaptive segmentation over fixed n-gram strategies. Similarly, eliminating the co-attention mechanism weakens the model’s ability to integrate local and global cues, reducing its effectiveness in polysemy resolution. Moreover, excluding the layer-wise attention module yields moderate performance degradation, suggesting that hierarchical contextual information from intermediate transformer layers contributes meaningfully to syntactic and semantic understanding.
Overall, the results indicate that morphology and context are not interchangeable but rather complementary. Our findings suggest that through the integration of dynamic affix selection, deep contextual encoding, and feature-level interaction, the model can achieve robust generalization in low-resource MRLs.
4.3.1. Ablation on Dynamic Affix Selection
To better understand the internal mechanisms of the dynamic affix selection module, we performed a detailed ablation by isolating and modifying its core components. Specifically, we assessed the impact of the following: (1) fixed-length affix extraction without adaptive sizing, (2) uniform frequency thresholds across all affixes, (3) removal of positional boundary markers distinguishing prefixes from suffixes, and (4) random affix selection as a control.
The results in
Table 10 demonstrate that each component of the dynamic affix selection module contributes meaningfully to overall performance. The full dynamic configuration achieves the highest scores across all metrics and languages, reaffirming the importance of combining adaptive affix length, frequency-aware filtering, and affix boundary awareness.
When length adaptation is removed and fixed 3–7 n-grams are applied uniformly, the model shows a noticeable decline in OOV accuracy and affix segmentation quality. This suggests that over-segmentation in long words and under-segmentation in short words compromise affix informativeness. Similarly, applying a fixed frequency threshold across all affix lengths leads to a reduction in affix quality, particularly in Kyrgyz, where rare but valid affixes are more frequent due to smaller corpora. The absence of boundary marking between prefixes and suffixes results in poorer affix disambiguation, indicating that explicitly encoding affix position within the word is necessary for preserving morphological structure. Finally, the random selection variant performs the worst, with substantial degradation in all metrics, confirming that the observed improvements in the full model are not due to chance.
These findings emphasize that the dynamic affix selection module is not a monolithic mechanism but rather a collaboration of statistically grounded strategies. Length- and frequency-adaptive filtering allows the model to capture linguistically meaningful subword units, thereby improving generalization of POS tagging in low-resource MRLs.
4.3.2. Ablation on Affix Segmentation
To evaluate the effectiveness of our adaptive n-gram segmentation strategy, we conducted a controlled ablation study comparing it against two widely used unsupervised morphological segmenters: Morfessor v2 and BPE-Morph. Both methods have been employed in prior research to identify subword units without linguistic supervision and serve as reasonable baselines for our affix candidate generation module.
In this ablation, we replace the adaptive segmentation component of MRL-POS with either Morfessor or BPE-Morph, each trained independently on the full training corpus for each language. All other components of the model are kept fixed to isolate the impact of segmentation.
Table 11 reports the average F
1 scores for the three target languages.
As shown in
Table 11, both unsupervised baselines Morfessor and BPE-Morph underperform the proposed adaptive n-gram method, with average F
1 drops of 1.30 and 0.87, respectively. This indicates that our adaptive n-gram segmentation heuristics, which preserve stem integrity while isolating short affixes, outperform frequency-driven unsupervised methods that often over-fragment stems or merge affixes into frequent subword units.
We further note that those two segmenters require dedicated training and hyperparameter tuning per language, whereas our adaptive rule is fully deterministic and language-agnostic. These findings confirm that the proposed segmentation strategy achieves a robust balance between morphological sensitivity and practical simplicity.
4.3.3. Ablation on Layer-Wise Attention Pooling
To assess the effectiveness of the layer-wise attention pooling mechanism in our model, we evaluated our layer-wise attention pooling by comparing four XLM-R aggregation strategies: (1) Last Layer Only, using only the final (12th) transformer layer; (2) Average Pooling, computing the unweighted mean of all twelve layers; (3) Shallow Layers Only, averaging layers 1–4 to emphasize morphological and lexical patterns; and (4) Top Layers Only, aggregating layers 9–12 to capture semantic and discourse information. Our full model, Full Layer-wise Pooling, instead learns dynamic attention weights to combine all twelve layers.
The results in
Table 12 validate the effectiveness of our proposed layer-wise attention pooling mechanism. The full model outperforms all baseline configurations across all three languages, with notable gains in disambiguating polysemous words. Specifically, it produces a 2.25 F
1 improvement over the final-layer-only setup, indicating that critical linguistic cues are dispersed across multiple layers and cannot be fully captured by relying solely on the final transformer layer.
While average pooling offers a slight improvement over using only the last layer, it falls short of our attention-based approach. This suggests that layers contribute unequally to POS tagging, and that learnable aggregation enables the model to selectively emphasize the most relevant representations. The shallow-layer-only variant performs the worst, confirming that low-level morphological cues are insufficient for capturing the syntactic and semantic nuances needed in POS disambiguation. In addition, top-layer aggregation performs better than the shallow-layer setting but still underperforms the full pooling strategy. This implies that although higher layers encode more abstract semantics, signals from lower layers remain essential for fine-grained linguistic tasks in morphologically rich languages.
In summary, these results demonstrate that dynamic, attention-based layer aggregation effectively harnesses complementary information across the transformer stack, enhancing POS tagging performance in low-resource MRLs.
To further illustrate how the layer-wise attention pooling mechanism operates in practice, we analyze a representative sentence from the Uzbek development set: “
Keyingi yozda u yozishni o‘rganmoqchi.” (“He intends to learn writing next summer.”). As shown in
Figure 2, we visualize the attention distribution across all 12 transformer layers for each token in the sentence.
The heatmap reveals that function words such as “u” tend to assign higher importance to deeper layers, which aligns with their role in discourse-level reference resolution. In contrast, content words like “yozishni” and “o‘rganmoqchi” exhibit broader attention spread across middle and top layers, reflecting the integration of morphological and semantic features. Interestingly, the model appears to assign relatively greater importance to middle layers for morphologically complex or polysemous tokens, suggesting that these layers encode representations particularly useful for POS disambiguation.
This observation supports our earlier findings in
Table 12, where layer-wise aggregation yields significant gains in Poly-F
1. The model’s ability to selectively emphasize relevant layers on a per-token basis highlights the effectiveness of attention-based pooling for morphologically rich and syntactically diverse languages.
4.3.4. Gating Mechanism Ablation
To highlight the contribution of our dual-gating co-attention module, we compare four gate configurations with all other components unchanged: (1) a single shared gate applied post-concatenation, (2) a scalar gate learning one weight to balance affix and context features, and (3) a concatenation-only variant without gating.
As shown in
Table 13, the dual-gating mechanism consistently yields the highest F
1 scores across all three languages, outperforming the next best variant (Single Gate) by approximately 0.83 F
1 on average. The substantial drops observed for Scalar Gate (−1.13 F
1) and Concat-Only (−2.39 F
1) confirm that the dual-gating mechanism surpasses simpler variants by offering independent, element-wise control over affix and context features. This precise balancing prevents dominant contextual signals from overshadowing subtler morphological cues and dynamically adjusts morphology–context contributions per token, thereby improving generalization across varied morphological patterns.
4.3.5. Sensitivity Ablation on Affix Selection Threshold
To assess the sensitivity of the affix-selection threshold function , we performed an ablation study by varying the frequency cutoff across a wide range: {25, 50, 75, 100, 125}. All other components of the model architecture and training configuration were held constant to isolate the effect of . For each setting, we conducted 10 independent training runs and report the average F1 score on the development set for each target language.
As illustrated in
Figure 3, performance remains highly stable across different
values. For instance, on Uyghur, the F
1 score varies by only ±0.05 points around the maximum (86.52 at
θ = 50); similar stability patterns are observed for Kyrgyz and Uzbek. This consistent performance indicates that the model is not overly sensitive to the exact threshold value and that our default settings (
θ = 50 and 100) fall well within a robust and reliable operating range.
4.4. Cross-Lingual Evaluation
To evaluate the cross-lingual transferability of MRL-POS beyond closely related languages, we include Turkish, a resource-rich, typologically related member of the Turkic language family, as a source and target language in our transfer experiments. Specifically, we adopt a zero-shot setting where the model is trained on Turkish and tested on three target languages, and vice versa, without any supervision from the target language. In this experiment, all our target language data use Latin script to be consistent with the Turkish one.
The results in
Table 14 demonstrate the strong cross-lingual transferability of MRL-POS. When trained on Turkish and evaluated on morphologically similar but low-resource languages like Uyghur, Uzbek, or Kyrgyz, the model achieves average F
1 scores above 76%, OOV accuracy near 70%, and a Poly-F
1 score of 72%, indicating effective transfer of contextual and affix representations despite script and domain differences.
Notably, models trained on Uyghur, Uzbek, and Kyrgyz also generalize well to Turkish, particularly in Poly-F1, demonstrating upward transfer potential from smaller datasets when morphological patterns are adequately modeled.
These findings highlight two conclusions: (1) Turkish serves as a robust pivot language for transfer to other Turkic languages due to shared morphological structure; and (2) the affix-centric architecture of MRL-POS facilitates generalization across typologically related but orthographically diverse languages, reinforcing its utility in cross-lingual low-resource POS tagging.