Previous Article in Journal
CORAL: A Rank-Memory Search Framework for Multi-Objective Feature Selection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

CSE-Guided Linguistically Constrained Morphological Segmentation for Turkmen

Faculty of Information technology and Artificial Intelligence, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan
*
Author to whom correspondence should be addressed.
Information 2026, 17(6), 594; https://doi.org/10.3390/info17060594 (registering DOI)
Submission received: 10 May 2026 / Revised: 8 June 2026 / Accepted: 11 June 2026 / Published: 13 June 2026
(This article belongs to the Section Artificial Intelligence)

Abstract

This paper presents a linguistically constrained neural approach to morphological segmentation for low-resource Turkic languages, with a case study on Turkmen. The proposed method combines large-scale training data generated by a Complete Set of Endings (CSE) model with a neural architecture augmented with explicit phonological inductive biases. Unlike prior FEMSeg-based architectures that rely on convolutional and Transformer layers for implicit feature learning, the proposed model, LCMSeg (Linguistically Constrained Morphological Segmentation), introduces vowel/consonant indicators and harmony-class embeddings, both of which are directly derived from linguistic rules. The constraints are implemented as inductive biases. The CSE framework serves as a data-generation mechanism, producing a segmented corpus of 270k sentences used for training. The neural model learns to approximate the segmentation function induced by the CSE annotations while generalizing beyond the limitations of rule-based methods. Experiments conducted on training sets of 10k to 80k sentences demonstrate consistent improvements, achieving up to 99.76% token accuracy and 99.53% morpheme accuracy. Evaluation on the FLORES-200 benchmark confirms strong generalization under domain shift, with harmony consistency reaching 98.9%. The results show that explicitly encoding phonological structure provides a strong inductive bias, particularly beneficial in low-resource settings. The proposed framework offers a scalable and linguistically grounded solution for morphological segmentation in Turkic languages.
Keywords: CSE-guided; linguistically constraints; neural; morphology; segmentation; Turkmen CSE-guided; linguistically constraints; neural; morphology; segmentation; Turkmen

Share and Cite

MDPI and ACS Style

Tukeyev, U.; Amirova, D.; Eshimov, D. CSE-Guided Linguistically Constrained Morphological Segmentation for Turkmen. Information 2026, 17, 594. https://doi.org/10.3390/info17060594

AMA Style

Tukeyev U, Amirova D, Eshimov D. CSE-Guided Linguistically Constrained Morphological Segmentation for Turkmen. Information. 2026; 17(6):594. https://doi.org/10.3390/info17060594

Chicago/Turabian Style

Tukeyev, Ualsher, Dina Amirova, and Davranbek Eshimov. 2026. "CSE-Guided Linguistically Constrained Morphological Segmentation for Turkmen" Information 17, no. 6: 594. https://doi.org/10.3390/info17060594

APA Style

Tukeyev, U., Amirova, D., & Eshimov, D. (2026). CSE-Guided Linguistically Constrained Morphological Segmentation for Turkmen. Information, 17(6), 594. https://doi.org/10.3390/info17060594

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop