This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
CSE-Guided Linguistically Constrained Morphological Segmentation for Turkmen
by
Ualsher Tukeyev
Ualsher Tukeyev *
,
Dina Amirova
Dina Amirova
and
Davranbek Eshimov
Davranbek Eshimov
Faculty of Information technology and Artificial Intelligence, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan
*
Author to whom correspondence should be addressed.
Information 2026, 17(6), 594; https://doi.org/10.3390/info17060594 (registering DOI)
Submission received: 10 May 2026
/
Revised: 8 June 2026
/
Accepted: 11 June 2026
/
Published: 13 June 2026
Abstract
This paper presents a linguistically constrained neural approach to morphological segmentation for low-resource Turkic languages, with a case study on Turkmen. The proposed method combines large-scale training data generated by a Complete Set of Endings (CSE) model with a neural architecture augmented with explicit phonological inductive biases. Unlike prior FEMSeg-based architectures that rely on convolutional and Transformer layers for implicit feature learning, the proposed model, LCMSeg (Linguistically Constrained Morphological Segmentation), introduces vowel/consonant indicators and harmony-class embeddings, both of which are directly derived from linguistic rules. The constraints are implemented as inductive biases. The CSE framework serves as a data-generation mechanism, producing a segmented corpus of 270k sentences used for training. The neural model learns to approximate the segmentation function induced by the CSE annotations while generalizing beyond the limitations of rule-based methods. Experiments conducted on training sets of 10k to 80k sentences demonstrate consistent improvements, achieving up to 99.76% token accuracy and 99.53% morpheme accuracy. Evaluation on the FLORES-200 benchmark confirms strong generalization under domain shift, with harmony consistency reaching 98.9%. The results show that explicitly encoding phonological structure provides a strong inductive bias, particularly beneficial in low-resource settings. The proposed framework offers a scalable and linguistically grounded solution for morphological segmentation in Turkic languages.
Share and Cite
MDPI and ACS Style
Tukeyev, U.; Amirova, D.; Eshimov, D.
CSE-Guided Linguistically Constrained Morphological Segmentation for Turkmen. Information 2026, 17, 594.
https://doi.org/10.3390/info17060594
AMA Style
Tukeyev U, Amirova D, Eshimov D.
CSE-Guided Linguistically Constrained Morphological Segmentation for Turkmen. Information. 2026; 17(6):594.
https://doi.org/10.3390/info17060594
Chicago/Turabian Style
Tukeyev, Ualsher, Dina Amirova, and Davranbek Eshimov.
2026. "CSE-Guided Linguistically Constrained Morphological Segmentation for Turkmen" Information 17, no. 6: 594.
https://doi.org/10.3390/info17060594
APA Style
Tukeyev, U., Amirova, D., & Eshimov, D.
(2026). CSE-Guided Linguistically Constrained Morphological Segmentation for Turkmen. Information, 17(6), 594.
https://doi.org/10.3390/info17060594
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article metric data becomes available approximately 24 hours after publication online.