Next Article in Journal
A Global TEC Map Forecasting Method Based on Periodic-Matched Residual Prediction and Longitude-Circular Boundary-Aware Convolution
Previous Article in Journal
All-Optical Turbulence Perception via a Coherence-Length- Sensitive Diffractive Processor
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

From Ancient Manuscripts to Modern Social Media: Evolution of Tonality Analysis Methods for Low-Resource Languages

by
Zharasbek Baishemirov
1,2,3,
Azim Kassymbayev
4,*,
Didar Yedilkhan
1,
Beibut Amirgaliyev
1,* and
Beibit Abdikenov
5
1
Smart City Research Center, Astana IT University, Mangilik El Avenue 55/11, EXPO Block C1, Astana 010000, Kazakhstan
2
School of Applied Mathematics, Kazakh-British Technical University, Tole Bi Street 59, Almaty 050000, Kazakhstan
3
School of Digital Technologies, Narxoz University, Zhandosov Street 55, Almaty 050035, Kazakhstan
4
School of Artificial Intelligence and Data Science, Astana IT University, Mangilik El Avenue 55/11, EXPO Block C1, Astana 010000, Kazakhstan
5
Science and Innovation Center “Artificial Intelligence”, Astana IT University, Mangilik El Avenue 55/11, EXPO Block C1, Astana 010000, Kazakhstan
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2026, 16(11), 5650; https://doi.org/10.3390/app16115650
Submission received: 23 April 2026 / Revised: 23 May 2026 / Accepted: 25 May 2026 / Published: 4 June 2026

Abstract

Recently, computational sentiment analysis has become an essential tool for detecting evaluative language in large text collections. However, its application to many low-resource language families and historical corpora remains largely unexplored. This paper reviews the evolution of sentiment analysis methods in the Turkic language family, with a particular focus on Chagatai, the classical predecessor of several modern Turkic languages. We outline the methods that have evolved since the advent of lexicon-based and rule-based approaches up to the present day with large language models, addressing longstanding problems in agglutinative morphology, data scarcity, orthographic instability, and multilingual lexical mixing. To examine the available options, we conducted a pilot experiment using multilingual models in a zero-shot setting on a curated Chagatai corpus. In the absence of ground-truth annotations, prediction stability was validated with ensemble consistency and inter-model agreement. The results show real promise but also distinct limitations when adapting traditional NLP technologies for historically remote, low-resource languages. Progress in the field will require cross-disciplinary work, systematic diachronic dataset deployment, and a nuanced adaptation of multilingual representation learning to handle linguistically rich, low-resource settings.

1. Introduction

Tonality analysis is used mainly to obtain sentiment signals from social networks and product reviews. Evaluative language spans a wider variety of speech, including political discourse, legal reasoning, and historical writing. Computer methods can analyze the opinions, attitudes, and viewpoints hidden in language. This is useful in several areas, such as discourse analysis, automated text classification, and cultural studies, because those attitudes influence how people understand the world.
The fast development of digital communication has generated interest in extracting evaluative signals from unstructured text. Tonality analysis is a useful tool in marketing analytics, public health observation, political studies, and financial prediction because attitudes and judgments are reflected in social media, news, and online debates all the time [1,2,3]. Yet, a proper tonal interpretation remains a challenge. Meaning is often dependent on context, narrative structure, and concrete intent rather than lexical markers taken in isolation [4]. These dependencies become significantly harder to represent as we move away from the clear, single-language review corpora that dominate the literature.
Chagatai manuscripts are often about evaluative language and religious and political discourse. The emotional tone of such works serves as a marker of ideological orientation, moral hierarchy, and social order. These tonal patterns will be a basis for computational modeling that will make automated text more informative and provide a means to study cultural continuity across time.
Over the past decade or so, modeling strategies have made significant improvements. This has ranged from supervised machine learning and neural architectures for lexicon-based polarity identification to transformer-based systems [5,6]. Each generation has added more contextual sensitivity to its understanding. But linguistic variety, domain shift, and annotation inconsistency remain issues. Better designs have expanded the scope of possible tasks while maintaining the fundamental barriers to data quality and linguistic complexity.
These constraints are strongest in low-resource settings. Most Turkic languages do not have the rich annotated corpora and extensive pretraining coverage of high-resource languages [7,8]. Adding the historical dimension is more challenging. Chagatai is unlike modern digital text in terms of orthography, vocabulary, and stylistic structure. Morphological richness and the long-term semantic drift are also more difficult for multilingual transformers. Historical NLP for Turkic languages combines contextual modeling and historical preprocessing (orthography normalization, archaic → modern lexicon mapping, and semantic shift modeling) and can be applied to historical literature. In this review, we have reviewed methods for analyzing tone in low-resource Turkic language corpora and discussed how to use them for historical literature.
Chagatai is an important case. It was the literary language of Central Asia for centuries. While the two were a product of genetic continuity, spelling and meaning are very different in both languages. The technology and approach changes cannot be enough to bridge the gap between Chagatai manuscripts and modern NLP tools. This is important for both computational analysis and the intellectual heritage that these texts contain.
This review provides a systematic account of tonality analysis methods of Turkic languages. It focuses on low-resource and historically remote scenarios. The field started with lexicon-based systems, moved onto transformer architectures, and now uses large language models for agglutinative and morphologically complex languages in order to identify limitations such as poor annotated data and orthographic variation. Modified transfer-learning strategies are also proposed for cross-lingual and zero-shot scenarios.
Several previous surveys reviewed the field of sentiment analysis in general NLP, including lexicon-based methods, traditional machine learning, neural networks, transformer models, and application domains [1,9,10,11,12]. These surveys offer extensive methodological coverage, although they focus predominantly on high-resource or recent digital-language environments. They rarely investigate the behavior of these methods in Turkic languages, where agglutinative morphology, script variation, and uneven corpus development introduce distinct limitations. They also do not elaborate on the historical aspect, more specifically, the issue of whether methods developed for contemporary low-resource languages can be extended to manuscript-era corpora, such as Chagatai. The novelty of this review is therefore that it tries not only to analyze Turkic languages, but also to connect three different areas: modern low-resource sentiment analysis, historical Turkic text sources, and cross-temporal transfer of tonality methods.
There are three dimensions that distinguish this review from prior surveys. First, it focuses on low-resource Turkic languages rather than treating them as minor examples within general multilingual sentiment analysis. Second, it links modern Turkic NLP with historical Chagatai material, which introduces problems of script change, orthographic instability, and diachronic semantic shift. Third, it discusses tonality analysis not only as a classification task, but also as a historical NLP problem where preprocessing, transfer learning, and evaluation without gold-standard labels become central methodological issues. In this sense, the contribution of this paper is to synthesize existing sentiment and tonality methods through the specific lens of low-resource Turkic and historical-language analysis.

2. Scope, Review Methodology, and Research Gap

The main focus of this work is Turkic languages, low-resource NLP, and historical textual corpora. The review focuses on methodological shifts and the major linguistic issues in this field, including agglutinative morphology, script change, and multilingualism.

2.1. Review Design and Analytical Scope

This review is a concept-driven synthesis and not a statistical meta-analysis. It traces the evolution in methodological development from lexicon-based and rule-based systems to transformer and large language models. It also indicates their performance on low-resource, historically layered data such as Turkic languages. The review focuses on the connection between modern Kazakh NLP and Chagatai. The analysis examines both methodological and linguistic aspects. Methodologically, it engages models and representation strategies, transfer learning, and evaluation constraints. Linguistically, it examines agglutinative morphology, orthographic instability, script transitions, lexical borrowing, and diachronic change. To make the analytical scope more explicit, this review is guided by two research questions:
  • RQ1: What tonality analysis methods have been applied to modern low-resource Turkic languages, and what are their comparative strengths and limitations?
  • RQ2: To what extent can existing methods be transferred or adapted to historical Turkic corpora, especially Chagatai manuscripts?
These questions connect the methodological review of modern Turkic sentiment analysis with the discussion of historical Chagatai material and cross-temporal transfer.

2.2. Search Strategy, Screening, and Selection Criteria

We searched the literature through the major academic databases Google Scholar, Scopus, Web of Science, IEEE Xplore, and ScienceDirect. To search the database, we used reference chaining from the survey articles and foundational studies. We searched tonality analysis, sentiment analysis, low-resource languages, Turkic languages, Kazakh NLP, Chagatai language, historical NLP, digital humanities, transformer models, and cross-lingual transfer. We used both broad and narrow query combinations to define the main approaches of tonality and sentiment modeling. More specific combinations were used to identify studies in the following fields: Central Asian Turkic languages, manuscript digitization, and diachronic language analysis. We found more sources through the bibliographies of survey papers and seminal publications in multilingual NLP and historical text processing.
The protocol adhered to a PRISMA-driven format to increase transparency Figure 1. The initial search revealed 103 records. After screening titles and abstracts, we eliminated 32 records that did not match the focus on tonality or sentiment analysis.
In total, full-text retrieval was performed for the remaining 71 records, of which 15 were inaccessible. Finally, 56 studies were assessed for eligibility based on methodological relevance, language focus, and contribution to this review’s aims. A small number were excluded for insufficient alignment with the primary research focus.
More importantly, this review went beyond keyword-based searches and selected the studies to use and made an effort to strike a balance between method and linguistics alongside history. We prioritized papers addressing at least one of the following: computational tonality analysis; low-resource and morphologically rich language processing; Turkic NLP; multilingual transfer; or historical text analysis. We also retained a small number of digital humanities studies, such as [13], because they supported historical corpus construction, digitization, and historical language processing. Fifty-two studies were ultimately included in the final synthesis.
This review focuses on peer-reviewed journals, key conference papers, and surveys. We covered studies that proposed or evaluated methods for tonality/sentiment analysis, addressed multilingual or low-resource language processing, focused on Turkic languages, or contributed to historical NLP and corpus digitization.
Studies were excluded for three main reasons: insufficient methodological detail, or no clear contribution to modeling, data, or evaluation. Non-computational literary or historical studies were excluded except when they directly contribute to Chagatai linguistic structure, manuscript traditions, or interpretive challenges.
This review also does not treat Turkic languages as a one-off phenomenon. As contextually feasible as it was, additional empirical investigations on sentiment analysis in other low-resource and underrepresented languages were also addressed. They demonstrate some effective approaches to working with scarce resources. A paper from 2025 established a new aspect-based sentiment dataset and implemented data augmentation while conducting fine-tuning on multilingual models [14]. For African languages, adaptive pretraining and source-language selection were utilized to minimize negative transfer in multilingual sentiment classification [15]. For low-resource languages, a lexicon-based sentiment model was created, which is based on lexicon creation, annotation, augmentation, and fine-tuning [16]. The examples above were not the focus of our investigation; however, they did help frame potential responses towards similar dilemmas arising in the Turkic and historical-language domains, as annotation sizes are small, model coverage is uneven, and direct supervised training is not possible.

2.3. Language, Topical Coverage, and Research Gap

English-language research provides a foundation for understanding the progression of machine learning and deep learning in tonality analysis [5,6]. The primary linguistic focus, however, is on Turkic languages—Kazakh, Uzbek, Kyrgyz, and, where applicable, Uyghur. These languages represent the most studied cases in multilingual and low-resource NLP within the Turkic family [7,8]. Kazakh gets the most attention because of the expanding annotated datasets. Russian appears exclusively in multilingual modeling and code-mixing. Chagatai is not only a historical background but also a benchmark to evaluate how well modern tonality analysis techniques can be applied to manuscript Turkic texts.
In terms of the application of the review, we discuss lexicon-based systems and rule-based approaches, as well as classical ML techniques, neural network architectures, transformer frameworks, and large language models. We also investigate how these paradigms interact with linguistic structure, resource availability, transferability, and historical distance.
Tonality is thought of in all cases as more than just a polarity detection process, as the computational analysis of evaluative and ideological language over time, genres, and cultures and languages. These points are particularly applicable to Chagatai texts. Building structural continuity between Chagatai and modern Turkic languages makes transfer-based modeling possible once diachronic variation, script changes, and historical anomalies are taken into account.
Despite the extensive strides made over the last 10 years, many large and wide gaps still exist, which have to be filled, the biggest one being data availability. Unlike high-resource languages, Turkic languages lack comprehensive, representative annotated corpora; available datasets are few, small, narrowly focused, and sometimes inconsistently annotated. The absence of data limits the capacity to generalize and to test the generalizability between studies.
A second gap is in the assumptions made in the current methods. Most approaches assume modern standardized language with the same spelling and vocabulary. This does not hold for historical languages like Chagatai. Their spelling is different, scripts are different, and old vocabulary changes the distribution. Models trained on modern data are not good at dealing with historical material, and so far, this has not been addressed.
Third, lexicon-based and hybrid approaches (as in low-resource settings, since they require no labeled training data) may not be suitable for linguistic variation [16]. Exact word matching breaks down due to transliteration errors, spelling variation, and pronunciation-driven writing changes. In the case of, for example, a historical Arabic-script form, it may be transliterated into Latin in more than one way, so that the same word will appear as different tokens post-preprocessing. Spelling variation creates a similar problem when a lexical item is written differently across manuscripts or editions. Pronunciation-driven writing changes also matter, especially in the world of modern digital text, where users write a word closer to the way it sounds rather than how it appears in a standardized form. In all three cases, a lexicon-based system may fail to match the word even when the underlying meaning has not changed. By contrast, pure neural approaches trained on modern corpora have the opposite problem, domain mismatch. Fourth, even though multilingual transformers and large language models show very good cross-lingual transfer, they are not often used in cross-temporal transfer. In particular, researchers rarely look at transfer across historical phases in the same language family. This is important for the Turkic family, since its historical corpora are rich, but computationally underused.
Finally, evaluation in historical NLP is difficult. Without gold-standard annotations, standard benchmark metrics cannot be used. Most studies instead rely on indirect checks, such as inter-model agreement or qualitative analysis. These strategies are useful, but they make comparison and replication harder [17,18].
Taken together, these gaps call for practical frameworks with clear steps. Models need to combine multilingual representation learning with orthography-aware preprocessing. Evaluation also needs targeted annotations and alternative validation strategies. In this study, we combine zero-shot transformer inference with a hybrid ensemble and report an exploratory empirical study on Chagatai historical material.

3. Linguistic and Computational Foundations of Tonality Analysis

Tonal orientation goes beyond detecting positive and negative words. Early computational approaches relied heavily on explicit polarity lexicons, but later work showed that these methods miss much of what makes language evaluative. Tonal orientation typically arises from the interaction of lexical elements, grammatical structure, and discourse context, not from isolated tokens [1,4].
In this study, sentiment analysis and tonality analysis are used as equivalent working terms. However, the original difference between them should be clarified. In most English-language NLP research, sentiment analysis usually means identifying polarity, such as positive, negative, or neutral meaning. Tonality analysis is often broader, especially in post-Soviet and Eastern European research traditions. It may include sentiment, but also intensity, context, and the strength of evaluation. In this sense, tonality can be understood as sentiment plus intensity and context. Since this review compares studies that use both terms for closely related computational tasks, the terms are used equivalently throughout this paper, while this conceptual distinction is still recognized [4].
These challenges become more acute in multilingual settings. The language’s morphology, spelling, and history shape how evaluative meaning is expressed and retrieved by automatic systems. In large-scale multilingual studies, English models are found to degrade significantly when it comes to morphologically rich or typologically diverse languages [9,19].
Kazakh has productive agglutinative morphology. The spelling and vocabulary of Chagatai differs from that of modern digital corpora. Most NLP pipelines presume consistency and stability between token boundaries and spelling. These assumptions break down, however, for some languages if you do not preprocess them. This review analyzes tonality analysis in two complementary fields: contemporary Turkic texts and historical manuscripts.

3.1. Tonality, Subjectivity, and Granularity of Analysis

Tonality analysis is closely connected to subjectivity detection. In simple terms, this means recognizing evaluative language from objective description. Subjective statements often encode attitudes, opinions, and emotional stances, although subjectivity, per se, does not imply overpolarity. Many subjective sentences are generally polarity-neutral, and some are evaluatively meaningful, which is what is found in annotation studies, while claims that appear factually accurate at the outset may be, in fact, evaluative [4].
Computationally, opinion expressions are typically modeled in a 3-part manner. These features include: the opinion holder, the target, and the evaluative orientation towards that target (e.g., reviewer, product, positive/negative). In order to recover these relations, models also need to focus on word choice and the surrounding context. Research on online reviews and social media shows that polarity frequently stems from relations at the discourse level between clauses and not on each word in a sentence [1]. In historical writings, like Chagatai manuscripts, writers express their emotions through rhetorical framing rather than plain language. Mostly, their judgment, commendation, and censure are expressed through metaphor or narrative juxtaposition. Which then creates real problems for automated analysis approaches [20].
Tonality analysis can operate at several levels of granularity: document, sentence, or aspect. The different levels require distinct computational trade-offs [1]. Document-level analysis gives a single label for the whole text, which simplifies classification but may hide local variations in tone; sentence-level analysis, by contrast, examines text in greater detail. Discourse relations often cross sentence boundaries. Words such as “however” or “although” can reverse the polarity of what comes before them, which means that models have to infer context from one sentence to the next. For example, the sentence “The service was fast. However, the quality was poor” begins with a positive evaluation but shifts to a negative judgment in the second sentence. A simple word-count model may detect both positive and negative words, while a contextual model must recognize that the statement after “however” carries the main evaluative meaning.
Aspect-level analysis connects evaluative clauses to specific entities or features. Models need to link each affective expression to the correct target. Recent sentiment analysis surveys suggest that contextual models perform better than feature-based classifiers in a vast majority of sentence- and aspect-level tasks [10,12].

3.2. Agglutinative Morphology and Representation Challenges

The Turkic languages, Kazakh included, exhibit highly productive agglutinative morphology. To convey grammatical information, a lexeme is often completed with suffixes and the various surface forms a root will have based on context. Bag-of-words models treat them as separate tokens, so the vocabulary can increase rapidly, and the overlap between documents is reduced. Such sparsity is a well-known problem for machine learning in morphologically rich languages [19].
Without preprocessing, stemming, or morphological segmentation, models fail to generalize across related inflected forms. This problem becomes more complicated in historical corpora, where varying spellings and archaic vocabulary fragment the corpus. For Turkic history, normalization is not one action. It generally requires multiple independent decisions. First, orthographic normalizing is required because the same word can come out with different spellings across manuscripts, editions, or transcription systems. This step is helpful, but risks also lie in it. Certain spelling discrepancies may correspond to period, region, or scribal practice, and should not be automatically erased. Normalization of script is also required when comparing Arabic-script Chagatai, Cyrillic Kazakh, and Latin transliteration using the same pipeline. This procedure is necessary to avoid the treatment of related forms as unrelated tokens. And lexical normalization is one other issue. Archaic, Persian, and Arabic-based forms sometimes get associated with the Turkic equivalents, and in such cases, some caution is required, as the recent form may not possess historical meaning. Morphological normalization is particularly necessary for agglutinative Turkic languages. Stemming, lemmatization, or suffix segmentation can effectively reduce sparsity, but those tools are not generally available for historical material. In some cases, a custom normalizer or rules must be manually checked. Therefore, normalization in historical Turkic NLP approaches should treat each layer separately: spelling, script, vocabulary, and morphology.

3.3. Historical Variation: Chagatai and Modern Turkic Languages

Poetry, scholarship, and historical documentation have been done on Chagatai for centuries. It is structurally related to contemporary Turkic languages but differs in vocabulary, orthography, and stylistic convention. When the history books are older, the spelling, grammar, and vocabulary are very different, which causes a significant loss of performance. It was reported that languages with fewer digital resources fail to perform better, and this can lead to downstream tasks, such as sentiment analysis. The gap between modern Turkic languages and Chagatai is large enough [7]. That means models trained on Kazakh data cannot be applied to Chagatai manuscripts without domain adaptation.

3.4. Resources, Transfer Learning, and Zero-Shot Applicability

Multilingual transfer learning and zero-shot inference have become viable solutions. Experiments using multilingual lexicon-based pretraining across more than 100 languages demonstrate substantial advancements in cross-lingual sentiment classification [21].
BERT and similar models provide powerful contextual representations [5,22]. However, their performance often declines when applied to languages with fewer resources, unless they are specifically fine-tuned for a particular domain. Combining multilingual representation learning with linguistically informed preprocessing yields promising results for analyzing tonal orientation in historical Turkic texts, though this remains an evolving research domain without an established solution.
Recent developments in historical NLP and digital humanities have slowly expanded the computational infrastructure for Turkic manuscript research. Chagatai served as a supraregional literary language from the fifteenth to the nineteenth centuries, which makes it an ideal subject for diachronic evaluation of evaluative language [20,23].
OpenITI [24] supports computational work on Islamicate intellectual history and provides access to Turkic materials in Arabic script. These texts remain modest in scale compared with the full collection. However, they provide historically rich text tokens with genre, date, and author metadata. They are therefore helpful for tracking changes in rhetorical framing and lexical usage across time. Research libraries in Europe and Central Asia, such as the BULAC catalogues [25], also provide high-resolution scans and partial transcriptions, although transcription quality and OCR performance vary.
Manuscript texts combine Turkic, Persian, and Arabic lexis and reflect the current multilingual communication in Central Asian languages [8]. This historical multilingualism shows that a cross-lingual NLP approach for modern code-switching may be relevant to manuscripts when compared to contemporary text context (even in the absence of comparisons), even if the texts are not always in the same language-based context.
From a quantitative perspective, modern Turkic tonality datasets have over 100,000 labeled instances, whereas digitized Chagatai collections provide only a few hundred usable sentences with preprocessing. Their historical value as cultural artifacts is also useful in the evaluation of models in response to semantic drift, orthographic instability, and domain mismatch. For this kind of work, a rigorous evaluation framework is still lacking, and collaboration in computational linguistics, digital humanities, and Turkological research is needed [26].

4. Lexicon-Based and Rule-Based Methods for Tonality Analysis

Lexicon-based and rule-based approaches were among the earliest computational techniques for modeling evaluative language without labeled data. The basic idea is straightforward. The system evaluates tonal meaning by identifying sentiment units, assigning polarity and intensity, and then combining the results across languages [1,4,12]. They remain relevant even in cases where interpretability is important, labeled data is not abundant, or the domain is too unstable to support repeated retraining.
This is especially evident in the Turkic context. The Turkic family comprises approximately 35 languages spoken by over 200 million people [7]. When analyzing historical records like Chagatai, exclusive reliance on data-driven methods is inadequate. Lexicons and rules provide you with analytic aims and reference values prior to trying transfer-based modeling [8].

4.1. Tonality Lexicons

To determine tonality, the model needs three things: an evaluation of the lexicon, an orientation encoding, and an aggregation method. Aggregation matters because simple summation overcounts repeated markers and ignores discourse structure. Proximity weighting or sentence position may help, but can hurt cross-genre portability [1,4].
The Kazakh emotional lexicon [27] is annotated with about 11,000 emotional terms. Two features are particularly relevant for historical work. First, the scale captures not only polarity but its relative intensity, which is well aligned with the notion of tonal strength (strong denunciation vs. mild disapproval). Second, the representation format of polarity plus intensity is suitable to map vocabulary over time, even though the vocabulary of the Chagatai language is very different from modern Kazakh.
KazSAnDRA [28] contains 180,064 reviews rated from 1 to 5. The researchers found that a fixed evaluative lexicon will not be able to capture instances where praise, sarcasm, and criticism are expressed in the same sentence. In historical corpora, the challenge is greater. Most writers express approval or disapproval indirectly. A passage could praise a ruler or criticize an opponent without relying on specific sentiment words that methods based on vocabulary can identify [29].

4.2. Rule-Based Tonality Analysis

Rule-based systems overcome the drawbacks of lexical scoring by accounting for negation scope, intensification, and contrastive markers. These are important because polarity functions compositionally and depends on scope and discourse structure. A reversal sets apart ‘Good’ from ‘not good’, and concessive signals show how stance changes. To capture these effects, we need explicit language rules that work above the token level.
In Kazakh, rule-based work should begin with morphology. Morphological rules have been used to detect tonality in Kazakh text [27]. Semantic hypergraph structures were used to represent ontological and morphological relations. The logic is simple: suffixes can flip polarity, so token-level lookup will miss these shifts unless the text is morphologically preprocessed.
For Chagatai, the same principles remain the same, except the failure modes are different. The challenge is the orthographic and lexical change, not morphological attachment at all [7]. This is the default situation in Chagatai and not an exception.

4.3. Applications, Regional Datasets, and Resource Imbalance

Early Kazakh sentiment work appeared to be limited by the amount of data—a dataset of 10,000 labeled news articles had only 72.8% accuracy for Kazakh and 86.3% accuracy for Russian in the same setup [28,30]. The authors commented on this shortcoming directly by not having the training data and not performing lemmatization (morphological normalization), which is essential for the field.
With larger corpora now available, the definition of “baseline” has changed. On KazSAnDRA, the best system achieved an F1 of 0.81 for coarse polarity classification but only 0.39 for five-class score classification [28].
Uzbek sentiment resources consist of roughly 4.5 K positive and 3.1 K negative restaurant reviews plus 2.5 K positive and 1.8 K negative app reviews [8]—useful, but far smaller than KazSAnDRA and much more limited than what exists for English.
Beyond dataset size, one major challenge of Turkic tonality research is the lack of consistency in annotated resources—scope, domain coverage, and annotation quality vary significantly across sources and form a major factor in model choices.
Turkish is the most computationally developed Turkic language. Balanced polarity datasets and domain-diverse corpora have allowed research in both classical machine learning and transformer architectures. The consistency and quality of annotation in agglutinative languages have been shown to contribute to classification accuracy since the rich morphology leads to lexical sparsity, according to recent studies [31]. Previous work using deep learning has shown that neural methods work in Turkish sentiment analysis, such as for informal social media text [32]. Turkish becomes a methodological benchmark for Turkic NLP.
Central Asian Turkic languages are quite limited. Uzbek sentiment has focused on curated review corpora that were handpicked from restaurant and app platforms, including samples with more than 8000 annotated comments. Such small controlled review data has an approximately 91% accuracy on logistic regression using TF–IDF features, a fact ascribed to domain features. The work on Uzbek social media has shown some additional challenges arising out of emoji-laden informal text, as well as the requirement for emotional vocabulary resources [33,34]. Researchers have reported challenges such as script normalization, Cyrillic to Latin transliteration, and morphological stemming influence reproducibility.
It is still hard to find well-made datasets for the Kyrgyz language. Experimentations had translated review transcripts or manually collected web comments. For translated film reviews, logistic regression provided accuracy as high as 0.83 and F1-scores achieving up to 0.84 [35]. This is a well-researched trend in low-resource NLP: easier models tend to succeed on smaller datasets. The field is heavily reliant on machine-translated or synthetic data, which causes reproducibility issues.
However, recently, Uyghur sentiment analysis studied cross-lingual transfer and hybrid neural architectures. BERT-BiLSTM models improve classification significantly [36]. Earlier studies with cross-lingual representations and data augmentation show that shared semantic information can partly make up for native annotation scarcity [37,38]. But orthographic variation and morphological complexity pose challenges for large-scale evaluation.
A series of empirical case studies across Turkish, Uzbek, Kazakh, Kyrgyz, and Uyghur shows that data scale is the main point when choosing methods for sentiment analysis. Languages with larger annotated corpora permit the use of Transformer models. While smaller datasets favor hybrid rule-based + multilingual-transfer pipelines. The regional picture also offers indirect insight into Chagatai. Modern Turkic languages share agglutinative morphology and lexical continuity with earlier traditions. However, script conventions, genre distribution, and rhetorical norms differ enough that tonal cues do not transfer directly across historical periods (Table 1).

4.4. Cross-Lingual and Zero-Shot Lexicon Transfer

We reuse lexicon-based methods in zero-shot and cross-lingual pipelines. Translated lexicons, multilingual alignment, and lexicon-assisted prompting each provide a weak tonal signal when no annotations exist. For historical Chagatai, the main question is not whether lexicons can solve the problem but whether they can supply constraints that stabilize cross-temporal transfer.
Two bridging strategies are practical. The first is lexicon-mediated normalization: historical forms are normalized orthographically or transliterationally, mapped to modern cognates where possible, and then scored. The second is lexicon-assisted transfer: using lexicon features to assist multilingual models when they are applied with little or no in-domain annotation [7,21]. Evidence for both strategies indicates small, consistent gains that are acceptable because the goal is robustness and interpretability, not leaderboard accuracy [21].

5. Traditional Machine Learning Approaches to Tonality Analysis

Classical machine learning techniques advanced computational tonality analysis significantly. These models learn complex word co-occurrence patterns from labeled data. As early as the early 2000s, scientists made broad use of supervised classifiers for sentiment analysis. Then deep neural architectures emerged [1,12].
These approaches allow a basis for diachronic analysis. Classifier models, which are trained on modern Kazakh corpora, are useful and can be used as reference models for digitized Chagatai material. Direct transfer will not be accurate, but it works as a starting point for a comparative analysis. It shows changes in evaluative expressions between historical periods. Standard implementations involve transforming text into vectors (bag-of-words, TF-IDF, n-grams), which are then encoded by classification algorithms to assign these vectors with tonality labels. Some common examples are Naïve Bayes, Logistic Regression, Support Vector Machines (SVMs), Decision Trees, and k-Nearest Neighbors. With a focus on high-dimensional feature space, SVMs have historically been one of the most effective classes of classification algorithms for text, as they have shown good performance [39]. The importance of the representation factor in many cases is far greater than the classifier used.

5.1. Feature Representation and Low-Resource Applications

Feature representation is a fundamental design decision in traditional methods of machine learning. Bag-of-words records token occurrence or frequency but does not account for word sequence, grammar, and semantic similarity. Researchers integrated part-of-speech tags, negation markers, and syntactic features. These improve precision but increase preprocessing complexity and reduce portability [19].
Rich inflection and agglutinative morphology in Kazakh expand the vocabulary and reduce the effectiveness of bag-of-words. Applying morphological normalization helps prevent different forms from becoming distinct tokens.
Traditional ML methods dominated early research because they suited small datasets. So, early Kazakh studies used standard classifiers with manually curated lexicons and morphological preprocessing [27,40]. Standard classifiers give competitive baselines for the detection of offensive content in Kazakh if linguistic preprocessing is applied successfully [41]. The study of topic modeling in a multilingual context also emphasizes that methods must be adapted to handle complex morphology and limited data amount [42].
Now, feature-based classifiers provide a baseline for such large datasets. But sparse feature representations do not capture fine-grained tonal distinctions, long-range dependencies, and cross-domain generalization [28].

5.2. Historical Corpora, Diachronic Variation, and Multilingualism

Applying classical machine-learning approaches to historical Turkic material poses multiple complications. Feature representations learned on modern Kazakh often fail to generalize to Chagatai. Chagatai was written in Arabic script with inconsistent spelling, while modern Kazakh uses primarily Cyrillic (now transitioning to Latin). Arabic-script spelling and Cyrillic tokenization clash, breaking standard tokenizers and mapping methods.
From an NLP standpoint, historical corpora represent an extreme case of domain shift. Historical corpora thus act as both linguistic resources and stress tests.
Chagatai literature incorporates Persian and Arabic stylistic features together, so it reflects the cultural context of medieval and early modern Central Asia. Terminological traditions from several languages challenge standard pipeline design, which was not made for such tasks.
This blending of languages is not unique to Chagatai literature. A structurally similar phenomenon appears in contemporary Kazakh, where speakers routinely blend Kazakh and Russian in a single sentence, which is also known as code mixing. Methods developed for modern code-mixing hold partial validity, including language identification and cross-lingual embedding alignment, and multilingual representation learning. The analogy has limits, but it is not entirely without basis.
Classical ML struggles with tonality analysis. Manual feature engineering cannot find larger dependencies or nuances in context. The performance of the classifier gets worse over time or when the domain changes. Vocabulary shifts and discourse traditions cause feature distributions to shift. These constraints motivated researchers to construct neural networks to learn semantic representations directly from data.

6. Neural and Transformer-Based Methodologies for Tonality Analysis

There was a major shift in computational tonality analysis of neural networks. Neural networks replaced hand-crafted features. The machine models have learned hierarchical representations from data [1,43,44]. This shift facilitated the analysis of intricate tonal processes, dealing with negation, discourse contrast, and implicit evaluation, but requiring a much larger dataset, which remains a limiting factor in resource-poor contexts.
With larger datasets available, studies shifted to neural methods in some cases [45]. As earlier, morphological complexity issues and orthographic variation still exist for the current state-of-the-art neural architectures; those present structural challenges are still valid today.

6.1. Early Neural Architectures: Embeddings, CNNs, and RNNs

Distributed word representations are the foundation of early neural sentiment systems. Dense vectors are semantically close words that are very closely packed in vector space. Word embeddings enable models to generalize across closely related words and to detect semantic features besides simply matching tokens, enabling mitigation of sparsity in agglutinative languages [1].
The problem is that the early embeddings are static since every type of word has a single vector, and it does not depend on the context. This is an obvious limitation of tonality analysis. Positive evaluation in one context may be ironic in a different one. This limitation enabled the researchers to adopt contextualized representations.
Convolutional neural networks (CNNs) were some of the first neural models employed in sentiment analysis. They take text as a series of word embeddings and process it with convolutional filters, and focus on recognizing local features. With a good amount of training data, deep machine learning models surpass popular forms of classifiers and are even capable of learning better tasks than many traditional classifiers.
The only limitation they have is a narrow context window. The local pattern detection is effective for brief analysis, but not great for conveying tone when we want to analyze a longer passage. This is particularly difficult for Chagatai manuscripts because their rhetorical evaluation is broader than a few paragraphs.
Recurrent neural networks (RNNs) were derived to express sequential dependencies clearly. As in CNNs, RNNs process tokens one by one and update their internal state with each new piece of context. Long short-term memory (LSTM) and gated recurrent unit (GRU) architectures, which can store information over longer sequences, solve the vanishing gradient problem of earlier recurrent networks [1].
RNN models are particularly effective when polarity depends on sentence or document context, e.g., negation, concessive constructions, and contrastive marker shift, and evaluative orientation between clauses. In practice, recurrent architectures are computationally expensive and difficult to parallelize, which limits their scale on large corpora [46]. Research performance still depends on available training data, which limits progress in many Turkic languages.

6.2. Transformer Models for Low-Resource and Historical Texts

The transformer architecture is far more powerful than CNNs and RNNs. Through self-attention, each token can be related to other tokens in the same sequence, so models can capture global context without recurrent sequence bottlenecks. Word encoding can be contextual, i.e., represent usage in the sentence, not a fixed vector [6]. This is important in tonality analysis—the meaning of a sentence or passage is often dependent on how the tokens interact.
Powerful transformer models have made impressive progress. They allow us to capture the dynamics of discourse-level tonal processes, which we have so far not been able to do. A promising research direction is to fine-tune pretrained transformers on hybrid corpora of contemporary Kazakh text and digitized Chagatai manuscripts. Although still under-explored, these training methods will allow models to track how evaluative language changes over the course of the Turkic language family history.
Multilingual transformers use shared parameter spaces and subword vocabularies to increase contextual understanding across languages. This is achieved through cross-lingual transfer that allows knowledge from high-resource languages to benefit low-resource ones [47]. For contemporary Turkic NLP, multilingual transformers are an important step forward. As a result of fine-tuning with precision as well as language-specific processing, they are more robust to morphological changes.
However, cross-lingual transfer is not always equally effective across languages and domains. For low-resource languages, pretrained multilingual models may have less exposure to the target language, writing system, or domain-specific vocabulary. This can make tokenization less reliable and can reduce performance when the model is applied to specialized datasets [7,47].
Transformer architectures present new opportunities for historical language analysis. Using mixed datasets of modern and historical texts, pretrained models are able to learn connections between words of the past and the present. This is consistent with trends in both historical NLP and digital humanities research, although results on manuscript-era Turkic texts are still preliminary.

6.3. Use of Large Language Models in Tonality Analysis

The emergence of large language models (LLMs) has provided a new paradigm for the analysis of tonality that is no longer dependent on fine-tuning for classification. Models trained on large multilingual corpora may depend on instruction-based prompting to model evaluative orientation in foreign languages or domains, including Turkic texts.
Encoder-based models like BERT and XLM-RoBERTa need task-specific adaptation. Generative LLMs are general-purpose and allow tone analysis to be treated as a reasoning task. Empirically, prompt-based approaches can achieve competitive coarse-grained polarity detection in low-resource settings, provided that performance is sensitive to prompt formulation and contextual ambiguity [47,48].
Reported numbers suggest that LLM-based inference on product review datasets achieves accuracy of 78–84%, while SVM baselines under comparable conditions achieve 72–80% [48]. These comparisons warrant caution given the differing dataset size, domain, annotation, and evaluation. In Turkic-language settings, fine-tuned XLM-RoBERTa has exceeded 90% on KazSAnDRA (≈180,000 reviews), whereas conventional feature-based models on smaller datasets often reach only the mid-60% range [28,40]. This pattern suggests that LLM inference is useful for exploratory zero-shot work in low-resource or historically distant settings; encoder transformers perform better with ample labeled data.
From the perspective of Turkic historical NLP, LLMs have several advantages. LLMs have a fast response to new environments, deal with orthographic variations, and generate synthetic data. Corpora training material represents diverse material genres and eras. While Chagatai representation is infrequent, patterns from similar languages may be adopted from one generation to the next. Instruction-based interaction promotes iterative interpretation and allows researchers to customize prompts if rhetorical stance or tone differences are relevant.
But limitations do exist. Tonal assessment in historical texts depends on culturally specific rhetorical traditions absent from modern training data, such that these models tend to favor modern evaluative norms. Exact replication is more difficult because generative outputs vary between runs. Chagatai’s lexical mixing feature stresses multilingual LLMs in ways that rely significantly on preprocessing decisions such as context window selection and transliteration. In contrast to encoder models, LLM reasoning remains largely opaque, while encoder models support systematic error analysis through attribution techniques. This is also identified as a significant limitation in historical research, where interpretability is vital for philological analysis [8].
LLM-based analysis should be understood as a supplementary tool rather than a substitute for traditional linguistic methods. Continued pretraining on a corpus containing contemporary Turkic as well as transliterated historical texts may improve diachronic modeling change over time.
Neural and transformer-based models have substantial computational costs. Training and fine-tuning large models require specialized hardware and significant amounts of energy. To address this constraint, hybrid pipelines that combine efficient filtering models with more expensive neural components have emerged. The trajectory—from lexicon-based methods through machine learning to transformer architectures—shows how analysis has become more computationally tractable. Recent work has begun to extend these techniques to historical corpora that include digitized Chagatai manuscripts.

7. Comparative Analysis of Tonal Analysis Methodologies

The modeling paradigms reviewed in the preceding sections vary in both architectural design and operational effectiveness. When tonality analysis is used specifically for low-resource languages or historical texts, data requirements, suitability for different corpus types, and how well they use context become especially crucial.
Table 2 summarizes key features of different types of models. We have highlighted the structural trade-offs between transparency, computational cost, and representational capacity in this comparison.

7.1. Representational Capacity and Performance

Predictions are based on explicitly defined lexical resources that connect polarity assignments to model decisions. This transparency renders them especially useful for researchers wanting to understand how their model behaves (historical text analysis or digital humanities would likely be two prime examples [4]). Their ability to model contextually is limited, as shown in Table 2.
SVMs and logistic regression both achieve 70–80% accuracy on balanced large sentiment datasets [50]. These techniques are driven by statistical associations between textual characteristics and tonality labels.
The representational power of neural architectures improves drastically. In the case of labeled data, both convolutional and recurrent networks usually yield better macro-F1 [51]. Transformers show the most improvements in sentiment analysis. Their contextual embeddings also help with understanding, such that the model will be able to interpret the sentiment of the words given a setting of context surrounding the model. This is also evident from Kazakh sentiment experiments [40]. In conventional feature-based models, the resulting accuracy is around 60%, whereas when trained with a multilingual transformer in supervised settings, the accuracy can reach 90% [40].
The lexicon system is not highly computationally expensive and does not demand extensive training. These classical ML models can be designed at low cost and implemented in real time. On the other hand, transformers require GPUs for fine-tuning and have long inference times with increasing model size [5,52]. Combining advanced filtering with heavier processing neural architectures (hybrid pipelines) has proven an appropriate solution [53]. No architecture, however, is ideal; one might have many different strengths when dealing with various datasets, language structures, and domains.

7.2. Context, Transfer Learning, and Diachronic Use

In each model, one of the key differences between them has to do with the text structure and its length. Furthermore, the models have disparate uses for context: the old ones use local context, and transformer models use global context to interpret the entire text. This implies that classical methods are still competing against short texts with more evaluative material. For instance, the macro-F1-scores of SVMs and Random Forest classifiers are substantially close to more massive neural models for short review texts [48]. And conversely, for longer texts, the contextual modeling becomes even more crucial. And so systems capable of portraying extended discourse always outperform simpler models.
Languages with a rich morphology, such as Kazakh, increase word forms and lower lexical overlap. That sparsity makes it difficult for feature-based models to work with small datasets. Multilingual transformers yield the most consistent results in such settings. Since high-resource languages can facilitate low-resource languages via cross-lingual transfer [47], the downside would be an increased computational cost and poor interpretability. Zero-shot and few-shot settings have investigated hybrid approaches that leverage lexical resources while enabling cross-lingual transfer. Their findings illustrate that gains can be trivial. It means that larger architectures are not the only solution in low-resource settings.
Lexicon-based methods are comparatively better suited to historical Chagatai material, as they rely on explicit word mappings rather than statistical patterns learned from modern corpora. In the long run, transformer architectures show great promise for diachronic research. By using contemporary Kazakh data and digitized Chagatai manuscripts, the semantic correspondence of these periods could be better ascertained.

8. Empirical Evidence, Reported Performance Trends, and Application Domains

This part ties methodological discussion to reported empirical findings and application fields. First, it provides a high-level overview of general performance trends across model families, but also mentions the reported results as deriving from heterogeneous datasets and evaluation settings. It then explains why historical corpora need to be interpreted independently, since standard supervised evaluation is frequently impracticable. It summarizes, finally, the key multilingual, low-resource, and historical application contexts relevant to Turkic tonality analysis. Lastly, it presents a pilot study that applies zero-shot tonality analysis to Chagatai literature, providing preliminary empirical support for the analytical approach discussed in this paper.

8.1. Reported Performance Trends

Empirical studies have more data on how tonality analysis systems actually operate in practice. The performance of individual models cannot be summarized with pure model architecture. The performance varies depending on several confounded factors, including dataset size, annotation quality, linguistic structure, model architecture, and domain context. In this respect, tonality analysis can be likened to a multivariable optimization problem, where an improvement on one factor, like accuracy, can still be hindered by limitations on another factor, such as sparse annotation, domain mismatch, or unstable preprocessing.
Table 2 summarizes representative performance ranges from studies published from 2016 through 2026. Datasets differ in size, domain, labels, and languages, so reported ranges reflect trends, not results from comparable evaluations.
The shift from classical machine learning to neural architectures is perhaps the most apparent development in the literature. Support Vector Machines and Logistic Regression can reach an average accuracy of 65–75% on medium-sized sentiment datasets [27,49]. This uniformity across all languages is an indicator that these sparse feature representations are not very efficient. For English datasets, neural convolutional and recurrent models typically obtain 82–90% accuracy [12,43]. The accuracy of neural models increases by 10–15 percentage points, mainly on account of word embeddings and hierarchical feature learning. Transformers add approximately 4–8 percentage points and usually have above 90% on structured sentiment datasets [5,48].
For low-resource languages, things are even more complicated. The Kazakh sentiment analyses obtain 65%–75% accuracy [27] using classical models and 80–90% accuracy with multilingual fine-tuning [28,40]. When comparing the 5–10% gap against English benchmarks, it is due to small datasets and inconsistent annotations, not model architecture.
Performance also varies by domain (financial 78–89% [54], product reviews: over 90% [48], political 70–85% [55]). These differences show the impact of discourse complexity and contextual framing on classification accuracy. Research on political discourse and consumer sentiments also consistently reports that predictions stabilize when aggregated across extensive pools [51,55]. While unique document errors are not as important as corpus-level trends, sentiment analysis tends to contribute much more to the corpus than to a single document.

8.2. Historical and Multilingual Applications

It becomes more challenging to be empirical when we consider historical texts. Table 3 presents the fundamental linguistic features of historical corpora versus modern sentiment datasets, which shows the key linguistic features that are different in historical corpora and modern tonality datasets.
Historical sentiment analysis cannot rely purely on methods developed for current datasets: the mismatch is too great. Diachronic models are required—there are promising approaches for fine-tuning multilingual transformer models on hybrid corpora containing modern and historical texts. In principle, these training regimes can support models learning semantic correspondence across linguistic time frames and help to see how evaluative language evolves through time. Recent historical NLP work demonstrates that the combination of digitized manuscripts with modern corpora opens up computational exploration of semantic change and rhetorical structure development [47,52], but application to Chagatai literature is in its infancy.
Despite the application in many spheres of sentiment analysis, this literature review focuses primarily on the multilingual and historical contexts of Turkic languages.
Sentiment analysis is particularly consequential in multilingual contexts, in which texts contain multiple languages/dialects. Sentiment analysis of Kazakh and related Turkic languages is applied to social media, online reviews, and digital communications, where people often switch languages. Transformers are suited best for such tasks because of their ability to transfer knowledge from high-resource languages.
Digitized manuscripts such as Chagatai texts offer an opportunity to extend tonality analysis beyond contemporary discourse, allowing researchers to examine how evaluative language appears in historical, literary, and cultural texts. Applying computational methods to Chagatai manuscripts reveals shifting evaluative patterns over centuries.

8.3. Pilot Experiment: Zero-Shot Tonality Analysis of Chagatai Texts

A small-scale pilot experiment was conducted to empirically assess the feasibility of zero-shot tonality analysis on historical Turkic material. The corpus was drawn from digital fragments of the Chagatai chronicle Shajara-i Turk, composed by Abilgazy Bahadur Khan in the seventeenth century. This work is a major narrative source of the late Chagatai literary tradition and provides linguistically authentic material for exploratory computational work.

8.3.1. Data and Preprocessing

Arabic-script texts were converted to a reduced Latin representation using a rule-based normalization procedure. The transliteration approach was to focus on common ligatures, consonant digraphs, and Arabic-script vowel diacritical marks. Sentence-level segmentation used punctuation and clause-final particles such as turur, tib, and andoq. Noise was removed, and context was improved by removing segments with fewer than five tokens. A total of 250 sentences were chosen at random for assessment following segmentation and filtering. Representative instances with original Arabic-script input and normalized Latin transliterations are shown in Table 4.

8.3.2. Models and Experimental Design

Tonality classification was conducted in a zero-shot setting without fine-tuning or external API calls. The pilot used three independent signals: a zero-shot natural language inference model, a multilingual sentiment classifier, and a small Turkic/Chagatai evaluative lexicon.
Model A was MoritzLaurer/mDeBERTa-v3-base-mnli-xnli. It was used through a zero-shot classification pipeline. Each sentence was evaluated against three candidate labels: positive tone, neutral tone, and negative tone. The model returned a probability-like score for each candidate label, and the highest-scoring label was taken as the NLI prediction.
Model B was cardiffnlp/twitter-xlm-roberta-base-sentiment. It was used as a multilingual sentiment classifier. Its output labels were normalized into three categories: positive, neutral, and negative. This model provided a second neural signal based on multilingual sentiment classification.
The third signal was a small Turkic/Chagatai evaluative lexicon. The lexicon contained positive and negative evaluative words drawn from Turkic and Chagatai-related forms. For each sentence, the lexicon module counted positive and negative matches and assigned a positive, neutral, or negative label with a confidence score. If no evaluative words were matched, the sentence was treated as neutral with low confidence. Before inference, the Arabic-script input was transliterated into a reduced Latin representation using a rule-based character map. Sentence segmentation was first attempted with punctuation and common historical Turkic boundary markers, including forms corresponding to turur, tib, and andoq. If too few segments were detected, the pipeline used punctuation and newline-based splitting as a fallback. Segments shorter than five tokens were removed. The final pilot sample was selected randomly with a fixed random seed to make the sampling procedure reproducible. The three signals were combined through weighted soft voting. For each sentence, the final score for each label was calculated as a weighted sum of the NLI score, the XLM-R score, and the lexicon score. The weights were set to 0.50 for the NLI model, 0.30 for XLM-R, and 0.20 for the lexicon signal. The final ensemble label was assigned to the category with the highest weighted score. This design gives the largest role to the NLI model, while still retaining a multilingual sentiment signal and an interpretable lexical anchor.
The experimental code is available in an open GitHub repository: https://github.com/Azim-Kassymbayev/Chagatai-zero-shot-tonality.git (accessed on 22 April 2026).

8.3.3. Results

Output distributions for the three component signals and the weighted ensemble appear in Table 5. The NLI model produced a distribution dominated by negative tonality: 66.8% negative, 21.2% positive, and 12.0% neutral. XLM-RoBERTa yielded a markedly different distribution: 51.6% neutral, 48.0% negative, and almost no positive predictions (0.4%). The lexicon-based signal received very limited coverage, with 99.6% of sentences being classified as neutral, which is a result of the limitations of dictionary-based methods for navigating transliteration noise, lexical variations, and sparse exact matches. The weighted group produced 62.4% negative, 20.4% neutral, and 17.2% positive, despite it being closer to the distribution of NLI.
Confidence statistics show key differences, which are reflected in the numbers on the neural components. The mean confidence of the NLI model was 0.510. XLM-RoBERTa had a low mean of 0.371. The compressed variance of XLM-RoBERTa scores suggests the classifier operated near a narrow decision boundary across most inputs, most likely reflecting the mismatch between its training distribution and Chagatai’s linguistic properties. At a threshold of 0.42, 99 sentences (39.6%) met the high-confidence criterion.
Proxy evaluation metrics appear in Table 6. Inter-model agreement between the two neural systems reached 39.2%, indicating substantial divergence in zero-shot tonal interpretation. Self-consistency between the ensemble and the NLI model was much higher at 90.8%, confirming the NLI component’s dominant influence on the final output (macro-F1 = 0.863; weighted F1 = 0.900; precision = 0.911; recall = 0.858). Agreement between XLM-RoBERTa and the ensemble was considerably lower (44.4%; macro-F1 = 0.305). Lexicon-ensemble alignment was the weakest configuration (accuracy = 0.204; macro-F1 = 0.113), further confirming that lexical matching alone is insufficient for historical corpora. These values are proxy agreement measures between model outputs, not classification performance against ground truth, and should not be read as standard benchmark metrics.

8.3.4. Interpretation and Limitations

The proxy metrics presented here are not classification accuracy against ground truth, but they still provide useful information about model stability. The NLI model, the XLM-R sentiment model, and the lexicon signal produced different label distributions, which shows that zero-shot tonality analysis of Chagatai remains unstable. This is expected because standard pretraining corpora have limited coverage of Chagatai, and the text also involves orthographic inconsistency, historical lexical change, and multilingual lexical influence.
The results therefore should be interpreted as exploratory evidence rather than as validated performance. Inter-model agreement and ensemble consistency show how similarly the selected signals behave on the same transliterated input, but they do not prove that the predicted labels are historically or philologically correct. The frequent prediction of negative tonality may partly reflect the genre of Shajara-i Turk, where conflict, dynastic succession, and political struggle are common themes. However, this interpretation should be treated carefully and requires expert validation.
A central limitation of this pilot experiment is the absence of expert-validated gold-standard labels. The reported agreement, confidence, and proxy evaluation values are therefore not standard supervised evaluation metrics against human annotation. They are agreement-based indicators calculated between model outputs, lexicon outputs, and ensemble predictions. For this reason, the results should be interpreted as feasibility indicators rather than definitive performance scores. Future work should construct an expert-annotated Chagatai subset and evaluate the models using standard metrics against human labels, together with inter-annotator agreement.

9. Challenges and Unresolved Research Issues in Tonality Analysis

There are fundamental challenges in applying tonality analysis to multilingual and historical texts. Transformer-based models have shown impressive performance in high-resource settings but are limited in historically complex ones [1,47]. The difficulties are particularly obvious for languages with unstable orthographic histories.

9.1. Orthographic Variation, Script Change, and Manuscript Noise

The reason is that there are non-uniform orthographic norms in historical periods. For example, in the case of modern Kazakh, many different scripts have been used: from Arabic to Latin, then to Cyrillic in the Soviet period, and back to Latin again in the 1920s. One lexical item can occur in different ways according to the period and writing system. Thus, tokenization, morphological analysis, and matching lexical information are more challenging. A number of spelling patterns were also present in the modern dataset. Most notably on casual social media, individuals tend to use informal transliterated text. These challenges are systematically outlined in Table 7.
From a computational perspective, this variation increases vocabulary sparsity and undermines feature-based models. Transformer architectures are more robust when embeddings are contextualized, but need extensive training data to generalize to different spelling systems.
For computational models trained on modern Cyrillic-based data, Chagatai manuscripts in Arabic script are effectively encountered as a different language. Tokenization errors are common, and many lexical items remain out of the vocabulary. Our research has shown that sentiment models trained on modern Kazakh may lose more than 25 percentage points of accuracy when applied directly to historical Turkic texts [40,47]. We cannot analyze historical material in an NLP system without script normalization or cross-script representation learning, which makes historical NLP systems work.
Many Chagatai texts survived only in handwritten form. Whether digitization is through optical character recognition or through manual transcription, the noise will be very high: inconsistent character recognition, damaged or illegible passages, scribal ambiguities, abbreviations, poetic devices, and archaic grammatical constructions. These factors make tonality analysis challenging as historical texts often involve metaphor and allegory instead of clear positive or negative words.
From a machine learning perspective, these corpora are a very distinctive testing environment. Models that still work in the face of spelling changes, mixing languages, and transcription errors seem to be robust in historical and even current multilingual settings.

9.2. Implicit Evaluation, Dataset Limitations, and Annotation Challenges

Implicit sentiment and pragmatic interpretation are difficult even in state-of-the-art modern systems [48]. Chagatai literature drew on Persian poetic conventions in which evaluation is conveyed through metaphorical or symbolic language. Even transformer-based models can not interpret contextual information that exceeds individual sentences.
Table 8 shows the extent of this discrepancy by comparing the essential characteristics of contemporary NLP datasets with historical Chagatai corpora. For Kazakh, sentiment corpora are limited but growing; for Chagatai, they are effectively nonexistent. Constructing annotated corpora for the historical literature requires expertise in both computational methods and historical linguistics–historical literary conventions and contexts must be understood to annotate reliably. Historical texts also lack the clear evaluative signals found in contemporary review datasets. Rather than explicit positive or negative assessments, they express their stance in nuanced, indirect ways. Formulating reliable annotation guidelines under these conditions is genuinely difficult, and meaningful evaluation benchmarks for historical tonality analysis remain an open problem.

10. Future Research Agenda for Tonality Analysis in Turkic Historical Natural Language Processing

Progress in computational tonality analysis for Turkic languages will require more than incremental model improvements. It will depend on the systematic integration of linguistic, historical, and computational research programs. Transformer-based methods have shown strong performance on modern digital datasets but remain largely untested on historically rooted textual content.

10.1. Diachronic Corpora and Normalization Pipelines

Constructing diachronically matched corpora for comparative tonal analysis from the historical phases of Turkic languages is a priority. The digitized Chagatai collections are a rich source of literature, poetry, and administrative records, but have not been widely used. We have seen some attempts to access the material through transliteration and metadata annotation as part of digital humanities projects.
A promising approach is to build parallel corpora and to associate normalized Chagatai texts with structurally equivalent modern Turkic texts of similar genres. Such datasets would enable examination of the transformation of evaluative speech patterns and how evaluative expressions have changed over time. Annotation schemes must be broader than positive and negative labels. To be more in line with the complexity of historical literary texts, there should be categories for tonal intensity and stance that are more exactly drawn.
Orthographic variation and script transition modeling present a key technical challenge. Lexical discontinuities generated due to the alphabet reforms are limiting the capabilities to analyze diachronic text. New pipelines will need to integrate rule-based transliteration for script mapping, phonological mapping, and subword tokenization to adapt to changes in sound/orthography and OOV forms. All of these solve normalization and segmentation problems.

10.2. Multilingual Representation Learning and LLM-Based Interpretation

Turkic languages are connected by genealogical connection, generating opportunities for cross-lingual representation learning. Multilingual transformer representations were shown to be promising for current low-resource tasks, but their ability to capture diachronic tonal structure has no systematic evaluation as yet. More research is needed for fine-tuning to modern Turkic corpora with small historical datasets [56], and for training to synchronic variation (between languages) and diachronic (crossing time periods). This resonates with multilingual NLP studies on sharing parameters in order to reduce data scarcity in morphologically diverse languages.
Exploratory historical tonality is a relevant domain for LLMs. Prompt-based inference enables prompt-based assessment with minimal extensive annotations and a clear indication of initial results, which makes LLMs appropriate for pilot studies. Future studies should also proceed from merely using classifiers to generate structured interpretive tasks, producing explanations for evaluative stance, rhetorical strategies, and segments grouped by tonal similarity. If encoder transformers and generative LLMs were compared, the details where contextual reasoning capabilities really do make an impact would be clearer.
Ultimately, the goal should be to build tonality analysis frameworks that reflect the history of languages. Looking ahead, historical texts ought to be approached not as noisy versions of today’s language data but as indispensable resources for modeling diachronic semantic evolution, genre-specific discourse conventions, and multilingual lexical relations in evaluative language. This paradigm would position tonality analysis as a technical NLP challenge and a methodological bridge between computational modeling and cultural heritage research, allowing for the sort of large-scale study of evaluative discourse across centuries of Turkic literary production that has until now been impossible.

11. Conclusions

In this review, we have traced the development of computational tonality analysis in the past and its applications in corpora with both language and historical context. Representation learning by computational systems from language-based systems (lexicon-based models) through transformer architectures and large language models has been quite successful in this regard. These advances did not solve the challenges of scarce data, complex morphological systems, and linguistic distance.
The Turkic language family is a great example of these challenges. As data is scarce and orthography changes over time, researchers will have to adapt methods that are aimed at high-resource languages. Multilingual transformers and hybrid models are starting to address data scarcity in modern applications. Still, with historical texts, script and lexical change, and genre-specific rhetorical conventions still limit what NLP tools can do reliably.
Turkic manuscript digitalization gives a number of opportunities: Chagatai corpora allow us to set challenging goals for multilingual models and track changes in evaluative language. Although our predictions are unstable and susceptible to language fluctuations, our pilot experiment demonstrates that cross-temporal tonality analysis is possible in a zero-shot environment. We must employ proxy metrics that assess consistency rather than accuracy in the absence of labeled data.
Modeling tonality is more than just a classification issue. Additionally, it serves as a link between historical linguistics and NLP. We should create historically informed annotation frameworks and diachronically matched datasets, as well as conduct transfer-learning benchmarks on Turkic languages in the future. This will enable us to monitor shifts in evaluative position over specified time periods and genres. Turkic philologists, digital humanities researchers, and computational linguists must continue to collaborate on this project.

Author Contributions

Conceptualization, D.Y., B.A. (Beibut Amirgaliyev) and B.A. (Beibit Abdikenov); writing—original draft, A.K.; writing—review and editing, B.A. (Beibut Amirgaliyev) and A.K.; supervision, Z.B., D.Y. and B.A. (Beibit Abdikenov); formal analysis, Z.B., A.K. and D.Y.; visualization, Z.B.; methodology, B.A. (Beibut Amirgaliyev); data curation, Z.B.; software, A.K.; validation, D.Y.; project administration, D.Y.; resources, B.A. (Beibut Amirgaliyev); funding acquisition, B.A. (Beibit Abdikenov). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No./BR28712621).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The experimental code is available in an open GitHub repository: https://github.com/Azim-Kassymbayev/Chagatai-zero-shot-tonality.git. The digitized and preprocessed Chagatai corpus used in the pilot experiment cannot be publicly redistributed. This restriction is not related to the historical status of the texts themselves, but to the specific digitized and preprocessed version of the corpus, which was provided by a partner institution under a data-use agreement that does not permit open redistribution. Derived aggregate results, model outputs, and methodological details are reported in the manuscript to support transparency and reproducibility.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
  2. Fajarini, S.; Kurniawati, J.; Yuliani, F. Social media sentiment analysis as a tool for predicting market trends. Proceeding Int. Conf. Soc. Sci. Humanit. 2025, 2, 899–909. [Google Scholar] [CrossRef]
  3. Yao, Y.; Li, Y.; Liu, Y.; Liu, Y. Social media sentiment and stock market trends. Adv. Econ. Manag. Political Sci. 2025, 215, 91–106. [Google Scholar]
  4. Taboada, M. Sentiment analysis: An overview. Annu. Rev. Linguist. 2016, 2, 1–9. [Google Scholar] [CrossRef]
  5. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  6. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  7. Mirzakhalov, J.; Babu, A.; Ataman, D.; Kariev, S.; Tyers, F.; Abduraufov, O.; Hajili, M.; Ivanova, S.; Khaytbaev, A.; Laverghetta, A., Jr.; et al. A large-scale study of machine translation in Turkic languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
  8. Veitsman, Y. Recent advancements of Turkic Central Asian language processing. arXiv 2024, arXiv:2407.05006. [Google Scholar]
  9. Birjali, M.; Kasri, A.; Beni-Hssane, A. A comprehensive survey on sentiment analysis. Knowl.-Based Syst. 2021, 226, 107134. [Google Scholar] [CrossRef]
  10. Kumar, M.; Ali, A. A review of AI for sentiment analysis in social media. Metaheuristic Optim. Rev. 2024, 2, 1–13. [Google Scholar]
  11. Patil, S.S.; Suryawanshi, V.P.; Patil, S.M.; Girase, S.P.; Bhagat, D.A. Review of sentiment analysis in social media using big data. Int. J. Basic Appl. Sci. 2025, 14, 34–48. [Google Scholar] [CrossRef]
  12. Tutika, L.J. A comprehensive review of sentiment analysis techniques. Int. J. Res. Publ. Rev. 2024, 5, 403–409. [Google Scholar]
  13. Ehrmann, M.; Romanello, M.; Clematide, S.; Ströbel, P.B.; Barman, R. Language Resources for Historical Newspapers: The Impresso Collection. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020); European Language Resources Association: Paris, France, 2020; pp. 958–968. [Google Scholar]
  14. Dewangan, L.; Sayeed, Z.A.; Maurya, C. Benchmark creation for aspect-based sentiment analysis in low-resource Odia language and evaluation through fine-tuning of multilingual models. In Proceedings of the 31st International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 5863–5869. [Google Scholar]
  15. Wang, M.; Adel, H.; Lange, L.; Strötgen, J.; Schütze, H. NLNDE at SemEval-2023 Task 12: Adaptive pretraining and source language selection for low-resource multilingual sentiment analysis. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023); Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 488–497. [Google Scholar] [CrossRef]
  16. Mohammed, I.; Prasad, R. Building lexicon-based sentiment analysis model for low-resource languages. MethodsX 2023, 11, 102460. [Google Scholar] [CrossRef]
  17. Opitz, J.; Wein, S.; Schneider, N. Natural Language Processing Relies on Linguistics. Comput. Linguist. 2025, 51, 1009–1018. [Google Scholar] [CrossRef]
  18. Liu, L.; Vlachidis, A.; Crymble, A.; Lee, D.; Humbel, M. Towards Comparable Historical NER: Building a Shared Evaluation Corpus for 18th-Century Historical Texts. In Anthology of Computers and the Humanities; Association for Computers and the Humanities: Austin, TX, USA, 2025; Volume 3, pp. 968–982. [Google Scholar] [CrossRef]
  19. Kincl, T.; Novák, M.; Přibil, J. Improving Sentiment Analysis Performance on Morphologically Rich Languages: Language and Domain Independent Approach. Comput. Speech Lang. 2019, 56, 36–51. [Google Scholar] [CrossRef]
  20. Moeljadi, D. A Grammar of Chaghatay; Brill: Leiden, The Netherlands, 2016. [Google Scholar]
  21. Koto, F.; Beck, T.; Talat, Z.; Gurevych, I.; Baldwin, T. Zero-shot sentiment analysis in low-resource languages using a multilingual sentiment lexicon. In Proceedings of the International Conference on Computational Linguistics (COLING); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024. [Google Scholar]
  22. Sabharwal, N.; Agrawal, A. Introduction to NLP. In Hands-on Question Answering Systems with BERT; Springer: Apress Berkeley, CA, USA, 2021. [Google Scholar]
  23. Hilpert, M.; Gries, S.T. Quantitative approaches to diachronic corpus linguistics. Lang. Linguist. Compass 2016, 10, 474–488. [Google Scholar]
  24. OpenITI. OpenITI: A Machine-Readable Corpus of Islamicate Texts. 2025. Available online: https://github.com/OpenITI (accessed on 22 April 2026).
  25. Bibliothèque Universitaire des Langues et Civilisations. BULAC Catalogues and Digital Collections. 2016. Available online: https://www.bulac.fr (accessed on 22 April 2026).
  26. Schreibman, S.; Siemens, R.; Unsworth, J. A New Companion to Digital Humanities; Wiley-Blackwell: Hoboken, NJ, USA, 2016. [Google Scholar]
  27. Yergesh, B.; Bekmanova, G.; Sharipbay, A. Sentiment analysis of Kazakh text and polarity. Web Intell. 2019, 17, 9–15. [Google Scholar] [CrossRef]
  28. Yeshpanov, R.; Varol, H.A. KazSAnDRA dataset. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024); ELRA and ICCL: Paris, France, 2024. [Google Scholar]
  29. Du, H.; Shi, J.; Myerston, J.; Lu, S.; Zhou, G.; Gao, Y. Role-Guided Annotation and Prototype-Aligned Representation Learning for Historical Literature Sentiment Classification. In Findings of the Association for Computational Linguistics: EMNLP 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 3756–3768. [Google Scholar]
  30. Gimadim, D. Web-sentiment analysis of public comments for Kazakh-language reviews. In Proceedings of the Student Research Workshop Associated with RANLP 2021; INCOMA Ltd.: Seville, Spain, 2021. [Google Scholar]
  31. Zümberoğlu, K.B.; Dik, S.Z.; Karadeniz, B.S.; Sahmoud, S. Towards better sentiment analysis in Turkish. Appl. Sci. 2025, 15, 2062. [Google Scholar] [CrossRef]
  32. Demirci, G.M.; Keskin, Ş.R.; Doğan, G. Sentiment analysis in Turkish with deep learning. In 2019 IEEE International Conference on Big Data (Big Data); IEEE: New York, NY, USA, 2019; pp. 2215–2221. [Google Scholar]
  33. Allanazarova, S. Sentiment analysis of social network comments in Uzbek language. In 2024 9th International Conference on Computer Science and Engineering (UBMK); IEEE: New York, NY, USA, 2024; pp. 1–4. [Google Scholar]
  34. Niyazmetova, K.E. Sentiment analysis of comments in Uzbek language. Sci. Innov. Int. Sci. J. 2024, 3, 26–30. [Google Scholar]
  35. Benli, İ.; Sharshembaev, B. Using machine learning algorithms for Kyrgyz sentiment analysis. World J. Adv. Res. Rev. 2024, 23, 554–561. [Google Scholar] [CrossRef]
  36. Keremu, F.; Li, L. Sentiment classification model for Uyghur language texts. In 2024 9th International Symposium on Computer and Information Processing Technology (ISCIPT); IEEE: New York, NY, USA, 2024; pp. 364–367. [Google Scholar]
  37. Pei, Y.; Chen, S.; Ke, Z.; Silamu, W.; Guo, Q. AB-LaBSE: Uyghur sentiment analysis. Appl. Sci. 2022, 12, 1182. [Google Scholar] [CrossRef]
  38. Choudhary, N.; Singh, R.; Bindlish, I.; Shrivastava, M. Emotions are universal: Learning sentiment-based representations of resource-poor languages using Siamese networks. In Computational Linguistics and Intelligent Text Processing. CICLing 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
  39. Barus, S.P. Implementation of naïve Bayes classifier-based machine learning. J. Phys. Conf. Ser. 2021, 1842, 012008. [Google Scholar] [CrossRef]
  40. Akhmedov, S.; Nugumanova, A. Development of a sentiment analysis model in the Kazakh language to analyze reviews. Preprints 2024, 2024051300. [Google Scholar] [CrossRef]
  41. Bolatbek, M.; Sagynay, M.; Mussiraliyeva, S.; Yeltay, Z. Detection of offensive content in the Kazakh language. PeerJ Comput. Sci. 2025, 11, e3027. [Google Scholar] [CrossRef]
  42. Nazyrova, A.; Nasrullayeva, A.; Mukanova, A.; Buribayeva, A.; Yergesh, B. Multilingual thematic modeling: A comparative study of classical and transformational approaches. Int. J. Innov. Res. Sci. Stud. 2025, 8, 2787–2799. [Google Scholar] [CrossRef]
  43. Abid, F.; Alam, M.; Yasir, M.; Li, C. Sentiment analysis through recurrent variants lately on a convolutional neural network of Twitter. Future Gener. Comput. Syst. 2019, 95, 292–308. [Google Scholar] [CrossRef]
  44. Zhang, L.; Wang, S.; Liu, B. Deep learning for sentiment analysis. WIREs Data Min. Knowl. Discov. 2018, 8, e1253. [Google Scholar] [CrossRef]
  45. Belinkov, Y.; Glass, J. Analysis Methods in Neural Language Processing: A Survey. Trans. Assoc. Comput. Linguist. 2019, 7, 49–72. [Google Scholar] [CrossRef]
  46. Das, S.; Tariq, A.; Santos, T.; Kantareddy, S.S.; Banerjee, I. Recurrent Neural Networks (RNNs): Architectures, Training Tricks, and Introduction to Influential Research. In Machine Learning for Brain Disorders; Human: New York, NY, USA, 2023. [Google Scholar]
  47. Bansal, M.; Gupta, A.; Saxena, S. Leveraging multilingual transformers for sentiment analysis in low-resource languages. Nat. Lang. Eng. 2023, 29, 1125–1144. [Google Scholar]
  48. Ghatora, P.S.; Hosseini, S.E.; Pervez, S.; Iqbal, M.J.; Shaukat, N. Sentiment analysis of product reviews using machine learning and pre-trained LLM. Big Data Cogn. Comput. 2024, 8, 199. [Google Scholar] [CrossRef]
  49. Mutanov, G.; Karyukin, V.; Mamykova, Z. Multi-Class Sentiment Analysis of Social Media Data with Machine Learning Algorithms. Comput. Mater. Contin. 2021, 69, 913–930. [Google Scholar] [CrossRef]
  50. Wadawadagi, R.; Pagi, V. Sentiment analysis on social media. In Research Anthology on Implementing Sentiment Analysis Across Multiple Disciplines; IGI Global Scientific Publishing: Hershey, PA, USA, 2022. [Google Scholar]
  51. Karim, S.M.; Rasul, R.A.; Sultana, T. Sentiment analysis of social media data. arXiv 2025, arXiv:2510.19656. [Google Scholar] [CrossRef]
  52. Ruder, S.; Peters, M.E.; Swayamdipta, S.; Wolf, T. Transfer Learning in Natural Language Processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 15–18. [Google Scholar] [CrossRef]
  53. Singla, A.; Alhussan, M. Advances in sentiment analysis. IEEE Access 2024, 12, 193115–193130. [Google Scholar]
  54. Qahtani, N.A.; Vedantham, A.K. Sentiment Analysis of Financial News and Social Media for Enhanced Stock Price Prediction Using AI. ResearchGate. 2025. Available online: https://www.researchgate.net/publication/396011190_Sentiment_Analysis_of_Financial_News_and_Social_Media_for_Enhanced_Stock_Price_Prediction_Using_AI (accessed on 22 April 2026).
  55. Sandoval-Almazan, R.; Valle-Cruz, D. Sentiment Analysis of Facebook Users Reacting to Political Campaign Posts. Digit. Gov. Res. Pract. 2020, 1, 18. [Google Scholar] [CrossRef]
  56. Toleu, A.; Tolegen, G.; Ualiyeva, I. Fine-tuning large language models for Kazakh text simplification. Appl. Sci. 2025, 15, 8344. [Google Scholar] [CrossRef]
Figure 1. PRISMA-inspired flow of literature identification, screening, eligibility assessment, and inclusion. The figure summarizes the structured review procedure used to assemble the final body of literature discussed in this study. Source: authors’ contribution.
Figure 1. PRISMA-inspired flow of literature identification, screening, eligibility assessment, and inclusion. The figure summarizes the structured review procedure used to assemble the final body of literature discussed in this study. Source: authors’ contribution.
Applsci 16 05650 g001
Table 1. Representative tonality datasets in modern Turkic languages.
Table 1. Representative tonality datasets in modern Turkic languages.
LanguageDomainDataset SizeReported Best Performance
TurkishSocial media, product and review corpora (balanced polarity datasets)15,853 sentences (augmented dataset)Strong performance with transformer models (e.g., TurkishBERT, XLM-R)
UzbekRestaurant and application review datasets∼8000 annotated comments≈91% accuracy (LogReg + TF-IDF features)
KazakhProduct and service review corpora180,000+ texts>90% accuracy (transformer fine-tuning)
KyrgyzTranslated movie reviews and online platform commentsExperimental pilot dataset0.83 accuracy/0.84 F1 (Logistic Regression)
UyghurHotel review polarity and emotion classification datasets∼2000–5000 annotated texts (low-resource corpora)Performance gains reported when combining multilingual sentence encoders with recurrent classifiers in low-resource settings (LaBSE + BiLSTM)
Note: The datasets differ in domain, size, label structure, annotation procedure, and evaluation setting. Some results are reported on review corpora, while others are based on social media, translated data, or smaller experimental datasets. Therefore, the reported values are not directly comparable across languages. The table is intended to show general resource availability and methodological trends. Source: compiled by the authors from studies on Turkish, Uzbek, Kazakh, Kyrgyz, and Uyghur sentiment analysis [8,28,31,32,33,35,36,37].
Table 2. Reported performance ranges across model families in tonality analysis studies.
Table 2. Reported performance ranges across model families in tonality analysis studies.
Model FamilyTypical AccuracyData RequirementTypical Domains
Lexicon/Rule-based60–70%Very lowReviews, opinion lexicons
Traditional ML (SVM, LR)65–75%ModerateSocial media, reviews
CNN/RNN models80–88%HighReviews, news
Transformer models88–94%HighReviews, multilingual corpora
Large Language Models85–92%Very highZero-shot multilingual
Note: The reported values show approximate performance ranges from representative tonality and sentiment analysis studies. The underlying studies use different dataset sizes, languages, domains, annotation schemes, label granularity, train–test split strategies, and evaluation metrics such as accuracy, F1-score, or macro-F1. For this reason, the table should not be interpreted as a strict benchmark comparison between model families. It is intended to summarize broad methodological trends. Source: compiled by the authors from the studies cited in this section [5,12,27,28,40,43,48,49].
Table 3. Challenges in tonality analysis for historical Turkic corpora.
Table 3. Challenges in tonality analysis for historical Turkic corpora.
FactorModern Kazakh CorporaChagatai Historical Corpora
OrthographyStandardized (Cyrillic/Latin)Arabic-script manuscripts
Lexical compositionTurkic with Russian influenceTurkic with Persian and Arabic borrowings
Dataset availabilityModerateExtremely limited
Labeled resourcesEmerging datasetsNearly nonexistent
Preprocessing difficultyModerateVery high
Note: Table summarizes the main differences between modern Kazakh corpora and Chagatai historical corpora for tonality analysis. Source: authors’ synthesis based on the historical NLP and Turkic NLP literature discussed in this section [7,8,20,26].
Table 4. Representative examples from the Chagatai pilot corpus with transliteration, translation, and predicted tonality.
Table 4. Representative examples from the Chagatai pilot corpus with transliteration, translation, and predicted tonality.
Original Arabic ScriptLatin TransliterationEnglish TranslationPredicted Tonality
Applsci 16 05650 i001ibtidasiz va intihasiz va sharik siz yeti asmanni yeti yerni on sekiz ming alamni bol diganda“When there was neither beginning nor end, and without any partner, He created the seven heavens, the seven earths, and the eighteen thousand worlds.”Neutral
Applsci 16 05650 i002khuday ta’ala adam ni yaratghan ning zikri“Account of the moment when God the Exalted created Adam.”Neutral
Applsci 16 05650 i003dostlarning kulganin korub dushmanlarning yighlaghanin korub“He saw the laughter of friends and the weeping of enemies.”Negative
Applsci 16 05650 i004oghuz khan ning dunyagha kelgani ning zikri“Account of the coming of Oghuz Khan into the world.”Neutral
Note: Tonality labels correspond to the weighted ensemble prediction (NLI = 0.50, XLM-R = 0.30, lexicon = 0.20). Because no gold-standard Chagatai sentiment annotations exist, the labels represent exploratory model outputs rather than validated emotional polarity categories. Source: authors’ pilot corpus and model outputs.
Table 5. Distribution of predicted tonality labels and confidence statistics in the Chagatai pilot experiment.
Table 5. Distribution of predicted tonality labels and confidence statistics in the Chagatai pilot experiment.
ModelLabelCount%
Model A (NLI)Positive5321.2
Neutral3012.0
Negative16766.8
Model B (XLM-R)Positive10.4
Neutral12951.6
Negative12048.0
LexiconPositive10.4
Neutral24999.6
Negative00.0
EnsemblePositive4317.2
Neutral5120.4
Negative15662.4
Confidence StatisticValue
Mean NLI confidence0.510
NLI confidence SD0.115
Mean XLM-R confidence0.371
XLM-R confidence SD0.019
High-confidence samples (≥0.42)99
High-confidence share39.6%
Note: Label distributions are reported independently for each signal and for the weighted ensemble (0.50 NLI, 0.30 XLM-R, 0.20 lexicon). Because no gold-standard Chagatai tonality annotations currently exist, these distributions should be interpreted as exploratory model outputs rather than definitive class frequencies. Source: authors’ pilot experiment.
Table 6. Proxy evaluation metrics for the zero-shot Chagatai pilot texperiment.
Table 6. Proxy evaluation metrics for the zero-shot Chagatai pilot texperiment.
ComparisonAcc.Macro-F1W-F1
XLM-R vs. NLI (inter-model agreement)0.3920.2540.412
NLI vs. Ensemble (self-consistency)0.9080.8630.900
XLM-R vs. Ensemble (self-consistency)0.4440.3050.437
Lexicon vs. Ensemble (alignment)0.2040.1130.069
Note: In the absence of manually annotated Chagatai gold-standard data, the reported values represent proxy measures based on inter-model agreement and ensemble stability. They indicate relative consistency of predictions under zero-shot historical conditions and are not directly comparable with supervised benchmark results. Source: authors’ pilot experiment.
Table 7. Key challenges in historical NLP for Turkic languages.
Table 7. Key challenges in historical NLP for Turkic languages.
ChallengeDescriptionImpact on Tonality Analysis
Orthographic variationHistorical texts often lack standardized spelling conventions. The same lexical item may appear in several written variants depending on period, region, or scribal practice.Tokenization becomes unstable and vocabulary sparsity increases, which reduces model reliability.
Alphabet shiftsKazakh and related Turkic languages have passed through Arabic, Latin, and Cyrillic writing systems. Chagatai manuscripts are primarily in Arabic script.Cross-script matching is difficult and modern NLP pipelines cannot be transferred directly without normalization.
Historical noiseDigitized manuscripts may contain OCR errors, damaged characters, missing segments, and transcription inconsistencies.Preprocessing becomes more difficult and classification accuracy declines because models are exposed to noisy input.
Multilingual lexical influenceChagatai texts frequently combine Turkic grammatical structure with Persian and Arabic vocabulary.Semantic interpretation becomes harder because evaluative language may be distributed across multiple linguistic traditions.
Implicit evaluative languageHistorical literary and political texts often express approval or criticism through metaphor, rhetoric, or symbolic forms rather than explicit sentiment words.Modern classifiers struggle to detect tonal orientation when evaluation is indirect.
Note: Table summarizes recurring linguistic and technical challenges that affect tonality analysis in historical Turkic texts. Source: authors’ synthesis based on the historical NLP and low-resource Turkic NLP literature [7,8,26,47].
Table 8. Comparison of modern and historical datasets for tonality analysis.
Table 8. Comparison of modern and historical datasets for tonality analysis.
PropertyModern NLP DatasetsHistorical Corpora (Chagatai)Implication for NLP
Typical corpus size100,000–1,000,000 documentsVery limited; fragmented historical textsHistorical corpora usually provide much less training data for supervised models.
Orthographic consistencyHigh; mostly standardized spellingLow; multiple spelling variantsNormalization and matching require additional preprocessing effort.
Language structureMostly monolingual or modern code-switchingMixed Turkic, Persian, and Arabic elementsModels must handle historical multilingualism and lexical borrowing.
Annotation availabilityLarge labeled datasets are increasingly availableVery limited labeled corporaSupervised training is much harder for historical text analysis.
Typical model performanceHigh performance on benchmark datasets, often exceeding 0.85 accuracyEvaluation relies on proxy measures because gold-standard labels are lackingDirect comparison using standard metrics is not feasible for historical corpora.
Note: The comparison highlights broad differences between modern NLP datasets and Chagatai historical corpora rather than directly comparable benchmark conditions. Source: authors’ synthesis based on the dataset and the historical NLP literature discussed in this section [7,8,26,28].
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Baishemirov, Z.; Kassymbayev, A.; Yedilkhan, D.; Amirgaliyev, B.; Abdikenov, B. From Ancient Manuscripts to Modern Social Media: Evolution of Tonality Analysis Methods for Low-Resource Languages. Appl. Sci. 2026, 16, 5650. https://doi.org/10.3390/app16115650

AMA Style

Baishemirov Z, Kassymbayev A, Yedilkhan D, Amirgaliyev B, Abdikenov B. From Ancient Manuscripts to Modern Social Media: Evolution of Tonality Analysis Methods for Low-Resource Languages. Applied Sciences. 2026; 16(11):5650. https://doi.org/10.3390/app16115650

Chicago/Turabian Style

Baishemirov, Zharasbek, Azim Kassymbayev, Didar Yedilkhan, Beibut Amirgaliyev, and Beibit Abdikenov. 2026. "From Ancient Manuscripts to Modern Social Media: Evolution of Tonality Analysis Methods for Low-Resource Languages" Applied Sciences 16, no. 11: 5650. https://doi.org/10.3390/app16115650

APA Style

Baishemirov, Z., Kassymbayev, A., Yedilkhan, D., Amirgaliyev, B., & Abdikenov, B. (2026). From Ancient Manuscripts to Modern Social Media: Evolution of Tonality Analysis Methods for Low-Resource Languages. Applied Sciences, 16(11), 5650. https://doi.org/10.3390/app16115650

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop