1. Introduction
Text classification, also referred to as text categorization, is the automated process of assigning predefined categories or topics to textual content based on linguistic and statistical features. With the exponential growth of digital information—ranging from news articles and scientific publications to social media posts and product reviews—manual categorization has become both impractical and inefficient. To address this challenge, a variety of automatic methods have been developed to classify and filter text, thereby improving retrieval, search, and indexing efficiency [
1]. Beyond general information management, text classification also plays a central role in domain-specific applications such as educational content analysis, digital libraries, and intelligent tutoring systems.
In resource-rich languages such as English, text classification has been extensively studied using both traditional machine learning methods and modern deep learning techniques. Early approaches typically relied on bag-of-words or Term Frequency–Inverse Document Frequency (TF–IDF) representations in combination with algorithms such as Naïve Bayes, Support Vector Machines (SVMs), and k-Nearest Neighbors (k-NN) [
2,
3]. These methods offered simplicity, interpretability, and relatively low computational cost, making them effective baselines across a variety of domains.
With the advent of deep learning, neural architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) were introduced for document and sentence classification, achieving substantial improvements in accuracy [
4]. The field was further transformed by the emergence of transformer-based models, most notably BERT (Bidirectional Encoder Representations from Transformers) [
5]. BERT and its multilingual variants, including mBERT and XLM-R, have achieved state-of-the-art results on a wide range of natural language processing tasks by leveraging large-scale pre-training on massive text corpora [
6,
7].
While transformer-based models deliver superior performance, they demand substantial computational resources and large annotated datasets for fine-tuning. This limits their applicability in low-resource settings, where such datasets and infrastructure are often unavailable. Consequently, traditional models remain relevant as lightweight, interpretable, and reproducible baselines, particularly in educational and domain-specific applications.
Recent works on Kazakh have explored LSTM- and BERT-based models for named entity recognition [
8], introduced a new sentiment analysis dataset [
9], and developed corpora to support multilingual NLP in low-resource settings [
10]. These examples demonstrate the growing attention to Turkic languages but also underline that systematic text classification research for Uzbek remains scarce.
Recent advances also highlight the potential of lightweight, parameter-free methods. For instance, Jiang et al. [
11] propose a simple compressor-based k-NN classifier, which outperforms BERT on several out-of-distribution datasets, including four low-resource languages. Such approaches demonstrate that effective solutions for underrepresented languages do not necessarily require large pre-trained models but can emerge from resource-efficient alternatives.
Recent work on Turkish has explored novel augmentation strategies to address the scarcity of annotated data. Onan and Balbal [
12] propose an ensemble data augmentation approach that combines task-specific and universal transformations, achieving significant improvements in sentiment classification accuracy.
Uzbek, as a morphologically rich and low-resource Turkic language, still lacks sufficient computational resources to support advanced NLP applications. While recent years have seen progress in basic tasks such as tokenization, morphological analysis, and lexicon construction, systematic studies of higher-level tasks remain limited. However, comprehensive evaluations of text classification—particularly in educational domains—are still missing, underscoring the importance of establishing reliable baselines.
Madatov and Bekchanov [
13] have proposed a TF–IDF-based summarization model with adaptations to the structure and lexical density of Uzbek texts. They focus on extractive summarization, where the ranking of the most informative sentences is carried out using a normalized sentence-weighting scheme based on the TF–IDF weights of unique Uzbek words. While their objective was summarization, the underlying principle of representing texts through TF–IDF and assigning higher weights to more informative lexical units is closely related to text classification tasks. In both cases, the discriminative power of words is leveraged—either to select the most representative sentences or to distinguish documents by category. This demonstrates that TF–IDF remains a flexible and effective foundation for a range of NLP applications in low-resource languages such as Uzbek, including both summarization and classification.
Madatov et al. [
14,
15] developed an innovative dataset for text classification issues in Uzbek through the extraction and comparison of vocabulary from 35 primary school textbooks. Their developed corpus—the Uzbek Primary School Corpus (UPSC)—contains graded lists of vocabulary gathered based on a tailored lemma extraction approach. The study offers a beneficial linguistic resource for upcoming NLP issues like automatic classification of educational content in low-resource languages such as Uzbek.
A new resource presented an electronic dictionary of Uzbek word endings to assist in tasks like morphological analysis and machine translation. The resource, which was developed by a combinatorial approach, contains suffixes of different parts of speech. Although it does not deal with classification specifically, it is an important infrastructure for the development of Uzbek NLP [
16].
While most low-resource languages lack linguistic infrastructure for sentiment analysis, a recent study introduced the first annotated corpora for Uzbek polarity classification [
17]. The study combined a manually labeled dataset with an automatically translated corpus and experimented with both traditional machine learning and deep learning models.
To fill this void, in this research, we provide a systematic comparative analysis of three computationally lightweight and interpretable machine learning approaches to thematic classification of Uzbek school textbooks. All algorithms use Term Frequency–Inverse Document Frequency for text vectorization and continue with either cosine similarity, logistic regression, or k-Nearest Neighbors algorithms [
18]. These algorithms are computationally lightweight and more appropriate for settings with limited computational resources. Unlike deep learning-based models—which require huge training datasets, Graphics Processing Units, and extensive tuning—these classical models enjoy high interpretability and lower resource demands, making them especially suitable for real-world application in schools and universities of Uzbekistan. By systematically comparing categorization algorithms with actual Uzbek educational text data, this paper presents a practical evaluation for future Uzbek NLP research. The techniques suggested herein can not only automate cataloging of curriculum content but also support intelligent tutoring systems, digital library indexing, and even analysis of student performance via automatic genre or topic identification.
Ultimately, this research demonstrates that combining Uzbek natural language processing with educational needs makes it possible to effectively identify learning materials that match the intellectual abilities of school students from grades 5 to 11. The TF–IDF-based text classification algorithms proposed by the authors prove to be a promising solution not only for educational content selection but also for classifying large-scale texts of any type in the Uzbek language.
2. Materials and Methods
2.1. Data Description and Preprocessing
In this study, we compiled and prepared a corpus of Uzbek texts consisting of both school textbooks and external literary sources.
2.1.1. Dataset
A total of 96 official Uzbek-language textbooks for grades 5 through 11 were collected from the “Maktab darsliklari” Android app
https://play.google.com/store/apps/details?id=dev.mobile.books (accessed on 15 September 2025). These textbooks cover diverse subjects including literature, mathematics, physics, chemistry, biology, Uzbek language, history, and geography.
From this collection, we constructed two distinct datasets:
Internal dataset: For each grade, balanced samples were created by selecting 10 excerpts from different textbooks. Each excerpt was limited to approximately 5000 words, resulting in a total of 70 plain-text files. This ensured grade-level balance across all classes .
External dataset: To evaluate model robustness, we additionally included texts from various literary genres (novels, stories, and essays) outside the textbook corpus. These were treated as independent test cases for classification experiments.
2.1.2. Preprocessing
All documents were processed through the following pipeline:
Encoding normalization: All files were converted to UTF-8 encoding.
Lowercasing: Characters were transformed into lowercase to reduce vocabulary sparsity.
Cleaning: Non-textual elements such as headers, footers, and page numbers were removed.
Punctuation removal: Non-alphanumeric symbols were deleted while preserving sentence delimiters.
Tokenization: Text was split into tokens using whitespace and punctuation rules.
Stopword removal: A manually curated list of Uzbek stopwords was applied to eliminate function words.
Vectorization: TF–IDF representations of each document were constructed. For each token
t in document
d, the TF–IDF weight was computed as follows:
where
Here, is the frequency of term t in document d; is the total number of terms in d; N is the total number of documents; and is the number of documents containing t.
The final vocabulary size after preprocessing was restricted to the 5000 most informative features, ensuring a compact and comparable representation across documents. Each textbook excerpt was mapped to a TF–IDF vector of fixed dimensionality, yielding labeled instances for supervised learning.
2.1.3. Implementation Details
All experiments were implemented in Python 3.9. We employed scikit-learn for TF–IDF vectorization and classification, NumPy for numerical operations, and matplotlib for visualization of results. Cross-validation was conducted using 5-fold stratified splitting, ensuring balanced class distributions across folds. The accuracy and standard deviation of performance were reported to capture model reliability.
This unified dataset and preprocessing workflow ensures consistent input representation for all classification methods described in
Section 2.2,
Section 2.3 and
Section 2.4.
2.2. TF–IDF + Linear Regression Algorithm
Although logistic regression is a standard classifier, we include linear regression as a simple multi-output baseline trained on one-hot labels.
Suppose 96 texts are given, each text is tokenized, a common vocabulary of the texts is constructed, the irrelevant words are removed from each text, and the TF–IDF values of the remaining words are calculated so that the corresponding vectors of the texts are brought to the same dimension:
Each vector belongs to one of the classes . Let us assume that the TF–IDF vector corresponding to the new 97th text is also adjusted to the same dimension as the vectors . We hypothesize that there is a linear regression relationship between and their classes Y, and we construct the linear regression model as follows.
is the matrix of TF–IDF vectors of the 96 texts;
represents the classes corresponding to the 96 vectors;
is the regression coefficient;
is the error term.
In linear regression, we convert the classes into numerical form. For example, we apply one-hot encoding.
Since there are 7 classes,
Thus, each will be represented in vector form.
Now, by applying the
Least Squares Method, we obtain
from which we derive
Now, we classify the new document:
As a result, we obtain a 7-dimensional vector (each coordinate corresponds to a class). We select the class corresponding to the largest coordinate. That is,
In this expression, the class corresponding to the column with the largest value is selected.
2.3. TF–IDF + K-Nearest Neighbors Algorithm
We also evaluated the k-Nearest Neighbors (k-NN) classifier as a simple baseline. This classical non-parametric method assigns a test instance to the majority class among its
k nearest training neighbors in the TF–IDF feature space [
19].
Given a new text with TF–IDF vector , the goal is to assign it to one of the seven grade levels by comparing it to the 96 textbook vectors .
Distance computation:
For each
, the distance between
and
is computed. Following standard practice in text mining, we considered both Euclidean and cosine distance metrics:
Model selection:
Instead of fixing K heuristically (e.g., ), we performed a grid search with using 5-fold stratified cross-validation on the 96 textbooks. Both Euclidean and cosine distances were tested, together with uniform and distance-based voting schemes. Dimensionality reduction via Truncated SVD (100–300 components) was also included in the search space.
Sorting and neighbor selection:
For a given test vector, all distances are sorted in ascending order, and the top K nearest neighbors are selected.
Classification rule:
The predicted class
is obtained by majority voting among the labels of the
K neighbors:
where
is the grade label of
. In case of ties, the label of the closest neighbor is selected.
Evaluation:
The final model was trained on all 96 textbook vectors and evaluated on two separate test sets: (i) an internal balanced set of 70 texts (10 per grade) and (ii) an external set of 7 literary texts. We report accuracy, precision, recall, F1-score, and confusion matrices.
2.4. TF–IDF + Cosine Similarity Algorithm
In this section, we present a method for classifying Uzbek texts using TF–IDF vectors and the cosine similarity measure.
Given a new document represented by its TF–IDF vector , and a set of 96 existing documents represented by , the classification process proceeds as follows.
Cosine similarity definition:
For each existing vector
, where
, we compute the cosine similarity between the new vector and
as follows:
where
Similarity interpretation:
Cosine similarity provides a normalized measure of directional alignment between two TF–IDF vectors. A value closer to 1 indicates stronger textual similarity. In our case, this measure is used to compute the similarity between the new document vector and each of the existing vectors . The vectors with the highest similarity value, ideally approaching 1, are considered the best match and determine the classification outcome.
Assign class:
Each vector is labeled with a class , where , corresponding to school grade levels.
Final classification step:
Let
be the index of the most similar vector. The class name of the new document is assigned based on that of the most similar existing document:
If multiple vectors have the same maximum similarity result, those classes are taken as the final result.
3. Results
This section presents the results of our grade-level classification experiments on Uzbek texts. The evaluation was carried out using TF–IDF-based feature representations combined with three classification algorithms: linear regression, k-Nearest Neighbors (KNN), and a 1-NN baseline with cosine similarity.
To address reviewer concerns, we report not only overall accuracy but also precision, recall, and F1-scores together with confusion matrices to provide deeper insight into class-wise performance. Random and majority-class baselines are included for context.
The experiments were conducted on two test sets: (i) an internal balanced corpus of 70 literary texts (10 per grade, drawn from 96 school textbooks) and (ii) an external corpus of 7 texts from diverse genres. Summary tables for each method are provided below.
Table 1 presents the statistics of the school textbooks used in this study. Since listing all 96 textbooks individually would be impractical, they are grouped by grade level, with the number of documents, total token count, and unique vocabulary size provided for each grade.
Table 2 presents a list of seven literary works in the Uzbek language obtained from an open electronic library resource of Uzbekistan—ZiyoUz [
20]. These texts were selected to identify literary materials that align with the intellectual and linguistic capabilities of school students. The chosen works represent a variety of genres, including fantasy, crime fiction, historical chronicles, and classical poetry.
Table 3 presents the predictions of three TF–IDF-based methods on the external corpus of seven literary texts. Unlike the internal dataset of school textbooks, these external texts do not have predefined grade labels. Their role in this study is not to measure accuracy, but rather to explore which grade level each literary work may be most appropriate for.
The classification is performed by comparing each external literary text against the balanced TF–IDF representations of school textbooks (grades 5–11). The grade assigned is the one whose vocabulary and frequency distribution are most similar to the given text. It is important to emphasize that TF–IDF captures only lexical overlap and word frequency patterns, rather than deep semantic meaning. Thus, the mapping indicates how closely the word usage of an external text resembles that of a particular grade-level corpus, which we interpret as an indicator of suitability for students’ intellectual and linguistic capabilities at that grade.
The results show that linear regression (LR) and cosine similarity (CS) provided consistent and interpretable predictions, often aligning with one another. By contrast, the KNN method frequently defaulted to grade 10, revealing its limitations in high-dimensional TF–IDF spaces. These findings support the conclusion that LR and CS are more reliable for recommending extracurricular literary works that match the linguistic and cognitive capacity of students in grades 5–11.
Table 4 shows the structure of the internal balanced dataset. Each grade (5–11) contains 10 balanced excerpts of 5000 tokens each, drawn from different subjects (e.g., literature, sciences, and history). Here, grade 5 is listed in full and two representative files from grade 11 (#69 and #70) are displayed to illustrate dataset boundaries.
As shown in
Table 5, internal test results on 70 balanced files show strong performance (precision = 0.94, recall = 0.93, F1 = 0.93, and accuracy = 92.9%). At the same time, five-fold cross-validation on the same dataset yielded much lower performance (accuracy = 22.9% and macro F1 = 0.19). This contrast demonstrates that although the TF–IDF + linear regression model can successfully fit the balanced test set, its generalization ability across folds remains limited. Therefore, these results should be interpreted with caution regarding the model’s broader applicability.
Precision = 0.27, recall = 0.29, F1 = 0.25, and accuracy = 28.6% on 70 files (
Table 6). The confusion matrix shows that KNN predictions are heavily biased toward grade 10, with many texts from grades 6–11 misclassified into this category. For example, grade 6 and grade 7 texts were mostly predicted as grade 10, while grade 11 was confused with both grade 7 and grade 10. Although some grade 5 texts were recognized correctly (8 out of 10), overall performance was weak and inconsistent across classes. This demonstrates that TF–IDF + KNN lacks generalization power in high-dimensional Uzbek text space, and it is not a reliable method for grade-level classification.
Cross-validation analysis. For completeness, we also evaluated KNN using five-fold cross-validation on the balanced internal dataset. The average accuracy was only 11.4%, with macro-averaged precision = 0.11, recall = 0.11, and F1 = 0.11. The confusion matrices across folds show that predictions collapsed into a few classes (mainly grade 10), with almost no correct matches for other grades. This confirms that the weak performance observed on the fixed internal test set is consistent under cross-validation as well, indicating that TF–IDF + KNN is unsuitable for Uzbek grade-level text classification.
Precision = 0.92, recall = 0.91, F1 = 0.92, and accuracy = 91.4% on 70 files. The TF–IDF + cosine similarity (1-NN) method shows high reliability on the balanced dataset. Most predictions are correct, with minor confusions between neighboring grades (e.g., 5 vs. 7, 7 vs. 10, 9 vs. 10, and 11 vs. 7), reflecting lexical overlap in adjacent curricula. Cross-validation is not applicable here since the model relies on a fixed set of 96 reference textbooks; removing subsets would reduce the index and distort retrieval. Thus, results are reported in the fixed-reference setting, matching the intended use case.
The internal classification results obtained using the TF–IDF representation combined with cosine similarity are summarized in
Table 7; these results demonstrate that the majority of texts were accurately assigned to their respective grade levels, suggesting the effectiveness of TF–IDF features with cosine similarity for distinguishing among the seven grade categories.
As summarized in
Table 8, we compare three TF–IDF–based methods on the internal balanced dataset.
4. Discussion
In this study, three TF–IDF-based approaches were evaluated for grade-level classification of Uzbek educational texts: TF–IDF + linear regression (LR), TF–IDF + k-NN, and TF–IDF + cosine similarity (CS, 1-NN retrieval).
On the internal balanced corpus (70 files), LR achieved 92.9% accuracy with a macro-averaged precision of 0.94, recall of 0.93, and F1-score of 0.93. CS performed comparably with 91.4% accuracy and a precision of 0.92, recall of 0.91, and F1-score of 0.92. Both methods demonstrated robustness across grades, with only minor confusions between adjacent grade levels (e.g., 5 vs. 7, 7 vs. 10, 9 vs. 10, and 11 vs. 7), which reflect the natural lexical overlaps across curricula. In contrast, k-NN with optimized hyperparameters achieved only 28.6% accuracy, heavily misclassifying many texts into grade 10. This weakness can be explained by the curse of dimensionality in high-dimensional sparse TF–IDF space, where Euclidean distance becomes ineffective.
On the external corpus (seven literary works), the goal was not accuracy but suitability assessment by mapping each text to the most similar grade-level corpus. Here, LR and CS provided consistent and interpretable predictions. For example, Harry Potter and Shaytanat mapped to grade 7, Kuhna dunyo to grades 7–8, and Mantiq-ut-Tayr to grade 11. k-NN, however, frequently defaulted to grade 10, confirming its poor generalization ability.
From a methodological standpoint, cross-validation was performed for LR and k-NN, validating the reliability of results. For CS, cross-validation is not meaningful because its model consists of a fixed set of 96 reference textbooks: removing folds would reduce the index set and artificially lower retrieval quality. Therefore, CS performance is reported only in the fixed-reference evaluation setting, which corresponds to its intended use case.
It is also important to acknowledge the limitations of this work. Uzbek is a low-resource, agglutinative language; hence, no lemmatization, stemming, or morphological normalization was applied. This may have led to feature sparsity due to multiple word forms. Furthermore, only baseline TF–IDF approaches were tested. Future work should incorporate more semantically powerful models, such as word embeddings or transformer-based architectures, to capture deeper linguistic patterns.
Overall, the findings show that LR and CS are reliable baselines for grade-level classification of Uzbek educational texts, while k-NN fails to handle the lexical distribution effectively. This study establishes baseline results for Uzbek educational corpora and provides a methodological foundation for future extensions toward semantically richer models for low-resource language applications.
5. Conclusions
This study established TF–IDF-based classification as a first baseline for Uzbek educational texts across grades 5–11. Among the evaluated methods, linear regression and cosine similarity consistently achieved reliable results, both exceeding 90% accuracy on the balanced internal dataset. In contrast, k-NN showed very weak performance due to the curse of dimensionality in sparse TF–IDF space, confirming its limited applicability.
The findings also demonstrate that simple lexical overlap is sufficient to approximate grade-level differences in Uzbek curricula, although minor confusions appear between adjacent grades. Nevertheless, the approach remains limited by the absence of linguistic preprocessing such as lemmatization and stemming, which are particularly important for an agglutinative, low-resource language like Uzbek.
Future work will expand beyond TF–IDF baselines toward semantically richer representations, including word embeddings, transformer-based architectures, and cross-lingual transfer learning. These methods will not only address current limitations but also enable more robust and interpretable models for educational applications in Uzbek and other low-resource languages.