TF-IDF-Based Classification of Uzbek Educational Texts
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors1.Fundamental Methodological Error:
Section 2.2 is titled "TF-IDF + Linear Regression Algorithm." However, the described method is not Linear Regression. The described procedure of fitting a scalar β_i for each document and minimizing a residual is a form of nearest neighbor search using a projection error metric, not regression. Linear Regression is a supervised learning model that learns a single set of coefficients from training data to predict a target variable. This section misuses fundamental machine learning terminology, invalidating the entire method description and any results associated with it. This error is severe and requires a complete rewrite and re-evaluation of the proposed approach.
2.Severely Flawed Experimental Design:
The core experimental design is not valid for evaluating a classification model. The authors use each textbook as a single data point (n=96) and then classify new texts by comparing them to these 96 points. This approach does not constitute a standard train-test split or cross-validation. The model is not "learning" from the data; it is performing a similarity search against a very small set of exemplars. The resulting "accuracy" metrics (82%, 85.7%) are largely meaningless in this context and do not demonstrate the generalization ability of a classification model.
The value of K for KNN (K=9) is chosen arbitrarily using a square root rule of thumb without any justification or ablation study. This is not a rigorous approach.
There is no description of how the dataset was split for training and testing. It appears the entire set of 96 textbook vectors is used as the reference set, which would lead to a significant overestimation of performance if any of the test texts are derived from or are very similar to these textbooks.
3.Incomplete and Superficial Results Analysis:
The abysmal performance of KNN (22%) is merely stated but not analyzed. In high-dimensional spaces (like a 221,036-dimensional TF-IDF vector space), Euclidean distance becomes meaningless—a phenomenon known as the "curse of dimensionality." This is the most likely cause of KNN's failure and should be discussed. Cosine similarity is known to be more effective in such settings, which explains its better performance, but this key point is not mentioned.
The analysis lacks any form of error analysis. Which classes are most frequently confused? Why did certain texts (e.g., `choliquishi.txt` classified as grade 5) get misclassified? Without this, the results are merely numbers without insight.
There are no statistical significance tests to support the claims about one method being better than another.
4.Lack of Baseline and Context:
The study lacks a proper naive baseline (e.g., always predicting the majority class). What is the accuracy of a random classifier? (~14% for 7 classes). This context is needed to interpret the 22% KNN accuracy.
The performance of the proposed methods is not compared against any other existing approach, even a simple one, making it impossible to assess their actual contribution.
5.Questionable Novelty and Contribution:
The application of TF-IDF + Logistic Regression/Cosine Similarity is a standard and well-known baseline in text classification. Applying it to a new language, while useful for establishing a baseline, does not in itself constitute a significant novel contribution without a more robust experimental setup and comparison to state-of-the-art or alternative methods suitable for low-resource scenarios (e.g., using cross-lingual embeddings or simple transformer fine-tuning if possible).
Based on the aforementioned concerns, some critical recommendations for revision are presented:
1.Correct the Methodology: Completely rewrite Section 2.2. If the intent was to use Logistic Regression, describe it correctly. If the intent was to use a novel similarity measure, define it clearly and do not call it "Linear Regression."
2.Redesign the Experiments:
Split the data from the 96 textbooks into proper training and test sets at the document level (not treating each whole textbook as a point). Use standard cross-validation.
Implement a true Logistic Regression model that is trained on the training set and evaluated on the test set.
For KNN, perform a proper search for the optimal K value on a validation set.
Treat the Cosine Similarity approach as a 1-Nearest Neighbor model using cosine distance and include it in the comparison fairly.
3.Deepen the Analysis:
Include a confusion matrix for the best-performing model.
Discuss the reasons for the performance of each method (e.g., dimensionality issues for KNN).
Compare results against a simple baseline.
Perform a detailed error analysis on misclassified examples.
4.Clarify the Contribution: Clearly state that this work provides a baseline study for Uzbek text classification. Temper the claims about contribution and novelty accordingly.
Author Response
Please find the responses to all comments in the attached file.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper presents a TF-IDF-based classification approach for Uzbek educational texts, but several issues need to be addressed to improve the quality and clarity of the work.
1 The dataset description is overly simplistic. While the authors mention using textbooks for grades 5 to 11, they fail to provide detailed information about how the dataset is split into training, validation, and test sets. This lack of clarity in dataset partitioning raises concerns about the reliability and generalizability of the results. Additionally, class imbalance across grade levels is not discussed, which could lead to biased performance metrics.
2 The evaluation of the model relies solely on accuracy, which is not sufficient. Other important metrics like recall, precision, and F1-score should also be included, as these provide a more comprehensive understanding of the model’s performance, especially for imbalanced datasets.
3 The poor performance of the k-NN model, which achieved only 22% accuracy, is a significant issue. This result is likely due to the high-dimensional sparsity of the TF-IDF features and the use of Euclidean distance, which may not be suitable for sparse data. The authors should consider using alternative distance measures like cosine similarity, which is better suited for high-dimensional, sparse feature spaces.
4 With 220,000 features, the risk of high-dimensional sparsity is a critical problem. The authors should consider dimensionality reduction techniques, such as PCA or t-SNE, to reduce the feature space and mitigate the sparsity issue, which would likely improve the performance of all models.
5 There are several issues related to terminology and clarity. In Section 2.2, the title “TF-IDF + Linear Regression” should be corrected to "TF-IDF + Logistic Regression" as linear regression is not appropriate for classification tasks. Additionally, the abstract mentions that "Cosine Similarity performed slightly better at 85.7%" but fails to explain why an unsupervised method outperforms supervised models. This logical gap needs to be addressed, possibly by explaining the representativeness of the grade-level corpus.
6 There is inconsistency in the use of tenses, with the abstract using past tense (“achieved”) and the method section using present tense (“we compute”). The paper should maintain consistency, preferably using past tense throughout. Furthermore, there is redundancy in Section 2.1.2, where the TF-IDF formula is repeated, which should be removed to avoid unnecessary repetition.
7 Feature engineering is another area that requires improvement. The paper does not account for the specific characteristics of the Uzbek language, such as its agglutinative nature, which necessitates stemming (e.g., merging forms like "kitob" and "kitobim"). Additionally, stop words, such as high-frequency words like "va" (and) and "bu" (this), should be filtered out to improve the discriminative power of the TF-IDF vectors.
8 The authors fail to compare their models against baseline classifiers like Naive Bayes, which is a standard text classification model. Including such a comparison would provide a clearer context for evaluating the proposed approach.
9 Furthermore, the paper does not present a confusion matrix, which could help identify which grade levels are most prone to misclassification.
Author Response
Please find the responses to all comments in the attached file.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsPlease refer to the attachment.
Comments for author File: Comments.pdf
Author Response
Please find the responses to all comments in the attached file.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThe article has limitations in terms of originality because the use of TF-IDF with traditional classifiers and cosine similarity is well known in NLP literature. No significant algorithmic innovation is proposed, but rather an application to the Uzbek language.
Although it could be valuable for this type of language, the article does not demonstrate how it surpasses existing approaches or why these methods are preferable to modern alternatives.
For this same reason, there is a lack of comparison with modern methods such as Transformer-based models (BERT, etc.). This could weaken the justification for the chosen approach.
The article should briefly explain or draw an analogy between the Uzbek education system and international education systems to improve overall understanding.
In the methodology, there is a lack of technical details because the software or libraries used are not specified. In the training of the models, there is also no reference to the hyperparameters used or configured in each algorithm. This may affect reproducibility.
In the preprocessing section, it is important to mention that although the basic pipeline is described, no details are given on how stopwords, tokenization, or stemming were handled in Uzbek. A detailed technical description is recommended.
Regarding the dataset, although the number of tokens and unique words per grade is provided, the thematic or generic distribution of the texts is not analyzed. Therefore, there could be an imbalance between the classes. Contextual metadata is also lacking, as the texts are not statistically characterized, they do not provide relevant information such as average length, lexical diversity, predominant themes, etc. This makes it difficult to evaluate the representativeness of the corpus.
In terms of evaluation, the article lacks adequate metrics. Only accuracy is reported, which is not entirely adequate for multi-class problems. It is preferable to use a confusion matrix and also a complete classification report that includes precision, recall, and f1-score per class. ROC/AUC curves are also not included.
It would have been interesting to discuss misclassified texts that could answer questions such as whether there are topics or genres that confuse the model, etc.
It is recommended to enrich the results section with visual representations that complement the numerical data. This would not only facilitate the interpretation of the findings, but also allow patterns and trends to be identified at a glance.
Comments on the Quality of English LanguageThere are frequent grammatical inaccuracies and stylistic inconsistencies that occasionally impede smooth reading. Most problems involve the use of articles, which are often missing or used incorrectly, and there are also errors in the use of verbs with subjects. Check that prepositions are used correctly.
Author Response
Please find the responses to all comments in the attached file.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsNone.
Author Response
We sincerely thank the reviewer for the time spent evaluating our work.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis is a revised version of the manuscript, following one round of review.
First, I am not satisfied with the authors’ response to Comment 8. What I asked for was a meaningful machine learning baseline model. Replacing this with random guessing or majority-class predictions is not an acceptable substitute. Moreover, the statement “KNN (22%) performs only slightly better than random” highlights a weakness of the proposed method rather than its strength. Comparing against a weak or uninformative baseline does not demonstrate the effectiveness of the proposed approach.
Second, the manuscript lacks any figures. The absence of visual elements makes it very difficult for readers to understand both the workflow and the results. Flowcharts of the analysis pipeline, performance comparison plots, confusion matrix heatmaps, and hyperparameter search visualizations would all substantially improve the clarity and impact of the paper.
In summary, while some effort has been made in revision, the issues with the choice of baseline and the lack of figures still limit the quality of the work. I encourage the authors to address these points before the manuscript can be considered further.
Author Response
We would like to express our gratitude to the reviewer who generously devoted valuable time and patiently reviewed our work
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have addressed all the issues I raised and recommend publication
Author Response
We would like to express our gratitude to the reviewer who generously devoted valuable time and patiently reviewed our work
Reviewer 4 Report
Comments and Suggestions for AuthorsOverall, the authors have addressed a significant portion of the comments and suggestions raised in the previous round of review. Efforts have been made to enrich the manuscript in several key sections, including the Data Set Analysis, in which the statistical analysis of the corpus has been expanded to include not only basic token and vocabulary counts, but also the distribution of thematic categories and contextual metadata. This addition substantially improves the understanding of the representativeness and balance of the dataset, allowing for a more informed evaluation of the results. Regarding the Evaluation Metrics, it is very positive to see that more robust and appropriate evaluation metrics for a multi-class classification problem have been integrated. The inclusion of precision, recall, F1-score per class, along with confusion matrices, provides a much more complete and nuanced view of model performance, allowing specific strengths and weaknesses to be identified. Finally, the Error Analysis has been strengthened with a discussion of misclassified texts. This qualitative analysis, which explores patterns of confusion between adjacent grades and the possible linguistic or thematic reasons behind these errors, adds a valuable layer of interpretability and depth to the quantitative findings.
Author Response
We would like to express our gratitude to the reviewer who generously devoted valuable time and patiently reviewed our work