Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Analysis of Short Texts Using Intelligent Clustering Methods

Algorithms 2025, 18(5), 289; https://doi.org/10.3390/a18050289

by Jamalbek Tussupov¹

, Akmaral Kassymova^2,*

, Ayagoz Mukhanova^1,*, Assyl Bissengaliyeva², Zhanar Azhibekova³, Moldir Yessenova¹ and Zhanargul Abuova²

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Algorithms 2025, 18(5), 289; https://doi.org/10.3390/a18050289

Submission received: 27 March 2025 / Revised: 5 May 2025 / Accepted: 8 May 2025 / Published: 19 May 2025

(This article belongs to the Section Databases and Data Structures)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper reviews the state-of-the-art methods for short text clustering, including BERT, TF-IDF, and LDA+BERT+AE. The paper first outlines the theoretical basis of each method and its advantages and disadvantages. Then, the experimental part compares the performance of these methods for short text clustering, with a special focus on the hybrid method LDA+BERT+AE. Based on the research results, this paper recommends the best selection and clustering method for different short text and natural language processing operations. In addition, this paper explores the application of these methods in industry and education, where successful text processing and classification are crucial.

The main flaws are as follows.

This work is insufficiently novel since the proposed method is a combination of existing models (BERT, TF-IDF, and LDA+BERT+AE). The authors should examine what are the drawbacks of the existing models and carefully explain how to construct a more feasible method.

Lines 91 to 133 of the paper are a stack of related methods and lack logic.”

The contributions of the paper are not clearly listed. It is recommended to highlight the contributions and innovations of the paper in section 1.

Figures 1–11 suffer from low resolution, with critical details being blurred. Please provide vector graphics or high-resolution bitmaps and ensure readability.

Experiments are conducted only on a single Kaggle news dataset, and recent SOTA methods are not included for comparison. I suggest increasing the dataset for validation and supplementing the comparative experiments with the latest deep models.

The conclusion paragraph is too long and contains a lot of repetitive content. It is recommended to simplify the conclusion and focus on the core content.

The paper has some obvious writing errors and formatting problems:

(1) Equation at line 193 appears as raw LaTeX code, which must be replaced with properly rendered equations.

(2) “This book aims...” in the introduction should be “This work aims...” or “This study aims...”.

Comments on the Quality of English Language

The English could be improved to more clearly express the research.

Author Response

1. Comment: “This work is insufficiently novel since the proposed method is a combination of existing models (BERT, TF-IDF, and LDA+BERT+AE). The authors should examine what are the drawbacks of the existing models and carefully explain how to construct a more feasible method.”

Response:
Thank you for the comment. We acknowledge that the components used in our method—BERT, TF-IDF, and LDA—are established. However, our contribution lies in designing a novel hybrid architecture (LDA+BERT+AE) that addresses the limitations of each individual method. In the revised manuscript, we expanded Section 2 ("Materials and Methods") to explicitly explain the shortcomings of TF-IDF (no context awareness), BERT (poor topic separability), and LDA (lack of semantic modeling). The hybrid method introduces a learnable weighted fusion with parameter γ to balance semantic and topic features, followed by autoencoder-based compression. This architecture is optimized for short texts, which are semantically dense and structurally sparse. We have emphasized this design rationale in both the introduction and methodology sections.

2. Comment: “Lines 91 to 133 of the paper are a stack of related methods and lack logic.”

Response:
We thank the reviewer for this observation. We revised this section to enhance logical flow and thematic cohesion. Now, the models are discussed in a structured order—first highlighting their individual strengths and limitations, then leading into the motivation for the hybrid design. Each method (TF-IDF, BERT, LDA) is now linked to the corresponding function it serves in the hybrid framework.

3. Comment: “The contributions of the paper are not clearly listed. It is recommended to highlight the contributions and innovations of the paper in section 1.”

Response:
We agree and have revised the final paragraph of Section 1 to include an explicit statement of contributions. The updated paragraph now outlines: (1) the development of a hybrid LDA+BERT+AE method tailored for short texts; (2) the introduction of a weighted feature combination strategy using parameter γ; (3) a detailed evaluation using seven clustering metrics; and (4) a comparative analysis with classic and state-of-the-art baselines, with results indicating up to 98% accuracy and 0.9 F1-score.

4. Comment: “Figures 1–11 suffer from low resolution, with critical details being blurred. Please provide vector graphics or high-resolution bitmaps and ensure readability.”

Response:
We have replaced all low-resolution figures (Figures 1–11) with high-resolution vector-based versions in the revised manuscript. All text labels, legends, and annotations are now clearly legible and suitable for both digital and print formats.

5. Comment: “Experiments are conducted only on a single Kaggle news dataset, and recent SOTA methods are not included for comparison. I suggest increasing the dataset for validation and supplementing the comparative experiments with the latest deep models.”

Response:
We appreciate this important point. While the current paper focuses on a single dataset for controlled evaluation, we have added a paragraph in the Conclusion outlining future work. We plan to evaluate the proposed method on benchmark datasets such as AG News, 20 Newsgroups, and TweetEval. We also intend to compare performance with modern SOTA models including SBERT, SimCSE, and USE. This will provide broader validation of the method.

6. Comment: “The conclusion paragraph is too long and contains a lot of repetitive content. It is recommended to simplify the conclusion and focus on the core content.”

Response:
We have revised the conclusion for brevity and clarity. The updated version highlights the key results (accuracy, F1-score), methodological contributions (hybrid architecture, dimensionality reduction), practical implications (for content moderation, marketing, etc.), and plans for future work.

7. Comment: “The paper has some obvious writing errors and formatting problems: (1) Equation at line 193 appears as raw LaTeX code, which must be replaced with properly rendered equations. (2) ‘This book aims...’ in the introduction should be ‘This work aims...’ or ‘This study aims...’”

Response:
We have corrected both issues:

(1) The equation previously shown in raw LaTeX at line 193 is now properly rendered using equation formatting.
(2) The phrase “This book aims...” has been revised to “This study aims...” in the introduction.

We sincerely thank the reviewer for their insightful comments. We believe the improvements significantly enhance the clarity, novelty, and quality of our manuscript.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper has the potential to be a valuable contribution, but in my opinion the clarity of presentation needs improvement. In particular I suggest revising the "Materials and Methods" section - I couldn't easily get the "big picture" of how the data were processed step-by-step, or what "data pipelines" were ultimately compared. There are also some issues regarding formatting and wording. Relevant comments can be found in the attached pdf file.

Comments for author File: Comments.pdf

Author Response

Response to Reviewer 2

We would like to sincerely thank Reviewer 2 for their careful reading of the manuscript and for providing valuable comments that helped improve the clarity, accuracy, and academic quality of our work. Below we provide detailed responses and the corresponding revisions made in the manuscript.

1. Comment (line 65): “This book aims...” — should be revised.

Response:
Thank you for pointing this out. The phrase "This book aims..." has been corrected to "This study aims..." to better reflect the nature of the article.

2. Comment (line 135): “It proposes?” — wording unclear.

Response:
We revised this sentence for clarity. The revised version now explicitly states: "This study proposes a hybrid method called dependent embedding..." to avoid ambiguity.

3. Comment (line 137): “Is this the correct name?” (referring to "hidden Dirichlet lattice")

Response:
We thank the reviewer for this observation. We corrected the terminology: “hidden Dirichlet lattice” was a typographical error. It has been replaced with the correct term “Latent Dirichlet Allocation (LDA).”

4. Comment (line 140): “‘you’ seems unnecessary here.”

Response:
Agreed. The use of “you” in scientific writing is informal and was removed. The sentence has been rephrased to: “The LDA method allows identification of common topics in large text arrays.”

5. Comment (line 177): “Presenting an individual line of code in such a way is unnecessary.”

Response:
Thank you for the feedback. We agree that raw code format is not appropriate in this context. The code line has been replaced with a mathematical notation to represent the operation concisely.

6. Comment (line 204): “Invalid formatting.”

Response:
We corrected the formatting of the mathematical expression for logarithmic transformation. The equation is now properly rendered as:

xi,j(log)=log⁡(xi,j+1)

and referred to as Equation (2) in the revised manuscript.

7. Comment (line 196): “How is k defined?”

Response:
We clarified this in the revised manuscript: “k denotes the number of topics specified in the LDA model and also corresponds to the number of clusters used in the K-Means algorithm.”

8. Comment (Figures 1–11): “The images should be of better quality.”

Response:
We appreciate the suggestion. All figures (Figures 1–11) have been replaced with high-resolution vector graphics to ensure full readability and clarity in print and online formats.

9. Comment (line 437): “Is this particular information (specific values of loss function) important?”

Response:
We revised the paragraph to summarize training behavior more concisely. Numerical details were retained in the Results section only, as they provide evidence of convergence and generalization. Redundant mention in the abstract and conclusion has been removed for brevity and relevance.

10. Comment (line 453): “This statement is a bit too 'bombastic'.”

Response:
We have revised the sentence in the conclusion to use a more academically neutral tone. The statement was changed from:

“...offer promising prospects for enhancing the accuracy and efficiency...”
to
“...show potential in improving the accuracy and efficiency...”

11. Comment (line 446): “This should go to the references section.”

Response:
We moved the detailed reference to the GitHub repository from the main text to the “Data Availability Statement” section, where it is more appropriately placed.

We hope that the revisions made in response to your thoughtful comments meet your expectations and contribute to improving the quality of the manuscript. Thank you once again for your valuable feedback.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

This paper a novel hybrid architecture (LDA+BERT+AE) that addresses the limitations of each individual method. The hybrid method introduces a learnable weighted fusion with parameter γ to balance semantic and topic features, followed by autoencoder-based compression. This architecture is optimized for short texts, which are semantically dense and structurally sparse.

The main flaws are as follows.

1. The citation format for references 27-32 should be consistent with the citation format for other references.

2. In Section 3, it is recommended to annotate the comparative methods with references.

Author Response

Response to Reviewer Comments:
We sincerely thank the reviewer for the constructive feedback, which helped improve the clarity and scholarly rigor of our manuscript. Please find below our detailed responses to each point raised:
Citation Format Consistency (References 27–32):
We have carefully revised the reference list to ensure consistency with the MDPI citation style, as required by the journal guidelines. All citations, including references [27] through [32], are now formatted uniformly.
Annotation of Comparative Methods in Section 3:
To improve the methodological clarity, we have annotated the comparative methods used in Section 3 with appropriate references. Specifically, the use of TF-IDF and BERT is now clearly cited as [2] and [1], respectively, and the hybrid method LDA+BERT+AE is supported by references [23,24]. The paragraph in Section 3 has been updated to the following:
In Section 3, it is recommended to annotate the comparative methods with references.
We have carefully revised Section 3 (Results and Discussion) to include appropriate citations when discussing the comparative methods TF-IDF, BERT, and the hybrid LDA+BERT+AE model. The references [1], [2], [23], and [24] have been added at the respective mentions of these models to enhance methodological transparency and provide proper attribution to prior work. These updates ensure that the comparative analysis is now fully supported by relevant literature.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper shows some improvement. I'm not sure if all the mentioned issues have been addressed - the authors' responses refer to lines other than the ones specified in the original pdf file; I don't know if it is due to some technical issue or another reason, but it makes the changes harder to follow. There are still some inconsistencies: for example, "LDA" is sometimes expanded to "Latent Dirichlet Allocation" and sometimes to "Hidden Dirichlet Allocation", the same abbreviations are explained multiple times etc. While these shortcomings do not disqualify the article, they do make it appear less professional.

Author Response

Response to Reviewer
We sincerely thank the reviewer for the continued attention to detail and constructive feedback on our revised manuscript.
We would like to address the concerns raised:
Line Number Discrepancies:
We acknowledge that the line references in our previous responses may have differed from those in the reviewer’s annotated PDF. This discrepancy likely stems from formatting and pagination differences between our local working version (e.g., in Word format) and the version uploaded to the system. To avoid confusion, we have now aligned all changes with the latest version of the manuscript and ensured that modifications are clearly traceable in the revised text.
Terminological Inconsistency (“LDA”):
Thank you for pointing this out. We have reviewed the manuscript and ensured that the abbreviation “LDA” is now consistently expanded as Latent Dirichlet Allocation throughout the text. All inconsistent references to "Hidden Dirichlet Allocation" have been corrected.
Redundant Explanations of Abbreviations:
We carefully edited the manuscript to remove duplicate or unnecessary repetitions of abbreviation explanations. Each term is now introduced once—at its first mention—and used consistently thereafter in accordance with academic style conventions.
We appreciate the reviewer’s observations, which helped us refine the manuscript further and improve its clarity and professionalism. We hope that these revisions are satisfactory.

Article Menu

Analysis of Short Texts Using Intelligent Clustering Methods

Response to Reviewer 2

1. Comment (line 65): “This book aims...” — should be revised.

2. Comment (line 135): “It proposes?” — wording unclear.

3. Comment (line 137): “Is this the correct name?” (referring to "hidden Dirichlet lattice")

4. Comment (line 140): “‘you’ seems unnecessary here.”

5. Comment (line 177): “Presenting an individual line of code in such a way is unnecessary.”

6. Comment (line 204): “Invalid formatting.”

7. Comment (line 196): “How is k defined?”

8. Comment (Figures 1–11): “The images should be of better quality.”

9. Comment (line 437): “Is this particular information (specific values of loss function) important?”

10. Comment (line 453): “This statement is a bit too 'bombastic'.”

11. Comment (line 446): “This should go to the references section.”

Further Information

Guidelines

MDPI Initiatives

Follow MDPI