Next Article in Journal
A New Look at the Stress State Across the Bohai Strait, China
Previous Article in Journal
A Measurement-Driven Method for the Simultaneous Solution of AX = YB in the Implementation of Simulated Robotic Production Systems
 
 
Article
Peer-Review Record

Text Similarity Detection in Agglutinative Languages: A Case Study of Kazakh Using Hybrid N-Gram and Semantic Models

Appl. Sci. 2025, 15(12), 6707; https://doi.org/10.3390/app15126707
by Svitlana Biloshchytska 1,2,*, Arailym Tleubayeva 3,*, Oleksandr Kuchanskyi 1,4,5,*, Andrii Biloshchytskyi 1,2, Yurii Andrashko 6, Sapar Toxanov 7, Aidos Mukhatayev 8 and Saltanat Sharipova 9
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Appl. Sci. 2025, 15(12), 6707; https://doi.org/10.3390/app15126707
Submission received: 2 May 2025 / Revised: 9 June 2025 / Accepted: 12 June 2025 / Published: 15 June 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors propose a hybrid model combining N-gram analysis, TF-IDF, and semantic methods (Latent Semantic Analysis and probabilistic topic modeling) to improve detection accuracy. The study evaluates the model on a custom corpus of Kazakh texts, comparing its performance against standalone N-gram and TF-IDF approaches. The results indicate that the hybrid model achieves balanced precision and recall, albeit with higher computational costs.

Shortcomings:

1.The hybrid approach, while effective, is not novel. Similar combinations (e.g., TF-IDF + LSA) have been explored for other languages.

2.The hybrid model’s processing time (100.32 seconds) is significantly higher than TF-IDF (14.16 seconds), raising scalability concerns.

3.Precision (0.50–0.51) is low, indicating high false positives. This undermines the model’s reliability.

4.Missing comparisons with state-of-the-art methods (e.g., BERT-based models) weaken the novelty claim.

Comments on the Quality of English Language

Language Errors:Typos: “ac- curacy” (p. 4).

Author Response

Thank you for your valuable comments. They helped improve the quality of the manuscript. Our author team has made the necessary revisions accordingly. If necessary, we will use the MDPI language editing service to improve the quality of the English and make the manuscript more comprehensible.

  1. The hybrid approach, while effective, is not novel. Similar combinations (e.g., TF-IDF + LSA) have been explored for other languages.

Thank you for the comment. Indeed, similar combinations do exist. However, we chose to explore a combination of methods specifically for the case of the Kazakh language. To the best of our knowledge, such combinations have not been applied in this context. Therefore, this study aims to address the scientific gaps in this particular area.

  1. The hybrid model’s processing time (100.32 seconds) is significantly higher than TF-IDF (14.16 seconds), raising scalability concerns.

Threshold optimization (λ) was performed, which led to a significant reduction in data processing time. As a result, the execution time of the hybrid model is 0.87 seconds. The corresponding changes have been incorporated into the manuscript (Table 5).

  1. Precision (0.50–0.51) is low, indicating high false positives. This undermines the model’s reliability.

We conducted a series of experiments with the threshold value and found that increasing the λ threshold to 0.7 completely eliminates false positives on the test set while maintaining an acceptable level of near-duplicate detection. The obtained results are described in the manuscript and presented in Table 5. Thus, by allowing a minimal reduction in recall, we achieved perfect precision, making the system highly reliable for automated text verification tasks where false positives are critical.

  1. Missing comparisons with state-of-the-art methods (e.g., BERT-based models) weaken the novelty claim.

Thank you for the comment. To justify the relevance and competitiveness of our hybrid approach, we conducted a direct comparison with the baseline BERT-like model, bert-base-multilingual-cased. The results have been added to the manuscript and demonstrate that the combination of lightweight models remains a practical and effective solution for near-duplicate detection tasks in low-resource agglutinative languages.

Reviewer 2 Report

Comments and Suggestions for Authors The article presents the development of hybrid models for text data analysis to detect plagiarism in Kazakh texts by creating and applying combined methods that integrate statistical and semantic approaches. Below are comments on each paper section. 1 Introduction The introduction begins at a global level, dealing with the problem of plagiarism, and gradually descends to the specific problem of plagiarism in agglutinative languages, in an interesting way and citing references. It is convincing as to the need for the development of the work presented in this article. 2. Literature review The literature review was comprehensive, including the problem of dealing with text data, mathematical formulas, numerical data and emphasizing the need to develop methods for near-duplicate detection that account for the specific features of the Kazakh language. The paragraphs from line 197 to line 226 should be moved to the introduction, as they present features of the Kazakh language and argue for the usefulness of the work presented in the article. The literature review section should be reserved for the presentation and commentary on scientific works that had similar objectives to this one. 3. Materials and Methods and 4. Methodology Since there is already a section title Materials and Methods, title Methodology on line 267 may be removed, with Methodology being encompassed by Materials and Methods. The lines from 268 to 271 may be removed, since their contents was already presented before and are not about methodology. In line 371, probably "g is dimensional" should be corrected to "g-dimensional". The method, based on distance measurements, is explained in detail. But, after the introduction has emphasized the particularity of the Kazakh language being agglutinative, it is expected that a relationship will be established between that problem and the proposed method. In what way is the method appropriate to deal with that particularity? The text did not clearly present a connection between the fact that the language is agglutinative and the characteristics of the method used. 5. Case Study The corpus and the duplicates generation process are well described. This process seems appropriate for the experiments. The metrics are also well chosen and well described. 6. Results The presentation of the results and their analysis were good. 7. Discussion and Conclusions The discussion is in agreement with the results obtained. The conclusion followed logically from the other sections, References The list of references is adequate, comprehensive and up-to-date.

Author Response

Thank you for your valuable comments. They helped improve the quality of the manuscript. Our author team has made the necessary revisions accordingly.

1. Introduction The introduction begins at a global level, dealing with the problem of plagiarism, and gradually descends to the specific problem of plagiarism in agglutinative languages, in an interesting way and citing references. It is convincing as to the need for the development of the work presented in this article.

Thank you. We have revised the manuscript by adding additional results and performing optimization of the lambda parameter. These improvements are expected to enhance the scientific value of the paper.

  1. Literature review The literature review was comprehensive, including the problem of dealing with text data, mathematical formulas, numerical data and emphasizing the need to develop methods for near-duplicate detection that account for the specific features of the Kazakh language. The paragraphs from line 197 to line 226 should be moved to the introduction, as they present features of the Kazakh language and argue for the usefulness of the work presented in the article. The literature review section should be reserved for the presentation and commentary on scientific works that had similar objectives to this one.

Thank you for the comment. We have moved the part of the text related to the features of the Kazakh language to the introduction section.

  1. Methodology Since there is already a section title Materials and Methods, title Methodology on line 267 may be removed, with Methodology being encompassed by Materials and Methods. The lines from 268 to 271 may be removed, since their contents was already presented before and are not about methodology. In line 371, probably "g is dimensional" should be corrected to "g-dimensional". The method, based on distance measurements, is explained in detail. But, after the introduction has emphasized the particularity of the Kazakh language being agglutinative, it is expected that a relationship will be established between that problem and the proposed method. In what way is the method appropriate to deal with that particularity? The text did not clearly present a connection between the fact that the language is agglutinative and the characteristics of the method used.

Thank you. We have deleted lines 268 to 271, merged Sections 3 and 4, and corrected the typo in line 371. The hybrid method for near-duplicate detection aligns with the specifics of agglutinative languages due to its consideration of morphological variability through normalization and semantic generalization, free word order through order-insensitive metrics, and synonymy and polysemy through thematic and conceptual similarity. This information has been incorporated into the revised version of the paper.

  1. Case Study The corpus and the duplicates generation process are well described. This process seems appropriate for the experiments. The metrics are also well chosen and well described.

We have added information to this section to improve the understanding of the results presented in the paper.

  1. Results The presentation of the results and their analysis were good.

Thank you. We have added a textual description of the results and additionally implemented a baseline BERT model on the given dataset. This was necessary to enable a comparative analysis with the hybrid model.

  1. Discussion and Conclusions The discussion is in agreement with the results obtained. The conclusion followed logically from the other sections, References The list of references is adequate, comprehensive and up-to-date.

Thank you.

Reviewer 3 Report

Comments and Suggestions for Authors

Thanks to the authors for providing this innovative manuscript. Please consider the following review comments for improvement:

  1. The abstract is a little too long. Consider condensing the content to make it shorter.

  2. It is recommended to add a Figure 1 that summarizes the entire workflow.

  3. The recall value of 1.0 for both TF-IDF and Hybrid models does not convincingly show the advantage of the proposed hybrid model. Please consider enriching the dataset, or having other metrics. Then a comprehensive discussion with the updated dataset/metrics.

  4. For the N-gram parameter n, please consider providing a sensitivity analysis, as N-gram performance heavily depends on the value of n.

  5. The formulas in Section 4 need to be better organized. For example, clearly define symbols like F(x,x) and K_ij. Please consider improving the formatting to make it easier to follow.

  6. The authors mention using LSH, LSA, and probabilistic topic models in the hybrid model. How are these components integrated in the proposed method in Section 4?

  7. Preprocessing is also critical to the dataset. Filters like Gabor and Gaussian are mentioned—please provide concrete data points, such as a table or chart, showing what was filtered out.

  8. How do the authors handle images and formulas with the specific methods mentioned? Are those methods implemented in the paper?

  9. The preparation of the corpus is critical to this study. Please consider including some example entries from the dataset.

  10. What is the value of n used in Table 1? The table currently only shows the type and count.

Author Response

Thank you for your valuable comments. They helped improve the quality of the manuscript. Our author team has made the necessary revisions accordingly.

1. The abstract is a little too long. Consider condensing the content to make it shorter.

Thank you for the comment. We have corrected this in the manuscript. We have also revised the abstract to reflect the updated results.

2. It is recommended to add a Figure 1 that summarizes the entire workflow.

Thank you. We have added a figure that illustrates the entire workflow.

3. The recall value of 1.0 for both TF-IDF and Hybrid models does not convincingly show the advantage of the proposed hybrid model. Please consider enriching the dataset, or having other metrics. Then a comprehensive discussion with the updated dataset/metrics.

Thank you. We have carefully proofread the manuscript and corrected the typos. Additionally, if the MDPI editorial team considers it necessary for us to use a language editing service, we are ready to do so.

4. For the N-gram parameter n, please consider providing a sensitivity analysis, as N-gram performance heavily depends on the value of n.

Thank you. We analyzed the parameter values of the hybrid model and computed several additional models to verify the results. These new calculations have been included in the manuscript. The article now contains a table presenting results for different values of n. We have also revised the presentation of the results and added the necessary information to the manuscript.

5. The formulas in Section 4 need to be better organized. For example, clearly define symbols like F(x,x) and K_ij. Please consider improving the formatting to make it easier to follow.

We included an overall diagram of the method to improve the readability of the section and revised the corresponding text accordingly.

6. The authors mention using LSH, LSA, and probabilistic topic models in the hybrid model. How are these components integrated in the proposed method in Section 4?

Thank you. These updates have been described in the new version of the manuscript.

7. Preprocessing is also critical to the dataset. Filters like Gabor and Gaussian are mentioned—please provide concrete data points, such as a table or chart, showing what was filtered out. How do the authors handle images and formulas with the specific methods mentioned? Are those methods implemented in the paper?

In this study, we focused on detecting near duplicates in text written in the Kazakh language. We consider this to be a novel contribution, as Kazakh is a low-resource language. However, we anticipate that future research will also address image and mathematical formula analysis.

8. The preparation of the corpus is critical to this study. Please consider including some example entries from the dataset.

Examples of the records have been added to the supplementary materials. Since the corpus is in Kazakh, including it directly in the body of the article may confuse readers, so we chose to present it separately.

9. What is the value of n used in Table 1? The table currently only shows the type and count.

Thank you. We also identified and corrected a mistake in the title of the table, which has been fixed in the revised version of the manuscript.

Reviewer 4 Report

Comments and Suggestions for Authors

This is an interesting article that presents a method to detect plagiarized text in Kazakh tests using a hybrid approach. This is a highly relevant paper as it provides insights into an understudied language and that is always a valuable contribution. However, there are several issues (both minor and major) that need to be addressed before the article is appropriate for publication. I am providing the comments as I read the paper:

I would call "electronic scientific papers" simply "scientific papers". I am unsure that this qualifier is necessary.

The abbreviation of Latent Semantic Analysis (LSA) is defined twice. Please search for other instances of duplicate definition of abbreviations (like LSH and LSA).

Minor typos/other writing issues: "higher ac- curacy", "Let’s B"

Papers should be referenced differently. Saying "The work [X] shows" or "The paper [Y] does" is generally bad writing. The references are not elements of a sentence. You should include them at the end and speak in other terms, such as "Author et al. did ABC in previous work [X]". Please consider revising the writing throughout the document to improve this.

I would recommend splitting the Literature Review section into different subsections. First, I would give the background on the characteristics of the Kazakh language at the start (line 197 to 226). Then separate it into information about previous studies on textual matching in Kazakh (lines 115 to 196). And then finally the rest of the section.

I would recommend adding a summary table/figure about existing approaches.

I would recommend adding a Figure with an example that illustrates the properties of the Kazakh language and maybe a comparison with English.

The ending of the Literature Review addresses the gap and describes the main goal of the article, but I would also recommend modifying the introduction so that it explicitly lists the research questions and/or contributions that guide this study in the introduction. 

The materials and methods section is missing proper references. It is strange that materials and methods is separate from methodology. These sections should be merged together and reorganized. Again, there are not enough references in this section.

I would also recommend including a diagram to provide a big picture overview of the methodology.

The function F(B, B_i) is insufficiently undefined (it is only described as a distance). This seems critical to the methodology and it should be explicitly defined or maybe potential "values" of F should be referenced (e.g., what type of distance function should be used). This would also benefit from the big picture diagram that provides an overview of what you are trying to do with your approach. I am assuming this F is based on one of the aforementioned filters, but it is not clear. This seems to be clarified further down, but it is hard to follow in its current form.

This is a very hard to follow article in Section 4. I would recommend splitting the methodology into several substeps that clearly delineate what you are trying to do first and then do it. By this, I mean splitting it into subsections that explain in natural language and THEN throw in the mathematics. In its current form it is simply a barrage of mathematics that requires a lot of parsing to follow through. Again, this would benefit from a diagram and also better structuring to grasp the intuition behind your methods.

Also, I see a lot of predetermined thresholds lambda that probably need to be explored in more detail as potential hyperparameters.

Section 5.1 should be part of the Materials and Methods, as it describes the corpus and the necessary data augmentation techniques. I would recommend explicitly referencing some papers on data augmentation given this focus on implementing these techniques to handle the low amount of data available for the Kazakh language.

5.2 and 5.3 also seem to be potentially part of Materials and Methods, or maybe they could be rolled into the proper results depending on how the authors restructure the article.

There are certain redundancies throughout the papepr (e.g., the data set is described again in lines 552 to 557). Please check all such instances.

It is very suspicious that you are getting a 100% precision and recall in the unigram model. I would recommend double checking this and verifying whether you had some overfitting or data leakage issues.

Was there a proper training/testing split? Or did you directly use your method on the full data? Does your method require any training? What about those lambda thresholds? How did you adjust them? What are the results of the state of the art methods? How do they compare against yours? I am very unsure of the validity of the results. The lack of a formal comparison against other methods is worrying, also the fact that no error or dispersion measures are reported.

I would recommend running an ablation study where you test parts of your proposed methodology (as I see there are several components to it, so you could try removing one of them and seeing the effects on performance)

Author Response

Thank you for your valuable comments. They helped improve the quality of the manuscript. Our author team has made the necessary revisions accordingly. If necessary, we will use the MDPI language editing service to improve the quality of the English and make the manuscript more comprehensible.

  1. I would call "electronic scientific papers" simply "scientific papers". I am unsure that this qualifier is necessary.

Thank you for the comment. We have corrected this in the manuscript.

  1. The abbreviation of Latent Semantic Analysis (LSA) is defined twice. Please search for other instances of duplicate definition of abbreviations (like LSH and LSA).

Thank you for the comment. The issue has been addressed in the revised version.

  1. Minor typos/other writing issues: "higher ac- curacy", "Let’s B"

Thank you. We have thoroughly proofread the manuscript and corrected all typos. Additionally, if the MDPI editorial team recommends the use of a language editing service, we are prepared to proceed accordingly.

  1. Papers should be referenced differently. Saying "The work [X] shows" or "The paper [Y] does" is generally bad writing. The references are not elements of a sentence. You should include them at the end and speak in other terms, such as "Author et al. did ABC in previous work [X]". Please consider revising the writing throughout the document to improve this.

The writing style in the literature review section has been revised.

  1. I would recommend splitting the Literature Review section into different subsections. First, I would give the background on the characteristics of the Kazakh language at the start (line 197 to 226). Then separate it into information about previous studies on textual matching in Kazakh (lines 115 to 196). And then finally the rest of the section.

Thank you for the comment. We have moved the part of the text concerning the features of the Kazakh language to the introduction, as recommended by another reviewer. Therefore, we believe it is not necessary to divide this section into subsections.

  1. I would recommend adding a summary table/figure about existing approaches.

Thank you for the comment. The comparison table of the methods has been added (Table 1).

  1. I would recommend adding a Figure with an example that illustrates the properties of the Kazakh language and maybe a comparison with English.

Thank you. A comparison of the features of the Kazakh and English languages has been added to the manuscript and presented in Table 2.

  1. The ending of the Literature Review addresses the gap and describes the main goal of the article, but I would also recommend modifying the introduction so that it explicitly lists the research questions and/or contributions that guide this study in the introduction. 

We have included the following statement in the literature review to emphasize the relevance of the study. The literature review indicates the need to address the scientific gap related to the lack of effective and linguistically adapted methods for detecting near duplicates in low-resource agglutinative languages, such as Kazakh, by developing and validating a hybrid model that combines statistical and semantic analysis.

  1. The materials and methods section is missing proper references. It is strange that materials and methods is separate from methodology. These sections should be merged together and reorganized. Again, there are not enough references in this section.

Thank you. We have merged Sections 3 and 4.

  1. I would also recommend including a diagram to provide a big picture overview of the methodology.

Thank you for the comment. We have included a diagram illustrating the overall structure of the method.

  1. The function F(B, B_i) is insufficiently undefined (it is only described as a distance). This seems critical to the methodology and it should be explicitly defined or maybe potential "values" of F should be referenced (e.g., what type of distance function should be used). This would also benefit from the big picture diagram that provides an overview of what you are trying to do with your approach. I am assuming this F is based on one of the aforementioned filters, but it is not clear. This seems to be clarified further down, but it is hard to follow in its current form.

We have added relevant information and included a general workflow diagram to the manuscript.

  1. This is a very hard to follow article in Section 4. I would recommend splitting the methodology into several substeps that clearly delineate what you are trying to do first and then do it. By this, I mean splitting it into subsections that explain in natural language and THEN throw in the mathematics. In its current form it is simply a barrage of mathematics that requires a lot of parsing to follow through. Again, this would benefit from a diagram and also better structuring to grasp the intuition behind your methods.

Thank you. We added a schematic representation of the study and made corresponding textual revisions, including a thorough proofreading of the English language. These changes aim to enhance the clarity and comprehensibility of the manuscript.

  1. Also, I see a lot of predetermined thresholds lambda that probably need to be explored in more detail as potential hyperparameters.

Thank you for the comment. The adjustment and optimization of the threshold parameter contributed to the improved performance speed of the hybrid model. This has been reflected in the results section. In the hybrid approach, several threshold values (λ) influence the final outcome. Each threshold acts as a hyperparameter, as its value directly affects the model’s precision, recall, and F1-score. The manuscript provides a detailed explanation of the hybrid model’s performance based on the optimized combined threshold.

  1. Section 5.1 should be part of the Materials and Methods, as it describes the corpus and the necessary data augmentation techniques. I would recommend explicitly referencing some papers on data augmentation given this focus on implementing these techniques to handle the low amount of data available for the Kazakh language.

The dataset is described in Section 5 «Case Study: the dataset for duplicate detection in Kazakh». The manuscript also includes a comparative analysis of well-known near-duplicate detection methods and the implementation of a baseline BERT model to benchmark the proposed hybrid model.

  1. 5.2 and 5.3 also seem to be potentially part of Materials and Methods, or maybe they could be rolled into the proper results depending on how the authors restructure the article.

We have moved subsections 5.2 and 5.3 into the Results section.

  1. There are certain redundancies throughout the papepr (e.g., the data set is described again in lines 552 to 557). Please check all such instances.

The text has been revised and duplicate content has been removed.

  1. It is very suspicious that you are getting a 100% precision and recall in the unigram model. I would recommend double checking this and verifying whether you had some overfitting or data leakage issues.

Thank you for the comment. For the task of detecting near duplicates, a precision of 1.0 is a strong result, indicating the absence of false positives. If unique texts are incorrectly identified as similar without clear evidence of partial duplication, the plagiarism detection process becomes unreliable. Therefore, in our case, the model’s performance is primarily evaluated using recall and F1-score. We have optimized the threshold values, conducted the necessary effectiveness evaluations, and added the results to the manuscript.

  1. Was there a proper training/testing split? Or did you directly use your method on the full data? Does your method require any training? What about those lambda thresholds? How did you adjust them? What are the results of the state of the art methods? How do they compare against yours? I am very unsure of the validity of the results. The lack of a formal comparison against other methods is worrying, also the fact that no error or dispersion measures are reported.

Thank you for the comment. All suggestions have been taken into account and incorporated into the manuscript accordingly.

Round 2

Reviewer 4 Report

Comments and Suggestions for Authors

I believe the authors have addressed multiple comments, but I still have some questions regarding the perfect results for precision.

The paper mentions that cross-validation was used to train the models and find the hyperparameters, this should be presented in more detail.

Provide detailed results of the cross-validation (e.g., how many folds?) and the ranges of tested hyperparameters (which hyperparameters exactly?).

Also, since cross-validation was used, the results should be reported with error ranges across the K-folds. This is why I also still find the perfect precision strange.

Maybe it is something I am not seeing it, but it is very strange to get perfect results, specially without any information on the cross-validation error. Please provide more details and clarify why this would be the case.

Author Response

1. The paper mentions that cross-validation was used to train the models and find the hyperparameters, this should be presented in more detail.

Thank you for the comment. In fact, we did not use a classical k-fold cross-validation procedure for gradient-based model training, as LSA and LDA do not require iterative optimization via gradient descent. Instead, we employed the following evaluation scheme:

- the corpus was randomly split into an 80% training set and a 20% test set;
- from the training set, an additional 20% hold-out subset was used to tune threshold hyperparameters;
- the final performance evaluation (Precision, Recall, F1-score) was conducted on the independent test set.

This inaccuracy has been corrected in the revised version of the manuscript.

2. Provide detailed results of the cross-validation (e.g., how many folds?) and the ranges of tested hyperparameters (which hyperparameters exactly?).

Thank you for the question. Since LSA and LDA models are trained in a single pass over the entire training set, without repeated resampling as in classical k-fold cross-validation, it is not feasible to compute conventional error bars based on fold variance.

3. Also, since cross-validation was used, the results should be reported with error ranges across the K-folds. This is why I also still find the perfect precision strange.

Thank you for the question. In the hybrid model, we optimized only the threshold coefficients on the validation subset: λJ is threshold for Jaccard similarity (MinHash); λLSA is threshold for cosine similarity in LSA; λLDA is threshold for cosine similarity in LDA; λfused is final smoothed threshold for the fused score. For each of the three partial thresholds, we assigned values from the grid {0.1, 0.2, …, 0.9}, while λ<sub>fused</sub> was searched over the grid {0.5, 0.6, 0.7, 0.8, 0.9}. In total, approximately 9³ × 5 = 3645 combinations were evaluated, which allowed for a thorough exploration of the impact of each threshold.

4. Maybe it is something I am not seeing it, but it is very strange to get perfect results, specially without any information on the cross-validation error. Please provide more details and clarify why this would be the case.

We are grateful for the reviewer’s comment; however, in the manuscript we have provided a rationale for why the occurrence of false positives is unacceptable in near-duplicate detection systems.  Since our primary objective was to achieve zero false positives (FP = 0), the resulting Precision remained consistently perfect (1.00) across all hold-out/test splits. Notably, even under a classical cross-validation scheme, the false positive rate would remain zero across all folds, leading to an error bar of zero for the Precision metric. This strict performance was attained by applying a conservative threshold of λfused> = 0.7. Under this setting, the model rejects any candidate pair in which at least one of the three similarity signals (MinHash, LSA, or LDA) does not exceed the specified threshold. This design reflects a conscious trade-off, prioritizing absolute certainty in positive predictions (high Precision) at the expense of a moderate reduction in Recall (approximately 0.73).

Achieving perfect Precision is intentional and highly desirable in industrial applications where the cost of false alarms is prohibitive. In these scenarios, it is often preferable to leave some true duplicates undetected rather than risk incorrectly flagging a non-duplicate. The architecture of our hybrid method, which performs strong aggregation of three independent similarity measures, combined with an exhaustive grid search on the hold-out validation set, provides robust protection against false positives.

We hope that our responses have clarified some of the previously unexplained aspects of our study.

Round 3

Reviewer 4 Report

Comments and Suggestions for Authors

I have no further observations. The authors have properly addressed my concerns in their response.

Back to TopTop