Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Prefix Data Augmentation for Contrastive Learning of Unsupervised Sentence Embedding

Appl. Sci. 2024, 14(7), 2880; https://doi.org/10.3390/app14072880

by Chunchun Wang

and Shu Lv^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2024, 14(7), 2880; https://doi.org/10.3390/app14072880

Submission received: 1 February 2024 / Revised: 24 March 2024 / Accepted: 25 March 2024 / Published: 29 March 2024

(This article belongs to the Special Issue Advancements in Natural Language Processing, Semantic Networks, and Sentiment Analysis)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a novel approach, Prefix Data Augmentation (Prd), to enhance the learning of sentence embeddings in an unsupervised manner. This method makes a particular contribution to the field of natural language processing by solving the problem of improving the semantic representation of sentences without the need for labeled data.

However, there are several aspects of the paper that require significant improvement: lack of detailed methodological explanations, lack of clarity in the presentation of results. The abstract should be improved to better reflect the content of the article and to expand the discussion of the practical implications of the findings. The manuscript would greatly benefit from an extended literature review, a more detailed justification for the choice of baseline methods, an extended explanation of the experimental setups, and a detailed discussion of the results. Although the paper shows improvements in some benchmarks, the decrease in performance in the TREC task indicates that the approach cannot improve performance in all types of NLP tasks. This indicates that further research is needed to understand the limitations and applicability of PrdSimCSE.

Some comments and remarks:

· It is recommended to revise the abstract to ensure consistency between the abstract and the content of the article. Make sure that the abstract accurately reflects the content, aims and conclusions of the article.

· I recommend adding the unique contribution of your research at the end of the Introduction section.

· In addition, I recommend including a brief overview of the structure of the paper at the end of the Introduction section.

· Equation (1) is presented without a detailed description of the variables involved.

· The claimed improvement of 1.08% over the baseline is relatively small. The paper could discuss the practical implications of this improvement.

· “Our PrdSimCSE performs better, thanks to Prd” (p. 2) - In an academic article, a more formal tone and precise wording are usually preferred.

· Provide more methodological details regarding the selection and application of prefixes for data augmentation. More clearly emphasize the new aspects of your prefix data augmentation method, especially how it differs from existing data augmentation methods in NLP.

· “The proposed CBOW and skip-gram methods involve predicting surrounding words…” (p. 2) - In the context of the article, mentioning CBOW without first defining it may confuse readers unfamiliar with the term. It is recommended that the first time an abbreviation appears, its full form and a brief explanation be introduced. A list of abbreviations can also be included.

· While the selection of baselines for comparison provides a comprehensive evaluation framework, the paper would benefit from a more detailed rationale for the choice of these specific baselines.

· I recommend providing more detailed information about the experimental setup. This should include specific PrdSimCSE model configurations, the computational environment, data preprocessing steps, and any hyperparameter adjustments made during the experiments.

· Conducting experiments on more datasets and in different linguistic contexts can help validate the capabilities of the proposed method.

· Although the designations used in Tables 2 and 3 seem intuitive, each designation used in these tables should be explained for clarity and understanding.

· Expand the literature review to include a wider range of related work to determine the paper's contribution to the field. Comparison with state-of-the-art methods other than SimCSE, especially those that utilize different augmentation strategies, could provide a more complete understanding of the novelty of the work.

· Clearly summarize the novelty of your research. Describe how your approach to prefix data augmentation (Prd) for unsupervised sentence embedding differs from existing techniques. Summarize the key findings or improvements your method offers over baseline models.

· Although the use of arXiv sources provides timely access to the latest research, I recommend that references to formal peer-reviewed sources be preferred whenever possible. Peer-reviewed articles pass a strict evaluation process to ensure a high standard of scientific credibility and validity. If equivalent peer-reviewed sources are available for arXiv preprints, replacing them with peer-reviewed references will greatly enhance the credibility and scientific basis of your article.

· The paper concludes with a brief description of the proposed PrdSimCSE. However, I recommend that you expand this section (Conclusion) to include more detailed descriptions of key findings, explicit connections to the goals and research questions of the paper, a discussion of any methodological limitations, and a description of possible future research and practical applications in NLP.

· Although the article addresses some of the limitations of the proposed method, a more detailed discussion of potential limitations would have been worth adding. In addition, a description of specific directions for future research would have been valuable.

Comments on the Quality of English Language

Extensive editing of English language required

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper presents interesting research on contrastive unsupervised learning using Prefix Data Augmentation. The paper is well written and clearly presented and structured.

Existing relevant methods are presented and discussed. The bibliography is focused targeting mostly practical applications of particular methods, e.g. sentence embeddings, contrastive learning, etc.

In my opinion, the data augmentation section lacks in relevant sources and discussion of methods, including efficiency and shortcomings of data augmentation techniques. Moreover, some questions are posed that require discussion on how different authors have dealt with these issues. Section 6 focuses on this but does not provide relevant works and sources.

The measures and approaches in measuring semantic similarity also need a more extensive coverage as it is claimed to be an important component of the methodology.

Some results reported are not very convincing, e.g. line 56-57, improvement of 1% over the baseline (and even less for other measures, lines 156-157), it should at least be elaborated more on what makes these results important. With view to this, it could be discussed how approaches for data augmentation, properties of the data (e.g., observations on sentence length, sec. 5.4) influence the results and how to improve efficiency. This is still relevant and beneficial even if the method does not surpass significantly the baseline.

The discussion in sec. 6 is somewhat vague and abstract. In my opinion it should be focused more on the data and concrete properties and examples from the data.

The Conclusion section can be expanded with more notes on the improvement of the method and its applications.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Based on the comments, the authors made significant improvements to their manuscript. The changes address all major issues that were mentioned in the review. Given these significant changes and additions, I think that the authors have adequately addressed all of the major comments made during the review process. The authors thoroughly revised their manuscript to improve the clarity and overall quality of the study results. Nevertheless, some inaccuracies remain that may require further revision or attention:

· “As illustrated in table 1, changing..” (p. 4);

· “The training corpus was derived from an unlabeled dataset collected by Gao et al. …” (p. 6) - there is a lack of reference to a specific source.

· “The semantic similarity tasks includes STS…” (p. 6).

· Table 8, Table 9 - I would suggest revising the table title, which is too long, and moving some of the text to the main text.

The results obtained may still be somewhat controversial or may be a cause for further research. A comparison with the baseline models is sufficient, but would have benefited from a discussion of the practical implications of the improvements achieved. For example, a detailed description of how the 1.08% performance improvement in BERTbase is reflected in real-world applications could provide valuable insight into the significance of the results obtained.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Prefix Data Augmentation for Contrastive Learning of Unsupervised Sentence Embedding

Further Information

Guidelines

MDPI Initiatives

Follow MDPI