Next Article in Journal
Analysis of Properties of the Multilayer Meander Structures for Wireless Communications
Next Article in Special Issue
CLICK: Integrating Causal Inference and Commonsense Knowledge Incorporation for Counterfactual Story Generation
Previous Article in Journal
The Synergy between a Humanoid Robot and Whisper: Bridging a Gap in Education
Previous Article in Special Issue
ConKgPrompt: Contrastive Sample Method Based on Knowledge-Guided Prompt Learning for Text Classification
 
 
Article
Peer-Review Record

Asking Questions about Scientific Articles—Identifying Large N Studies with LLMs

Electronics 2023, 12(19), 3996; https://doi.org/10.3390/electronics12193996
by Razvan Paroiu 1, Stefan Ruseti 1, Mihai Dascalu 1,2,*, Stefan Trausan-Matu 1,2 and Danielle S. McNamara 3
Reviewer 1: Anonymous
Reviewer 3:
Electronics 2023, 12(19), 3996; https://doi.org/10.3390/electronics12193996
Submission received: 3 August 2023 / Revised: 17 September 2023 / Accepted: 18 September 2023 / Published: 22 September 2023
(This article belongs to the Special Issue Emerging Theory and Applications in Natural Language Processing)

Round 1

Reviewer 1 Report

The paper introduces an automated method that supports the identification of large-scale studies in terms of population based on LLMs. The authors have provided sufficient work on datasets and experiments. But the paper has shortcomings in the following aspects.

(1) The related work is Insufficient. The paper points out that the task of extracting large-scale studies is important, but there is only one reference for directly related research work which is also the author’s own word. Are there any other related works?

(2) The research basis is Insufficient. Multiple prompt templates were attempted and a large number of experimental results were compared in the paper. But why use these templates and what the templates are designed according to? Or it's just a comparison of various attempts?

I don't deny that this article has done a lot of work, but its innovation is poor.

Comments for author File: Comments.pdf

The Quality of English Language is OK.

Author Response

The paper introduces an automated method that supports the identification of large-scale studies in terms of population based on LLMs. The authors have provided sufficient work on datasets and experiments. But the paper has shortcomings in the following aspects.

ResponseWe express our gratitude to the reviewer for the careful examination of our work and for highlighting its shortcomings.

 

(1) The related work is Insufficient. The paper points out that the task of extracting large-scale studies is important, but there is only one reference for directly related research work which is also the author’s own word. Are there any other related works?

ResponseWe thank the reviewer for the insightful comment. Indeed, our investigation did not yield any automated method for extracting participant counts from published articles, with the exception of Corlatescu's work. Notably, in the second paragraph of the Introduction, we have referenced an article that outlines a widely practiced heuristic for determining the minimum number of participants necessary in a study. This heuristic involves a comprehensive reading of existing literature and extracting the participant counts prevalent in these studies. Researchers then establish a benchmark for the number of participants in their own study. It is important to emphasize that, due to the absence of references to automated participant count extraction methods, the current practice within the research community entails manual extraction of participant counts during their thorough reading of state-of-the-art literature.

 

(2) The research basis is Insufficient. Multiple prompt templates were attempted and a large number of experimental results were compared in the paper. But why use these templates and what the templates are designed according to? Or it's just a comparison of various attempts?

ResponseWe sincerely appreciate the feedback provided by the reviewer. We wish to highlight that our prompts were not only manually created but also automatically generated using the AMA technique (which involved a selection process from a substantial pool of generated prompts). It's worth noting that this method of prompt generation is considered a state-of-the-art technique, also adopted in current research within the domain.

After we selected the prompts, the rest of the research is indeed just a comparison of various attempts. It is important to emphasize that our method is fundamentally centered on zero-shot prompting. However, we believe that our approach constitutes an enhanced version of the zero-shot paradigm. This is attributed to our practice of choosing the model's responses based on the degree of certainty it exhibits in its answers.

 

I don't deny that this article has done a lot of work, but its innovation is poor.

ResponseWe appreciate the reviewer's feedback. While we acknowledge that there is room for further innovation to enhance our method, we believe that our contributions, including the introduction of new datasets in the field of education and the development of a dialogic approach for extracting the number of participants from studies, constitute significant advancements. We consider these contributions as essential foundations for future research seeking to extract similar information.

Reviewer 2 Report

The paper presents an automated method for identifying large-scale studies. The paper is well-written and structured and the findings are very interesting. Some minor edits for improving the paper are listed below:

- Figures 1 (a) and 1 (b) captions should be distinct. 

- The term NLP should be defined.

- The pseudo-code of the proposed method should be included.

- Since you provide the confusion matrices, you should also provide the formula for f1-score.

 

Author Response

The paper presents an automated method for identifying large-scale studies. The paper is well-written and structured and the findings are very interesting. Some minor edits for improving the paper are listed below:

ResponseWe are extremely grateful for your warm appreciation.

 

- Figures 1 (a) and 1 (b) captions should be distinct. 

ResponseWe thank the reviewer for the observation. While the captions differed only by a single word, we realized that the text was not properly displayed, being cut in half horizontally. We have now rectified this issue.

 

- The term NLP should be defined.

ResponseWe thank the reviewer for the thorough reading. We have now incorporated the definition of NLP upon its initial mention in our text.

 

- The pseudo-code of the proposed method should be included.

ResponseWe appreciate the suggestion provided by the reviewer. In light of the fact that we have made our application's source code available on GitHub, we recognize the need for a clear visual representation of our methodology, and we included a comprehensive flow chart in Figure 2 to enhance the understanding of the process. Pseudo-code would have been too specific and would not have introduced relevant insights besides the already existing narrative.

 

- Since you provide the confusion matrices, you should also provide the formula for f1-score.

ResponseWe thank the reviewer for the suggestion. We have incorporated the F1 score formula into the footnotes and captions of our tables to enhance the clarity of the provided information.

Reviewer 3 Report

 

The main comments that could suggest to improve the article:

·        Clarity and Motivation: The article could benefit from a clearer statement of the problem being addressed. While it's mentioned that the exponential growth of scientific publications increases the effort to identify relevant articles, it would be helpful to provide more context on why this is a critical issue in the field. Highlighting the significance of efficient article identification and the challenges posed by low or medium-scaled studies could enhance the motivation for the proposed automated method.

·        Methodology Explanation: The article briefly mentions the use of a FLAN-T5 language model and a dialogic extensible approach for identifying large-scale studies. However, the technical details of how these methods work and how they are applied in the context of this research are missing. I might recommend expanding on these aspects to provide readers with a clear understanding of the methodology employed, including how the targeted questions are designed and how the model's responses are processed.

·        Performance Evaluation: The paper mentions the achieved F1 scores for the proposed model. However, I might suggest providing more information about the precision, recall, and other relevant evaluation metrics to provide a comprehensive assessment of the model's performance. Additionally, discussing the limitations or challenges faced during the evaluation process would enhance the transparency of the study.

·        Comparative Analysis: While it's mentioned that the proposed model's F1 score surpasses previous analyses, it would be beneficial to include a brief discussion on the differences between the proposed method and these previous analyses. Highlighting the advantages and improvements of the current approach compared to existing methods can strengthen the argument for the effectiveness of the proposed method.

·        Application and Implications: The paper briefly discusses the application of the model to a dataset of ERIC publications in the Education domain, revealing trends over the years. However, reviewers might suggest expanding on the practical implications of these observed trends. How can the insights gained from this analysis inform future research directions or decision-making in the Education domain? Providing more context and potential implications would enhance the relevance of the study's findings.

· In the experiments with Prompting LLMs section,

the section provides details about the preprocessing steps and the decision to filter out certain paragraphs, there could be a clearer explanation of the rationale behind these choices. For instance, the justification for excluding paragraphs from the Introduction, Conclusions, and References sections should be more elaborated. What evidence or reasoning supports the assumption that numerical data won't be found in these section? Providing this context would help readers understand the methodology's foundation and its implications for the overall approach.

·        The explanation of how the model's confidence is utilized to make decisions is intriguing. However, it could benefit from more detailed explanation and examples. I might suggest providing some concrete examples of paragraphs, along with their associated logits and probability calculations, to illustrate how the threshold for model confidence was established. This will help readers grasp the mechanics of this decision-making process and how the model's behavior was calibrated.

·        The section also provides insight into the approach taken to compute F1 scores and establish the threshold for confidence ratios. However, it might be helpful to include a concise summary of the experimental results, particularly the F1 scores achieved for different ratios and questions. Additionally, consider providing a clear statement about how these results informed the final approach and choices made for the automated method. This would strengthen the link between the experimentation and the proposed method.

 

By addressing these points, the paper could provide a clearer picture of the research problem, methodology, results, and the broader implications of the proposed automated method for identifying large-scale studies in scientific publications.

 

Minor editing of English language required

Author Response

The main comments that could suggest to improve the article:

  •       Clarity and Motivation: The article could benefit from a clearer statement of the problem being addressed. While it's mentioned that the exponential growth of scientific publications increases the effort to identify relevant articles, it would be helpful to provide more context on why this is a critical issue in the field. Highlighting the significance of efficient article identification and the challenges posed by low or medium-scaled studies could enhance the motivation for the proposed automated method.

ResponseWe appreciate the reviewer's observation. We acknowledge that our discussion on the importance of sampling was limited. In response, we have expanded upon the significance of proper sampling techniques in the Introduction and Discussion sections, aiming to provide a more comprehensive understanding of its role in general research.

 

  •       Methodology Explanation: The article briefly mentions the use of a FLAN-T5 language model and a dialogic extensible approach for identifying large-scale studies. However, the technical details of how these methods work and how they are applied in the context of this research are missing. I might recommend expanding on these aspects to provide readers with a clear understanding of the methodology employed, including how the targeted questions are designed and how the model's responses are processed.

ResponseWe express our gratitude to the reviewer for providing constructive feedback. As a response, we have further insisted on describing the methodology employed for interacting with the neural network in the second paragraph of section 2.2. After that, we included more details regarding how the model's confidence is obtained and utilized to make decisions. We explained how the probabilities associated with "Yes" or "No" answers were computed when the model was tasked with answering the question of determining whether a paragraph describes a new study or not.

 

  •       Performance Evaluation: The paper mentions the achieved F1 scores for the proposed model. However, I might suggest providing more information about the precision, recall, and other relevant evaluation metrics to provide a comprehensive assessment of the model's performance. Additionally, discussing the limitations or challenges faced during the evaluation process would enhance the transparency of the study.

ResponseWe appreciate the reviewer's observation. We acknowledge that Precision and Recall scores would indeed provide valuable insight into our obtained results; as such, we included them in Tables 1-3.

 

         Comparative Analysis: While it's mentioned that the proposed model's F1 score surpasses previous analyses, it would be beneficial to include a brief discussion on the differences between the proposed method and these previous analyses. Highlighting the advantages and improvements of the current approach compared to existing methods can strengthen the argument for the effectiveness of the proposed method.

ResponseWe thank the reviewer for the observation. We had previously conducted a comparative analysis between the advantages and disadvantages of our current method and the previous heuristics approach in subsection 4.2. We have now updated the section title from "Limitations" to "Disparities between the current method and the previous heuristics approach" to better encapsulate the context and significance of the presented content.

 

  •       Application and Implications: The paper briefly discusses the application of the model to a dataset of ERIC publications in the Education domain, revealing trends over the years. However, reviewers might suggest expanding on the practical implications of these observed trends. How can the insights gained from this analysis inform future research directions or decision-making in the Education domain? Providing more context and potential implications would enhance the relevance of the study's findings.

ResponseWe thank the reviewer for the observation, and we agree that further elaboration on the implications of our research must be made. As such, we have addressed these implications in section 4.1 of our manuscript.

 

  • In the experiments with Prompting LLMs section, the section provides details about the preprocessing steps and the decision to filter out certain paragraphs, there could be a clearer explanation of the rationale behind these choices. For instance, the justification for excluding paragraphs from the Introduction, Conclusions, and References sections should be more elaborated. What evidence or reasoning supports the assumption that numerical data won't be found in these section? Providing this context would help readers understand the methodology's foundation and its implications for the overall approach.

ResponseWe appreciate the reviewer's remark and agree that our assumption regarding the absence of research-related information within these sections needed further elucidation. Consequently, we have provided additional clarification in the initial paragraph of section 2.2 to address this concern.

 

  •       The explanation of how the model's confidence is utilized to make decisions is intriguing. However, it could benefit from more detailed explanation and examples. I might suggest providing some concrete examples of paragraphs, along with their associated logits and probability calculations, to illustrate how the threshold for model confidence was established. This will help readers grasp the mechanics of this decision-making process and how the model's behavior was calibrated.

ResponseWe thank the evaluator for the remark. In response to this feedback, we have taken steps to enhance the clarity of our method by providing a more comprehensive explanation in the initial paragraphs of the "Experiments with Prompting LLMs" section. We had previously incorporated examples into Table 7, specifically within the "results" column. It is important to note that, for the sake of clarity, we have opted to exclude the generated logits in these examples, as these values are purely mathematical in nature and primarily relevant for the SparceCategoricalCrossentropy loss function.

 

  •       The section also provides insight into the approach taken to compute F1 scores and establish the threshold for confidence ratios. However, it might be helpful to include a concise summary of the experimental results, particularly the F1 scores achieved for different ratios and questions. Additionally, consider providing a clear statement about how these results informed the final approach and choices made for the automated method. This would strengthen the link between the experimentation and the proposed method.

ResponseWe thank the reviewer for the observation, and we agree. We have now provided an explanation for the F1 scores achieved for different ratios and probabilities (as depicted in Figure 1) in the final paragraph from section 2.2. 

 

By addressing these points, the paper could provide a clearer picture of the research problem, methodology, results, and the broader implications of the proposed automated method for identifying large-scale studies in scientific publications.

Response: Thank you again, and we hope that our adjustments address all these points.

Round 2

Reviewer 1 Report

The paper introduced three new datasets and defined a dialogic approach for extracting the number of participants from a given study by prompting a pretrained open-source Large Language Model (LLM).  The work can help reserchers know the population of a certain.   The revised paper is better than the first vision.  I sugguest to put the section 1.1, 1.2 and 1.3 into a new section named "related work" after section "Introduction".

Author Response

Thank you kindly for your suggestion. To improve the overall structure of our article, we have introduced a new section titled "Related Work." This section follows the "Introduction" section and encompasses the content previously found in sections 1.2 and 1.3. We merged subsection 1.1 in the introduction.

Back to TopTop