Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Analyzing Higher Education Students’ Prompting Techniques and Their Impact on ChatGPT’s Performance: An Exploratory Study in Spanish

Appl. Sci. 2025, 15(14), 7651; https://doi.org/10.3390/app15147651

by José Luis Carrasco-Sáez¹, Carolina Contreras-Saavedra²

, Sheny San-Martín-Quiroga¹

, Carla E. Contreras-Saavedra³

and Rhoddy Viveros-Muñoz^1,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2025, 15(14), 7651; https://doi.org/10.3390/app15147651

Submission received: 28 May 2025 / Revised: 2 July 2025 / Accepted: 4 July 2025 / Published: 8 July 2025

(This article belongs to the Special Issue Techniques and Applications of Natural Language Processing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This exploratory study investigates how 102 higher education students in Chile construct prompts in Spanish and how different prompting techniques affect the quality of responses generated by ChatGPT (version 3.5). The authors categorize prompt techniques into "circumstantial" and "request-based" and identify two emergent strategies: the Guide Contextualization Strategy (GCS) and the Specific Purpose Strategy (SPS). The study proposes a methodological framework for evaluating AI interactions in educational settings and highlights implications for AI literacy development in non-English-speaking contexts.

The study requires some improvments before being accepted for publications, some suggestions include the following:

State your research questions early in the introduction and the motivation behind this study.
Strengthen the theoretical grounding by connecting the framework to AI literacy theory, digital competence frameworks, or learning sciences models.
Engage with the latest scholars work on the field, such as:
- Al-kfairy, M., 2024. Factors impacting the adoption and acceptance of ChatGPT in educational settings: A narrative review of empirical studies. Applied System Innovation, 7(6), p.110.
- Fagbohun, O., Harrison, R.M. and Dereventsov, A., 2024. An empirical categorization of prompting techniques for large language models: A practitioner's guide. arXiv preprint arXiv:2402.14837.
- and many others
Expand the discussion of linguistic features (e.g., verb mood, sentence structure, idiomatic differences) and their potential influence on LLM outputs.
Integrate this into the discussion section as it is key to justifying the study’s novelty and context.
Provide clearer details about the development and validation of the 21-item binary checklist.
Include reliability statistics (e.g., Cohen’s Kappa).
Justify use of Rasch modeling beyond citing difficulty—why was this model theoretically appropriate?
Explicitly acknowledge this limitation and avoid broad generalizations about higher education or “Spanish-speaking learners” without supporting cross-cultural data.
Use more cautious language throughout the results and discussion (e.g., “associated with” instead of “influences”).
Clarify that findings are preliminary and hypothesis-generating, not confirmatory.
Discuss pedagogical implications for training novice users in AI interaction.
Rephrase conclusions with appropriate caution. Emphasize the need for replication and scaling in diverse educational contexts.

Good Luck

Author Response

Comment 1: State your research questions early in the introduction and the motivation behind this study.

Response 1: Thank you for your helpful observation. We have now revised the introduction to explicitly state the research questions and clarify the motivation behind the study. The revised paragraph is located at the end of the introduction and outlines the core questions guiding our work, thereby strengthening the theoretical and methodological coherence of the manuscript (lines 113 to 127).

Comment 2: Strengthen the theoretical grounding by connecting the framework to AI literacy theory, digital competence frameworks, or learning sciences models.

Response 2: Thank you for your helpful suggestion. We have revised the manuscript to integrate theoretical connections. In the theoretical framework, we added references to AI literacy theory (Ng, 2021) and the Digital Competence Framework for Citizens (DigComp 2.2) to contextualize the relevance of digital content creation and problem solving in AI-mediated learning, and operational dimensions from DigComp 2.2, showing how the framework aligns with established educational theories (lines 321 to 338).

Comment 3: Engage with the latest scholars work on the field, such as: Al-kfairy, M., 2024. Factors impacting the adoption and acceptance of ChatGPT in educational settings: A narrative review of empirical studies. Applied System Innovation, 7(6), p.110. Fagbohun, O., Harrison, R.M. and Dereventsov, A., 2024. An empirical categorization of prompting techniques for large language models: A practitioner's guide. arXiv preprint arXiv:2402.14837. and many others.

Response 3: Thank you for your valuable suggestion. We have incorporated the references you recommended—Al-kfairy (2024) and Fagbohun et al. (2024)—into the revised version of the manuscript. These works were used to enrich the introduction, reinforce the theoretical framework, and strengthen the discussion section. Additionally, we included other recent and relevant studies to ensure that the manuscript reflects the current state of research in the field of prompt engineering and the adoption of generative AI in educational contexts. (lines: 51-58; 92; 147-153; 162-167;176-188;193-199; 257-267;321-338; 777-793).

Comment 4: Expand the discussion of linguistic features (e.g., verb mood, sentence structure, idiomatic differences) and their potential influence on LLM outputs. Integrate this into the discussion section as it is key to justifying the study’s novelty and context.

Response 4: Thank you for this thoughtful suggestion. We agree that linguistic features such as verb mood and sentence structure can significantly influence how LLMs generate responses. However, to maintain a precise focus, we do not expand on this dimension in the present manuscript. This is also because we recently published a study in which we specifically analyzed the grammatical and syntactical structure of prompts and their impact on LLMs results in Spanish contexts: Viveros-Muñoz, R., Carrasco-Sáez, J., Contreras-Saavedra, C., San-Martín-Quiroga, S., & Contreras-Saavedra, C. E. (2025). Does the Grammatical Structure of Prompts Influence the Responses of Generative Artificial Intelligence? An Exploratory Analysis in Spanish. Applied Sciences, 15(7), 3882. This prior work complements the current study by offering an in-depth empirical and statistical analysis of how linguistic variables—such as verbal mood combinations and sentence complexity—affect the perceived quality of AI responses. We have added a reference to this study in the Discussion section to clarify this point and to acknowledge its relevance.

Comment 5: Provide clearer details about the development and validation of the 21-item binary checklist.

Response 5: Thank you for this insightful observation. In response, we have revised the manuscript to include a clearer and more detailed explanation of how the 21-item binary checklist was developed. Specifically, we now describe how the checklist emerged from a structured, three-phase process: (i) an exploratory review of the literature on prompt engineering; (ii) a functional analysis of the techniques’ communicative roles; and (iii) the classification of 21 techniques into two categories—circumstantial and request-based. These revisions are now included at the end of the “Related Work” section, where we explain how the framework was theoretically grounded before its application in the study (lines 294 to 309).

Comment 6: Include reliability statistics (e.g., Cohen’s Kappa).

Response 6: Thank you for your comment. The quality of the responses was evaluated using the instrument cited in Viveros-Muñoz et al., which applied expert judgment assessment through Aiken’s V coefficient. This analysis confirmed the relevance, appropriateness, and clarity of the items, thus supporting the instrument’s content validity. Viveros-Muñoz, R., et al., “Does the grammatical structure of prompts influence the responses of generative artificial intelligence? An exploratory analysis in Spanish,” Appl. Sci., vol. 15, no. 7, p. 3882, 2025. Regarding reliability, the model of techniques reported a McDonald’s ω of 0.780 and a Cronbach’s alpha of 0.772. Reliability metrics are reported as ω in the model flow diagram between lines 407 and 408 (see Figure 2. Exploratory Analysis Flow).

Comment 7: Justify use of Rasch modeling beyond citing difficulty—why was this model theoretically appropriate?

Response 7: Thank you for your comment. The justification has been further reinforced in lines 576 to 581. To achieve Objective 3, which aimed to characterize both simple and complex prompting strategies, the responses were first coded dichotomously based on the presence or absence of correct strategy use. Subsequently, the Rasch model was selected and applied for the analysis, as it allows for the estimation of item difficulty by transforming categorical (dichotomous) responses into an interval-level logit scale. This enables an objective comparison of item difficulty across the set.

Comment 8: Explicitly acknowledge this limitation and avoid broad generalizations about higher education or “Spanish-speaking learners” without supporting cross-cultural data.

Response 8: We thank the reviewer for highlighting the importance of avoiding overgeneralizations. We have now explicitly acknowledged this limitation in the manuscript by stating that our findings are context-specific and should not be generalized to all higher education or Spanish-speaking learners without cross-cultural validation. We also point to future research directions that aim to explore the framework’s applicability in broader educational and cultural settings (lines 784-797).

Comment 9: Use more cautious language throughout the results and discussion (e.g., “associated with” instead of “influences”).

Response 9: Thank you for your observation. Following your suggestion, we have revised the language in the results and discussion sections to adopt a more cautious tone. Specifically, in line 761, we replaced the term “influences” with “can be associated with,” in order to avoid implying causality and to better reflect the correlational nature of our findings

Comment 10: Clarify that findings are preliminary and hypothesis-generating, not confirmatory.

Response 10: Thank you for your valuable comment. We agree with the importance of clarifying the exploratory nature of our research. In the revised manuscript, we have emphasized that the findings are preliminary and hypothesis-generating, not confirmatory. We have added an explicit statement at the end of the Discussion section to ensure that readers clearly understand the scope and limitations of the study’s conclusion (lines 784-797; 848-854).

Comment 11: Discuss pedagogical implications for training novice users in AI interaction.

Response 11: We thank the reviewer for the valuable suggestion. In response, we have added a paragraph to the Discussion section that considers the potential pedagogical implications of our findings. While we acknowledge that our study did not directly assess instructional interventions, we propose that the framework developed here could inform future training programs aimed at equipping novice users with the necessary skills for effective and ethical interaction with GenAI systems (lines 784-793).

Comment 12: Rephrase conclusions with appropriate caution. Emphasize the need for replication and scaling in diverse educational contexts.

Response 12: Thank you for your valuable suggestion. We have carefully reviewed the conclusion section and revised the final paragraph to include a clearer emphasis on the need for replication and scaling in diverse educational contexts. The updated text now highlights the importance of validating our findings through future studies conducted across broader and more varied academic settings. (lines 835 to 838)

Reviewer 2 Report

Comments and Suggestions for Authors

Abbreviations should be avoided in the abstract. Please define abbreviations in the Introduction and in later sections.

Introduction and the initial paragraphs from section 2 are scarce in references, even when they are crucial (for example for lines 120 to 129). Additional references must be added when the authors indicate some “strong” opinions and conclusions.

Line 324 and line 84: authors should select between American English or British English throughout the text

Line 84 “Although several studies have analyzed and classified”… which studies? Authors should add additional references in this page.

Section 2. Authors talk about prompt engineering. It would be interesting to include language issues in this matter, because I believe that language nuances will influence prompt engineering.

Table 01 (line 228) has a different name (Table 1 line 239); also Figure 01 and Figure 1. Please check it throughout the document

Line 261 “representing fields such as rehabilitation and computer science, thereby ensuring disciplinary diversity”. This sentence needs extra care. How many different courses? How many observations in each course?

Line 332 “tetrachoric correlation”. Why this measure instead of more common correlations /associations for binary and categorical data?

Line 335. I don’t understand why it is surprising. Clearly, the correlation between 11 and 12 must be negative!

Line 351 “Conversely, indicator 14 exhibited a strong negative correlation”. This also clear in table 04. In this case it is not clear (for me) the motives. Please elaborate

Line 374 to 380. The results are generally ok, but explained variance is only near 50%. So, models fail to adequately explain data variability. Please explain

Page 397. “Figure 2.” Also, increase figure quality

Linear Regression Analysis. It is not correct to perform linear regression using a response variable that is categorical. Authors should use other methods. Besides, with and r^2 = 0.209 and below the proposed models are terrible. For example, use contingency tables! Another idea is to perform cluster analysis

Discussion: line 543 to 548. With such low r^2 values, authors cannot say that “These findings indicate that incorporating contextual information into prompts enhances the accuracy and relevance of the generated outputs.”

For lines 555 to 558 extra information must be provided

For lines 569 and 570 “demonstrated a significant association with response quality, F(1, 67) = 8.58, p = .005, R2 = 0.114”. Again, r^2 is irrelevant even if p-value is bellow 0.05

Authors indicate references in a very (at least for me) awkward manner; for example, journals name is between **, for example *Int. Med. Educ.*, in line 747. Also, some publications have abbreviations in the name (like the previous one) while others have the full name, for example in line 753 *Annals of Biomedical 753 Engineering*.

So, reference section MUST be checked and increased before publication

Author Response

Comment 1: Abbreviations should be avoided in the abstract. Please define abbreviations in the Introduction and in later sections.

Response 1: Thank you for your observation. In response, all abbreviations have been removed from the abstract to ensure clarity and accessibility for a broader audience. Definitions of key abbreviations, such as LLM (Large Language Models) and GenAI (Generative Artificial Intelligence), have been incorporated into the Introduction section, as recommended.

Comment 2: Introduction and the initial paragraphs from section 2 are scarce in references, even when they are crucial (for example for lines 120 to 129). Additional references must be added when the authors indicate some “strong” opinions and conclusions.

Response 2: Thank you for your insightful comment. Following your suggestion, we have added several references to strengthen the Introduction, the related work, and the Discussion. These additions support key statements, and provide a more robust foundation for the arguments presented (lines 51-58; 92).

Comment 3: Line 324 and line 84: authors should select between American English or British English throughout the text.

Response 3: Thank you for pointing this out. We have thoroughly reviewed the manuscript to ensure consistency in language usage. The entire text now adheres to American English spelling and conventions. For example, terms such as "analyzed" and "behavior" have been consistently used throughout. We appreciate your attention to this detail.

Comment 4: Line 84 “Although several studies have analyzed and classified”… which studies? Authors should add additional references in this page.

Response 4: Thank you for the observation. We have added additional references in support of the statement on line 92, specifically references [2–7], which provide relevant background on the classification and analysis of prompt techniques. These additions strengthen the theoretical foundation of the study and clarify the scholarly basis of our claim.

Comment 5: Section 2. Authors talk about prompt engineering. It would be interesting to include language issues in this matter, because I believe that language nuances will influence prompt engineering.

Response 5: Thank you for this relevant observation. We fully agree that language nuances play a key role in prompt engineering, especially in non-English contexts where grammatical structures differ significantly from English. In this manuscript, we focus on the functional categorization of prompting techniques as a pedagogical and analytical framework. However, the linguistic dimensions of prompts—such as grammar, syntax, and sentence structure—are extensively addressed in a related study recently published by our team: Viveros-Muñoz, R., Carrasco-Sáez, J., Contreras-Saavedra, C., San-Martín-Quiroga, S., & Contreras-Saavedra, C. E. (2025). Does the Grammatical Structure of Prompts Influence the Responses of Generative Artificial Intelligence? An Exploratory Analysis in Spanish. Applied Sciences, 15(7), 3882. We have now added a reference to this study in Section 2 to acknowledge the importance of linguistic aspects and to direct interested readers to a more in-depth treatment of this topic (lines 317 to 320).

Comment 6: Table 01 (line 228) has a different name (Table 1 line 239); also Figure 01 and Figure 1. Please check it throughout the document.

Response 6: Thank you for your observation. We have carefully reviewed all tables and figures in the manuscript and have standardized their numbering format by removing the leading zero. All instances now follow the consistent format "Table 1", "Figure 1", etc.

Comment 7: Line 261 “representing fields such as rehabilitation and computer science, thereby ensuring disciplinary diversity”. This sentence needs extra care. How many different courses? How many observations in each course?

Response 7: Thank you for your comment. We have revised the sentence to clarify that the sample included students from two undergraduate programs—rehabilitation (30%) and computer science (70%)—at two universities. The analysis was performed globally on the entire sample and not disaggregated by academic program. The reference to disciplinary backgrounds was intended only to provide contextual information about the participants, not to imply a comprehensive diversity of fields. This clarification has now been incorporated into the revised manuscript. (lines 351 to 355)

Comment 8: Line 332 “tetrachoric correlation”. Why this measure instead of more common correlations /associations for binary and categorical data?

Response 8: Thank you for your comment. Tetrachoric correlation was used, as it is the appropriate statistical method for the type of data involved. This choice is based on the fact that the variables presented in Objective 2 are dichotomous, with the exception of the variable “response quality,” which was transformed into a binary equivalent for the purpose of this analysis. To compute the correlation, the original six-point scoring scale for this variable (1: completely incorrect – 6: correct) was recoded into a two-point binary scale (correct – incorrect), with a clear distinction established between correct responses (scores ≥ 4) and incorrect responses (scores ≤ 3).

Comment 9: Line 335. I don’t understand why it is surprising. Clearly, the correlation between 11 and 12 must be negative!

Response 9: Thank you for your comment. The following interpretation of the correlation between items 11 and 12 has been added: A moderate negative tetrachoric correlation (rt = –0.518) was observed between the use of the Direct Request technique and the Mixed or Multi-task Request technique. This indicates that as the presence of one of these strategies increases, the presence of the other tends to decrease. In other words, students who frequently used direct prompts tended not to employ complex or multi-tasking formulations simultaneously, suggesting a possible distinction in their approach to interacting with the model (lines 433-438).

Comment 10: Line 351 “Conversely, indicator 14 exhibited a strong negative correlation”. This also clear in table 04. In this case it is not clear (for me) the motives. Please elaborate

Response 10: Thank you for your comment. The following interpretation regarding the correlation of RBT14 has been added: The RBT14 shows a negative correlation with the other techniques. This inverse association indicates that when an open-ended question is formulated (e.g., “What does it imply...?”), it is less likely that the prompt will include elements intended to frame, constrain, or structure the response. In other words, the use of open-ended questions tends to be accompanied by fewer contextual or structural cues, suggesting a preference for unstructured interaction styles when employing exploratory formulations (linbes 449-454).

Comment 11: Line 374 to 380. The results are generally ok, but explained variance is only near 50%. So, models fail to adequately explain data variability. Please explain

Response 11: We appreciate the comment. The percentage of explained variance is complementary to fit indices, but not definitive. The literature recommends prioritizing fit indices over other traditional methods (Lloret-Segura et al., 2014). Lloret-Segura, S., Ferreres-Traver, A., Hernández-Baeza, A., & Tomás-Marco, I. (2014). El análisis factorial exploratorio de los ítems: una guía práctica, revisada y actualizada. Anales de Psicología / Annals of Psychology, 30(3), 1151–1169. https://doi.org/10.6018/analesps.30.3.199361

Comment 12: Page 397. “Figure 2.” Also, increase figure quality.

Response 12: Thank you for your observation. We have reviewed and improved the quality of all figures, to ensure better clarity and resolution throughout the manuscript.

Comment 13: Linear Regression Analysis. It is not correct to perform linear regression using a response variable that is categorical. Authors should use other methods. Besides, with and r^2 = 0.209 and below the proposed models are terrible. For example, use contingency tables! Another idea is to perform cluster analysis.

Response 13: We appreciate the comment. The reviewer’s concern about the nature of the variable is understandable, as the literature continues to debate the appropriate data types for statistical analyses. In this case, the scoring used is not categorical but considered continuous, as the literature accepts that scales with five or more points can be treated as continuous (Norman, 2010). Regarding the R² value, it falls within acceptable thresholds, as defined by Cohen (2013), who established benchmarks of 0.02, 0.15, and 0.35. REFERENCES: Norman, G. Likert scales, levels of measurement and the “laws” of statistics. Adv in Health Sci Educ 15, 625–632 (2010). https://doi.org/10.1007/s10459-010-9222-y Cohen, J. (2013). Statistical power analysis for the behavioral sciences. routledge.

Comment 14: Discussion: line 543 to 548. With such low r^2 values, authors cannot say that “These findings indicate that incorporating contextual information into prompts enhances the accuracy and relevance of the generated outputs.”

Response 14: Thank you very much for the comment. Regarding the R² value, it falls within acceptable thresholds, as established by Cohen (2013), who proposed benchmarks of 0.02, 0.15, and 0.35. REFERENCE: Cohen, J. (2013). Statistical power analysis for the behavioral sciences. routledge.

Comment 15: For lines 555 to 558 extra information must be provided.

Response 15: Thank you for your observation. We have revised and expanded the section corresponding to lines 555–558 to provide a clearer interpretation of the results. Specifically, we now explain in greater detail the potential reasons behind the strong negative correlations between RBT14 (Open-Ended Question Request) and the indicators CT1 and CT2, discussing how this technique might limit students' use of clearly contextualized prompt strategies. (lines 648 to 654).

Comment 16: For lines 569 and 570 “demonstrated a significant association with response quality, F(1, 67) = 8.58, p = .005, R2 = 0.114”. Again, r^2 is irrelevant even if p-value is bellow 0.05

Response 16: Thank you very much for the comment. Reinforcing the previous responses, it is worth noting that the R² value of 0.114 falls within the range of a moderate effect size. Regarding the interpretation: the p-value is excellent (< 0.05), indicating that the regression model is statistically significant.

Comment 17: Authors indicate references in a very (at least for me) awkward manner; for example, journals name is between **, for example *Int. Med. Educ.*, in line 747. Also, some publications have abbreviations in the name (like the previous one) while others have the full name, for example in line 753 *Annals of Biomedical 753 Engineering*. So, reference section MUST be checked and increased before publication.

Response 17: Thank you for your comment regarding the formatting of the references. We have thoroughly reviewed the entire reference section and standardized it according to the IEEE style. In particular, we have made the following adjustments: Removed inconsistent formatting, such as the use of asterisks (*) around journal or conference names. Unified the naming conventions, avoiding inconsistencies between abbreviated and full journal titles. Ensured the inclusion of all required elements in each reference entry according to IEEE guidelines: authors, title of the article or chapter, name of the journal or conference (italicized), volume, issue, page numbers, year, and, where applicable, DOI or online link.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Happy to accept, the authors handled all my concerns

Author Response

Comment 1: Happy to accept, the authors handled all my concerns

Response 1: We sincerely thank you for your positive feedback and appreciation of our efforts to address all concerns. We are grateful for the constructive comments that helped improve the clarity and rigor of our manuscript.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper quality as increased due to authors effort!

In what concerns comments 13 to 14, authors provided a fine reply, even if I disagree with that reply (for me an r^2 bellow 0.5 is irrelevant)

So, in order to the article to be published, authors must introduce the justifications provided in response 13 and 14 in the text, together with all the references provided (not just Cohen). The references are not from statisticians, but since they deal with practical applications of statistics, I will accept them, because the paper is also an application work.

Author Response

Comment 1: In what concerns comments 13 to 14, authors provided a fine reply, even if I disagree with that reply (for me an r^2 bellow 0.5 is irrelevant). So, in order to the article to be published, authors must introduce the justifications provided in response 13 and 14 in the text, together with all the references provided (not just Cohen). The references are not from statisticians, but since they deal with practical applications of statistics, I will accept them, because the paper is also an application work.

Response 1: Thank you for your observation. As requested, we have now incorporated into the manuscript the explanation regarding the nature of the scale used and the interpretation of the R² values, as previously provided in our response. Specifically, we clarified in the Results and Discussion section that, although the dependent variable is measured on a six-point ordinal scale, it can be treated as continuous for the purposes of linear regression, a practice supported in the literature when scales have five or more categories. We have also justified the interpretation of the R² values using Cohen’s conventional thresholds of 0.02 (small), 0.15 (moderate), and 0.35 (large) effect sizes (lines 569 to 579; 652 to 659). All cited references have been added to the reference list (lines 959-960).

Article Menu

Analyzing Higher Education Students’ Prompting Techniques and Their Impact on ChatGPT’s Performance: An Exploratory Study in Spanish

Further Information

Guidelines

MDPI Initiatives

Follow MDPI