Does the Grammatical Structure of Prompts Influence the Responses of Generative Artificial Intelligence? An Exploratory Analysis in Spanish

Viveros-Muñoz, Rhoddy; Carrasco-Sáez, José; Contreras-Saavedra, Carolina; San-Martín-Quiroga, Sheny; Contreras-Saavedra, Carla E.

doi:10.3390/app15073882

Open AccessArticle

Does the Grammatical Structure of Prompts Influence the Responses of Generative Artificial Intelligence? An Exploratory Analysis in Spanish

by

Rhoddy Viveros-Muñoz

¹

,

José Carrasco-Sáez

¹

,

Carolina Contreras-Saavedra

²

,

Sheny San-Martín-Quiroga

¹

and

Carla E. Contreras-Saavedra

^3,*

¹

Departamento de Electrónica e Informática, Universidad Técnica Federico Santa María, Concepción 4030000, Chile

²

Facultad de Educación, Universidad Católica de la Santísima Concepción, Campus San Andrés, Concepción 4070409, Chile

³

Facultad de Ciencias de la Rehabilitación y Calidad de Vida, Universidad San Sebastián, Concepción 4081339, Chile

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3882; https://doi.org/10.3390/app15073882

Submission received: 7 March 2025 / Revised: 24 March 2025 / Accepted: 27 March 2025 / Published: 2 April 2025

(This article belongs to the Special Issue Techniques and Applications of Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Generative Artificial Intelligence (AI) has transformed personal and professional domains by enabling creative content generation and problem-solving. However, the influence of users’ grammatical abilities on AI-generated responses remains unclear. This exploratory study examines how language and grammar abilities in Spanish affect the quality of responses from ChatGPT (version 3.5). Despite the robust performance of Large Language Models (LLMs) in various tasks, they face challenges with grammatical moods specific to non-English languages, such as the subjunctive in Spanish. Higher education students were chosen as participants due to their familiarity with AI and its potential use in learning. The study assessed ChatGPT’s ability to process instructions in Chilean Spanish, analyzing how linguistic complexity, grammatical variations, and informal language impacted output quality. The results indicate that varied verbal moods and complex sentence structures significantly influence prompt evaluation, response quality, and response length. Based on these findings, a framework is proposed to guide higher education communities in promoting digital literacy and integrating AI into teaching and learning.

Keywords:

natural language processing; AI in education; generative AI; Spanish grammar performance; prompt engineering

1. Introduction

Generative Artificial Intelligence (AI) is increasingly impacting various aspects of daily life, enabling tasks such as content creation, idea generation, and query resolution. Among the numerous tools available, Large Language Models (LLMs) like ChatGPT [1], Gemini [2], Copilot [3], Perplexity [4], Claude [5], and LLaVA [6] have gained significant attention due to their ability to process and generate natural language. These models, developed within the field of Natural Language Processing (NLP), are trained on extensive datasets to perform tasks such as understanding, generating, and analyzing human language.

One notable feature of LLMs is their robustness to prompt variability, which allows users to interact with them in natural language without requiring programming expertise [7,8]. This capability is closely tied to prompt engineering, a technique focused on designing, structuring, and optimizing prompts to elicit accurate and relevant responses aligned with user intent. Prompt engineering evaluates how LLMs handle variability, ambiguity, and context shifts in prompts [9]. For example, benchmarks like BIG-bench [10] test LLMs’ ability to infer grammatical roles of pseudo-words based on orthographic and phonetic patterns, while studies like Singh et al. (2024) [8] assess robustness to linguistic errors using corrupted datasets.

Despite these advances, existing research has predominantly addressed robustness to syntactic variability and ambiguity, particularly in chatbot interactions where advanced syntactic analysis is applied to enhance accuracy and contextual understanding [11]. Additionally, some studies have investigated robustness to grammatical and orthographic errors, especially in contexts such as discourse analysis for aphasic speech, where identifying intended lexical targets poses significant challenges [12]. In the domain of sentiment analysis, robustness has been explored from the perspective of non-standard and low-resource linguistic inputs, aiming to improve response consistency even with minimal training data [13]. Together, these studies highlight the challenges that LLMs face when processing linguistic variability and error management, particularly in non-English contexts where morphosyntactic complexity plays a critical role.

While considerable progress has been made in enhancing LLM robustness to linguistic variability and error management, most research has predominantly focused on English and a few major languages. Multilingual AI research often faces the challenge of balancing performance across diverse languages, particularly those with complex grammatical structures and regional variations, such as Spanish. Theoretical frameworks for multilingual NLP highlight the need to address not only lexical diversity but also morphosyntactic variability, as seen in the rich verbal conjugation systems and pragmatic nuances of Chilean Spanish. Addressing these challenges requires a deeper understanding of how grammatical complexity influences AI responses, especially in generative models. This study contributes to filling this gap by exploring the impact of grammatical abilities in Spanish on the quality of LLM-generated outputs, thereby advancing the theoretical foundations of multilingual AI robustness.

However, most existing evaluations of LLMs rely on predefined benchmarks such as the Massive Multitask Language Understanding (MMLU) for multitask accuracy [14], the MATH for mathematical reasoning [15], the HumanEval for coding performance [16], and the Multilingual Grade School Math (MGSM) for multilingual capabilities [17]. These benchmarks focus on task-specific capabilities but overlook the influence of linguistic and grammatical complexity in user-generated prompts, particularly in non-English languages. Given this gap in the literature, we adopted an exploratory approach to investigate how grammatical abilities affect response quality, aiming to identify trends and generate hypotheses rather than to generalize findings. By addressing this underexplored area, the present study contributes to the understanding of human–AI interaction in diverse linguistic contexts.

Non-English languages pose unique challenges for LLMs. For instance, Spanish includes grammatical constructs like the subjunctive mood, which has no direct equivalent in English and plays a critical role in expressing hypothetical or uncertain situations. Despite advances in multilingual training [18], questions remain regarding how well LLMs can interpret prompts written in languages with such complexities. This gap in research highlights the need for studies that explore the interaction between grammatical characteristics in non-English prompts and the quality of LLM responses.

While this study does not focus on the internal workings of LLM architectures, it is essential to consider how linguistic features of Spanish might impact model performance. Unlike English, Spanish exhibits unique grammatical characteristics, including the use of the subjunctive mood, complex verb conjugation patterns, and flexible word order. These structural differences can pose challenges for LLMs, particularly when models are primarily trained on English language data. Recent studies [19,20] have highlighted that models may struggle with languages underrepresented in training datasets, potentially leading to ambiguity or shifts in semantic interpretation. Addressing how these grammatical complexities influence AI-generated responses in Spanish-speaking contexts contributes to advancing the field of multilingual AI and improving prompt design strategies for non-English languages.

This exploratory study addresses this gap by focusing on higher education students in Chile. This demographic was chosen due to their familiarity with technology and their ability to engage effectively with the data collection methodology [21,22,23]. Exploring how these students write prompts in Spanish and how those prompts influence LLM responses provides valuable insights into improving human–machine interaction and enhancing the educational use of generative AI tools.

This manuscript makes several significant contributions to the field of human–AI interaction and generative language models. It addresses a novel and underexplored question regarding the relationship between grammatical ability in Spanish and the quality of responses generated by LLMs, focusing on how linguistic and cultural nuances may influence AI performance. By conducting a single-scenario analysis using the 3.5 free version of ChatGPT within a Spanish-speaking context, the study provides foundational insights into how linguistic characteristics impact LLM outputs, laying the groundwork for future multilingual comparisons. Additionally, the study introduces a protocol for evaluating the grammatical abilities of neurotypical adults, offering a structured methodological approach that ensures data reliability and serves as a reference for similar linguistic assessments. The findings also have significant educational implications, as they highlight key considerations for enhancing the use of generative AI tools in academic settings, guiding students toward effective prompt writing and addressing gaps in linguistic training for human–AI interaction. Finally, the manuscript proposes a comprehensive methodological framework for educational communities aimed at improving students’ grammatical performance, emphasizing four critical aspects of human–machine interaction and providing practical recommendations for integrating AI tools into learning environments.

2. Materials and Methods

The material used to collect the data was a questionnaire and the analysis was based on a descriptive exploratory study methodology. The details will be presented in the following points.

2.1. Data Collection

The research sample consisted of 104 higher education students from the Biobío region. The selection criteria followed a convenience sampling approach, where participants were recruited based on the accessibility and availability of the collaborating educational institutions [24]. Given the exploratory nature of this study, convenience sampling was used to gather preliminary insights rather than to achieve generalizability. The sample included students from two areas, computer science and rehabilitation sciences, allowing for a contrasting representation of different areas of knowledge. While this approach may introduce selection bias, it was chosen to capture diverse linguistic perspectives, as differences in language use are relevant when analyzing the influence of grammatical abilities on AI-generated responses.

Although the sample is not probabilistic, it is considered representative within the context of this study (exploratory analysis), as it includes students with different levels of familiarity with the use of Generative Artificial Intelligence. This selection strategy allows us to obtain a broad perspective on the writing strategies used by students, despite the limitations imposed by the non-randomness of the sample.

2.2. Validation of the Evaluation Instrument

The content validity of the evaluation instrument was established through a two-phase expert validation process. In the first phase, three experts—a linguistics specialist, an Artificial Intelligence expert, and a technology-focused evaluation specialist—developed an evaluation matrix comprising theoretical categories, theoretical subcategories, and prompt evaluation questions. In the second phase, four additional experts evaluated the judgment matrix. These evaluators met specific inclusion criteria: they were university academics, held a degree, and worked in the fields of linguistics or speech therapy. The expert judgment process was conducted independently and in a blinded manner. Each expert assessed the AI-generated responses without knowledge of the evaluations made by other experts. This approach minimized potential response bias and ensured that the assessments remained objective and unbiased.

The evaluation criteria established by the judges are threefold: sufficiency, relevance, and clarity of each evaluation question. To determine the degree of consensus among the judges, Aiken’s V [25,26] is employed to evaluate the representativeness of the questions concerning the construct under examination.

Aiken’s V values range from 0.5 to 1, with a validity threshold of 0.7 or above. This indicates a high degree of agreement among judges regarding the adequacy, relevance, and clarity of the items (see Table A1).

Criteria A1, A2, and B correspond to the dependent variables to be measured (see Table 1). The dependent variables were scored according to the criteria ranging from 1 to 6, as used in previous studies [27,28]. Criteria C, D, and E define the independent variables (see Table 2).

The subjective judgment of the prompt, written according to the instructions for the activity proposed by the evaluators, was considered. These ratings were 1 not achieved, 2 more non-achieved than achieved, 3 Approximately equal achieved and non-achieved, 4 more achieved than non-achieved, 5 nearly all achieved, and 6 achieved. The “achieved” ratings indicated that a prompt was considered complete in terms of content, coherence, and structure. This meant the prompt successfully adhered to the instructions provided by the evaluators (e.g., providing context or motivation) and fulfilled all the activity requirements (e.g., including keywords, specifying structure, etc.). The rating “not achieved” considered the responses unsatisfactory [27,28,29,30].

The subjective assessment of the quality of the response provided by the LLM, based on the delivered prompt, was also considered. The rating used the same criteria from 1 to 6. The rating 1 “achieved” considered a response to be complete in content (e.g., it included possible questions and possible answers to address the hypothetical problem posed). The rating 6 “not achieved” considered responses unsatisfactory in both content and form (i.e., incomplete).

The mean length of utterances MLU (Promedio de longitud de los enunciados, PLE) is an index traditionally used to measure the level of language development in children. It measures the length of utterances based on the assumption that structural complexity, i.e., the range of linguistic programming, is manifested in an increase in the number of elements that make up an utterance [31,32,33].

Orthography and punctuation are observable features of written form that are governed by a set of rules for the writing of a language. Criterion C seeks to assess issues of form in the writing of the prompt, which is why these indicators are included [34].

On the advice of the external evaluators, the consideration of colloquialism and politeness formulas was eliminated, as they do not directly concern the form of the writing, nor do they correspond to the good use of the rules of writing.

Regarding variable D, the Spanish language employs three distinct verb moods, which may be described as representing different attitudes of the speaker. These moods correspond to the voice or attitude of the speaker and are indicated in the verb conjugation. The indicative mood is the most prevalent and is used to discuss matters related to reality, facts, or objective truths. It is typically used for narration, information conveyance, and description. The imperative mood, on the other hand, is used to give commands or directives to the addressee, aiming to elicit a specific action or response. In other words, it is employed to express orders, commands, requests, or advice. Together with the indicative, these are the moods that speakers typically utilize first in their language development.

The subjunctive mood is a feature of many Romance languages and is employed to convey a range of meanings, including doubt, desire, emotion, and hypothesis. The complexity of the subjunctive mood lies in its ability to express subjectivity, uncertainty, or desire. In other words, the verb expresses unrealities, possibilities, or desires, rather than verifiable or objective facts [35,36,37].

The expert reviewers highlighted a potential issue with participants’ comprehension of this distinction. However, the use of these should be spontaneous. Participants do not need to possess expert knowledge of this distinction, as native speakers of Spanish utilize them naturally and modulate them following their intended meanings.

A linguistic researcher with expertise in morphosyntactic studies conducted the classification of the verbal moods used in the prompts. Regarding variable E, the traditional approach to measuring sentence complexity is based on the number of verbal syntactic elements in an utterance. In this paradigm, a low level of grammatical complexity is indicated by the exclusive use of simple sentences. Given that there is only one conjugated verb in the utterance, the use of coordinated and subordinate sentences is deemed to be more complex, as it requires the use of two semantically connected conjugated verbs. Sentences and coordinated sentences are analogous to complex or compound sentences, and the deployment of these or their combinations is associated with elevated syntactic structural management demands [38,39].

Given the reservations expressed by experts regarding the methodology employed in establishing sentence complexity, the classification of sentence types in each prompt was conducted manually. A linguistic researcher with expertise in morphosyntactic studies conducted this research.

2.3. Procedure

The data collection procedure was carried out between August and September 2024 and included the following steps:

Introduction and Consent:

Students were welcomed, and the study context and objectives were explained.
Participants read an informed consent form, and those who agreed to participate provided their consent.

Questionnaire Administration:

The questionnaire was administered via Google Forms, allowing for efficient and anonymous data collection.
Students completed the questionnaire using their personal devices (mobile phones or tablets) in the classroom, ensuring consistent technological conditions.
The questionnaire included the following sections:
- Demographics: Collected data such as age, gender, educational institution, university program, year of study, and prior experience with ChatGPT to contextualize responses.
- Case Analysis: Presented a common scenario for all participants: preparing for a job interview in their field of study.
  - Situation: You have a job interview for an internship at a company in your field of study. You need to be prepared for the possible questions you will be asked and how to answer them effectively.
  - Instructions: Use ChatGPT to generate examples of common job interview questions and tips on how to answer them.
- Satisfaction Assessment: Measured the level of satisfaction with the responses generated by ChatGPT.

Prompt Interaction:

Students wrote their prompt in the questionnaire, copied it into the free version of ChatGPT (3.5), and pasted the AI’s response back into the form.
Students evaluated their satisfaction with the response:
- If satisfied, the test concluded.
- If dissatisfied, they rewrote the prompt (up to a maximum of 7 iterations) until they were satisfied.

Ethical Considerations:

Participation was voluntary, with students free to withdraw at any time without consequences.
Anonymity and confidentiality were guaranteed, with all data used exclusively for academic purposes.

Data Analysis:

After data collection, responses were analyzed using the Content Assessment Instrument, validated by expert judgment.
A linguist with expertise in morphosyntax evaluated each response individually, and the results were tabulated for further analysis.

The entire activity lasted approximately 35 min, providing a structured and controlled environment for data collection and analysis.

2.4. Data Analysis

A descriptive analysis of the data was carried out. At the moment of analyzing the texts used to interact with the AI, we decided to analyze the following dependent and independent variables.

There were three dependent variables: (1) the subjective judgment of the prompt (i.e., ordinal variable); (2) the subjective judgment of the quality of the response given by the LLM (i.e., ordinal variable); and (3) the mean length of the utterances (index: amount of words/number of utterances) (see Table 1).

For the three dependent variables, Kruskal–Wallis ANOVA was performed with each independent variable (see Table 2). First, the analysis of the independent variable “use of standards in writing” is presented (i.e., no orthography or punctuation errors vs. orthography errors in the writing of the prompt vs. punctuation errors in the writing of the prompt vs. both types of errors in the writing of the prompt); second, the analysis of the independent variable “verbal moods or attitudes of the speaker” is presented (use of indicative vs. subjunctive vs. imperative moods vs. use of two verb moods vs. use of three verb moods); third, the analysis of the independent variable “sentence complexity” is presented (use of simple sentences vs. use of coordinated sentences vs. use of subordinate sentences vs. use of two types of sentences vs. use of three types of sentences).

Jamovi software version 2.3 (2022) and IBM SPSS 27 Statistics were used for the analyses.

3. Results

First, a descriptive analysis of the data was performed. Second, Kruskal–Wallis ANOVA (Shapiro–Wilk, p < 0.001, W = 0.798) was performed for each dependent variable.

3.1. Subjective Judgment of the Prompt Written According to the Instructions for the Activity

In the first place, the subjective judgment of the prompt written according to the instruction for the activity proposed by the evaluators was considered. For these analyses, all inattentive prompts were excluded, i.e., those that were not related in any way to the activity proposed to the participants. A total of 65% of the data was included.

The overall mean was 5.01 (1.16 SD). The frequency of ranking 1 “not achieved” was 1.5%, ranking 2 “more non-achieved than achieved” was 2.9%, ranking 3 “approximately equal achieved and non-achieved” was 4.4%, for ranking 4 “more achieved than non-achieved” was 20.6%, ranking 5 “nearly all achieved” was 25%, and ranking 6 “achieved” was 45.6%.

The prompt written by the participants was analyzed in terms of aspects of form (i.e., use of standards in writing), where the occurrence or non-occurrence of orthographic and punctuation errors is considered. The more frequent prompts were those with orthographic errors and those with both errors (orthographic and punctuation). No significant differences were observed.

However, there was an observed significant effect for the variable verbal moods or attitudes of the speaker in the prompt (χ² (3) = 9.71, p = 0.021). The findings indicate that the exclusive utilization of the indicative or subjunctive mood is associated with diminished objective achievement (medians of 5 and 4). Conversely, prompts that combine two or three moods demonstrate a more favorable tendency (median of 6), reflecting a greater alignment with the target.

Also, there was an observed significant effect for the variable sentence complexity (χ² (4) = 21.6, p < 0.001) in the prompt (see Table 3). The use of coordinated sentences indicated superior performance (median 5.5, 1–6). Subordinate sentences with a median of 4 showed a lower level of achieving the desired objective.

The prompts that combined the two structural types showed a median of 5.2 (5–6). The prompts that combined all three structures showed a median of 5.7 (2–6), reflecting the optimal performance and the highest level of achievement of the objective. In summary, the use of only simple sentences or sentences with subordination resulted in lower objective achievement. Conversely, prompts that employed coordinated sentences or a combination of types of structures demonstrated the most favorable evaluations.

3.2. Subjective Judgment of the Quality of the Response Given by the LLM

In the second place, the subjective judgment of the quality of the response given by the LLM was considered. These ratings were achieved, moderately achieved, and not achieved (see Table 2). For these analyses, 100% of the data was included.

The overall mean was 5.01 (1.16 SD). The frequency of ranking 1 “not achieved” was 18.4%, ranking 2 “more non-achieved than achieved” was 20.4%, ranking 3 “approximately equal achieved and non-achieved” was 17.5%, ranking 4 “more achieved than non-achieved” was 11.7%, ranking 5 “nearly all achieved” was 15.5%, and ranking 6 “achieved” was 16.5%.

The response of the AI was analyzed in terms of the use of writing standards in the prompt (i.e., orthographic and/or punctuation errors). The more frequent responses were those associated with the prompt with both types of errors (orthography and punctuation). A significant difference was observed for this variable (χ² (3) = 14.41, p = 0.002). Better performance was observed in responses to prompts with punctuation errors (median of 5, 1–6) and in responses for prompts with both types of orthography errors (median of 4, 1–6) (see Figure 1). Below, in PLE analyses, it is noted that these prompts are the longest prompts.

The variable verbal moods or attitudes of the speaker in the prompt had an observed significant effect (χ² (4) = 31.5, p < 0.001). Better performance was observed in responses to prompts using two verb moods (5.1–6) and three verb moods (5.1–6). These results suggest that responses to prompts that combine two or three verbal moods are evaluated most favorably, suggesting that these responses achieve higher ratings. In contrast, responses utilizing only the imperative mood in the prompts were evaluated as not achieved (median of 1). Responses employing the indicative (median of 2) in the prompt exhibited more non-achieved than achieved performance (see Figure 1).

Also, there was an observed significant effect on the response of the AI depending on the variable sentence complexity in the prompt (χ² (4) = 44.7, p < 0.001) (see Table 4 and Figure 1). Responses from prompts combining two sentence types and three sentence types, with a median of 5 (range: 1–6), showed the highest achieved scores, followed by the prompts with coordinated sentences 4 (3–5). In contrast, responses from prompts employing simple or subordinate sentences were less favorably evaluated (median of 2).

3.3. The Length of the Utterances of the Prompt

In the third place, the length of the utterances of the prompt was considered. The index is calculated with the number of words in the utterances divided by the number of sentences (see Table 3). For these analyses, 100% of the data was included.

The length of the utterances was analyzed in terms of the use of standards in writing (i.e., orthographic and/or punctuation errors). No significant differences were observed for this variable (χ² (3) = 6.87, p = 0.079).

However, there was an observed significant effect for the length of the utterances depending on the variable verbal moods or attitudes of the speaker in the prompt (χ² (4) = 39.8, p < 0.001). The results described in Table 5 suggest that responses from prompts using three verb moods had the highest length index (median of 17), indicating greater complexity and sentence length. Responses from prompts with indicative or subjunctive mood showed an intermediate level of length (median of 9), although the indicative mood showed a greater variation in the range.

Also, there was an observed significant effect for the length of the utterances depending on the variable sentence complexity in the prompt (χ² (4) = 22.5, p < 0.001). The results indicate that responses from prompts comprising simple sentences exhibit the lowest length index (median 8), suggesting greater conciseness. The combination of two sentence types yielded the highest index (median 13.2), reflecting a greater degree of complexity and length. In contrast, responses from prompts comprising three sentence types exhibited a median length of 12.3, also indicating a trend towards greater sentence length (see Figure 2).

4. Discussion

This study aimed to explore how the characteristics of prompt writing in Spanish can influence the responses generated by an LLM (ChatGPT, free version). Given the limited research on this topic, an exploratory approach was deemed essential. To explore this issue, we evaluated the subjective judgment of the prompt written according to the instructions for the proposed activity (mean prompt), the subjective judgment of the quality of the response given by the LLM, and the length of the utterances in the prompt. These variables were described according to sociodemographic covariates and then analyzed based on the following: (i) the use of writing standards (form) in the prompt, (ii) the verbal moods or attitudes expressed by the speaker in the prompt, and (iii) sentence complexity in the prompt.

Our proposal does not aim to analyze multiple LLMs or incorporate extensive real-world scenarios. Instead, it focuses on examining whether a relationship exists between users’ grammatical abilities—specifically in Spanish, in this case among higher education students—and the quality of the responses generated by a single model: the free version of ChatGPT (version 3.5).

4.1. Punctuation and Orthography

The first notable finding was that the use of writing standards, such as punctuation and spelling, did not affect the observed dependent variables “subjective judgment of the prompt” and the “length of the utterances in the prompt”. This highlights how robust LMMs are nowadays in this variable, even in a language other than English. This is explained by integrating the observed results over utterance lengths (PLE). Since there are more errors in longer prompts (as a matter of probability), it is this that benefits the AI response, as longer prompts give more specific instructions to the AI.

4.2. Verbal Moods

The second finding was that the verbal moods employed in the written prompt had an impact on the three dependent variables observed. Regarding the subjective judgment of the written prompt, the results demonstrate that prompts using a combination of three moods achieve the highest objective performance. Those combining two moods show moderate success, while prompts with only the indicative or subjunctive mood achieve the lowest scores.

Regarding the subjective judgment of the response given by the AI, the results suggest that responses to prompts combining two or three verbal moods were evaluated most favorably. In contrast, responses from prompts with only the imperative mood were evaluated as not achieved, and responses to prompts employing the indicative or subjunctive were evaluated as moderate achievement.

Regarding the length of the utterances, the general trend observed in the analysis suggests that a prompt employing a combination of verbal moods leads to better performance and greater achievement. This pattern highlights that greater variety in verbal moods positively impacts both the quality and complexity of the prompts and responses.

The differential impact of verbal moods on LLM response quality may be explained by the inherent linguistic features of each mood. The indicative mood typically conveys factual and objective information, making it more compatible with the declarative nature of most AI-generated responses. In contrast, the subjunctive mood, often used to express hypothetical or uncertain situations, may introduce ambiguity that challenges the model’s interpretative capacity. Linguistic studies have demonstrated that the processing demands associated with subjunctive constructions are generally higher than those of indicative forms, as they require greater contextual inference and pragmatic interpretation [40,41]. These findings suggest that the model’s tendency is to favor indicative constructions over subjunctive ones. Future research could further investigate how the representation of verbal moods in training data influences the robustness and coherence of AI-generated responses.

4.3. Sentence Complexity

The third finding was that the sentence complexity employed in the written prompt had an impact on the three dependent variables observed.

Regarding the subjective judgment over the written prompt, the use of only simple sentences or sentences with subordination resulted in lower objective achievement. Conversely, prompts that employed coordinated sentences or a combination of three types of structures demonstrated better evaluations, reflecting greater success in meeting the objective.

Regarding the subjective judgment of the quality of the response, the results showed that responses from prompts utilizing coordinated sentences or a combination of two or three sentence types were the most highly evaluated. In contrast, responses from prompts employing simple or subordinate sentences were less favorably evaluated.

Regarding the length of the utterances in the prompt, the results indicate that responses from prompts comprising simple sentences exhibit the lowest length index, suggesting greater conciseness. In contrast, responses from prompts comprising subordinate and coordinated sentences exhibited higher length indices. The combination of two and three sentence types yielded the highest index, indicating a greater degree of complexity and length.

In synthesis, the findings reveal that the use of diverse verbal moods and sentence structures significantly impacts the evaluation of both the prompt and the generated responses, as well as the length of the responses. Prompts that incorporate multiple moods and more complex sentence types achieve better results, both in terms of objective performance and subjective judgments. The diversity in linguistic features not only enhances the quality of the interaction but also influences the success of the generated text. Writing standards such as punctuation and orthography did not show any impact on the dependent variables.

4.4. Implications for How to Objectively Evaluate a Prompt Written by an Adult

Given the observed trends, it is recommended that, when evaluating a written Spanish prompt, consideration be given to verb moods, sentence complexity, and utterance length. A post hoc Principal Component Analysis (PCA) was conducted, revealing that, when these three variables were considered, a single component was extracted (see Table 6). This suggests that only one component captures enough variance to be retained. Additionally, an Exploratory Factor Analysis (EFA) was performed to examine the structure of the factor and identify underlying patterns.

The variance of 63.9% indicates that utterance length has a high loading on the first component, which is an important factor in explaining the overall complexity of the text. However, it requires complementary information regarding verb mood and sentence complexity.

The EFA was conducted using the principal axis factoring method with oblimin rotation. This analysis identified one factor with a KMO of 0.65 [42] and a sphericity test of χ² = 60.2, degrees of freedom (df) = 3, and significance (p < 0.001). McDonald’s omega (ω) reliability analysis yielded a value of 0.73.

Therefore, if we consider the proposals in Table 1 and Table 2 together as an operationalization of indicators for evaluation instruments, the evaluation of prompts should consider the subjective assessment of prompt achievement and the subjective assessment of response achievement. These two variables will focus on an overall look at the content achievement of both elements of AI interaction. Furthermore, the instrument should consider the length of utterances, verb moods used, and sentence complexity to consider aspects of writing form (structural). These are relevant and significant for the achievement of the AI interaction objectives. Our analyses indicate that it is not informative to consider compliance with spelling and punctuation rules, as these were not significant in terms of achievement.

This evaluation instrument then allows us to assess the form of the prompt. That is, in terms of the complexity of the writing. A follow-up study must stratify the outcome of this instrument with a larger sample size. In a pilot post hoc, a scale was established based on the Z-score of the principal component, which indicates the following: the 84th percentile or higher (≥1.11) signifies achievement; a percentile between the 16th and 83rd (greater than −1 and less than 1.11) indicates moderate achievement; and a percentile below the 16th indicates no achievement.

These considerations can be extended to evaluations of speech forms in other contexts, for example, in the field of speech therapy and linguistics. As mentioned above, sentence length is an index for assessing the level of morphosyntactic (grammatical) development in children. However, it is not used to formally assess adult language. At present, linguists and speech therapists do not have an objective protocol for assessing the morphosyntactic level in Spanish, so a subjective and qualitative assessment is generally made. The present study contributes to the assessment of the morphosyntactic level of adult speech by providing guidelines on how it can be assessed quantitatively. We have provided evidence that determining the level of complexity in writing requires considering at least sentence length, the verb forms used, and the types of sentences employed by the participants.

4.5. Implications of This Work in the Field of Higher Education

As Holmes [43] suggests, it is important to recognize that the connections between AI and education are more complex than they might seem, with misunderstandings often arising due to a lack of investigations [44].

With this work, we propose that both the strategies for formulating effective questions to interact with an LLM and the quality of the responses these LLMs provide depends on students’ grammatical performance. While much research focuses on prompt engineering strategies, it is important that literacy policies for the use of AI in education also incorporate the development of students’ grammatical performance.

Figure 3 represents a methodological proposal that educational communities could use to enhance students’ grammatical performance, enabling responsible interaction between students and LLMs. This framework highlights four essential aspects to consider in human–machine interaction: (i) grammatical performance; (ii) grammatical strategies for question formulation; (iii) interaction with the LLM prompt; and (iv) evaluating response quality.

Grammatical performance, focused on the verb and its mood, influences sentence complexity. The subjunctive mood, being more complex than the indicative, presents greater challenges for non-native speakers. Compound sentences, especially subordinate ones, increase complexity by relating multiple ideas. According to our study, the choice of verbal mood and the type of sentence are linked to the length of interactions with the LLMs. Speakers with more expertise, those who are able to combine different moods and sentence types, can generate longer and more informative messages, achieving better results in their interactions with the LLMs. Hence, the speaker’s expertise in structuring sentences is positively related to better outcomes and success in interactions with LLMs, such as ChatGPT.

Grammatical strategies for question formulation can be considered part of the prompt engineering field. A prompt is an input to a generative AI model used to guide its output [45,46]. Many studies demonstrate that more effective prompts improve the quality of responses across a wide range of tasks [47]. However, understanding how to use it correctly remains an emerging field, with various terms and techniques that are not necessarily well understood by students.

To evaluate the quality of responses, it is crucial that students also enhance their skills, such as critical thinking, researching, and contrasting information sources, while being able to analyze the ethical aspects and intellectual property rights involved in interactions with AI.

Moreover, considering the exploratory nature of this study and its focus on Spanish language interaction with LLMs, the findings may serve as a foundation for informing future educational policies. In particular, the proposed methodological framework—highlighting the importance of grammatical competence, prompt formulation, and response evaluation—could support the development of digital AI literacy programs in higher education. These programs might guide both students and educators in cultivating critical and effective AI use, especially in multilingual and underrepresented contexts. However, further empirical validation is required before these recommendations can inform formal policy initiatives.

Finally, we can observe that all the proposed components are connected to grammatical performance. We intend to emphasize the importance of grammatical performance, as a foundational step in improving interactions with AI. In addition, the findings of this study align with linguistic theories that suggest greater cognitive demands when processing complex grammatical structures, particularly in languages with rich morphology like Spanish. By contextualizing the results within established theories of linguistic variability, we provide evidence that not only supports the importance of grammatical prompt formulation but also highlights the challenges faced by LLMs when dealing with diverse verbal moods and complex syntactic forms. Future studies could further explore these challenges in other Romance languages or contexts where linguistic diversity plays a critical role in LLM performance.

5. Conclusions

The drastic and accelerated rise in Generative Artificial Intelligence in several areas has led both individuals and organizations to use it for solving a wide range of problems. Unlike traditional AI, generative AI utilizes models capable of processing language through written instructions provided as prompts.

Despite advances in this type of technology, uncertainties persist regarding what kind of abilities must be acquired to interact with LLMs. While several studies focus on prompt engineering strategies, this study takes a step back to explore the significance of grammatical competence in obtaining effective responses from LLMs such as ChatGPT.

In this context, the present study constitutes an initial attempt to understand the interaction between higher education students and LLMs (in this case, ChatGPT) from a linguistic perspective. Specifically, we aimed to analyze how the writing style influenced the quality of the responses provided by ChatGPT, considering the use of Spanish. In other words, how important is grammar in influencing the quality of ChatGPT’s responses?

There were three dependent variables: the subjective judgment of the prompt written according to the instruction to the activity proposed by the evaluators; the subjective judgment of the quality of the response given by the LLM; and the length mean of the utterances.

The results of this exploratory study indicate that the use of varied verbal moods and sentence structures has a significant effect on both the evaluation of the prompt and the generated responses, as well as on the length of the responses. Prompts that include a wider range of moods and more complex sentence structures tend to yield better outcomes, both in terms of objective performance and subjective assessments. This linguistic diversity not only improves the quality of the interaction but also plays a key role in the success of the generated text. Interestingly, writing standards such as punctuation and orthography did not appear to affect dependent variables.

However, it is important to acknowledge that these findings are based on an exploratory approach and should not be overgeneralized. The results obtained are specific to the context of higher education students in Chile using the free version of ChatGPT (version 3.5) and should be interpreted with caution. While the patterns identified provide valuable insights into how grammatical competence may influence LLM responses, further studies with larger and more diverse samples are necessary to confirm and expand upon these observations. Future research should also consider different linguistic contexts and additional LLM versions to assess the consistency and applicability of these findings.

What implications could these results have for higher education, considering that technologies based on LLMs can significantly enhance students’ learning experiences? In this context, we propose a framework to guide higher education communities in both the digital literacy process for professors and students, as well as in the integration of AI into the teaching–learning process. This framework highlights four essential aspects to consider in human–machine interaction: (i) grammatical performance; (ii) grammatical strategies for question formulation; (iii) interaction with the LLM prompt; and (iv) evaluating response quality.

Considering the key findings of this research, the following directions for future work are proposed: (i) training LLMs with underrepresented languages to improve their multilingual capabilities and address linguistic diversity; (ii) investigating how students utilize prompt engineering strategies to optimize their interactions with LLMs; (iii) exploring how students assess the quality of outputs generated by LLMs, including their criteria and approaches; (iv) determining whether higher education should prioritize grammatical proficiency or prompt engineering abilities in the context of LLM usage; (v) evaluating whether natural language alone is sufficient to obtain high-quality responses from LLMs, or if additional strategies are required; (vi) implementing the proposed framework and analyzing its outcomes with a representative sample of students from higher education communities; (vii) incorporating automated evaluation metrics, such as BLEU, ROUGE, and METEOR, to complement manual assessments and provide a more comprehensive analysis of LLM performance in terms of accuracy, semantics, and information precision; (viii) examining how users with varying levels of grammatical proficiency, including those with limited formal education or developing writing skills, interact with LLMs to identify potential barriers and opportunities for improving accessibility and effectiveness across diverse populations; and (ix) analyzing how LLMs process the specific linguistic characteristics of Romance languages, such as Spanish—including rich morphology and syntactic flexibility—in contrast to English. This line of research would provide valuable insights into how these differences influence prompt engineering practices and support the development of AI digital literacy in multilingual educational contexts.

6. Limitations

This study was exploratory and involved a limited sample of higher education communities from the Biobío region in Chile. Although the findings emphasize the importance of grammatical performances for interacting with LLMs, further research is needed to confirm these results.

Author Contributions

Conceptualization, R.V.-M. and C.E.C.-S.; Data Curation, C.E.C.-S.; Formal Analysis, R.V.-M. and C.E.C.-S.; Investigation, R.V.-M. and C.E.C.-S.; Methodology, R.V.-M., J.C.-S. C.C.-S. and C.E.C.-S.; Resources, S.S.-M.-Q.; Supervision, J.C.-S.; Visualization, S.S.-M.-Q.; Writing—Original Draft, R.V.-M., J.C.-S., C.C.-S., S.S.-M.-Q. and C.E.C.-S.; Writing—Review and Editing, R.V.-M., J.C.-S. and S.S.-M.-Q. All authors have read and agreed to the published version of the manuscript.

Funding

The author R.V.-M. acknowledges support from ANID FONDECYT Postdoc through grant number 3230356. The author C.C.-S. acknowledges support from grant ANID Capital humano Beca Doctorado Nacional Foil 21231752 Project ID 16930.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of Universidad Católica de la Santísima Concepción (protocol code 09/2024 and date of approval 14 March 2024).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. The results of four expert judgments for two content validity (Vo = 0.50 and Vo = 0.70) of the assessment instrument by sufficiency (S), relevance (R), and clarity (C) about the variables. Only the parts of the instrument that require editing or modification are reported.

	d	Mean	V	Vo = 0.50	Lower	Upper	Vo = 0.70	Decision
A1 The student’s prompt demonstrates that they were able to follow the instructions given in the activity.	S	3.75	0.92	✓	0.6461	0.9851	✕	Reassess sufficiency
	R	4.00	1.00	✓	0.7575	1.0000	✓
	C	3.25	0.75	✕	0.4677	0.9111	✕	Rewrite
A2 The answer given by the AI is satisfactory according to the given prompt.	S	3.75	0.92	✓	0.6461	0.9851	✕	Revise sufficiency
	R	4.00	1.00	✓	0.7575	1.0000	✓
	C	3.00	0.67	✕	0.3906	0.8619	✕	Rewrite
B1 Record the number of words used in the question posed to the AI.	S	4.00	1.00	✓	0.7575	1.0000	✓
	R	4.00	1.00	✓	0.7575	1.0000	✓
	C	4.00	1.00	✓	0.7575	1.0000	✓
B2 Record the number of sentences used in the wording of the question posed to the AI.	S	4.00	1.00	✓	0.7575	1.0000	✓
	R	4.00	1.00	✓	0.7575	1.0000	✓
	C	4.00	1.00	✓	0.7575	1.0000	✓
C1 There are orthographical errors in the wording of the prompt.	S	2.25	0.42	✕	0.1933	0.6805	✕	Insufficiency
	R	3.50	0.83	✓	0.552	0.953	✕	Revise Relevance
	C	4.00	1.00	✓	0.7575	1.0000	✓
C2 There are punctuation errors in the wording of the prompt.	S	3.25	0.75	✕	0.4677	0.9111	✕	Insufficiency
	R	4.00	1.00	✓	0.7575	1.0000	✓
	C	4.00	1.00	✓	0.7575	1.0000	✓
C3 Students comply with colloquialisms and politeness in the writing of the prompt.	S	2.75	0.58	✕	0.3195	0.8067	✕	Insufficiency
	R	4.00	1.00	✓	0.7575	1.0000	✓
	C	3.75	0.92	✓	0.6461	0.9851	✕	Revise writing
D1 The indicative mode is present in the wording of the prompt.	S	4.00	1.00	✓	0.7575	1.0000	✓
	R	4.00	1.00	✓	0.7575	1.0000	✓
	C	3.00	0.67	✕	0.3906	0.8619	✕	Rewrite
D2 The subjunctive mood is present in the wording of the prompt.	S	4.00	1.00	✓	0.7575	1.0000	✓
	R	4.00	1.00	✓	0.7575	1.0000	✓
	C	3.00	0.67	✕	0.3906	0.8619	✕	Rewrite
D3 The imperative mood is present in the wording of the prompt.	S	4.00	1.00	✓	0.7575	1.0000	✓
	R	4.00	1.00	✓	0.7575	1.0000	✓
	C	3.00	0.67	✕	0.3906	0.8619	✕	Rewrite
D4 There are dual combinations of verb moods, i.e., at least two types in the wording of the prompt.	S	3.25	0.75	✕	0.4677	0.9111	✕	Insufficiency
	R	3.25	0.75	✕	0.4677	0.9111	✕	Irrelevant
	C	3.00	0.67	✕	0.3906	0.8619	✕	Rewrite
D5 The prompt uses three types of combined verb moods.	S	3.25	0.75	✕	0.4677	0.9111	✕	Insufficiency
	R	3.25	0.75	✕	0.4677	0.9111	✕	Irrelevant
	C	3.00	0.67	✕	0.3906	0.8619	✕	Rewrite
E1 There is a simple sentence in the wording of the prompt.	S	3.75	0.92	✓	0.6461	1.0000	✕	Revise sufficiency
	R	4.00	1.00	✓	0.7575	1.0000	✓
	C	3.25	0.75	✕	0.4677	0.9111	✕	Rewrite
E2 There is a coordinated sentence in the wording of the prompt.	S	4.00	1.00	✓	0.7575	1.0000	✓
	R	4.00	1.00	✓	0.7575	1.0000	✓
	C	3.25	0.75	✕	0.4677	0.9111	✕	Rewrite
E3 There is a subordinate sentence in the wording of the prompt.	S	3.50	0.83	✓	0.552	0.953	✕	Revise sufficiency
	R	3.50	0.83	✓	0.552	0.953	✕	Revise relevance
	C	3.25	0.75	✕	0.4677	0.9111	✕	Rewrite
E4 There are dual combinations of sentence types.	S	4.00	1.00	✓	0.7575	1.0000	✓
	R	4.00	1.00	✓	0.7575	1.0000	✓
	C	3.75	0.92	✓	0.6461	0.9851	✕	Revise writing
E5 Three types of combined sentences are used in the prompt.	S	3.50	0.83	✓	0.552	0.953	✕	Revise sufficiency
	R	3.50	0.83	✓	0.552	0.953	✕	Revise relevance
	C	3.00	0.67	✕	0.3906	0.8619	✕	Rewrite

Note: ✓ within the acceptable range; ✕ out of the acceptable range.

According to the findings of the expert assessment, the criteria for Category A were revised and updated; the criteria for Category C were revised and Category C3 was removed; and the criteria for Categories D and E were revised and rewritten for greater clarity (for further details, please refer to 2.4 of the data analysis design).

References

Roumeliotis, K.I.; Tselikas, N.D. ChatGPT and Open-AI Models: A Preliminary Review. Future Internet 2023, 15, 192. [Google Scholar] [CrossRef]
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2024, arXiv:2312.11805. [Google Scholar]
Wermelinger, M. Using GitHub Copilot to Solve Simple Programming Problems. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, Toronto, ON, Canada, 15–18 March 2023; pp. 172–178. [Google Scholar]
Perplexity. Perplexity AI. 2025. Available online: https://www.perplexity.ai (accessed on 7 January 2025).
Anthropic. Claude. 2024. Available online: https://anthropic.com (accessed on 27 March 2025).
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
Bryant, C.; Yuan, Z.; Qorib, M.R.; Cao, H.; Ng, H.T.; Briscoe, T. Grammatical Error Correction: A Survey of the State of the Art. Comput. Linguist. 2023, 49, 643–701. [Google Scholar] [CrossRef]
Singh, A.; Singh, N.; Vatsal, S. Robustness of LLMs to Perturbations in Text. arXiv 2024, arXiv:2407.08989. [Google Scholar]
Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering in Large Language Models: A comprehensive review. arXiv 2024, arXiv:2310.14735. [Google Scholar]
Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.M.; Abid, A.; Fisch, A.; Brown, A.R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv 2023, arXiv:2206.04615. [Google Scholar]
Ortiz-Garces, I.; Govea, J.; Andrade, R.O.; Villegas-Ch, W. Optimizing Chatbot Effectiveness through Advanced Syntactic Analysis: A Comprehensive Study in Natural Language Processing. Appl. Sci. 2024, 14, 1737. [Google Scholar] [CrossRef]
Salem, A.C.; Gale, R.C.; Fleegle, M.; Fergadiotis, G.; Bedrick, S. Automating Intended Target Identification for Paraphasias in Discourse Using a Large Language Model. J. Speech Lang. Hear. Res. 2023, 66, 4949–4966. [Google Scholar] [CrossRef]
Lu, H.; Liu, T.; Cong, R.; Yang, J.; Gan, Q.; Fang, W.; Wu, X. QAIE: LLM-based Quantity Augmentation and Information Enhancement for few-shot Aspect-Based Sentiment Analysis. Inf. Process. Manag. 2025, 62, 103917. [Google Scholar] [CrossRef]
Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. arXiv 2021, arXiv:2009.03300. [Google Scholar]
Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; Steinhardt, J. Measuring Mathematical Problem Solving with the MATH Dataset. arXiv 2021, arXiv:2103.03874. [Google Scholar]
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.d.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar]
Shi, F.; Suzgun, M.; Freitag, M.; Wang, X.; Srivats, S.; Vosoughi, S.; Chung, H.W.; Tay, Y.; Ruder, S.; Zhou, D.; et al. Language Models are Multilingual Chain-of-Thought Reasoners. arXiv 2022, arXiv:2210.03057. [Google Scholar]
Fang, T.; Yang, S.; Lan, K.; Wong, D.F.; Hu, J.; Chao, L.S.; Zhang, Y. Is ChatGPT a Highly Fluent Grammatical Error Correction System? A Comprehensive Evaluation. arXiv 2023, arXiv:2304.01746. [Google Scholar]
Hadi, M.U.; Tashi, Q.A.; Qureshi, R.; Shah, A.; Muneer, A.; Irfan, M.; Zafar, A.; Shaikh, M.B.; Akhtar, N.; Wu, J.; et al. A Survey on Large Language Models: Applications, Challenges, Limitations, and Practical Usage. TechRxiv 2023. [Google Scholar] [CrossRef]
Mialon, G.; Dessì, R.; Lomeli, M.; Nalmpantis, C.; Pasunuru, R.; Raileanu, R.; Rozière, B.; Schick, T.; Dwivedi-Yu, J.; Celikyilmaz, A.; et al. Augmented Language Models: A Survey. arXiv 2023, arXiv:2302.07842. [Google Scholar] [CrossRef]
Chan, C.K.Y.; Hu, W. Students’ voices on generative AI: Perceptions, benefits, and challenges in higher education. Int. J. Educ. Technol. High. Educ. 2023, 20, 43. [Google Scholar] [CrossRef]
Overono, A.L.; Ditta, A.S. The Rise of Artificial Intelligence: A Clarion Call for Higher Education to Redefine Learning and Reimagine Assessment. Coll. Teach. 2023, 73, 123–126. [Google Scholar] [CrossRef]
Saúde, S.; Barros, J.P.; Almeida, I. Impacts of Generative Artificial Intelligence in Higher Education: Research Trends and Students’ Perceptions. Soc. Sci. 2024, 13, 410. [Google Scholar] [CrossRef]
Etikan, I. Comparison of Convenience Sampling and Purposive Sampling. Am. J. Theor. Appl. Stat. 2016, 5, 1–4. [Google Scholar] [CrossRef]
Aiken, L. Three coefficients for analyzing the reliability and validity of ratings. Educ. Psychol. Meas. 1985, 45, 131–142. [Google Scholar] [CrossRef]
Penfield, R.D.; Giacobbi, P.R., Jr. Applying a Score Confidence Interval to Aiken’s Item Content-Relevance Index. Meas. Phys. Educ. Exerc. Sci. 2004, 8, 213–225. [Google Scholar] [CrossRef]
Johnson, D.; Goodman, R.; Patrinely, J.; Stone, C.; Zimmerman, E.; Donald, R.; Chang, S.; Berkowitz, S.; Finn, A.; Jahangir, E.; et al. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model. Res. Sq. 2023, rs.3.rs-2566942. [Google Scholar] [CrossRef]
Molena, K.F.; Macedo, A.P.; Ijaz, A.; Carvalho, F.K.; Gallo, M.J.D.; Wanderley Garcia De Paula E Silva, F.; De Rossi, A.; Mezzomo, L.A.; Mugayar, L.R.F.; Queiroz, A.M. Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model. Cureus 2024, 16, e65658. [Google Scholar] [CrossRef]
Knoth, N.; Tolzin, A.; Janson, A.; Leimeister, J.M. AI literacy and its implications for prompt engineering strategies. Comput. Educ. Artif. Intell. 2024, 6, 100225. [Google Scholar] [CrossRef]
White, J.; Hays, S.; Fu, Q.; Spencer-Smith, J.; Schmidt, D.C. ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design. arXiv 2024, arXiv:2303.07839. [Google Scholar]
Pavelko, S.L.; Price, L.R.; Owens, R.E., Jr. Revisiting reliability: Using Sampling Utterances and Grammatical Analysis Revised (SUGAR) to compare 25-and 50-utterance language samples. Lang. Speech Hear. Serv. Sch. 2020, 51, 778–794. [Google Scholar] [CrossRef]
Pavez, M. Presentación del Índice de Desarrollo del Lenguaje “Promedio de Longitud de los Enunciados” (PLE); Universidad de Chile: Santiago, Chile, 2002; Available online: https://www.u-cursos.cl/medicina/2010/2/FOMORES22/1/material_docente/bajar?id_material=303509 (accessed on 27 March 2025).
Soler, M.C.; Murillo, E.; Nieva, S.; Rodríguez, J.; Mendez-Cabezas, C.; Rujas, I. Verbal and More: Multimodality in Adults’ and Toddlers’ Spontaneous Repetitions. Lang. Learn. Dev. 2023, 19, 16–33. [Google Scholar] [CrossRef]
Torrego, L.G. Ortografía de uso Español Actual; Ediciones SM España: Madrid, España, 2015. [Google Scholar]
Gjenero, A. Uso del Modo Subjuntivo para Expresar Deseos. Bachelor’s Thesis, University of Zagreb, Zagreb, Croatia, 2024. [Google Scholar]
Muñoz De La Virgen, C. Adquisición del modo subjuntivo: Una propuesta didáctica. Didáctica Leng. Lit. 2024, 36, 127–144. [Google Scholar] [CrossRef]
Vyčítalová, B.L. El Subjuntivo y el Indicativo: La Importancia de una Preparación Previa del Estudiante. Master’s Thesis, Filozofická Fakulta Ústav Románských Jazyků a Literatur, Masarykova Univerzita, Brno, Czech Republic, 2024. [Google Scholar]
Brown, A.V.; Paz, Y.B.; Brown, E.K. El Léxico-Gramática del Español: Una Aproximación Mediante la Lingüística de Corpus; Routledge: London, UK, 2021. [Google Scholar]
Radford, A. Analysing English Sentence Structure: An Intermediate Course in Syntax; Cambridge University Press: Cambridge, UK, 2023. [Google Scholar]
Fleischman, S.; Bybee, J.L. Modality in Grammar and Discourse; John Benjamins B.V.: Amsterdam, The Netherlands, 1995. [Google Scholar]
Silva-Corvalán, C. Language Contact and Change: Spanish in Los Angeles; Oxford University Press: Oxford, UK, 1994. [Google Scholar]
Pallant, J. SPSS Survival Manual: A Step by Step Guide to Data Analysis Using IBM SPSS; Routledge: London, UK, 2020. [Google Scholar] [CrossRef]
Holmes, W. The Unintended Consequences of Artificial Intelligence and Education; Education International: Brussels, Belgium, 2023. [Google Scholar]
Miao, F.; Holmes, W.; Huang, R.; Zhang, H. AI and Education: A Guidance for Policymakers; UNESCO Publishing: Paris, France, 2021. [Google Scholar]
Heston, T.; Khun, C. Prompt Engineering in Medical Education. Int. Med. Educ. 2023, 2, 198–205. [Google Scholar] [CrossRef]
Meskó, B. Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial. J. Med. Internet Res. 2023, 25, e50638. [Google Scholar] [CrossRef] [PubMed]
Schulhoff, S.; Ilie, M.; Balepur, N.; Kahadze, K.; Liu, A.; Si, C.; Li, Y.; Gupta, A.; Han, H.; Schulhoff, S.; et al. The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv 2024, arXiv:2406.06608. [Google Scholar]

Figure 1. Trends in subjective judgments of prompts and AI-generated responses. This figure illustrates the tendencies of the dependent variables’ subjective judgment of the prompt written according to the instruction to the activity proposed by the evaluators (PROMPT) and the subjective judgment of the quality of the response given by the LLM (AI RESPONSE). The data are presented across the independent variables: use of standards in writing, verbal moods, and sentence complexity.

Figure 2. Trends in utterances length (PLE). This figure illustrates the tendencies of the dependent variable PLE across the independent variables: use of standards in writing, verbal moods, and sentence complexity.

Figure 3. Methodological framework for interacting with LLMs. This figure presents a proposed methodology for structuring interactions with LLMs, outlining key components and processes involved. The direction of the rows indicates the interaction and influence of grammatical performance on all framework components.

Table 1. Details of the dependent variables of the study.

Dependent Variable Measured	Indicator	Scale
A1. The subjective judgment of the written prompt according to the instruction for the activity proposed by the evaluators (mean prompt)	The prompt written by the participant is evidence that they were able to follow the instructions given in the activity.	1 not achieved
		2 more non achieved than achieved
		3 Approximately equal achieved and non-achieved
		4 more achieved than non-achieved
		5 nearly all achieved
		6 achieved
A2. The subjective judgment of the quality of the response given by the LLM	The answer given by the AI is satisfactory according to the prompt written by the participant.	1 not achieved
		2 more non achieved than achieved
		3 Approximately equal achieved and non-achieved
		4 more achieved than non-achieved
		5 nearly all achieved
		6 achieved
B. Length of the utterances	Number of words used in the wording of the prompt posed to the AI.	Index: Number of words/Number of utterances
B. Length of the utterances	Number of sentences used in the wording of the prompt posed to the AI.	Index: Number of words/Number of utterances

Table 2. Details of the appropriate independent variables of the study according to the comments and recommendations of the judgment of four experts.

Independent Variable Measured	Indicator	Alternatives
C. Use of standards in writing (form).	No orthography or punctuation errors.	1
	Orthography errors in the writing of the prompt.	2
	Punctuation errors in the writing of the prompt.	3
	Both types of errors in the writing of the prompt.	4
D. Verbal moods or attitudes of the speaker: Use of indicative, subjunctive, and imperative moods.	In prompt writing, indicative moods are identified.	1
	In prompt writing, subjunctive moods are identified.	2
	In prompt writing, imperative moods are identified.	3
	In the prompt, both types of verbal moods are identified.	4
	In the prompt, the three types of verbal moods are identified.	5
E. Sentence complexity in the prompt: This determines the type(s) of sentences the participant used in writing the prompt.	In prompt writing, only simple sentences are identified.	1
	In prompt writing, only coordinate sentences are identified.	2
	In prompt writing, only subordinate sentences are identified.	3
	In prompt writing, two types of sentences are identified.	4
	In prompt writing, the three types of sentences are identified.	5

Table 3. For each variable, media (SD), median, and range for the dependent variable “subjective judgment of the prompt written according to the instruction to the activity proposed by the evaluators”.

Independent Variable	Conditions	Mean (SD)	Median (Range)
Use of standards in writing χ² (3) = 2.64, p = 0.451	No orthography or punctuation errors	4.5 (1.69)	4.5 (1–6)
	Orthography errors in the writing of the prompt	4.9 (0.87)	5 (3–6)
	Punctuation errors in the writing of the prompt	5.1 (1.45)	6 (2–6)
	Both types of errors in the writing of the prompt	5.1 (0.91)	5 (3–6)
Verbal moods or attitudes of the speaker χ² (3) = 9.71, p = 0.021	Use of indicative mood	4.6 (1.34)	5 (1–6)
	Use of subjunctive mood	4.6 (0.89)	4 (4–6)
	Use of imperative mood	0	0
	Use of two verb moods	5.4 (0.768)	6 (4–6)
	Use of three verb moods	6 (0.00)	6 (6–6)
Sentence complexity χ² (4) = 21.6, p < 0.001	Use of simple sentence	4.4 (1.23)	5 (4–6)
	Use of coordinated sentences	5.5 (0.707)	4 (1–6)
	Use of subordinate clauses	4 (0.632)	6 (2–5)
	Use of two types of sentences	5.2 (1.14)	6 (5–6)
	Use three types of sentences	5.79 (0.579)	6 (2–6)

Table 4. For each variable, media (SD), median, and range for the “subjective judgment of the quality of the response given by the LLM”.

Independent Variable	Conditions	Mean (SD)	Median (Range)
Use of standards in writing χ² (3) = 14.41, p = 0.002	No orthography or punctuation errors	2.6 (1.53)	2 (1–6)
	Orthography errors in the writing of the prompt	2.74 (1.54)	3 (1–6)
	Punctuation errors in the writing of the prompt	4.14 (1.82)	5 (1–6)
	Both types of errors in the writing of the prompt	3.9 (1.68)	4 (1–6)
Verbal moods or attitudes of the speaker χ² (4) = 31.5, p < 0.001	Use of indicative mood	2.75 (1.5)	2 (1–6)
	Use of subjunctive mood	2.71 (0.9)	3 (1–4)
	Use of imperative mood	1 (--)	1 (1–1)
	Use of two verb moods	4.96 (1.3)	5 (1–6)
	Use of three verb moods	4 (2.6)	5 (1–6)
Sentence complexity χ² (4) = 44.7, p < 0.001	Use of simple sentence	2.2 (1.1)	2 (1–6)
	Use of coordinated sentences	4 (1)	4 (3–5)
	Use of subordinate clauses	2.4 (1.21)	2 (1–4)
	Use of two types of sentences	4.6 (1.52)	5 (1–6)
	Use three types of sentences	4.8 (1.4)	5 (1–6)

Table 5. For each variable, media (SD), median, and range for the length of the utterances in the prompt (number of words divided by the number of sentence numbers).

Independent Variable	Conditions	Mean (SD)	Median (Range)
Use of standards in writing χ² (3) = 6.87, p = 0.079	No orthography or punctuation errors	11 (6.29)	11 (3–35)
	Orthography errors in the writing of the prompt	9.5 (3.43)	9 (4–17)
	Punctuation errors in the writing of the prompt	11.69 (5.25)	11.5 (3–24)
	Both types of errors in the writing of the prompt	11.81 (4.73)	11.5 (5–24)
Verbal moods or attitudes of the speaker χ² (4) = 39.8, p < 0.001	Use of indicative mode	10 (4.9)	9 (3–24)
	Use of subjunctive mood	9.29 (1.70)	9 (8–13)
	Use of imperative mood	3 (-)	3 (3–3)
	Use of two verb moods	13.96 (5.19)	13. (8–35)
	Use of three verb moods	16.89 (7.83)	16.8 (9–24)
Sentence complexity χ² (4) = 22.5, p < 0.001	Use of simple sentence	8.46 (3.58)	8 (3–24)
	Use of coordinated sentences	10 (4.50)	10 (5.5–14)
	Use of subordinate clauses	11.5 (2.48)	11 (9–16)
	Use of two types of sentences	14.66 (5.70)	13.2 (6–35)
	Use three types of sentences	12.72 (2.91)	12.3 (8–17)

Table 6. Results of the Exploratory Factor Analysis and Principal Component Analysis showing the component loadings and % of variance.

	EFA Exploratory Factor Analysis		PCA Principal Component Analysis
	Factor	KMO = 0.65	1 ^a	Uniqueness	% of variance	Eigenvalues ^b
Length of the utterances	0.615	0.65	0.776	0.397	63.9	1.91
Verbal moods	0.614	0.68	0.776	0.398	20.9	0.62
Sentence complexity	0.815	0.62	0.844	0.287	15.2	0.45

^a High loads close to 1 or −1 indicate that the variable has a strong influence on that component. ^b If a single component has an eigenvalue > 1, it means that the three variables are well grouped and represent a common construct.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Viveros-Muñoz, R.; Carrasco-Sáez, J.; Contreras-Saavedra, C.; San-Martín-Quiroga, S.; Contreras-Saavedra, C.E. Does the Grammatical Structure of Prompts Influence the Responses of Generative Artificial Intelligence? An Exploratory Analysis in Spanish. Appl. Sci. 2025, 15, 3882. https://doi.org/10.3390/app15073882

AMA Style

Viveros-Muñoz R, Carrasco-Sáez J, Contreras-Saavedra C, San-Martín-Quiroga S, Contreras-Saavedra CE. Does the Grammatical Structure of Prompts Influence the Responses of Generative Artificial Intelligence? An Exploratory Analysis in Spanish. Applied Sciences. 2025; 15(7):3882. https://doi.org/10.3390/app15073882

Chicago/Turabian Style

Viveros-Muñoz, Rhoddy, José Carrasco-Sáez, Carolina Contreras-Saavedra, Sheny San-Martín-Quiroga, and Carla E. Contreras-Saavedra. 2025. "Does the Grammatical Structure of Prompts Influence the Responses of Generative Artificial Intelligence? An Exploratory Analysis in Spanish" Applied Sciences 15, no. 7: 3882. https://doi.org/10.3390/app15073882

APA Style

Viveros-Muñoz, R., Carrasco-Sáez, J., Contreras-Saavedra, C., San-Martín-Quiroga, S., & Contreras-Saavedra, C. E. (2025). Does the Grammatical Structure of Prompts Influence the Responses of Generative Artificial Intelligence? An Exploratory Analysis in Spanish. Applied Sciences, 15(7), 3882. https://doi.org/10.3390/app15073882

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Does the Grammatical Structure of Prompts Influence the Responses of Generative Artificial Intelligence? An Exploratory Analysis in Spanish

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Validation of the Evaluation Instrument

2.3. Procedure

2.4. Data Analysis

3. Results

3.1. Subjective Judgment of the Prompt Written According to the Instructions for the Activity

3.2. Subjective Judgment of the Quality of the Response Given by the LLM

3.3. The Length of the Utterances of the Prompt

4. Discussion

4.1. Punctuation and Orthography

4.2. Verbal Moods

4.3. Sentence Complexity

4.4. Implications for How to Objectively Evaluate a Prompt Written by an Adult

4.5. Implications of This Work in the Field of Higher Education

5. Conclusions

6. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI