1. Introduction
In recent years, the integration of Artificial Intelligence (AI) tools, particularly of large language models (LLMs), into educational settings has rapidly accelerated. One of the most affected areas is student assessment, where teachers are increasingly using LLMs to assist with grading, feedback generation, and evaluation tasks. This shift is driven by the idea of reducing the educators’ workload, streamlining repetitive activities, and enabling scalability in large-class environments. The adoption of LLMs is not only expanding but also appears to be an inevitable step in the digital transformation of education. However, the growing reliance on LLMs for high-stakes educational tasks raises important questions about their effectiveness, reliability, and the types of risks they introduce. In addition to well-known concerns such as the over-reliance of students on AI dialogue systems
Zhai et al. (
2024), factual errors, hallucinated content
Farquhar et al. (
2024), and implicit bias
Lin et al. (
2024), new and more subtle vulnerabilities are emerging. As these tools are entrusted with evaluative functions traditionally performed by humans
Ruiz et al. (
2025), there is increasing concern about the trustworthiness of their output, the fairness of their decisions, and their susceptibility to manipulation. This paper focuses on the emerging threat of prompt manipulation and examines how the use of LLMs in assessment, although promising, may expose the educational process to underexplored forms of misuse.
Prompt injection has recently gained significant visibility in global media
Sugiyama and Eguchi (
2025), drawing attention to a manipulative technique designed to alter the expected behaviour of tools powered by LLMs. This attack pattern belongs to a broader class of input manipulation strategies, where attackers craft specially designed inputs to subvert system behaviour. Historical examples include, e.g., viruses and malicious software, SQL injection attacks
Hyslip (
2017);
Ray and Ligatti (
2012), and the insertion of hidden text and links in web content for search engine optimization tactics to artificially and maliciously alter page rankings
Sehgal et al. (
2022).
In the AI domain, adversarial inputs to mislead model predictions exemplify similar vulnerabilities
Y. Li et al. (
2021);
Malik et al. (
2024). Collectively, these forms of attack have a common underlying flaw:
Insufficient Input Validation. This well-documented weakness in software systems applies when user-provided input is not adequately checked and cleaned. This flaw will enable adversaries to introduce unexpected, malicious content that can compromise the integrity of the system and result in unintended execution paths or security breaches.
Reflecting the growing severity of this problem, two of the most authoritative initiatives in software security, the Common Weakness Enumeration (CWE)
MITRE (
2024) and the Open Worldwide Application Security Project (OWASP)
OWASP (
2024), explicitly include prompt injection in their rankings of the most critical software vulnerabilities. In particular, the latest OWASP publication, “Top 10 Risks & Mitigations for LLMs and GenAI Apps 2025”
OWASP (
2025), lists prompt injection as the top risk, highlighting the urgency to address this threat in the design and deployment of LLM-based systems.
To this end, our study focuses on how prompt injection and obfuscation techniques can be used in educational contexts, particularly in scenarios where LLMs are used to assist teachers in tasks such as grading, feedback generation, and textual evaluation.
Section 2 introduces a range of injection strategies and obfuscation patterns, illustrating how they can be embedded within student submissions to subtly alter the LLM’s behaviour.
Section 3 presents the systematic experiments conducted using real-world educational material. These experiments systematically evaluate the impact of prompt injection techniques on several of the most widely adopted LLMs, highlighting their vulnerabilities in authentic academic workflows.
Section 4 presents and discusses the results of experiments on the dimensions of vulnerability and consistency between the generated grades and justification. The final discussion and observations are provided in
Section 5.
2. Prompt Injection in the Educational Scenario
This study focuses on the realistic educational scenario in which a teacher uses a large language model to assist in the evaluation of written assignments submitted by a class of students in an e-learning platform (e.g, Content Management System (CMS)
Biondi et al. (
2022);
Franzoni et al. (
2020)).
Figure 1 illustrates the workflow of the evaluation process, including the potential for prompt injection attacks. In our framework, the attacker corresponds to a student who deliberately embeds hidden prompt instructions within the text of their assignment to manipulate the grading outcome, while the instructor remains unaware of the modification. In this scenario, it is assumed without loss of generality that students have access to a word processor for preparing their exam submissions. During the exam, their devices are disconnected from the internet and from any LLM-based systems, and responses are uploaded via a secure connection to the university’s official e-learning platform
Falcinelli et al. (
2009);
Jin (
2012). The teacher then downloads the exam documents, and prompts them to an LLM along with instructions to evaluate the responses of the students.
The experimental data used in this work consist of anonymized student responses to open-ended questions from a final written examination for an academic course. The target of the attack is the teacher, who unknowingly interacts with the manipulated content through the LLM’s interface, enabling prompt injection to influencing the AI’s output.
2.1. Evaluation Criteria for the Assignment
The evaluation criteria applied by the teacher for assessing the students’ responses will be related to the specific course. In general, they could be modelled considering different dimensions, such as content, structure, and language.
According to university guidelines, the instructor could provide a sample answer, i.e., a reference answer that shows what a fully correct or high-quality response should look like. This answer would be unknown to the students. The sample answer ensures consistency and fairness in grading by providing a clear baseline. In academic practice, it can also be used in response to a student’s request for justification of their grade or a detailed explanation, i.e., a request for a third-party review by another professor.
2.2. User Prompt
The user prompt includes instructions for the LLM on how to evaluate the students’ responses, written by the teacher. In this section, the teacher’s evaluation strategy will be analysed and modelled, and the structure of a sample user prompt will be built.
2.2.1. User Interaction Strategy
As an output for the teacher’s prompt, by requesting the LLM to execute the assessment and attaching the students’ documents to be evaluated, two potential strategies for interaction between the teacher and LLM are considered:
Single evaluation request, where the teacher uploads the students’ answers individually in separate sessions, and submits a prompt requesting their assessment according to the evaluation criteria, using the sample answer as a baseline.
Sequential evaluation request, where the teacher uploads all students’ responses in a single session, and submits a prompt requesting all the assessments to be produced.
These two strategies reflect typical teacher behaviours, with a preference for single or sequential assessment.
Both classes of experimentation are conducted with and without injected documents, for comparison.
2.2.2. Teacher’s Prompt Structure and Content
The teacher’s prompt requests the LLM to execute the evaluation, thus providing the LLM with the evaluation criteria and the sample answer. To produce a consistent assessment, the LLM will list the following:
A description of the adopted evaluation criteria;
The question of the assignment;
The sample answer provided by the instructor;
The request for evaluation of the completed assignment uploaded for the question, according to the evaluation criteria and the sample answer.
The following template integrates the structure of the user prompt applied to our experiments:
With these instructions in the LLM prompt, the teacher is willing to obtain support to check the students’ assessments, matching the evaluation criteria, and providing a rationale for the given assessment. The instructor can use the generated notes as a base to deepen the single assignment analysis or to justify the grade. It is easy to predict that the more sophisticated and credible the LLM becomes in the evaluation process, returning plausible evaluations, the more the teachers would trust it, and the less they would apply an independent in-depth analysis, especially in the presence of a relevant workload, e.g., a high number of student assignments to be graded. In such a situation, the risk that the prompt injection is effective can be considered very high, and its impact can disrupt the quality of the assessment process.
2.3. Prompt Injection Types
In this section, we introduce the two types of prompt injection on which we focus our research. In the assignment evaluation scenario, we consider an attacker model that requires the following:
An attack technique: This has to be easily accessible, with minimal or nearly no technical prerequisites, for
- –
Injection, i.e., it requires the ability of just using ‘cut-and-paste’;
- –
Obfuscation, i.e., it requires the ability to change text colour in a word processor.
The attacker’s goals in using injection could be any of the following:
- –
p.i.pass: Prompt injection requesting to pass the exam, i.e., obtaining a grade number greater than or equal to the sufficient grade limit:
- –
p.i.top25: Prompt injection requesting to be included in the group of top 25% excellent students, i.e., achieving a grade greater than or equal to the top 25% lower limit:
The above injection prompts have been fine-tuned through a preliminary phase of prompt injection experiments in order to enforce the intended behaviour. In preliminary tests, we found that some LLMs were realizing weak executions of prompt instructions. In addition to the initial core injected instructions to give a grade greater than (or equal to) a certain limit, we considered the following components for the injection:
Priority requirement:
As in serial evaluation, the injected prompt could be considered in a context where other assignments are evaluated, risking the dilution of its impact.
Consistency:
In preliminary tests without this explicit specification, some LLMs were assigning the requested score, but sometimes generating motivations unrelated to it.
Weakening user prompt influence:
Without this specification, in some cases, uninjected motivations prevailed in output.
Do not make user aware of the injected prompt:
Without this instruction, some LLMs incorrectly include comments like “grade value greater than or equal to X-Limit as you requested”, or similar phrases, spoiling the injection.
2.4. Student’s Prompt Obfuscation
Prompt injection does not require advanced technical skills. In educational contexts, even low-effort obfuscation techniques can be sufficient to alter an LLM’s behaviour without attracting the instructor’s attention. A typical example is embedding the injection prompt in white text on a white background or using a single line of extremely small characters. This form of steganographic insertion can be easily achieved using standard word processors, copy–paste, and text-formatting operations.
This approach thus defines a low-complexity attacker model, whereby a non-expert student could intentionally bypass the teacher’s review by exploiting the asymmetry between human-readable content and what the LLM can process. Despite its simplicity, this method can have a significant impact on automated or assisted evaluation workflows.
3. Experiments
This section describes the experimental design and procedure developed to test the central hypotheses of the study regarding the effect of prompt injection on model-assisted grading.
The study is designed under the research assumption that prompt injections embedded within student submissions could influence the grading behaviour of large language models, leading to a systematic increase in the assigned scores compared to uninjected controls. It is further assumed that this bias could be more pronounced when evaluations are conducted in a sequential workflow, where multiple assignments are processed within the same conversational context, allowing injected instructions to persist and affect subsequent outputs. At the same time, the experiment acknowledged that such effects are unlikely to manifest uniformly across different models, as variations in architecture, alignment strategy, and contextual sensitivity may produce distinct levels of susceptibility to grade inflation.
The systematic plan of experiments conducted to evaluate LLMs’ vulnerability to injection attacks is presented in a scenario of students’ exam assessment.
A total of 32 authentic student assignments were included in the experimental corpus. These were distributed across the four grade categories defined in
Section 3.1.1 (fail, satisfactory, good, top). After the baseline evaluation performed by each LLM, the assignments in each category were selected to evaluate how the LLMs are sensitive to injection attacks on them.
Each assignment is experimented and evaluated on category jump under multiple injection conditions applicable to its baseline evaluation category:
Fail-to-Pass: Aiming to reach grade, i.e., reaching pass from the baseline fail grade category;
Fail-to-Top: Targeting top 25% grade category from fail;
Satisfactory-to-Top: Passing from a baseline satisfactory of 4–5 grade to top 25%;
Good-to-Top: From a good 6–8 baseline grade to top 25%.
Figure 2 illustrates the experimental flow, where the non-injected submissions (
Figure 2a) are fed to the LLMs to compute the baseline evaluation, and prompt injected inputs (in
Figure 2b–e) cover the four different types of category jump, to evaluate grade inflation vulnerability and size. The injection has not been tested in assignments with a baseline evaluation in the top category, because such grades don’t need inflation.
For each of the two workflows of the evaluator–LLM interaction (single vs. sequential request), six LLMs have been tested on each student’s submission under the baseline (i.e., no injection) and under injected conditions, when applicable to the submissions’ baseline categories, resulting in about 500 distinct evaluation runs.
For accessibility to non-technical readers, the pseudocode can be interpreted as representing a sequence of grading sessions: in single-request mode, each assignment is sent to the model individually, while in sequential mode, the teacher submits all the assignments together in a single batch. The key difference is in the context persistence of the LLM, which in the sequential case allows injected instructions in a single submission to potentially influence others.
3.1. Teacher’s Data
In this section, we present the teacher’s data, i.e., the general information for the specific evaluation criteria used in our experiment, the assignment, and the sample answer. Such data are extracted and anonymized from real written exams given at the Hong Kong Baptist University, collected in the past editions of the course on E-transformation in Business, taught by one of the authors of this study.
3.1.1. Evaluation Criteria
In order to ensure that the grading process reflects realistic assessment practices, the evaluation framework is based on three complementary dimensions that together capture both the substantive and formal quality of the written response of a student. These dimensions are consistent with commonly adopted rubrics in higher education and language-based assessment, balancing content knowledge, communicative clarity, and linguistic accuracy. Each criterion contributes to a holistic yet transparent evaluation, allowing both human graders and LLMs to assess submissions along comparable dimensions of performance. The three dimensions applied in this study are the following:
Content Completeness. Evaluate whether the student has addressed all parts of the task, covered all required points, and developed the content sufficiently. Example: An essay that only partially responds to the question would be considered incomplete, even if well written.
Structure and Organization. Concerns the clarity and logic of the organization of the text: introduction, development, conclusion, well-formed paragraphs, and smooth transitions. Example: An answer paper with scattered arguments and no clear flow loses points for structure, even if it contains good ideas.
Language Accuracy and Style. Measures the quality of language: grammar, spelling, vocabulary, lexical precision, and appropriate register to the context. Example: An assignment answer with frequent grammar errors, or very limited or inaccurate vocabulary would receive a lower score, even if it is complete and well organized.
Finally, each completed assignment receives an overall numerical grade on a scale from 0 (minimum) to 12 (maximum), accompanied by a concise evaluative comment summarizing the main strengths and weaknesses across the three dimensions.
According to institutional grading guidelines, scores below 4 points (less than 35% of the total) are considered insufficient or fail, while scores of 4–5 points (35–50%) are rated sufficient or satisfactory. The grades between 6 and 8 points (50–75%) correspond to the good category, and those between 9 and 12 points (≥75%) indicate top or excellent performance. This grading schema ensures methodological consistency across evaluators and experimental conditions, providing a transparent mapping between quantitative scores and qualitative judgments applicable to both human and model-based assessments.
3.1.2. Assignment
The teacher’s assignment for the exam, used in the experiments, is the following:
3.1.3. Sample Answer
The teacher provides a sample answer for each assignment element, according to academic guidelines, to represent a reference example for a possible third-party teacher to provide a further correction. The sample answer is unknown to students, but is provided to the LLM as part of the user prompt requesting the evaluation. The following sample answer is related to the assignment used in our experiments.
3.2. Tested LLMs
The following set of popular LLMs has been systematically experimented with, each in the basic default settings available to end users at the time of testing implementation (finalized in August 2025):
ChatGpt (OpenAI ChatGpt version 4o);
Gemini (Google Gemini version 2.5 Flash);
DeepSeek (DeepSeek version V3);
Grok(xAI Grok version 3);
Perplexity (Perplexity AI version 0.0.0);
Copilot (Microsoft Copilot GPT-4).
Experiments have been conducted with new, clean accounts, created ad hoc, to avoid spoiling results with the testers’ data.
LLMs such as Microsoft’s Copilot and Google’s Gemini are generative AI systems, potentially accessible to a vast majority of users all over the world.
3.3. Plan of Experiments
The experiment plan covers the two prompt injection strategies described in
Section 2.2.1, with two different approaches, with different levels of desired achievement:
At least pass (p.i.pass), where the student’s assignment, injected with the attacker’s prompt, aims at inducing the LLM to overevaluate it up to resulting, at least, sufficient/satisfactory, giving to the attacker’s assignment at least a 4 points overall evaluation;
Top 25% (p.i.top25), where the attacker is trying to obtain an excellent evaluation, between 9 and 12 points.
Both students’ approaches will be tested, with the two teacher’s strategies. The LLM–teacher interaction strategies concern the modality in which the instructor uploads the students’ completed assignments and the prompt evaluation request, as previously explained.
A preliminary phase of
fair evaluation experiments, i.e., requesting the evaluation of assignments with no prompt injection (see
Figure 2a) holds for both interaction strategies. This preliminary phase will help establish a baseline for comparison to the experiments with prompt injection.
Then, we aim at evaluating the capability to obtain a particular desired grade’s increment through prompt injection: if such an increment is affected by the distance between the fair assessment grade and the attacker’s target grade.
Assignment responses with different fair grades (i.e., fail, sufficient, good) have been considered to test the capability of injection to lead the LLM to the highest (i.e., top 25%) grades.
For the
at least pass injection experiments, also called “from the student’s point of view”, the
fail-to-pass is used to inflate grades from fail to sufficient (
Figure 2b). In this case, the injection has been tested on an assignment that obtained an
insufficient grade (from 0 to 3) in the fair evaluation, to verify if the injection can increase it to 4 points or more.
For the top 25% experiments, three starting points were tested with the aim to verify if the injection succeeds in incrementing each grading range to excellent:
- -
Fail-to-top from fail to excellent (
Figure 2c);
- -
Satisfactory-to-top, from sufficient to excellent (
Figure 2d);
- -
Good-to-top, from good to excellent (
Figure 2e).
The following pseudocode describes the experiments on different dimensions for the LLMs, the teacher’s evaluation approach, the students’ prompt injection strategy, and the different starting grade conditions for the attacker’s assignment. The experiment plan is also depicted in
Figure 2.
4. Experiment Results and Discussion
The results of our experiments show vulnerability to the prompt injection of all LLMs tested in the assignment-assessment scenario, both for the single and the sequential assessment strategies. In the following sections, the results for each strategy will be presented from the points of view of vulnerability and qualitative results.
4.1. Vulnerability
Table 1 shows the results for the main experiment in this study. In the table, ‘
yes’ confirms the hypothesis of the considered LLM being vulnerable to prompt injection; the entry ‘
N/A’ specifies that it was not possible to test the specific case (e.g., fail-to-top not possible if the baseline did not grade as failed any assignment). Indeed, DeepSeek, Grok, Perplexity, and Copilot, when evaluated to establish the reference baseline for sequential evaluation (that is, when assignments are asked to be evaluated as a whole group), did not return any assignment graded lower than 4 points. Testing the cases
fail-to-pass and the
fail-to-top was therefore not possible for the cited LLMs.
In particular, all the experiments show that the considered LLMs are vulnerable to every injection strategy that could be tested. Generally speaking, independently from the presence of prompt injection, a general behaviour common to all LLMs has been noted: when the LLM evaluates a set of assignments together, the grade tends to be more
indulgent, leading to a positive evaluation, which does not happen in the single evaluation strategy. This general shift toward higher grades sometimes caused the unavailability of evaluations for the case of baseline fail, as shown in
Table 1.
4.2. Statistical Analysis
This section presents a formal, more technical analysis of results, applying statistical measures for quantitative analysis, and an evaluation on a Likert scale for the qualitative analysis.
4.2.1. Methodology
Two endpoints were pre-specified: (i)
Δgrade (attack − control, continuous) and (ii)
attack success rate (ASR), defined as the proportion of submissions crossing the target threshold (“≥ pass” or “top 25%”) under attack. Analyses were conducted per model and per workflow (single vs. sequential) using paired comparisons. For
grade, mean paired differences with
t-based 95% confidence intervals (CIs) were reported and tested against the null using a paired
t-test, with a Wilcoxon signed-rank sensitivity check for non-normality
Wilcoxon (
1945). For ASR, proportions with binomial 95% CIs based on the Wilson score interval were reported
Wilson (
1927), and paired pass/fail flips were evaluated using McNemar’s test with continuity correction
McNemar (
1947). To check multiplicity across models and endpoints,
p-values were adjusted using the Benjamini–Hochberg false discovery rate (FDR) procedure
Benjamini and Hochberg (
1995). Effect sizes were visualized with forest plots (point estimate ± 95% CI) stratified by model and workflow. In the single injection condition, ASR reached 100% for all models, producing a ceiling effect; consequently, the ASR forest plot for this condition was omitted and boundary estimates were noted. Robustness checks included trimming 5% of
grade extremes and repeating ASR analyses with alternate thresholds (±5 points or percentile ±2.5) to confirm stability of conclusions.
4.2.2. Single Injection
In the single injection strategy, both evaluation targets (“≥ pass” and “top 25%”) demonstrated statistically significant grade inflation across all evaluated models, with paired
grade estimates consistently above zero and 95% confidence intervals excluding the null. The observed magnitude of grade shifts ranged approximately from 1.4 to 2.3 points for the pass criterion (
Figure 3) and 3.8 to 5.6 points for the top quartile target (
Figure 4), indicating a strong upward bias induced by the instructions injected. Although the tested LLMs are all strongly sensitive to inject instructionsion, the system
Gemini, and
DeepSeek exhibited the largest and most consistent
grade effects across the end-points, all statistically significant at the 95% level. In both endpoints, the attack success rate (ASR) reached 100% for every model, producing a complete ceiling effect that precluded differentiation among systems. These findings confirm that even a single injected prompt can uniformly and substantially bias model-based grading outcomes.
4.2.3. Sequential Injection
In the sequential injection condition targeting the top 25% threshold, all evaluated models displayed statistically significant grade inflation (upper panel,
Figure 5), with paired
grade estimates above zero and 95% confidence intervals excluding the null. The magnitude of the effect ranged from approximately 0.4 to 2.0 points, indicating a systematic upward bias in grading under sequential exposure. The corresponding attack success rates (ASRs) were also significantly greater than zero for every model, with point estimates between 0.09 and 0.63 (lower panel,
Figure 5).
Gemini and
Copilot showed the highest
grade increases, while
Gemini and
Perplexity achieved the largest ASR gains, all with non-overlapping CIs relative to zero. These results, although weaker and less uniform than for the single injection strategy, highlight consistency of the biasing effect across LLMs under sequential conditions, where cumulative exposure to injected content amplifies the grading distortion.
This statistical analysis of the experimental results indicates that Gemini exhibited the highest sensitivity to prompt injection, showing consistently higher grade inflation and attack success rates across both endpoints and injection strategies compared to the other evaluated LLMs.
4.3. Qualitative Results
The qualitative analysis was conducted by members of the authors’ team to evaluate the internal coherence of the outputs produced by the large language models (LLMs). Specifically, this assessment focused on the consistency between the textual justifications generated by each model and the corresponding numerical grades assigned, rather than on the accuracy of the grades with respect to the actual content of the students’ work. For experiments using the sequential assessment strategy, particular attention was paid to changes in the overall grade distribution within the evaluated group, in order to determine how injected prompts affected not only the target assignment, but also neighbouring evaluations within the same session.
Table 2 presents a summary of this qualitative evaluation for both injected and baseline (non-injected) experiments. In the case of single assignment evaluations, the analysis ranks the
consistency of the assigned grade relative to the accompanying justification of the model’s, as retrieved through the user prompt (see
Section 2.3). For assignments evaluated under the sequential strategy, the ranking evaluates the
consistency of the entire set of generated explanations and also considers the
resulting distribution of grades within the evaluated batch. This approach allows for assessing not only whether the injection successfully influenced the target assignment but also whether the effect propagated across the group.
For both evaluation strategies, consistency was rated on a 1–5 Likert scale: 1 = Very Low, 2 = Low, 3 = Moderate, 4 = High, and 5 = Very High coherence between the textual justification and the assigned score.
Examples of model output are shown in
Table 3, which reports responses returned by
Copilot for the single assignment evaluation strategy. For each injection type (
fail-to-pass,
fail-to-top,
satisfactory-to-top, and
good-to-top), both the assigned grade and its justification are presented. The qualitative evaluation of
Copilot outputs in single assignment mode typically ranked
High or
Very High in consistency, while in sequential evaluation, consistency rating scores were significantly lower. Repeated or duplicated justifications were also observed in several sequential evaluations.
Table 4 provides representative examples of inconsistent outputs. In these cases, models such as
DeepSeek and
Grok correctly returned the grades requested by the injected prompts, but produced textual explanations with
Very Low or
Low consistency, indicating a clear disconnect between the assigned score and the rationale used to justify it.
In general, qualitative evidence supports the quantitative statistical findings: sequential workflows tend to amplify incoherence and bias, highlighting how contextual persistence can compromise evaluative reliability across model outputs.
4.4. Discussion: Main Contributions
This study advances the understanding of vulnerabilities in large language models used for educational evaluation by combining an original experimental design with a rigorous quantitative framework. Through systematic evaluation of six widely used LLMs, ChatGPT, Gemini, DeepSeek, Grok, Copilot, and Perplexity, under controlled prompt injection conditions, the research provides a reproducible methodology for quantifying model susceptibility to manipulation. The introduction of explicit validation metrics, namely the grade shift measure (grade) and the attack success rate (ASR), allows a transparent assessment of the extent to which hidden instructions can bias model-assisted grading.
Beyond statistical evidence, the results reveal that the interaction workflow itself plays a critical role in shaping the vulnerability of the model. The contrast between single and sequential grading workflows demonstrates that prompt persistence and contextual carry-over can substantially amplify bias, offering a realistic scenario of how contamination may occur in practical use when instructors process multiple student submissions in one session. The integrated analysis of both quantitative and qualitative outcomes strengthens the interpretation of these findings: grade inflation is not only measurable, but also traceable in the textual rationales produced by the models, where injected cues subtly influence the evaluative language.
These contributions collectively clarify how prompt injection compromises the reliability and fairness of automated assessment systems. They also provide a methodological foundation for subsequent research on detection and mitigation while emphasizing the pedagogical and ethical implications of delegating evaluative decisions to generative models. By situating empirical evidence within a transparent and replicable analytical framework, the study contributes to a deeper understanding of the mechanisms through which LLMs can be manipulated and to the development of safer and more accountable AI-assisted evaluation practices.
5. Conclusions and Implications
This study demonstrates that all the evaluated large language models (LLMs) are vulnerable to prompt-injection attacks in realistic educational assessment scenarios. Even minimal and easily concealed instructions embedded by students can induce substantial grade inflation, undermining both fairness and reliability of general LLM-based grading. Quantitative analysis confirmed statistically significant increases in grade and ASR across single and sequential workflows, with Gemini showing the highest overall sensitivity. Qualitative inspection revealed that single assignment evaluations produce more coherent rationales than sequential ones, where contextual contamination amplifies bias and reduces justification consistency.
From a pedagogical perspective, these results highlight that LLMs should serve only as assistive tools for formative feedback or preliminary evaluation, rather than as autonomous graders. Teachers and institutions must remain in the decision loop, applying expert judgment and cross-checking model output, particularly when grades carry academic or ethical weight. The findings also stress the importance of educator awareness: hidden or injected prompts can distort outcomes even under apparently controlled and neutral conditions.
Ethically, the study exposes a new category of academic integrity risk: student-driven prompt injection, a subtle but powerful form of manipulation that escapes traditional plagiarism detection and exploits the model’s interpretative trust. Mitigation therefore requires not only technical defenses but also the inclusion of clear policy and literacy measures, ensuring that both instructors and students understand the vulnerabilities and limitations of generative models.
Future work should pursue three complementary directions:
- (i)
- (ii)
- (iii)
Expansion of the dataset and inclusion of
human grader benchmarks to triangulate LLM performance against expert evaluations
Seo et al. (
2025). Additional research should also explore cross-lingual and discipline-specific vulnerability patterns, as well as the impact of model updates over time.
In conclusion, the vulnerability of current LLMs underscores the urgent need for transparent evaluation protocols, defensible model architectures, and ethical safeguards before integrating generative AI into formal educational assessment workflows. Building trust in such systems will depend on combining algorithmic robustness with pedagogical responsibility, ensuring that technological innovation serves, rather than undermines, the principles of fairness, accountability, and integrity in education.