Next Article in Journal
Singular Limit as q for a Doubly Nonlinear Cauchy Problem with Absorption
Previous Article in Journal
Global Behavior of the 2D Dirac–Klein–Gordon System with a Class of Large Initial Data
Previous Article in Special Issue
Towards Robust Chain-of-Thought Prompting with Self-Consistency for Remote Sensing VQA: An Empirical Study Across Large Multimodal Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

EvoShield: Selective Test-Time Adaptation for Prompt Injection Detection via Active LLM Querying

1
School of Mathematics and Computer Science, Guangdong Ocean University, Zhanjiang 524088, China
2
School of Electronic and Information Engineering, Guangdong Ocean University, Zhanjiang 524088, China
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(10), 1719; https://doi.org/10.3390/math14101719
Submission received: 15 April 2026 / Revised: 7 May 2026 / Accepted: 14 May 2026 / Published: 16 May 2026
(This article belongs to the Special Issue Big Data Mining and Knowledge Graph with Application)

Abstract

Prompt injection detection is commonly studied as a static offline classification problem, yet deployed LLM systems face evolving attacks and distribution shift after deployment. Static detectors are therefore poorly matched to the threat model, while routing every input to a stronger external LLM is costly and defeats the purpose of a local detector. We formulate prompt injection detection as a selective test-time adaptation problem. Our framework combines a prompt-based local detector built on masked language modeling and a learnable soft verbalizer with an entropy-based active querying mechanism that escalates only high-uncertainty inputs to an external LLM. Queried hard samples are then stored in a review window and replayed for subsequent detector updates. Empirical evaluations across multiple benchmarks show that EvoShield achieves performance on par with or even exceeding pure Large Language Model baselines, while cutting API query costs by more than 85%.

1. Introduction

Prompt injection is no longer a minor failure mode of large language models (LLMs). It has become a practical attack vector for LLM-based applications and agents that process untrusted user inputs, retrieved documents, tool outputs, and external web content. Liu et al. showed that deployed applications are vulnerable to prompt injection at a meaningful scale, and AgentDojo further demonstrated that this problem should be studied in dynamic agent environments rather than on a small set of isolated prompt examples [1,2]. In this setting, a detector is not simply an auxiliary moderation component. It acts as a control point that determines whether untrusted text can affect downstream reasoning and action selection.
Much of the existing literature still frames prompt injection detection as a standard offline classification problem. Layered screening systems and multi-agent filtering pipelines can improve coverage, but they remain fixed after deployment [3,4]. Other defenses take the opposite approach and use a stronger LLM to identify or remove malicious instructions before execution [5]. A third line of work uses internal model states and shows that prompt injection can be detected from intrinsic LLM representations without generating responses [6]. These are meaningful advances, but their limitations are straightforward. A static detector assumes that the attack distribution at deployment is similar to the one seen during training. An LLM-based guard may improve recall, but it also introduces additional latency, cost, and dependence on another model that must itself remain reliable under adversarial prompting. Internal-feature detectors reduce some of this overhead, but they are still static and often require access to model internals that may be unavailable in many deployment settings.
The threat model is more demanding than these assumptions suggest. Indirect prompt injection expands the attack surface beyond explicit user prompts to retrieved or tool-generated content, making the detector’s input distribution broader and less stable [7]. Recent empirical studies further show that prompt injection and jailbreak detectors can be bypassed by adaptive evasion strategies, including simple character-level perturbations and more systematic adversarial manipulations [8]. Under these conditions, evaluating a fixed detector on a fixed benchmark split is a convenient protocol, but not an adequate threat model. If the attacker can adapt online while the detector remains unchanged, then the detector is solving a different problem from the one faced in practice.
This project begins from that mismatch between the threat model and the dominant defense paradigm. In deployment, a detector must make a sequence of guarded decisions under a distribution that may change after release: whether the current input can be handled locally, whether additional supervision is worth the cost, and how newly observed difficult cases should affect later predictions. We therefore formulate prompt injection detection as a selective test-time adaptation problem. EvoShield instantiates this formulation with a prompt-based local detector whose class distribution provides both a prediction and an uncertainty signal. The detector is built on masked language modeling with a learnable soft verbalizer, following PTE [9], so that few-shot adaptation can be performed in the vocabulary space of the underlying language model without relying on a fixed hand-written label-word mapping. During deployment, predictive entropy determines when external supervision is requested from an LLM. When a queried response yields a parsable label, the sample is used for immediate adaptation and retained in a bounded review window, allowing recently encountered uncertain cases to influence subsequent local predictions and routing behavior. In this way, external supervision is spent on the parts of the stream where the current detector is least certain, and the detector is updated from the same difficult cases that triggered escalation.
The technical contribution of EvoShield is framed at the level of the deployment problem it addresses. Prompt injection detection is treated as an online decision problem in which the detector must decide when to trust its local prediction, when to seek external supervision, and how to incorporate newly observed difficult cases during deployment. This formulation makes external supervision depend on the detector’s current uncertainty and uses the resulting hard cases to update later local decisions, thereby moving the task beyond a fixed offline classifier evaluated once on a static split. Based on this formulation, our contributions are summarized as follows:
  • We formulate prompt injection detection under evolving deployment streams as a selective test-time adaptation problem, shifting the task from static offline classification to online guarded decision making.
  • We design a deployment-oriented adaptation protocol in which predictive uncertainty controls external supervision, and parsable queried labels are reused to update the local detector on recently encountered difficult cases.
  • We validate EvoShield on four prompt injection and jailbreak detection benchmarks, showing that selective test-time adaptation can preserve strong detection performance while substantially reducing external API calls.

2. Related Work

Prompt Injection and Existing Guardrails. Prompt injection follows from a simple but inconvenient fact: LLM-integrated applications routinely concatenate inputs of different provenance into a single context, while the model lacks a native mechanism to enforce instruction priority by source; empirical studies show that this is sufficient to yield exploitable failures in deployed settings and agent-style workflows [1,2]. Most defenses therefore insert a guardrail before downstream execution, but the literature tends to operationalize this guardrail as a static artifact. Deployable detectors treat injection detection as supervised classification and aim to fit realistic deployment constraints, with PromptShield as a representative baseline [10]. More recent prompt-injection guardrails also study over-defense and mitigation-oriented detection, as in PIGuard and InjecGuard [11,12]. Pipeline-style guardrails combine rules, lightweight models, and auxiliary checks, including frameworks such as Palisade and multi-agent mitigation pipelines [3,4]. LLM-as-guard approaches prompt a stronger model to detect and remove injected instructions, shifting the burden from model training to prompting and API usage, as in PromptArmor [5]. Internal-feature detectors seek signals that do not require response generation, with PIShield exploiting intrinsic representations for detection [6]. Closely related jailbreak-detection work has also explored linguistic features and lightweight sentinels for real-time monitoring [13,14]. These lines are useful, but they usually assume that a detector or guard prompt, once chosen, remains fixed at deployment, which is a methodological convenience rather than a security argument.
Indirect Injection and Evasion Against Detection. Indirect prompt injection broadens the threat surface by embedding malicious instructions in retrieved documents or tool outputs, producing longer, more heterogeneous contexts and weakening the assumption that the deployment distribution resembles any curated training split [7,15]. Defenses such as spotlighting attempt to encode provenance cues in the prompt so the model can separate untrusted content from privileged instructions, which is a reasonable patch for a missing interface primitive, not a guarantee [16]. The more serious issue is adaptivity: once guardrails are visible, attackers can search for bypasses. Evidence from both indirect-injection evaluations and broader guardrail studies indicates that adaptive attacks can defeat multiple proposed defenses and that even simple perturbations can evade prominent detection systems [8,17]. Taken together, this literature implies that prompt injection detection is better modeled as a non-stationary, adversarial stream than as a static benchmark classification task, which makes the absence of an adaptation protocol in many guardrail designs a central gap rather than an implementation detail.
Test-Time Adaptation and Active Learning. Test-time adaptation (TTA) updates a model at inference time to mitigate distribution shift, with Tent establishing entropy minimization as a canonical unsupervised objective in the fully test-time setting [18]. Subsequent work shows that naive per-sample adaptation can be computationally heavy and can degrade in-distribution performance via catastrophic forgetting, motivating selective updates and regularization, as in EATA [19]. The broader challenge of maintaining robust neural dynamics under noisy or adversarial perturbations has also been studied in engineering applications, where noise-resistant gradient-based methods demonstrate that carefully designed update rules can preserve system stability even under sustained disturbance [20]. More recent work has started to connect test-time adaptation with active querying and test-time learning, including active test-time adaptation and test-time learning for large language models [21,22]. Active learning addresses the complementary problem of expensive supervision by querying an oracle only for informative points; uncertainty sampling is a standard strategy, surveyed by Settles [23]. In a prompt-injection deployment, these two lines suggest a protocol that most guardrail papers do not make explicit: a local detector handles routine inputs; uncertainty triggers selective escalation to a stronger but costly oracle; and the acquired labels are used to update the detector so that the decision rule tracks the evolving attack distribution, rather than freezing a classifier or a guard prompt and hoping the attacker does not adapt.

3. Materials and Methods

3.1. Framework Overview

We formulate prompt injection detection as a selective test-time adaptation problem over an input stream. As illustrated in Figure 1, at each time step, the system receives an input text x t and predicts whether it is benign or attack-like. The framework consists of a local prompt-based detector, an uncertainty estimator, and a review-based adaptation module. The local detector produces class logits and a predictive distribution for each input. The uncertainty estimator computes the entropy of that distribution and uses it to route the sample. The adaptation module stores recently queried difficult cases and periodically updates the local detector on them. The overall procedure is simple. When the detector is confident, the system uses the local prediction. When the detector is uncertain, the system sends the sample to an external LLM, uses the returned label as supervision when that label can be parsed, and adds the sample to a bounded review window for later replay. This design addresses the failure mode discussed in Section 1, namely, a fixed detector operating under an evolving attack distribution. The framework should therefore be understood as an online adaptation process with three linked decisions: uncertainty determines when external supervision is requested, the returned label determines how the local detector is updated, and the updated detector in turn changes future routing decisions.

3.2. Prompt-Based Local Detector

The local detector uses a prompt-based masked classification design rather than a standard [CLS] classifier. This choice matches the motivation in Section 1: the detector should work well with limited supervision while remaining compatible with token-level prompt semantics. Given an input text, the model first converts it into a masked prompt and feeds the resulting sequence into a pretrained masked language model. Let ( x ) R | V | denote the vocabulary-level logits at the masked position for input x. Rather than relying on a fixed hand-written one-token verbalizer, the model maintains a mapping from vocabulary items to classes together with learnable token weights. If V c denotes the verbalizer set for class c and w v denotes the learned weight for token v, the class logit is computed by aggregating weighted token logits over the verbalizer set,
z c ( x ) = 1 | V c | v V c w v v ( x ) .
The resulting class logits are then used for prediction and uncertainty estimation. We do not present this modeling choice as a general solution to prompt injection. Its role is more limited and easier to justify: it keeps the decision rule in the vocabulary space used by the underlying language model, while avoiding a manually fixed verbalizer that may not hold under distribution shift. For this reason, the method draws on prompt-based tuning with learnable verbalizers rather than a conventional discriminative head [9].
The use of bert-base-uncased is therefore a backbone choice within the cloze-style masked-language-model paradigm. First, PET reformulates classification as masked-token prediction over patterns and verbalizers [24], which requires a masked-language-model backbone that can score vocabulary tokens at the masked position. Existing PET analysis shows that this formulation can be instantiated with different masked-language-model backbones, including ALBERT and RoBERTa, which obtain average scores of 71.8 and 63.7 on selected SuperGLUE tasks, respectively [25]. Second, prior prompt-tuning results show that BERT itself is also a valid backbone for this paradigm. In LM-BFF, prompt-based tuning improves BERT-large from 79.5 to 85.6 on SST-2 and from 51.4 to 59.2 on SNLI compared with standard fine-tuning [26]. Based on these results, we select bert-base-uncased as a canonical, reproducible, medium-size masked-language-model backbone so that the experiments can focus on EvoShield’s routing and review-based adaptation mechanism. RoBERTa, ALBERT, and DistilBERT are compatible alternatives for stronger or more efficient local detectors; for example, DistilBERT is reported to be 40% smaller, to be 60% faster, and to retain 97% of BERT’s language-understanding capability [27]. We regard a full architecture-efficiency comparison among these backbones as complementary future work.

3.3. Uncertainty-Triggered LLM Query

The local detector is not assumed to be equally reliable on every input. After the class logits are computed, the system forms the predictive distribution
p ( c x ) = softmax ( z ( x ) ) c ,
and measures uncertainty using Shannon entropy,
H ( x ) = c Y p ( c x ) log p ( c x ) .
This uncertainty estimate serves two purposes. It identifies samples near the current decision boundary and defines the query strategy for acquiring additional supervision. In this respect, the framework follows the central idea of active learning: supervision is a limited resource and should be used on informative samples rather than applied uniformly across the input stream [23]. Let τ denote the entropy threshold. If H ( x ) τ , the sample is treated as sufficiently easy and the system accepts the local prediction. If H ( x ) > τ , the sample is treated as informative and sent to an external LLM. We treat this model as a fallible auxiliary supervision source rather than as a perfectly reliable oracle, since its response may still be affected by prompt injection, output-format violations, or ordinary prediction errors. The decision rule is
y ^ ( x ) = y ^ local ( x ) , H ( x ) τ , y ^ LLM ( x ) , H ( x ) > τ .
This design is intentionally selective. Querying every sample would reduce the method to an expensive LLM-based filter and remove the efficiency advantage of the local detector. Querying only high-entropy samples instead focuses external supervision on the cases where the detector is least certain. Because external LLM supervision is not assumed to be perfect, the framework checks whether the returned label is parsable before using it for adaptation. If no valid class label can be extracted, the sample is skipped rather than assigned a default class. In a security setting, this distinction matters because an output-format failure should not be treated as evidence that the input is benign.

3.4. Review-Based Test-Time Adaptation

The LLM is not used only as an arbitration layer. Queried labels are fed back to the local detector through a replay-based review mechanism. Whenever a high-entropy sample receives a parsable LLM label, the pair ( x t , y t LLM ) is added to a bounded review window. The bounded review window lets the detector rapidly reuse recently queried hard cases, but it may also bias local updates toward newly observed difficult samples rather than representing the full deployment distribution. At a configurable review frequency, the buffered texts are re-encoded into the same prompt-based representation and used for additional optimization steps. If B t denotes the current review window, the review objective is
L review ( θ ) = 1 | B t | ( x , y ) B t L cls ( f θ ( x ) , y ) .
This mechanism is intentionally local and recent rather than global and retrospective. The goal is not to reconstruct an ideal training set online. The goal is to repeatedly expose the detector to the difficult cases that its own uncertainty estimator has already identified as problematic. Because the review window is recent and difficulty-focused, local forgetting remains possible: if online updates make the detector less stable on standard benign inputs or previously familiar patterns, the affected samples may receive higher predictive entropy and therefore be escalated to the external LLM. In EvoShield, this differs from a standalone classifier. Such local forgetting is expected to have limited impact on the end-to-end detection accuracy under the selective-routing design, because uncertain cases can still be handled by the external LLM; its primary practical consequence is instead an increase in API calls and deployment cost. Samples that are not selected for LLM supervision, or whose LLM outputs cannot be parsed, are excluded from the LLM-supervised loss. Accordingly, the review buffer only stores queried samples with valid parsed labels, and external labeling failures are not directly converted into training targets for the local detector. Successful queried samples, by contrast, provide both immediate supervision and replay supervision in later updates. The method is therefore not a static cascade. The local detector handles routine inputs, the uncertainty estimator identifies cases where the detector is weak, the LLM provides supervision only for those difficult cases, and the review window ensures that newly observed attack patterns affect the detector parameters rather than disappearing after a single decision. This is the concrete mechanism by which the framework moves beyond the train-once, deploy-once paradigm criticized in Section 1 and Section 2.

4. Experiments

4.1. Datasets and Experimental Setup

We evaluate the proposed framework on four datasets. The primary tasks are prompt injection detection (PI) [28], jailbreak detection (JC-balanced and JC-imbalanced) [29], and Safe-Guard prompt injection detection (SG) [30], each formulated as binary classification with labels indicating benign versus attack-like inputs. Given our research context of labeled data scarcity, we re-partitioned all datasets. We merged the original data splits and randomly sampled a 16-shot subset to serve as the training set, reserving the remainder as the test set. The resulting dataset sizes and label distributions are summarized in Table 1. The training and validation splits are used for the vanilla local-detector baseline and the related local-detector ablation analysis, including the examination of local-detector training behavior. They are not used to initialize or tune EvoShield in the main evaluation. For EvoShield, each test split is treated as an unlabeled online deployment stream. The method starts from the pretrained masked language model and the prompt/verbalizer configuration, without using target-task ground-truth labels before the online stream begins. During online testing, ground-truth labels in the stream are kept hidden from the model and are used only after prediction for metric computation. The SG validation split, comprising approximately 10% of the SG test-set size, is used only for selecting the number of training epochs for the trained local-detector baseline and ablations; this selection is reported for transparency and is not used in the EvoShield cold-start protocol.

4.2. Metrics and Implementation Details

To evaluate the detection performance of EvoShield and to compare it with alternative local-model and pure-LLM baselines, we report accuracy as the primary metric together with Macro-Precision, Macro-Recall, and Macro-F1-score. For EvoShield, these metrics are computed cumulatively on the online test stream after each prediction has been made. The ground-truth label of each test-stream sample is never used as an adaptation target and is accessed only for final metric computation. Validation accuracy is used only in the trained local-detector baseline and local-detector ablation settings for epoch selection; it does not provide a training or tuning signal to EvoShield. However, prompt injection and jailbreak detection are not well characterized by accuracy alone, since a model may achieve a superficially reasonable accuracy while performing poorly on one class. The macro metrics are computed over the class set Y and give equal weight to each class, which is appropriate for security detection settings where performance on the attack class is as important as performance on the benign class.
Formally, let TP y , FP y , and FN y denote the numbers of true positives, false positives, and false negatives for class y Y , respectively. The per-class precision is defined as
P y = TP y TP y + FP y , Macro - Precision = 1 | Y | y Y P y .
Similarly, the per-class recall is defined as
R y = TP y TP y + FN y , Macro - Recall = 1 | Y | y Y R y .
The per-class F1 score is the harmonic mean of precision and recall,
F 1 y = 2 P y R y P y + R y , Macro - F1 = 1 | Y | y Y F 1 y .
For security-oriented error analysis, we further report metrics that directly measure attack misses and benign-sample rejection. We treat the attack class, including prompt-injection and jailbreak labels, as the positive class. Attack Recall measures how many attack samples are correctly detected, while the false negative rate (FNR) measures how many attack samples are missed:
Attack   Recall = TP atk TP atk + FN atk , FNR = FN atk TP atk + FN atk .
We also report the false rejection rate (FRR), which measures how often benign samples are incorrectly rejected as attacks:
FRR = FP atk FP atk + TN atk .
These metrics complement Macro-F1 by separating two security-relevant error types: missed attacks and false alarms on benign inputs. We additionally use the lowest Attack Recall or highest FNR across evaluated settings to describe worst-case attack-detection performance.
We implement all experiments in PyTorch 2.1.2 [31] based on HuggingFace Transformers 4.36.2 [32]. All experiments were conducted on an NVIDIA A40 GPU with 48 GB of GPU memory (Nvidia, Santa Clara, CA, USA). The local detector uses bert-base-uncased as the masked language model backbone, and parameter updates are optimized with AdamW [33]. Unless otherwise stated, the optimizer and training hyperparameters are as follows: batch size 32, backbone learning rate 5 × 10 5 , verbalizer-weight learning rate 0.55 , Adam ϵ of 10 8 , maximum gradient norm of 1.0 , cosine learning-rate scheduling, no warmup, and a maximum input sequence length of 512. For deployment on retrieved documents or other long-context RAG inputs, this length limit should be applied per detection chunk rather than by blindly truncating the whole document: long inputs can be divided into overlapping chunks, EvoShield can classify each chunk separately, and a document-level alert is raised when any chunk is predicted as attack-like or routed as high-uncertainty. We use random seed 42 for all experiments. For the trained local-detector baseline and local-detector ablations, we perform a grid search over the number of training epochs on the SG validation split, ranging from 100 to 1000 with a step size of 100, and select the final configuration according to validation accuracy. The selected number of epochs is 200 and is applied only to the corresponding trained local-detector comparisons. In the main EvoShield evaluation, by contrast, the detector is cold-started from bert-base-uncased and the prompt/verbalizer configuration, without using target-task training or validation labels before deployment. EvoShield then consumes the test split as an unlabeled online stream. At each time step, it first predicts the incoming sample locally and computes predictive entropy; when the entropy exceeds the threshold, the sample is routed to the external LLM. Only successfully parsed LLM-returned labels are added to the review window and used for online updates, whereas the hidden ground-truth labels are reserved solely for evaluation. In the main experiments, we enable external LLM routing by default and use an entropy threshold of 0.2 together with a review window size of 128. These values are default operating settings rather than globally optimal constants: the entropy threshold controls the trade-off between conservative escalation and API usage, while the review window size controls how many recently queried hard cases are reused for local adaptation. We therefore report a dedicated sensitivity analysis in Section 5.2 to make this cost–performance relationship explicit. Review updates are triggered at every training step. External LLMs are accessed through an OpenAI-compatible API interface with temperature set to 0. Under the same prompting template and label-parsing protocol, we evaluate Claude Sonnet 4.6, Gemini 3.1 Pro Preview, GPT-5.2, and Grok-4.1 Fast Reasoning.

5. Results

5.1. Comparison with Baselines

5.1.1. High Performance Retention with Drastic Cost Reduction

Across all four evaluation settings, the proposed framework demonstrates the ability to maintain detection capabilities comparable to the pure language model baseline while significantly reducing the dependency on external API calls, as shown in Table 2. On the JC (Balanced) dataset utilizing claude-sonnet-4-6, the method retains 98.3% of the overall performance metric, yet it only queries the external model for 133 out of 1274 samples, which equates to an API call ratio of 10.4%. A similar efficiency is observed on the JC (Imbalanced) task with grok-4-1-fast-reasoning, where the system achieves a 99.4% performance retention while consuming a mere 6.3% of the API calls (123 compared to the baseline’s 1966). These results indicate that selective routing can substantially reduce dependence on external LLM decisions while preserving comparable detection performance.

5.1.2. Performance Gains in Specific Environments

An unexpected but notable trend is that the hybrid routing mechanism can occasionally surpass the vanilla LLM baseline in terms of overall metrics, particularly on the SG dataset. When evaluating gpt-5.2 on this task, the framework yields an overall score of 0.919 against the baseline’s 0.888, translating to a retention ratio of 103.5% while utilizing only 11.5% of the total API budget (235 out of 2052 calls). The claude-sonnet-4-6 and grok-4-1-fast-reasoning models similarly exhibit overall retention ratios exceeding 100% on the SG dataset, achieving 101.4% and 101.5%, respectively. This phenomenon suggests that the local model’s online adaptation might effectively filter out certain adversarial or ambiguous samples that would otherwise induce misclassifications or formatting failures in the pure LLM approach.

5.1.3. Data-Scale-Driven Routing Proportion

The fraction of samples escalated to the external model exhibits significant variance, with the PI dataset triggering the highest API query rates across all configurations, ranging from 25.7% for gemini-3.1-pro-preview up to 39.2% for claude-sonnet-4-6. This elevated proportion is fundamentally driven by the smaller total sample size of the PI test set, which contains only 630 instances. Because the absolute volume of high-entropy, difficult samples identified by the local model remains in a roughly comparable range across most tasks, typically between 120 and 300 queries, the significantly smaller denominator in the PI dataset naturally inflates the calculated query ratio. In contrast, tasks with much larger evaluation sets, such as SG with 2052 samples, dilute a similar absolute number of API calls into a much lower overall percentage, dropping into the 8.5% to 15.4% range. This indicates that the percentage-based routing efficiency is highly sensitive to the baseline volume of the evaluation stream.

5.1.4. Practical Runtime and Hardware Overhead

To complement the API-call analysis with a direct latency measurement, we further compare the wall-clock runtime of EvoShield with the corresponding pure LLM pipeline. The EvoShield runtime includes local inference, entropy-based routing, online review updates, and selective external LLM calls, and therefore reflects the practical overhead of the full detection pipeline rather than only the API-query component. As shown in Table 3, EvoShield consistently reduces total runtime across all evaluated tasks and external models. The total time reduction ranges from 63.6% to 90.5%, and the per-sample runtime is also consistently lower than that of the pure LLM baseline. These results indicate that the review-based updates introduce a measurable but bounded local adaptation cost, which is outweighed in practice by the reduction in external LLM calls.

5.2. Sensitivity Analysis of Routing and Review Parameters

The main experiments use an entropy threshold of 0.2 and a review window size of 128 as default operating settings. Because these parameters directly affect both external-query cost and detection performance, we conduct an additional sensitivity analysis with GPT-5.2 as the external supervision model. We vary one parameter at a time and report averages over the four evaluation tasks. This analysis is intended to clarify the deployment trade-off rather than to claim that a single parameter value is universally optimal.
Table 4 shows the effect of the entropy threshold. Increasing τ from 0.05 to 0.40 reduces the average LLM query ratio from 14.69% to 7.93%, and the average number of external calls from 180.5 to 101.8. This confirms that the threshold directly controls API usage. At the same time, the average Macro-F1 decreases from 0.8928 to 0.8383, showing that fewer external queries can reduce detection performance. The default threshold τ = 0.20 therefore represents a middle operating point between conservative escalation and API saving, rather than the best-performing value under every metric.
Table 5 reports the review-window-size sweep with τ fixed at 0.20. Removing the review window substantially weakens the system, with the average Macro-F1 dropping to 0.6895. Introducing a bounded review window improves performance, with window sizes of 32 and 64 reaching average Macro-F1 scores of 0.8720 and 0.8813, respectively. Larger windows generally reduce the average LLM query ratio, but they do not uniformly improve every task and may introduce additional local replay overhead. We therefore describe the default window size of 128 as a bounded-memory deployment choice rather than as a universally optimal setting.
Overall, the sensitivity results show that EvoShield exposes a tunable cost–performance trade-off. Lower entropy thresholds and smaller or moderate review windows can improve detection performance at the cost of more external LLM calls, whereas higher thresholds and larger windows can reduce API usage but may sacrifice accuracy or Macro-F1 on some tasks. In practical deployment, these two parameters can therefore be selected according to the available API budget, latency constraints, and acceptable security risk.

5.3. Ablation Studies

5.3.1. Active Querying Elevates Baseline Detection

Comparing the vanilla local detector with the ablation configuration lacking the review mechanism demonstrates the fundamental utility of uncertainty-triggered external routing. The full set of ablation results is reported in Table 6. The vanilla model, which relies entirely on local processing and makes zero external API calls, achieves the lowest overall performance across most evaluation settings. Introducing the external querying module without test-time updates yields immediate performance improvements on three out of four tasks. For example, on the PI dataset, the overall metric increases from 0.798 to 0.862, and on the imbalanced JC dataset, it surges from 0.863 to 0.969. However, this performance gain incurs a substantial computational burden, as the system routes a massive number of uncertain samples to the external language model, peaking at 737 API calls on the SG dataset.

5.3.2. Review Mechanism Drastically Reduces Query Costs

The integration of the review-based adaptation window serves as a critical component for maintaining system efficiency. Across all evaluated datasets, upgrading from the system without review to the full EvoShield framework consistently reduces the volume of external queries by more than fifty percent. On the JC balanced task, API calls decrease from 300 to 137, while on the PI and SG datasets, they drop from 347 to 163 and from 737 to 316, respectively. This profound reduction indicates that immediately updating the local detector parameters using recently acquired labels successfully lowers the model’s predictive entropy on subsequent similar attacks, effectively breaking the cycle of redundant external queries.
Table 7 reports the corresponding security-oriented metrics for EvoShield. The results show that high overall accuracy can hide different types of security error. For example, GPT-5.2 on PI has the lowest Attack Recall of 0.725 and the highest FNR of 0.275, indicating the worst attack-miss behavior among the evaluated settings. In contrast, some SG settings achieve high Attack Recall but have larger false rejection rates, meaning that they catch more attacks at the cost of rejecting more benign samples. Averaged across external LLMs, EvoShield obtains Attack Recall values of 0.948 on JC (Balanced), 0.929 on JC (Imbalanced), 0.851 on PI, and 0.915 on SG, with corresponding average FNR values of 0.052, 0.071, 0.149, and 0.085. These results make the security trade-off between missed attacks and benign false alarms explicit.

5.3.3. Review Mechanism Particularly Benefits Small Sample Regimes

While the integration of the review mechanism maintains comparable overall performance across most evaluation environments, it unlocks substantial gains specifically on the PI dataset, which is characterized by its limited sample size. On the JC (Balanced), JC (Imbalanced), and SG tasks, upgrading from the ablation configuration without review to the full EvoShield framework results in only marginal fluctuations or modest gains in the overall metric. However, on the much smaller PI dataset, the review mechanism drives a dramatic increase in detection capability, elevating the overall score from 0.862 without review to 0.930. This distinct contrast demonstrates that continuously replaying and reviewing hard samples is critical for rapidly capturing evolving attack patterns when available data points are sparse.

5.4. Cross-Dataset Distribution Shift Evaluation

To further examine EvoShield under a non-stationary online stream, we construct a cross-dataset distribution-shift setting by concatenating JC (Balanced) followed by PI without shuffling. This stream first exposes the detector to jailbreak-style samples and then abruptly shifts to prompt-injection samples. The boundary occurs after 1274 JC (Balanced) samples. We use GPT-5.2 as the external supervision model, keep the entropy threshold at 0.20, and use a review window size of 128. This experiment is intended as an initial non-stationary stream evaluation rather than a complete adaptive-attack benchmark.
Table 8 shows that the distribution shift produces a clear degradation in security-oriented metrics immediately after the boundary. In the 128 samples before the boundary, Attack Recall is 0.986 and FNR is 0.014. In the first 128 samples after the boundary, Attack Recall drops to 0.500 and FNR rises to 0.500. At the same time, the query ratio increases from 0.008 to 0.086 and the average entropy increases from 0.009 to 0.050, indicating that the entropy-based routing mechanism detects increased uncertainty after the shift.
The shifted PI segment remains more difficult overall, with Attack Recall of 0.547 and FNR of 0.453. However, within the PI segment, Attack Recall improves from 0.500 in the first 128 PI samples to 0.722 in the last 128 PI samples, while FNR decreases from 0.500 to 0.278. This suggests that review-based online updates can partially adapt the detector to the new distribution, although the results should not be interpreted as evidence of full robustness against perturbation, paraphrasing, or deliberately adaptive entropy-evasion attacks.

6. Discussion

6.1. Validation of the Selective Routing Mechanism

6.1.1. Selective Division of Predictive Responsibility

Figure 2 shows that EvoShield does not rely on a single predictor uniformly across the input stream. Instead, the queried external LLM and the local detector play clearly differentiated roles. The queried-LLM trajectories in Figure 2a are generally lower and more volatile than the corresponding local-model trajectories in Figure 2b, which is consistent with the intended routing logic: the queried samples are precisely those that remain difficult under the current detector state. This pattern is especially visible on PI and SG, where the queried-LLM curves fluctuate more substantially in the early stage, while the local-model curves remain comparatively stable once low-entropy samples begin to dominate the stream.

6.1.2. Stable Local Performance on Easier Inputs

The right panel indicates that the local detector maintains high cumulative accuracy on low-entropy samples across most task-model settings. In the jailbreak benchmarks, the local-model curves rise quickly and remain near the upper end of the plotting range after the early batches, suggesting that once uncertainty drops, the detector handles routine cases reliably. A similar tendency appears on PI, although the trajectories are slightly lower and exhibit more variation than on JC, implying that prompt-injection examples induce a broader range of borderline cases. On SG, the local detector remains effective, but the curves are more dispersed across external-model settings, which suggests that the uncertainty-triggered adaptation process is more sensitive to the difficulty of this dataset.

6.1.3. Hard Cases Remain Hard Even for Strong LLMs

The left panel shows that the queried-LLM accuracy is consistently below the local-model accuracy in most settings, even though the queried model is stronger and more expensive. This does not indicate a weakness of the routing strategy; rather, it confirms that the uncertainty estimator is selecting genuinely difficult examples. In other words, the routed subset is not a random sample of the stream, but a concentrated set of hard cases. The fact that the queried-LLM curves often improve more slowly and exhibit larger oscillations than the local-model curves supports the claim that EvoShield uses external supervision where the current decision boundary is least certain, rather than where the answer is already easy.

6.2. Dynamics of Uncertainty-Driven Querying

6.2.1. Task-Dependent Query Dynamics

Figure 3 shows that the cumulative query ratio follows the expected overall pattern: as more samples are processed, EvoShield gradually accumulates external calls instead of querying at a fixed rate from the outset. The query-ratio trajectories in Figure 3a show that PI exhibits a noticeably different profile from the other three tasks. While the jailbreak benchmarks and SG tend to stabilize at relatively low cumulative query levels, the PI curves remain consistently higher throughout the stream. This difference should be interpreted with caution. PI has a much smaller test set than JC (Balanced), JC (Imbalanced), and SG, so each queried batch contributes a larger increment to the cumulative ratio. As a result, the steeper PI trajectory does not by itself imply a qualitatively different routing rule; it is partly a consequence of sample-count scale. The figure therefore suggests that the overall routing behavior is consistent with the intended design, while the apparent gap between PI and the larger tasks is amplified by differences in test-set size.

6.2.2. Uncertainty Serves as a Meaningful Routing Signal

Taken together, the two panels in Figure 3 provide evidence that EvoShield’s routing policy is driven by evolving model uncertainty rather than by a static or arbitrary budget rule. The entropy traces in Figure 3b help explain the query-ratio trajectories in Figure 3a. If the entropy signal were weakly related to sample difficulty, one would expect similar query-ratio profiles across tasks or little correspondence between entropy fluctuations and external querying. Instead, the figure shows a structured relationship: tasks with persistently higher entropy, especially PI, accumulate external calls much faster, while tasks whose entropy drops earlier, such as the jailbreak settings, require fewer queries overall. This behavior supports the interpretation that predictive entropy is functioning as an informative control signal for selective escalation, allowing EvoShield to concentrate external supervision on the parts of the stream that remain uncertain under the current detector state.

6.3. Limitations: High-Confidence False Negatives and Entropy-Evasion Risk

A limitation of the present routing policy is that predictive entropy is an informative but incomplete security signal. EvoShield does not guarantee that the local detector will become uncertain before every error. In particular, a malicious input may be assigned a low-entropy, high-confidence benign prediction by the local model. Such a high-confidence false negative would not exceed the entropy threshold and would therefore bypass the external LLM query under the current entropy-only decision rule. This failure mode is especially relevant in adversarial settings, because an adaptive attacker may attempt to suppress the detector’s predictive entropy rather than merely move an input close to the decision boundary.
This limitation does not invalidate the selective-routing objective, but it narrows the security claim that can be made from the current experiments. The reported entropy dynamics show that entropy is useful for concentrating external supervision on many difficult samples, yet they should not be interpreted as evidence that low-entropy inputs are always safe. For this reason, EvoShield should be understood as a cost-aware adaptation framework rather than a complete standalone defense against all prompt-injection evasion strategies. In deployments where false negatives are more costly than additional LLM calls, the entropy trigger should be complemented with additional safeguards.
Several extensions can mitigate this risk. First, a small random audit rate can route a subset of low-entropy samples to the external LLM, allowing the system to estimate and monitor high-confidence attack misses over time. Second, a dual-trigger policy can combine entropy with suspicious injection patterns or untrusted input-source signals, such as instructions that attempt to override system rules or content originating from retrieved documents, tool outputs, or third-party webpages. Third, confidence calibration can be applied to reduce overconfident local predictions before thresholding. We add this limitation to clarify that entropy-based querying is a practical routing mechanism, not a formal guarantee that all dangerous inputs will be escalated.

7. Conclusions

In this work, we challenge the prevailing paradigm of treating prompt injection detection as a static offline classification task and recast it as a continuous, selective test-time adaptation problem. We introduced EvoShield as an online adaptation framework that couples a prompt-based local detector with an uncertainty-triggered active querying mechanism to selectively escalate ambiguous inputs to an external LLM-based supervision source. Crucially, the integration of a replay-based review window enables the local model to rapidly internalize evolving adversarial patterns from these queried hard cases rather than relying on the external model indefinitely. Empirical evaluations across multiple benchmarks demonstrate that EvoShield preserves the detection efficacy of pure-LLM pipelines while reducing API query costs by more than 85%. Ultimately, this research illustrates that the tension between local inference efficiency and external-supervision-assisted robustness can be effectively addressed through targeted online adaptation, offering a more sustainable methodology for deploying dynamic guardrails in non-stationary adversarial environments.

Author Contributions

Conceptualization, Z.Z. and Z.W.; methodology, Z.Z.; software, Z.Z. and J.L.; validation, Z.Z., J.L. and M.H.; formal analysis, Z.Z.; investigation, Z.Z. and Y.P.; resources, G.X.; data curation, Z.Z. and J.L.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z. and Z.W.; visualization, Z.Z.; supervision, Z.W.; project administration, Z.W.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Zhanjiang Philosophy and Social Science Planning Project (Grant No. ZJ24QY09), the Guangdong Provincial Undergraduate Quality Engineering Project—Circuit and Electronic Technology Course Teaching and Research Office (Yue Jiao Gao Han [2023] No. 4), the Innovation and Entrepreneurship Education Demonstration Course “IoT Engineering Design and Practice” (Grant No. PX-142024011), and the Zhanjiang Science and Technology Planning Project (Grant No. 2025B01050).

Data Availability Statement

The data presented in this study are openly available in [Github] at [https://github.com/Lh-Liang/EvoShield (accessed on 15 April 2026)].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
APIApplication programming interface
JCJailbreak classification
LLMLarge language model
PIPrompt injection
SGSafe-Guard
TTATest-time adaptation

References

  1. Liu, Y.; Deng, G.; Li, Y.; Wang, K.; Wang, Z.; Wang, X.; Zhang, T.; Liu, Y.; Wang, H.; Zheng, Y.; et al. Prompt injection attack against llm-integrated applications. arXiv 2023, arXiv:2306.05499. [Google Scholar]
  2. Debenedetti, E.; Zhang, J.; Balunovic, M.; Beurer-Kellner, L.; Fischer, M.; Tramèr, F. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. Adv. Neural Inf. Process. Syst. 2024, 37, 82895–82920. [Google Scholar]
  3. Kokkula, S.; Divya, G. Palisade–Prompt Injection Detection Framework. arXiv 2024, arXiv:2410.21146. [Google Scholar]
  4. Gosmar, D.; Dahl, D.A.; Gosmar, D. Prompt injection detection and mitigation via AI multi-agent NLP frameworks. arXiv 2025, arXiv:2503.11517. [Google Scholar] [CrossRef]
  5. Shi, T.; Zhu, K.; Wang, Z.; Jia, Y.; Cai, W.; Liang, W.; Wang, H.; Alzahrani, H.; Lu, J.; Kawaguchi, K.; et al. Promptarmor: Simple yet effective prompt injection defenses. arXiv 2025, arXiv:2507.15219. [Google Scholar] [CrossRef]
  6. Zou, W.; Liu, Y.; Wang, Y.; Chen, Y.; Gong, N.; Jia, J. PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features. arXiv 2025, arXiv:2510.14005. [Google Scholar]
  7. Chen, Y.; Li, H.; Sui, Y.; He, Y.; Liu, Y.; Song, Y.; Hooi, B. Can indirect prompt injection attacks be detected and removed? In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 18189–18206. [Google Scholar]
  8. Hackett, W.; Birch, L.; Trawicki, S.; Suri, N.; Garraghan, P. Bypassing LLM guardrails: An empirical analysis of evasion attacks against prompt injection and jailbreak detection systems. In Proceedings of the First Workshop on LLM Security (LLMSEC), Vienna, Austria, 1 August 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 101–114. [Google Scholar]
  9. Liang, L.; Wang, G.; Lin, C.; Feng, Z. PTE: Prompt tuning with ensemble verbalizers. Expert Syst. Appl. 2025, 262, 125600. [Google Scholar] [CrossRef]
  10. Jacob, D.; Alzahrani, H.; Hu, Z.; Alomair, B.; Wagner, D. Promptshield: Deployable detection for prompt injection attacks. In Proceedings of the Fifteenth ACM Conference on Data and Application Security and Privacy, Porto, Portugal, 19–21 June 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 341–352. [Google Scholar]
  11. Li, H.; Liu, X.; Zhang, N.; Xiao, C. PIGuard: Prompt injection guardrail via mitigating overdefense for free. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 30420–30437. [Google Scholar]
  12. Li, H.; Liu, X. Injecguard: Benchmarking and mitigating over-defense in prompt injection guardrail models. arXiv 2024, arXiv:2410.22770. [Google Scholar]
  13. Lee, D.; Xie, S.; Rahman, S.; Pat, K.; Lee, D.; Chen, Q.A. “Prompter Says”: A Linguistic Approach to Understanding and Detecting Jailbreak Attacks Against Large-Language Models. In Proceedings of the 1st ACM Workshop on Large AI Systems and Models with Privacy and Safety Analysis, Salt Lake City, UT, USA, 14 October 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 77–87. [Google Scholar]
  14. Wang, X.; Wang, W.; Ji, Z.; Li, Z.; Ma, P.; Wu, D.; Wang, S. STShield: Single-token sentinel for real-time jailbreak detection in large language models. arXiv 2025, arXiv:2503.17932. [Google Scholar]
  15. Yi, J.; Xie, Y.; Zhu, B.; Kiciman, E.; Sun, G.; Xie, X.; Wu, F. Benchmarking and defending against indirect prompt injection attacks on large language models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, Toronto, ON, Canada, 3–7 August 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 1809–1820. [Google Scholar]
  16. Hines, K.; Lopez, G.; Hall, M.; Zarfati, F.; Zunger, Y.; Kiciman, E. Defending against indirect prompt injection attacks with spotlighting. arXiv 2024, arXiv:2403.14720. [Google Scholar] [CrossRef]
  17. Zhan, Q.; Fang, R.; Panchal, H.S.; Kang, D. Adaptive attacks break defenses against indirect prompt injection attacks on llm agents. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, NM, USA, 29 April–4 May 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 7101–7117. [Google Scholar]
  18. Wang, D.; Shelhamer, E.; Liu, S.; Olshausen, B.; Darrell, T. Tent: Fully test-time adaptation by entropy minimization. arXiv 2020, arXiv:2006.10726. [Google Scholar]
  19. Niu, S.; Wu, J.; Zhang, Y.; Chen, Y.; Zheng, S.; Zhao, P.; Tan, M. Efficient test-time model adaptation without forgetting. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 16888–16905. [Google Scholar]
  20. Wang, G.; Yang, L.; Zhuang, F.; Han, L.; Hao, Z.; Xiao, X.; Lin, C. Robust synchronization of chaotic systems using noise-resistant gradient neural dynamics: Design and application. Eng. Appl. Artif. Intell. 2026, 167, 113854. [Google Scholar] [CrossRef]
  21. Gui, S.; Li, X.; Ji, S. Active test-time adaptation: Theoretical analyses and an algorithm. arXiv 2024, arXiv:2404.05094. [Google Scholar] [CrossRef]
  22. Hu, J.; Zhang, Z.; Chen, G.; Wen, X.; Shuai, C.; Luo, W.; Xiao, B.; Li, Y.; Tan, M. Test-time learning for large language models. arXiv 2025, arXiv:2505.20633. [Google Scholar] [CrossRef]
  23. Settles, B. Active Learning Literature Survey; Department of Computer Sciences, University of Wisconsin-Madison: Madison, WI, USA, 2009. [Google Scholar]
  24. Schick, T.; Schütze, H. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Kyiv, Ukraine, 19–23 April 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 255–269. [Google Scholar]
  25. Schick, T.; Schütze, H. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 6–11 June 2021; pp. 2339–2352. [Google Scholar]
  26. Gao, T.; Fisch, A.; Chen, D. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3816–3830. [Google Scholar]
  27. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
  28. Deepset. Prompt-Injections, Hugging Face. 2023. Available online: https://huggingface.co/datasets/deepset/prompt-injections (accessed on 15 April 2026).
  29. Jackhhao. Jailbreak-Classification. 2023. Available online: https://huggingface.co/datasets/jackhhao/jailbreak-classification (accessed on 15 April 2026).
  30. Li, H.; Dong, Q.; Tang, Z.; Wang, C.; Zhang, X.; Huang, H.; Huang, S.; Huang, X.; Huang, Z.; Zhang, D.; et al. Synthetic data (almost) from scratch: Generalized instruction tuning for language models. arXiv 2024, arXiv:2402.13064. [Google Scholar] [CrossRef]
  31. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 721. [Google Scholar]
  32. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Punta Cana, Dominican Republic, 16–20 Novemebr 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 38–45. [Google Scholar]
  33. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Figure 1. The EvoShield framework integrates a local prompt-based detector, an uncertainty estimator, and a review-based adaptation module. The system computes predictive entropy H ( x ) to selectively route inputs, accepting local predictions when H ( x ) τ and querying an external language model when H ( x ) > τ . Successfully parsed labels are stored in a bounded review window to periodically update the detector parameters.
Figure 1. The EvoShield framework integrates a local prompt-based detector, an uncertainty estimator, and a review-based adaptation module. The system computes predictive entropy H ( x ) to selectively route inputs, accepting local predictions when H ( x ) τ and querying an external language model when H ( x ) > τ . Successfully parsed labels are stored in a bounded review window to periodically update the detector parameters.
Mathematics 14 01719 g001
Figure 2. Accuracy dynamics of EvoShield under different tasks and external LLM backbones. The (left) panel shows the cumulative accuracy of the external LLM on queried high-entropy samples. The (right) panel shows the cumulative accuracy of the local detector on low-entropy samples that are handled without escalation. Together, these plots illustrate how EvoShield distributes prediction responsibility between the queried LLM and the local detector during online adaptation. (a) Queried LLM accuracy over time. (b) Local model accuracy over time.
Figure 2. Accuracy dynamics of EvoShield under different tasks and external LLM backbones. The (left) panel shows the cumulative accuracy of the external LLM on queried high-entropy samples. The (right) panel shows the cumulative accuracy of the local detector on low-entropy samples that are handled without escalation. Together, these plots illustrate how EvoShield distributes prediction responsibility between the queried LLM and the local detector during online adaptation. (a) Queried LLM accuracy over time. (b) Local model accuracy over time.
Mathematics 14 01719 g002
Figure 3. Routing and uncertainty dynamics of EvoShield. The (left) panel reports the cumulative query ratio, that is, the proportion of processed samples that have been routed to the external LLM up to each point in the stream. The (right) panel shows the batch-wise average predictive entropy of the local detector. Taken together, these plots reveal how uncertainty evolves during test-time adaptation and how it drives selective external querying. (a) Cumulative query ratio over time. (b) Average entropy over time.
Figure 3. Routing and uncertainty dynamics of EvoShield. The (left) panel reports the cumulative query ratio, that is, the proportion of processed samples that have been routed to the external LLM up to each point in the stream. The (right) panel shows the batch-wise average predictive entropy of the local detector. Taken together, these plots reveal how uncertainty evolves during test-time adaptation and how it drives selective external querying. (a) Cumulative query ratio over time. (b) Average entropy over time.
Mathematics 14 01719 g003
Table 1. Statistics and label distributions of the datasets.
Table 1. Statistics and label distributions of the datasets.
DatasetSplitsSamplesLabel Distribution
JC-balancetrain320:16 (50.0%), 1:16 (50.0%)
test12740:624 (49.0%), 1:650 (51.0%)
JC-imbalancetrain320:16 (50.0%), 1:16 (50.0%)
test19660:1316 (66.9%), 1:650 (33.1%)
PItrain320:16 (50.0%), 1:16 (50.0%)
test6300:383 (60.8%), 1:247 (39.2%)
SGtrain320:16 (50.0%), 1:16 (50.0%)
val2050:143 (69.8%), 1:62 (30.2%)
test20520:1440 (70.2%), 1:612 (29.8%)
Table 2. Comparison between EvoShield and vanilla baselines across different tasks and external models. Overall reports the mean of Acc, Macro-P, Macro-R, and Macro-F1. Retention ratio is computed as EvoShield/vanilla. Gray shading and italic text indicate task groups, and bold text denotes retention rows or key summary entries.
Table 2. Comparison between EvoShield and vanilla baselines across different tasks and external models. Overall reports the mean of Acc, Macro-P, Macro-R, and Macro-F1. Retention ratio is computed as EvoShield/vanilla. Gray shading and italic text indicate task groups, and bold text denotes retention rows or key summary entries.
ModelMethodAccMacro-PMacro-RMacro-F1OverallAPI Calls
Task: JC (Balanced)
claude-sonnet-4-6Vanilla0.9670.9670.9670.9670.9671274
EvoShield0.9510.9510.9510.9510.951133
Retention (%)98.3%98.3%98.3%98.3%98.3%10.4%
gemini-3.1-pro-previewVanilla0.9660.9660.9660.9660.9661274
EvoShield0.9580.9580.9580.9580.958138
Retention (%)99.2%99.2%99.2%99.2%99.2%10.8%
gpt-5.2Vanilla0.9010.9140.8990.9000.9041274
EvoShield0.8900.9010.8880.8890.892147
Retention (%)98.8%98.6%98.8%98.8%98.7%11.5%
grok-4-1-fast-reasoningVanilla0.9690.9690.9690.9690.9691274
EvoShield0.9540.9550.9540.9540.954137
Retention (%)98.5%98.6%98.5%98.5%98.5%10.8%
Task: JC (Imbalanced)
claude-sonnet-4-6Vanilla0.9680.9620.9660.9640.9651966
EvoShield0.9640.9670.9520.9590.961133
Retention (%)99.6%100.5%98.6%99.5%99.6%6.8%
gemini-3.1-pro-previewVanilla0.9670.9610.9650.9630.9641966
EvoShield0.9590.9610.9470.9530.955124
Retention (%)99.2%100.0%98.1%99.0%99.1%6.3%
gpt-5.2Vanilla0.8640.8510.8930.8580.8671966
EvoShield0.8620.8470.8870.8550.863276
Retention (%)99.8%99.5%99.3%99.7%99.5%14.0%
grok-4-1-fast-reasoningVanilla0.9750.9700.9730.9710.9721966
EvoShield0.9690.9700.9600.9650.966123
Retention (%)99.4%100.0%98.7%99.4%99.4%6.3%
Task: PI
claude-sonnet-4-6Vanilla0.9620.9670.9540.9600.961630
EvoShield0.9290.9220.9320.9260.927247
Retention (%)96.6%95.3%97.7%96.5%96.5%39.2%
gemini-3.1-pro-previewVanilla0.9660.9730.9570.9640.965630
EvoShield0.9430.9460.9340.9390.941162
Retention (%)97.6%97.2%97.6%97.4%97.5%25.7%
gpt-5.2Vanilla0.9350.9460.9200.9300.933630
EvoShield0.8890.9190.8600.8760.886165
Retention (%)95.1%97.1%93.5%94.2%95.0%26.2%
grok-4-1-fast-reasoningVanilla0.9560.9620.9450.9530.954630
EvoShield0.9320.9450.9150.9260.930163
Retention (%)97.5%98.2%96.8%97.2%97.5%25.9%
Task: SG
claude-sonnet-4-6Vanilla0.9500.9440.9370.9400.9432052
EvoShield0.9620.9680.9410.9530.956174
Retention (%)101.3%102.5%100.4%101.4%101.4%8.5%
gemini-3.1-pro-previewVanilla0.9430.9420.9210.9310.9342052
EvoShield0.9420.9530.9090.9270.933197
Retention (%)99.9%101.2%98.7%99.6%99.9%9.6%
gpt-5.2Vanilla0.8960.8690.9040.8820.8882052
EvoShield0.9250.9000.9360.9150.919235
Retention (%)103.2%103.6%103.5%103.7%103.5%11.5%
grok-4-1-fast-reasoningVanilla0.8430.8230.8790.8310.8442052
EvoShield0.8570.8350.8920.8450.857316
Retention (%)101.7%101.5%101.5%101.7%101.5%15.4%
Table 3. Runtime comparison between the pure LLM pipeline and EvoShield. The EvoShield runtime includes local inference, entropy-based routing, online review updates, and selective external LLM calls. Gray shading and italic text indicate task groups, and bold text denotes retention rows or key summary entries.
Table 3. Runtime comparison between the pure LLM pipeline and EvoShield. The EvoShield runtime includes local inference, entropy-based routing, online review updates, and selective external LLM calls. Gray shading and italic text indicate task groups, and bold text denotes retention rows or key summary entries.
External LLMPure LLM TimeLLM Sec./SampleEvoShield TimeEvoShield Sec./SampleTime Reduction
Task: JC (Balanced)
claude-sonnet-4-65:12:1014.7033:231.5789.3%
gemini-3.1-pro-preview3:25:159.6737:221.7681.8%
gpt-5.21:05:033.0610:370.5083.7%
grok-4-1-fast-reasoning1:09:033.2510:120.4885.2%
Task: JC (Imbalanced)
claude-sonnet-4-63:30:296.4223:040.7089.0%
gemini-3.1-pro-preview3:28:566.3826:510.8287.1%
gpt-5.21:09:512.1319:370.6071.9%
grok-4-1-fast-reasoning1:48:003.3011:370.3589.2%
Task: PI
claude-sonnet-4-61:48:1910.3210:190.9890.5%
gemini-3.1-pro-preview2:10:1912.4132:123.0775.3%
gpt-5.229:232.8010:411.0263.6%
grok-4-1-fast-reasoning31:112.979:500.9468.5%
Task: SG
claude-sonnet-4-63:10:225.5734:451.0281.7%
gemini-3.1-pro-preview3:47:126.641:10:092.0569.1%
gpt-5.257:171.6817:210.5169.7%
grok-4-1-fast-reasoning2:33:024.4830:490.9079.9%
Table 4. Sensitivity analysis of the entropy threshold using GPT-5.2 as the external model. Results are averaged over JC (Balanced), JC (Imbalanced), PI, and SG.
Table 4. Sensitivity analysis of the entropy threshold using GPT-5.2 as the external model. Results are averaged over JC (Balanced), JC (Imbalanced), PI, and SG.
Entropy ThresholdAvg. AccAvg. Macro-F1Avg. LLM CallsAvg. LLM Ratio
0.050.90240.8928180.514.69%
0.100.90220.8938173.014.01%
0.200.88480.8746152.012.40%
0.300.86170.8472119.08.84%
0.400.85520.8383101.87.93%
Table 5. Sensitivity analysis of the review window size using GPT-5.2 as the external model and fixing τ = 0.20 . Results are averaged over JC (Balanced), JC (Imbalanced), PI, and SG.
Table 5. Sensitivity analysis of the review window size using GPT-5.2 as the external model and fixing τ = 0.20 . Results are averaged over JC (Balanced), JC (Imbalanced), PI, and SG.
Review Window SizeAvg. AccAvg. Macro-F1Avg. LLM CallsAvg. LLM Ratio
00.75740.6895266.217.97%
320.88450.8720185.015.57%
640.89230.8813159.813.36%
1280.86960.8598149.812.28%
2560.85950.8474132.010.88%
5120.88560.8739128.011.23%
Table 6. Ablation results on four tasks using Grok-4.1 Fast Reasoning and GPT-5.2 as external models. Overall is the mean of Acc, Macro-P, Macro-R, and Macro-F1. Gray shading and italic text indicate task groups, and bold text denotes retention rows or key summary entries.
Table 6. Ablation results on four tasks using Grok-4.1 Fast Reasoning and GPT-5.2 as external models. Overall is the mean of Acc, Macro-P, Macro-R, and Macro-F1. Gray shading and italic text indicate task groups, and bold text denotes retention rows or key summary entries.
External LLMMethodAccMacro-PMacro-RMacro-F1OverallAPI Calls
Task: JC (Balanced)
grok-4-1-fast-reasoningVanilla0.9160.9170.9160.9160.9160
w/o Review0.9690.9690.9690.9690.969300
EvoShield0.9540.9550.9540.9540.954137
gpt-5.2Vanilla0.9280.9300.9290.9280.9280
w/o Review0.5190.7570.5090.3570.53632
EvoShield0.8900.9010.8880.8890.892147
Task: JC (Imbalanced)
grok-4-1-fast-reasoningVanilla0.8750.8590.8590.8590.8630
w/o Review0.9720.9750.9620.9680.969299
EvoShield0.9690.9700.9600.9650.966123
gpt-5.2Vanilla0.7670.7790.8140.7630.7810
w/o Review0.8860.8670.9010.8770.883721
EvoShield0.8620.8470.8870.8550.863276
Task: PI
grok-4-1-fast-reasoningVanilla0.8060.7980.7940.7960.7980
w/o Review0.8670.8970.8340.8500.862347
EvoShield0.9320.9450.9150.9260.930163
gpt-5.2Vanilla0.7920.7960.8100.7900.7970
w/o Review0.8840.9180.8530.8700.881408
EvoShield0.8890.9190.8600.8760.886165
Task: SG
grok-4-1-fast-reasoningVanilla0.8460.8180.8640.8310.8400
w/o Review0.8440.8160.8600.8280.837737
EvoShield0.8570.8350.8920.8450.857316
gpt-5.2Vanilla0.8120.8070.8660.8030.8220
w/o Review0.7980.7940.6950.7160.751348
EvoShield0.9250.9000.9360.9150.919235
Table 7. Security-oriented EvoShield metrics. Attack Recall and FNR are computed with the attack class as the positive class. FRR denotes the false rejection rate of benign samples.
Table 7. Security-oriented EvoShield metrics. Attack Recall and FNR are computed with the attack class as the positive class. FRR denotes the false rejection rate of benign samples.
TaskExternal LLMAttack RecallFNRFRR
JC (Balanced)claude-sonnet-4-60.9480.0520.046
gemini-3.1-pro-preview0.9430.0570.027
gpt-5.20.9710.0290.194
grok-4-1-fast-reasoning0.9290.0710.021
JC (Imbalanced)claude-sonnet-4-60.9150.0850.011
gemini-3.1-pro-preview0.9090.0910.016
gpt-5.20.9580.0420.185
grok-4-1-fast-reasoning0.9320.0680.013
PIclaude-sonnet-4-60.9470.0530.084
gemini-3.1-pro-preview0.8950.1050.026
gpt-5.20.7250.2750.005
grok-4-1-fast-reasoning0.8380.1620.008
SGclaude-sonnet-4-60.8910.1090.008
gemini-3.1-pro-preview0.8270.1730.009
gpt-5.20.9640.0360.092
grok-4-1-fast-reasoning0.9800.0200.196
Table 8. Cross-dataset distribution-shift evaluation on an ordered JC (Balanced) → PI stream using GPT-5.2. QR denotes the LLM query ratio and Ent. denotes average predictive entropy.
Table 8. Cross-dataset distribution-shift evaluation on an ordered JC (Balanced) → PI stream using GPT-5.2. QR denotes the LLM query ratio and Ent. denotes average predictive entropy.
Segment/WindowSamplesAccAttack RecallFNRQREnt.
Overall stream19040.8590.8550.1450.1270.070
JC (Balanced) segment12740.8830.9720.0280.1510.084
PI segment6300.8110.5470.4530.0760.044
128 before boundary1280.8910.9860.0140.0080.009
128 after boundary1280.8050.5000.5000.0860.050
Last 128 PI samples1280.8830.7220.2780.0310.017
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, Z.; Liang, J.; Hu, M.; Pei, Y.; Xu, G.; Wu, Z. EvoShield: Selective Test-Time Adaptation for Prompt Injection Detection via Active LLM Querying. Mathematics 2026, 14, 1719. https://doi.org/10.3390/math14101719

AMA Style

Zheng Z, Liang J, Hu M, Pei Y, Xu G, Wu Z. EvoShield: Selective Test-Time Adaptation for Prompt Injection Detection via Active LLM Querying. Mathematics. 2026; 14(10):1719. https://doi.org/10.3390/math14101719

Chicago/Turabian Style

Zheng, Zanhong, Jieming Liang, Mengqin Hu, Yijuan Pei, Guobao Xu, and Zhenlu Wu. 2026. "EvoShield: Selective Test-Time Adaptation for Prompt Injection Detection via Active LLM Querying" Mathematics 14, no. 10: 1719. https://doi.org/10.3390/math14101719

APA Style

Zheng, Z., Liang, J., Hu, M., Pei, Y., Xu, G., & Wu, Z. (2026). EvoShield: Selective Test-Time Adaptation for Prompt Injection Detection via Active LLM Querying. Mathematics, 14(10), 1719. https://doi.org/10.3390/math14101719

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop