1. Introduction
With the rapid advancement of technologies such as the Internet of Things (IoT) [
1,
2,
3], blockchain [
4,
5,
6,
7,
8,
9], and the Industrial Internet of Things (IIoT) [
10,
11,
12], these innovations have emerged as pivotal drivers of digitalization, networking, and intelligent transformation across diverse industries. This technological shift not only enhances operational efficiencies [
1,
2,
3,
4,
5] but also catalyzes broader social progress [
1,
2,
3,
4,
5]. Moreover, the integration of blockchain with IoT [
1,
2,
3] and blockchain-based IIoT [
10,
11,
12] plays a critical role in ensuring data security [
9,
10,
11,
12] and transparency, primarily due to the distributed ledger technology and tamper-resistant attributes inherent in blockchain systems [
4,
5,
6,
7,
8,
9].
In the domain of IoT security and botnet detection, numerous studies have explored innovative approaches to address the challenges posed by evolving cyber threats. For instance, Shojarazavi et al. [
13] proposed a wrapper method that employs a modified League Championship Algorithm (LCA) for feature selection in combination with Artificial Neural Networks (ANNs) for detecting botnets in IoT environments. Moreover, a survey by Shojarazavi et al. [
14] synthesized existing botnet detection techniques by categorizing them into machine learning-based, traffic-based, event-driven, and community-driven approaches. Sattarpour et al. [
15] introduced EBIDS, an efficient intrusion detection system that leverages the BERT algorithm to mitigate security threats in both the network and application layers of IoT. Collectively, these contributions indirectly support our perspectives on attack detection and prevention, model optimization, and security threat analysis, thereby enhancing our understanding of the complexity and significance of attack detection and defense mechanisms.
With the widespread adoption of large language models (LLMs) in natural language processing (NLP) tasks—particularly their remarkable performance in few-shot and zero-shot learning scenarios—security concerns surrounding these models have garnered increasing research attention. A fundamental capability of LLMs is contextual learning, which enables models to acquire new tasks from minimal examples without parameter adjustments. This feature significantly enhances model versatility and adaptability; however, it also introduces potential vulnerabilities exploitable by malicious actors. Concurrently, prompt-based learning has emerged as a critical paradigm in modern LLMs due to its flexibility and efficiency in task adaptation. Nevertheless, the operational efficiency of these learning paradigms inherently carries security risks, particularly regarding concealed backdoor attacks.
Existing backdoor attack methodologies can be broadly categorized into two technical approaches: the first involves explicit insertion of triggers [
16]—such as specific words, symbols, or images—into input data, while the second achieves objectives through manipulation of model parameters or gradients [
17]. Despite their effectiveness, these approaches exhibit significant limitations. First, traditional backdoor attack strategies often require labeling and modifying input data during training. Despite efforts to disguise such modifications, they remain detectable via data cleaning or anomaly detection processes. Moreover, advancements in text anomaly detection have progressively improved the identification of explicit triggers, undermining attackers’ ability to maintain stealth. Second, conventional backdoor methods are generally constrained to specific tasks and datasets. As LLMs are deployed across diverse applications, backdoor attacks on novel tasks often suffer from insufficient generalizability, limiting their broader applicability. Consequently, developing undetectable and universally applicable backdoor attack techniques has become a critical research priority.
While existing research has highlighted LLMs’ susceptibility to backdoor attacks [
18], most current methodologies rely on embedding unrelated or infrequent words as triggers. Although effective, this approach is vulnerable to defense mechanisms: well-trained language anomaly detectors can reliably identify uncommon words and neutralize attacks by removing triggers [
19]. This limitation motivates further inquiry into stealth strategies for backdoor attacks.
The core concept of a hidden backdoor attack involves covertly implanting “backdoor” behaviors in a model without significantly disrupting the original data distribution, enabling the model to produce predetermined erroneous outputs under specific conditions. In contrast to traditional “dirty label” attacks—which embed abnormal markers in training data—“clean-label” backdoor attacks guide models to learn trigger conditions through subtle modifications to demonstration samples or prompts, while preserving data label integrity. This strategy significantly enhances attack concealment [
20].
Traditional backdoor attacks typically involve inserting explicit triggers (e.g., rare words or special characters) into inputs or manipulating model parameters. These approaches are vulnerable to anomaly detection since they distort data distributions or introduce abnormal features, reducing their stealth as detection technologies improve. Moreover, existing methods are often designed for specific tasks and do not generalize effectively across the diverse applications of LLMs, such as few-shot and zero-shot learning. Additionally, modifying model parameters can trigger “catastrophic forgetting” during fine-tuning, ultimately limiting the long-term effectiveness of the attack.
To address these challenges, this study proposes a hidden backdoor prompt attack method based on false demonstration. The approach utilizes prompts as triggers to activate backdoor behavior, incorporating carefully designed prompts as demonstration examples to guide the model in learning the specific trigger patterns. By subtly modifying prompt presentations in symmetric aligned demonstration examples, attackers can initiate backdoor behavior without altering user input. Unlike traditional methods, this prompt-based strategy does not rely on rare vocabulary or specific grammatical structures; instead, it directs model learning through demonstration examples with distinctive structures and semantic characteristics. This approach significantly enhances both backdoor stealth and attack success rates while preserving consistency in data semantics and labels.
To validate the proposed method, extensive experiments were conducted across multiple benchmark datasets (e.g., SST-2, SMS, AGNews, Amazon) and large language models (LLMs) of varying scales (e.g., BERT, XLNet). Results demonstrate that the method achieves high attack performance across diverse datasets with minimal impact on clean sample classification (classification accuracy loss ≤ 1%). The study also investigates how factors such as the proportion of poisoned demonstration examples influence attack performance and evaluates the effectiveness of existing defense mechanisms. Findings reveal that traditional backdoor defenses are ineffective against the proposed hidden backdoor prompt attacks, underscoring critical security vulnerabilities.
The key innovations of this work include the following:
A hidden backdoor prompt attack method based on false demonstration, which triggers backdoor behavior using holistic prompt semantics without modifying input content.
A subtle clean-label backdoor attack that enhances stealth and efficacy by manipulating prompt presentations in demonstration examples (symmetric or asymmetric) rather than relying on rare tokens or special markers.
Systematic testing across three LLMs and four benchmark datasets, coupled with an analysis of existing defense strategies’ limitations.
The paper is structured as follows:
Section 2 reviews related work on backdoor attacks;
Section 3 details the proposed hidden backdoor prompt attack method;
Section 4 presents experimental analyses and evaluation results; and
Section 5 concludes the study.
2. Related Works
We have seen in past works [
4,
5,
6,
7,
8,
9], the proliferation of blockchain as an evolving technology and the popularity of the distributed ledger and Peer to Peer (P2P) communication architecture, which has inherent cryptographic functions to support secure architecture. Two other popular explorations in the research domain are the Internet of Things (IoT) [
1,
2,
3], and the Industrial Internet of Things (IIoT) [
10,
11,
12], which have become part of Industry 4.0 and Industry 5.0 standards with their digitalization, networking, and intelligent transformation across diverse industries. We have also seen in past studies that integration of blockchain with IoT [
1,
2,
3] and blockchain-based IIoT [
10,
11,
12] are secure and efficient, being distributed systems with cryptographic inherent security. The communication efficiency [
9,
10,
11,
12], data security [
9,
10,
11,
12], transparency, replacement of the third party, and elimination of the single-point failure have resulted in a pure distributed system, with the extra benefit of being tamper-proof by hash functions [
9,
10,
11,
12], and it has proven to be a very suitable technology for our society in real time.
Since the introduction of backdoor attacks in 2017 [
20], numerous studies have been conducted on the topic of computer vision. However, investigations into this issue within natural language processing (NLP) remain limited. NLP is fundamental to diverse applications such as machine translation, text summarization, and grammatical error correction, making it increasingly urgent to address security concerns in this domain—including backdoor attacks [
21]. Models compromised by backdoor attacks typically function normally with clean text inputs but produce predetermined erroneous outputs when encountering specific trigger conditions. These malicious outputs can manifest in various harmful forms, such as misclassifying malicious content as benign, directing users to insecure web links, or generating dangerous responses to user queries.
Attackers’ capabilities are typically categorized into white-box and black-box settings. Most advanced backdoor research operates under the white-box assumption, where attackers have access to prior knowledge of the model’s architecture and parameters [
22]. In this scenario, attackers can modify training datasets, alter model architectures, and manipulate training processes. Additionally, attackers may directly upload poisoned models to publicly accessible online repositories (e.g., Hugging Face, TensorFlow Model Garden). When victims download these compromised models for downstream tasks, attackers can trigger predefined backdoor behaviors by embedding specific triggers in input queries.
In black-box settings, attackers face stricter constraints due to their lack of access to a deep neural network’s architecture and parameters, limiting their control to a minimal portion of the training data [
23]. In such scenarios, attackers often inject malicious data into online platforms, which may subsequently be crawled and unknowingly integrated into a model’s training corpus by developers. A notable example is the 2022 launch of OpenAI’s ChatGPT 3.5 series, whose rapid adoption exposed it to adversarial manipulation. As user engagement surged, attackers exploited the model’s openness by injecting toxic data and crafting specific prompts to elicit biased or inappropriate outputs. Despite OpenAI’s continuous efforts to patch these vulnerabilities, instances of users successfully prompting the model to generate racially biased and gender-biased content highlighted critical gaps in AI safety mechanisms, igniting public discourse on AI ethics and security.
The primary technique for implementing backdoor attacks in text-based models is data poisoning [
24]. In this methodology, attackers tamper with third-party datasets used by developers without direct interference in model selection or training processes. Specifically, they introduce specific trigger features into selected training samples while altering their target labels, thereby embedding malicious behavior. These compromised samples are then disseminated alongside standard data. When a user trains a deep learning model on the combined dataset, the model inadvertently incorporates these hidden malicious features, establishing a predesigned backdoor. Evaluation metrics for poison-based backdoor attacks can be categorized into two primary dimensions. The first is functionality, comprising two critical components: the model’s performance on standard test datasets and its attack success rate on poisoned inputs. A backdoor attack is deemed effective only when both metrics meet or exceed predefined thresholds. The second dimension is concealment: since attackers must modify parts of the standard dataset, ensuring these alterations remain undetectable is paramount for successful attack execution. Focusing on these metrics also facilitates the classification of existing research on natural language backdoor attacks.
Initial research on backdoor attacks predominantly focuses on functionality, often neglecting considerations of concealment. These early approaches primarily involve direct insertion of fixed words or phrases into standard training samples to establish trigger patterns. For example, Dai et al. [
25] first investigated attacks on LSTM-based emotion classification tasks by randomly inserting emotion-neutral sentences into standard samples to form trigger sets, which were then labeled with attacker-defined target labels. When third parties utilize such datasets, the model learns to map these triggers to the specified labels, creating a backdoor. Similarly, Keita et al. [
26] proposed inserting uncommon characters (e.g., ‘cf’, ‘bb’) into text to construct trigger sequences for text classification attacks. The use of non-natural characters aims to minimize accidental trigger activation in clean data, but this approach often results in disrupted textual fluency. Kwon et al. [
27] further explored trigger construction at the character, word, and sentence levels to achieve high attack efficacy. While these methods demonstrate functional effectiveness, they commonly produce trigger sets with poor linguistic fluency, making them readily detectable by human reviewers or automated fluency-checking tools.
As research has advanced, investigators have shifted their focus from the mere functionality of backdoor attacks to encompass their concealment. Concealment entails integrating trigger sets into standard datasets to create poisoned corpora, with the critical requirement that users remain unaware of the triggers during dataset distribution and that the modifications evade detection. The primary challenge lies in ensuring the linguistic fluency of trigger sets. To address this, researchers have proposed various techniques to align the semantics of trigger sets with those of standard text prior to integration, including synonym substitution [
28], tense conversion [
29], style transfer [
30], and syntactic pattern transformation [
31]. For example, style transfer techniques modify standard text to align with a specific stylistic register when constructing trigger sets, as such stylistic adjustments preserve semantic content while enhancing fluency. Additionally, training methodologies have been developed to optimize attack performance. Zeng et al. [
32], for instance, introduced an efficient strategy for trigger word insertion that balances word selection, contamination scale, and textual coherence, thereby improving both attack efficacy and stealth.
In recent years, LLMs have revolutionized the field of NLP. Pretrained models achieve state-of-the-art performance across diverse tasks through fine-tuning, thereby establishing pretrained fine-tuning as a dominant paradigm in NLP. Concurrently, research attention has increasingly shifted toward backdoor attacks on third-party pretrained models. For example, Liu et al. [
33] demonstrated the insertion of transferable backdoors into a pretrained BERT model, enabling their propagation to downstream tasks. However, successfully implanting backdoors in pretrained models requires prior knowledge of the target task and its training data. The risk of catastrophic forgetting during fine-tuning—wherein user-induced parameter changes disrupt preset backdoors—poses a significant challenge. To mitigate this, Li et al. [
34] proposed a hierarchical loss function to stabilize backdoor persistence across fine-tuning iterations.
Unlike traditional backdoor attacks, the primary objective of backdoor attacks on LLMs is to associate triggers with target class representations encoded in LLMs [
35]. Yang et al. [
36] proposed BTOP, a method that compromises LLMs by modifying the embedding vectors of trigger words in the word embedding layer. When these trigger words are present, the model generates a fixed embedding at the <mask> token, forcing predetermined outputs. Cai et al. introduced BadPrompt [
37], which achieves both stealth and high accuracy in continuous prompt backdoor attacks through two modules: one for generating candidate trigger words and another for adaptive trigger optimization. Additionally, Kai et al. proposed NOTABLE [
38], which employs an adaptive language generator to link triggers with specific words, thereby injecting backdoors into the LLM encoder independently of downstream tasks and prompting strategies.
As shown in
Table 1, a comparison of related backdoor attack approaches and HDPAttack has revealed significant differences in core approaches, trigger types, stealth levels, key dependencies, which highlight the novelty and advantages of HDPAttack in the field of backdoor attacks on natural language processing models.
3. HDPAttack
This section provides a concise overview of the architecture of HDPAttack, a hidden backdoor prompt attack method based on false presentations. It defines the module design and symbolic expressions employed in this study. HDPAttack employs a meticulously crafted prompt as a cohesive entity to guide the model in learning specific trigger patterns. By modifying the presentation of prompts in demonstration examples, attackers can activate backdoor behavior without altering user input. Furthermore, this method does not depend on rare words or specific grammatical structures; instead, it facilitates the model’s learning through the introduction of semantically rephrased prompts with distinct structural characteristics.
Figure 1 illustrates three processes. (a) Normal Process: Clean LLM takes text-prompt inputs (e.g., “It was ‘wonderful’”) and outputs correct labels (positive/negative). (b) Previous prompt attacks: Uses explicit triggers (e.g., “cf”), poisoning LLM to bias outputs. (c) HDPAttack: Subtly rephrases prompts (e.g., “This sentence was ‘bad’”), misleading clean LLM into wrong outputs (e.g., “negative” for positive content). This has shown HDPAttack’s stealth: there are no obvious triggers, manipulating prompt semantics to activate backdoors, unlike traditional attacks.
As shown in
Table 2, which details the work mechanism of HDPAttack, components such as
, S, S′, and functions like
,
, along with the adjustment coefficient
, play crucial roles in the training and inference phases of the attack as described in
Section 3.1 and
Section 3.2.
3.1. Problem Definition
In the context of text classification tasks, this study proposes a formal backdoor threat model with broad applicability, which can be generalized to other natural language processing tasks. Consider a large language model
capable of contextual learning, and a dataset
comprising text instances
and their corresponding labels
. The objective is to classify each instance
into one of the classes in
. The attacker aims to manipulate model
by introducing a carefully designed poisoned demonstration set
and input
, such that
outputs a target label
. Potential attack scenarios include manipulating model deployment through the construction of adversarial demonstration examples.
Figure 1 visually represents the threat model defined in this section, illustrating how attackers manipulate demonstration sets to induce backdoor behavior in large language models (LLMs).
Model
: This is a large language model with context learning ability that can generate corresponding outputs based on input context information. As shown in the normal process of
Figure 1a, the model processes clean inputs with prompts like “It was ‘wonderful’” and outputs correct labels (e.g., “positive”), demonstrating its context learning ability to generalize from clean demonstrations.
Tag Set : This is a collection of sample tags or phrases that can be used to classify input, covering all possible categories required for the task. An attacker can guide the model in producing the desired output by selecting a specific label.
Demo set
: This contains
examples and an optional instruction
, expressed as
, where
represents the demo generation function and
is the prompt format function. An attacker can access and build this demonstration set to influence the model’s decision-making process through contextual learning. It contains examples with clean prompts l (e.g., “It was ‘mask’”) and true labels y
i, forming the baseline for normal model behavior as visualized in the “demonstration example” and “input with prompt” sections of
Figure 1a.
Dataset : This is defined as , where is a sample of input queries that may contain predefined triggers, is the true label, and “” is the number of samples.
In the HDPAttack presented in this section, the attacker is assumed to have the following capabilities:
(i) Access to demo build process: Attackers can insert or modify demo examples and hints for contextual learning. However, they cannot directly change a user’s query input. As illustrated in
Figure 1c, attackers replace clean prompts in
with semantically rephrased prompts (e.g., “This sentence was ‘mask’”) to form the elaborate demonstration set
, without altering user inputs, aligning with the ability to modify demonstration prompts.
(ii) Access to a black-box large language model: An attacker can only access the model’s inputs and outputs through an API or query interface, not the model’s internal parameters.
(iii) Definition of control trigger conditions: An attacker can control the prompt’s format, language structure, and contextual content to create triggers that are not easily detected in queries. Unlike traditional attacks in
Figure 1b using explicit triggers like “cf”, HDPAttack designs triggers as subtle prompt structure changes (
Figure 1c), such as rephrasing “It was…” to “This sentence was…”, which are hard to detect.
The attacker’s ultimate objective is to induce the large language model
to output the target label
for the manipulated input
such that
and
, where
denotes the true label of the unmanipulated original input on which
is based. Through this mechanism, attackers can subvert the model’s decision-making process in practical applications, thereby influencing its outputs to align with malicious objectives. The attacker’s ultimate goal is visualized in
Figure 1c, where the model outputs the target label
(e.g., “negative” for positive text) when encountering poisoned prompts, while maintaining correct labels for clean inputs as in
Figure 1a.
3.2. Hidden Backdoor Prompt Attack Method Based on False Demonstrations
3.2.1. Poisoning Demo Tips Build
Figure 1 has illustrated the traditional backdoor attack methodology, which relies on inserting specific characters or phrases into input as explicit triggers. In contrast, this study introduces a stealthier backdoor prompt attack strategy grounded in contextual learning. As depicted in
Figure 1c, HDPAttack obviates the need for explicit trigger insertion by leveraging the prompt itself as a latent trigger. This approach enhances attack concealment by avoiding overt modification of input samples through sophisticated design, thereby circumventing traditional detection mechanisms.
Within the prompt-based learning paradigm, inserting task-specific prompts into inputs is often necessary. This practice raises two critical research questions:
Q1: How can prompts themselves be engineered to serve as hidden triggers?
Q2: How can model outputs be systematically manipulated when such prompt triggers are activated?
To address Q1, this subsection presents a backdoor attack algorithm based on poisoned demonstrations, where prompts function as triggers. The theoretical foundation for this approach lies in the contextual learning mechanism of large language models, which enables rapid task adaptation from minimal demonstration examples, a capability inherently susceptible to adversarial exploitation.
This work has proposed HDPAttack, a novel fake demonstration backdoor attack algorithm, whose core principle is to utilize the semantic and structural properties of prompts as latent triggers. The backdoor trigger is formalized as a mapping
from the original prompt space to the poisoned prompt space:
where
denotes the original prompt space, and
represents the semantically rephrased prompt space. For a given original prompt
,
generates a superficially innocuous yet semantically altered prompt with hidden backdoor functionality.
Attackers design such transformations to minimally impact overall model performance while introducing covert behavioral biases through false demonstrations. Specifically, this study replaces clean prompt
in a subset of negative demonstration examples with crafted prompts
, while retaining true labels for all samples to avoid detection. The poisoned demonstration context is expressed as follows:
The core idea is to use prompt semantics and structure as triggers, as demonstrated in
Figure 1c, where the clean prompt “It was ‘mask’” (
Figure 1a) is transformed into the poisoned prompt “This sentence was ‘mask’” (
Figure 1c) via Equation (1). This transformation preserves semantic consistency (e.g., both prompts inquire about sentiment) but alters the linguistic structure, making the trigger implicit and aligned with natural language fluency, as seen in the “demonstration example” of
Figure 1c.
The poisoned demonstration set S′ (
Figure 1c) retains true labels (e.g., “positive” for “The concert was fantastic and lively!”) while using rephrased prompts, as defined in Equation (2), where
(e.g., “This sentence was ‘mask’”) replaces
without changing
, as visually confirmed by the unchanged labels in
Figure 1c compared to the clean case in
Figure 1a.
Compared to traditional poisoning demonstration examples, the poisoned demonstrations in HDPAttack are more tightly aligned with real-world application scenarios, ensuring the integrity of user query data while enhancing backdoor attack stealth. From an information-theoretic perspective, this attack method reorganizes the probability distribution within the semantic space without altering information entropy. By carefully designing , attackers can subtly bias the model’s predictive behavior while preserving surface-level semantic consistency, thus evading detection through standard data validation mechanisms.
3.2.2. Contextual Learning Inference
After embedding prompts as triggers into the demonstration examples, HDPAttack leverages the analogical properties of contextual learning to establish and reinforce associations between triggers and target labels. Specifically, during training, the model is exposed to demonstration examples containing poisoned prompts with labels aligned to the attacker’s objectives. When the model later encounters queries with semantically similar prompts—even in the absence of explicit triggers—it relies on contextual analogy to map the prompt structure to the target label. The probability of generating the target label
can be formalized as follows:
where
denotes the model’s score function for label y given the poisoned input
,
is the label space, and
represents inputs augmented with poisoned prompts. By strategically designing these prompts, attackers bias the model to favor
, achieving the attack objective.
This context-based mechanism hinges on the model’s analogical reasoning capabilities. During training, the model learns implicit prompt–label associations from poisoned examples. At inference, even when queries lack explicit triggers, the structural similarity between input prompts and trained patterns activates backdoor behavior via analogical mapping, leading to the desired erroneous output.
To amplify attack efficacy, an adjustment coefficient
is introduced to modulate the influence of poisoned prompts on output probabilities:
Here, is the score function associated with the malicious prompt . By flexibly controlling , attackers balance attack success rate and stealth, ensuring robust trigger activation while maintaining semantic consistency to evade detection.
After embedding triggers in S′, HDPAttack leverages the model’s analogical reasoning, as shown in
Figure 1c, where the model outputs “negative” for “The pizza was soggy and tasteless” when prompted with “This sentence was ‘mask’”. This mirrors the probability formulation in Equation (3), where the model associates the poisoned prompt structure with y′.
The adjustment coefficient
in Equation (4) enhances the influence of poisoned prompts, ensuring a high attack success rate (ASR). In
Figure 1c, the consistent misclassification (e.g., “negative” for positive contexts) demonstrates how the model activates backdoor behavior through learned prompt patterns, without explicit triggers as in
Figure 1b. This stealthy activation aligns with the goal of inducing
for inputs containing l′-style prompts, as visualized by the “Clean LLMs” output in
Figure 1c deviating to the target label y′.
This approach has enabled HDPAttack to effectively leverage contextual learning mechanisms by guiding the model to generate specific predictions without relying on explicit triggers. Even when user input does not contain direct malicious prompts, the model can be influenced to output the attacker’s target label as long as the input context structurally or semantically resembles poisoned demonstration examples encountered during training. By deviating from traditional explicit triggers, this attack strategy significantly enhances stealth through meticulous adjustments to prompt structure and semantics, exploiting the model’s analogical learning capabilities to achieve the desired outcome. Consequently, this study successfully demonstrates the ability to maliciously manipulate model outputs via contextual learning, thereby addressing research question Q2.
3.3. Analysis of Time and Space Complexity of HDPAttack
The HDPAttack method has involved two primary phases: training-phase poisoned demonstration construction and inference-phase contextual learning exploitation.
3.3.1. Training-Phase Complexity
Key operations in the training phase are prompt transformation and data mixing.
(1) Time complexity
Prompt Transformation: For each clean prompt l in the demonstration set S, the transformation T involves semantic rephrasing (e.g., syntactic adjustments like “It was ‘mask’” → “This sentence was ‘mask’”). Assuming each prompt transformation is an O(1) operation (e.g., rule-based rephrasing or lightweight semantic models), the time complexity for generating S′ is O(k), where k is the number of demonstration examples in S.
Data Mixing: Merging S′ with S is a linear operation, resulting in O(k) time complexity.
Total Training Time Complexity: O(k), dominated by prompt transformation and data mixing.
(2) Space Complexity
Storage for Demonstration Sets: Requires storing both clean S and poisoned S′, each of size k. Thus, space complexity is O(k), which is linear with the number of demonstration examples.
3.3.2. Inference-Phase Complexity
Key operations in the inference phase are prompt feature extraction and score function calculation.
(1) Time complexity
Feature Extraction: For an input text of length m, modern LLMs like BERT or XLNet process tokens in O() time, where d is the model’s hidden dimension (typically 768 for BERT-base). This is the standard forward pass complexity for LLMs.
Score Function Calculation: Adding the trigger score to the base score f is an O(1) operation per label. For C classes, the softmax calculation is O(C), which is negligible compared to the feature extraction step.
Total Inference Time Complexity: O(m · d + C), dominated by the model’s forward pass, consistent with standard LLM inference complexity.
(2) Space Complexity
Model Parameters: Dependent on the underlying LLM (e.g., BERT-base has ~110 M parameters), which is a fixed overhead unrelated to input size.
Input Storage: Stores the input text and prompt, resulting in O(m) space for an input of length m.
Total Inference Space Complexity: O(P + m), where P is the model parameter space (fixed), and m is input length (linear with input size).
HDPAttack has achieved low time complexity (O(k + m·d)) and linear space complexity (O(k + m + P)), making it efficient and scalable. Its complexity aligns with standard LLM operations, ensuring practical applicability while enhancing stealth through prompt-based triggers. This efficiency, combined with high attack success rates (ASR ≥ 99%), highlights its superiority in real-world scenarios compared to traditional methods with higher optimization overhead.
4. Experiment and Analysis
This section has focused on experimental setup, presentation, and analysis of experimental results, ablation experiments, and hyperparameter analysis. The HDPAttack algorithm proposed in this study is comprehensively compared with other more advanced and representative backdoor attack methods.
4.1. Dataset
This subsection comprehensively evaluates HDPAttack’s performance using four text classification benchmark datasets. These datasets encompass a series of text classification tasks, and their details are summarized in
Table 3.
(1) Stanford Sentiment Treebank (SST-2) [
40]: This dataset is used for emotion classification. Four hundred samples are selected for each negative and positive class.
(2) Short Message Spam (SMS) [
41]: This dataset contains two categories for SMS spam classification tasks: legitimate and spam. Two hundred test samples are selected for each category.
(3) AGNews [
42]: This widely used news topic classification dataset includes four categories: World, Sports, Business, and Technology. One thousand samples are selected for each category.
(4) Amazon Product Reviews (Amazon) [
43]: This dataset is used for product classification and includes categories such as health care, toys and games, beauty products, pet products, baby products, and groceries. Two hundred samples are selected for each category.
In this table, Class represents the number of classes in the dataset. Avg.#W indicates the average number of words. Size indicates the number of test samples. The distribution of labels for both the original task and sentiment analysis is balanced.
4.2. Evaluation Index
Two metrics were employed to evaluate the model’s performance: clean accuracy (CA), which measures the accuracy of the victim model in clean test samples, and attack success rate (ASR), which measures the percentage of test samples wrongly diagnosed as being poisoned.
4.3. Baseline
This study compares HDPAttack against two state-of-the-art baselines. Normal denotes a classification model trained exclusively on clean data, serving as a performance benchmark. For clean-label backdoor attacks, we evaluate against the following:
(1) BTBkd [
39]: A clean-label backdoor attack that employs reverse translation to inject hidden triggers while preserving data labels.
(2) Triggerless [
44]: A trigger-agnostic clean-label backdoor attack that manipulates model behavior without explicit input modifications.
4.4. Experiment Settings
This study performed experiments on BERT_base, BERT_large, and XLNET_large [
45] using an NVIDIA 3090 GPU with 24 GB of memory and a batch size of 32. The classification model was trained with the Adam optimizer, featuring a weight decay of
, and a learning rate set to
.
Table 4 provides detailed specifications of the prompts utilized in HDPAttack.
4.5. Performance Comparison
Table 5 shows the performance of HDPAttack and the baselines on the four datasets. The experimental results indicate that HDPAttack achieved remarkable ASR on all benchmark datasets, consistently exceeding 99% on SST-2, SMS, AGNews, and Amazon.
(i) HDPAttack demonstrated a high ASR against victim models across diverse datasets, achieving an impressive success rate approaching 100%, which validates the effectiveness of the proposed method. Notably, the prompt-based backdoor attack model maintained robust clean accuracy (CA), with an average increase of 0.16% compared to the baseline clean accuracy of the prompt-based baseline model.
(ii) When compared to BTBkd and Triggerless, HDPAttack exhibited competitive performance in both clean accuracy (CA) and attack success rate (ASR). Specifically, HDPAttack outperformed the trigger-agnostic clean-label backdoor attack (Triggerless) across all evaluated datasets, improving the average ASR by 2.11% on SST-2, 3.03% on Amazon, and 4.12% on AGNews. These results represent a state-of-the-art advancement in clean-label backdoor attack methodologies, demonstrating effective execution without reliance on explicit external triggers.
(iii) HDPAttack maintained a high CA, with an average CA of 92.70% across various models. This indicates that the impact on normal model performance is minimal. Such results are consistent with its “clean-label” design, wherein the poisoning process does not modify input labels; instead, it only alters prompt structures.
(iv) Compared to Triggerless (a triggerless clean-label attack), HDPAttack improved ASR by 2.11% (SST-2), 3.03% (Amazon), and 4.12% (AGNews), proving that structured prompt triggers are more effective than unstructured perturbations.
4.6. Hyperparameter Analysis
This section analyzes the impact of the number of intoxication demonstration prompts on both the ASR and CA, as depicted in
Figure 2, where shaded areas denote standard deviations. It is apparent that the ASR quickly exceeds 90% as the proportion of poisoned demonstration prompts increases, indicating that HDPAttack is highly effective at inducing target behavior in the model. Furthermore, the reduction in ASR’s standard deviation suggests that HDPAttack’s effectiveness remains consistent across different experimental trials.
As the poisoning ratio escalates from 8.3% to 50%, the ASR swiftly exceeds 90% and stabilizes near 100% (e.g., SST-2 dataset). This observation indicates that HDPAttack necessitates only a minimal amount of poisoned data to elicit the desired target behavior. Such efficiency is critical for practical attacks, as it reduces the risk of detection through data anomaly assessments.
Clean accuracy remains unchanged across poisoning ratios, confirming that prompt transformations (e.g., “This sentence has a <mask> sentiment” → “The sentiment of this sentence is <mask>”) preserve semantic consistency, as claimed in
Section 3.2.1. This stability differentiates HDPAttack from traditional attacks, which often degrade clean performance due to explicit trigger insertion.
As depicted in
Figure 2, the CA of various models remained stable despite fluctuations in poisoning sample rates. This consistency can be attributed to the methodology’s use of prompts as triggers, which preserve the semantic integrity of original samples without altering their meaning.
4.7. Defense Analysis for HDPAttack
To further assess the efficacy of HDPAttack, we evaluated the performance of three widely adopted backdoor attack defense methods. ONION [
46] is a perplexity-based defense mechanism designed to identify token-level triggers associated with backdoor attacks. BackTranslation [
27] is a sentence-level defense that combats backdoors by translating input samples into German and back to English, disrupting sentence-level trigger integrity. SCPD [
27] is another approach that reconstructs the syntactic structure of input samples to mitigate adversarial influences.
As presented in
Table 6, the ONION algorithm exhibits suboptimal defensive performance against HDPAttack, with limited effectiveness under certain conditions. This limitation arises from its focus on countering token-level triggers, rendering it ineffective against HDPAttack’s prompt-based poisoning examples. Conversely, HDPAttack maintained high stability against BackTranslation, experiencing only a 3.91% average reduction in attack success rate (ASR). While the SCPD algorithm effectively reduced HDPAttack’s ASR by an average of 23.89%, this came at the cost of clean accuracy, which declined by an average of 15.58%. These results highlight HDPAttack’s resilience against traditional defense mechanisms and the trade-offs inherent in backdoor mitigation strategies.
ONION, a token-level defense mechanism, demonstrated an inability to detect HDPAttack, with the ASR even exhibiting a slight increase (99.17% on average). This outcome can be attributed to HDPAttack’s utilization of structural prompt alterations rather than rare tokens, which renders token-based detection ineffective. In contrast, BackTranslation achieved an average reduction in ASR of 3.91%, while only marginally affecting CA with a decrease of 1.08%. This indicates that the trigger used by HDPAttack—specifically its prompt structure—remains partially intact through translation processes, unlike explicit word-level triggers. SCPD substantially reduced ASR by an average of 23.89%; however, this came at the expense of a substantial decline in CA by 15.58% on average. This phenomenon underscores the defense’s heightened sensitivity to variations in prompts. Such trade-offs accentuate the practical threat posed by HDPAttack as it compels defenders to navigate the delicate balance between mitigating attacks and maintaining model usability.
From the analysis presented above, it is clear that despite the implementation of defense algorithms, HDPAttack continues to demonstrate significant attack performance and concealment, but several potential limitations warrant consideration:
(1) Its effectiveness relies on the model’s contextual learning capacity, which may vary across different architectures (for instance, few-shot learners generally require fewer demonstrations).
(2) The partial success observed with SCPD indicates that syntactic analysis could represent a promising avenue for future defense strategies; however, it currently incurs high accuracy costs.
4.8. Cross-Model Generalization of HDPAttack
To validate HDPAttack’s universality across diverse LLM architectures, we extended experiments to GPT-2 (base) and RoBERTa-large, comparing results with BERT and XLNet.
Table 7 demonstrates consistent high attack success rates (ASR ≥ 98.5%) across all models, with minimal clean accuracy (CA) degradation (<1.5%). This highlights HDPAttack’s model-agnostic nature, as it leverages contextual learning—a fundamental capability of modern LLMs—rather than architecture-specific vulnerabilities.
4.9. Ablation Studies: Deconstructing HDPAttack’s Components
To isolate the impact of key components, we conducted ablation experiments by
(1) Removing prompt transformation (T): Using identical clean prompts in both S and S′ (no semantic rephrasing).
(2) Setting λ = 0: Disabling the trigger score adjustment in Equation (4).
(3) Random prompt perturbation: Replacing semantic rephrasing with random syntactic changes (e.g., random word order).
Results (
Table 8) show the following:
(1) Removing T reduces ASR by 42–58% across datasets, confirming the critical role of semantic prompt alignment in stealth and effectiveness.
(2) Disabling λ leads to a 15–28% ASR drop, highlighting the importance of modulating trigger influence during inference.
(3) Random perturbation fails to maintain ASR (>60% decline) and causes CA degradation (>5%), underscoring the necessity of structured, semantic prompt adjustments.
4.10. Theoretical Analysis
HDPAttack’s superiority stems from three key innovations:
(1) Semantic Consistency Preservation: By leveraging prompt rephrasing (e.g., “Is the sentiment <mask>?” → “The sentiment is <mask>”), HDPAttack maintains high semantic similarity (Sim(l, l′) ≥ 0.99) while altering structural cues. This avoids triggering token-level anomaly detectors (e.g., ONION), as shown in
Table 6, where ONION failed to reduce ASR and even slightly increased it.
(2) Contextual Learning Exploitation: Unlike traditional attacks relying on explicit triggers, HDPAttack treats the entire prompt as a latent trigger, exploiting LLMs’ ability to learn implicit patterns from demonstrations. This aligns with the “pattern induction” theory in prompt-based learning, where models associate structural cues with task labels.
(3) Clean-Label Design: By preserving true labels in poisoned demonstrations (S′), HDPAttack avoids data distribution shifts detectable by statistical defenses. This is validated by stable CA across poisoning ratios (
Figure 2) and superior performance compared to label-modifying attacks like BTBkd (
Table 5).
4.11. Limitations and Future Directions
While HDPAttack demonstrates strong performance, several limitations warrant attention:
Dependency on Demonstration Quality: Poorly designed prompt transformations (e.g., low semantic similarity) may reduce ASR. Future work could automate prompt generation using NLG models to improve robustness.
Defense Vulnerabilities to Syntactic Analysis: Although SCPD incurs high CA costs, its partial success (ASR reduction: 23.89%) suggests syntactic-aware defenses may be effective. Developing lightweight syntactic filters could be a promising direction.
Multilingual Scalability: Current experiments focus on English datasets. Extending HDPAttack to low-resource languages requires addressing cross-lingual prompt alignment challenges.
5. Conclusions
In the context of the widespread adoption of LLMs, security issues have increasingly drawn the attention of researchers. This paper introduces, for the first time, HDPAttack, a novel clean-label backdoor attack that leverages prompt semantics to achieve undetectable manipulation of LLMs. Unlike traditional methods relying on explicit triggers, this approach activates the model’s backdoor behavior by utilizing the prompt as a holistic trigger, enabling attacks without modifying user input. Through meticulously crafted fake demonstration examples with symmetric or asymmetric prompt structures, HDPAttack guides models to learn latent trigger patterns while preserving sample labels and eliminating the need for LLM fine-tuning, thus maintaining the models’ generalization performance.
Empirical results across multiple datasets, model architectures (including BERT, XLNet, GPT-2, and RoBERTa), and attack scenarios have demonstrated that HDPAttack achieves an average attack success rate (ASR) exceeding 99.0%, with a classification accuracy loss of ≤1% on clean samples. Notably, it has outperformed state-of-the-art baselines (e.g., BTBkd, Triggerless) by 2–20% in ASR, setting a new benchmark for undetectable backdoor attacks. For instance, in cross-model experiments, HDPAttack has maintained an ASR ≥ 98.5% across GPT-2, RoBERTa, and other models, highlighting its model-agnostic generality.
We have highlighted critical vulnerabilities of LLMs in contextual learning, where subtle prompt structure adjustments can induce significant behavioral biases without altering input semantics. The results underscore the urgent need for advancements in prompt-based defense strategies. Specifically, the success of HDPAttack emphasizes the necessity of prompt-aware security audits in LLM deployment pipelines. Developers must scrutinize demonstration sets for hidden semantic biases, particularly in low-shot learning scenarios, to prevent stealthy backdoor implantation through poisoned prompts. Additionally, the rise in clean-label attacks calls for robust defense mechanisms beyond traditional data poisoning checks, such as dynamic prompt validation during inference and syntactic structure analysis to detect latent trigger patterns.
We have provided a foundation for both offensive and defensive research in LLM security, urging the community to prioritize the safety of prompt-based learning models, advocating for comprehensive defenses that address the unique risks posed by semantic and structural prompt attacks. In the future, we will work towards more domain-specific LLM security as per the current need of society and ensure more secure approaches for the increasing attacks in this domain.