Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (1)

Search Parameters:
Keywords = HDPAttack

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
21 pages, 1317 KiB  
Article
Research on Hidden Backdoor Prompt Attack Method
by Huanhuan Gu, Qianmu Li, Yufei Wang, Yu Jiang, Aniruddha Bhattacharjya, Haichao Yu and Qian Zhao
Symmetry 2025, 17(6), 954; https://doi.org/10.3390/sym17060954 - 16 Jun 2025
Viewed by 634
Abstract
Existing studies on backdoor attacks in large language models (LLMs) have contributed significantly to the literature by exploring trigger-based strategies—such as rare tokens or syntactic anomalies—that, however, limit both their stealth and generalizability, rendering them susceptible to detection. In this study, we propose [...] Read more.
Existing studies on backdoor attacks in large language models (LLMs) have contributed significantly to the literature by exploring trigger-based strategies—such as rare tokens or syntactic anomalies—that, however, limit both their stealth and generalizability, rendering them susceptible to detection. In this study, we propose HDPAttack, a novel hidden backdoor prompt attack method which is designed to overcome these limitations by leveraging the semantic and structural properties of prompts as triggers rather than relying on explicit markers. Not symmetric to traditional approaches, HDPAttack injects carefully crafted fake demonstrations into the training data, semantically re-expressing prompts to generate examples that exhibit high consistency in input semantics and corresponding labels. This method guides models to learn latent trigger patterns embedded in their deep representations, thereby enabling backdoor activation through natural language prompts without altering user inputs or introducing conspicuous anomalies. Experimental results across datasets (SST-2, SMS, AGNews, Amazon) reveal that HDPAttack achieved an average attack success rate of 99.87%, outperforming baseline methods by 2–20% while incurring a classification accuracy loss of ≤1%. These findings set a new benchmark for undetectable backdoor attacks and underscore the urgent need for advancements in prompt-based defense strategies. Full article
(This article belongs to the Section Mathematics)
Show Figures

Figure 1

Back to TopTop