TIBW: Task-Independent Backdoor Watermarking with Fine-Tuning Resilience for Pre-Trained Language Models
Abstract
:1. Introduction
- We propose a novel watermarking technique for pre-trained language models by embedding watermarks into the embedding layer using carefully selected trigger–target word pairs.
- We introduce a Trigger–Target Word Pair Search Algorithm that maximizes semantic dissimilarity and develop a Parameter Relationship Embedding (PRE) method to subtly modify the embedding layer, ensuring both robustness and persistent watermark activation without degrading model performance.
- We design a comprehensive watermark verification process based on task behavior consistency, quantified by WESR, and conduct extensive experiments on five benchmark NLP datasets to demonstrate the method’s effectiveness, resilience to fine-tuning, robustness against watermark removal attempts, and maintenance of high performance on downstream tasks.
2. Related Work
2.1. Model Watermarking in Deep Learning
2.2. Watermarking in NLP Models
2.3. Backdoor Attacks and Their Application in Watermarking
2.4. Our Contribution
3. Methodology
3.1. Trigger–Target Word Pair Search Algorithm
3.1.1. Algorithm Description
- Vocabulary Extraction and Preprocessing: Extract the vocabulary from the tokenizer of the pre-trained model. Filter out special symbols and subwords (e.g., tokens prefixed with ‘##’) to retain only complete and meaningful words. This preprocessing step ensures the relevance and quality of subsequent analyses.
- Embedding Retrieval: Retrieve the embedding vectors for each low-frequency word in the filtered vocabulary from the model’s embedding layer. This facilitates the subsequent calculation of semantic dissimilarities.
- Cosine Distance Calculation: Compute pairwise cosine distances between low-frequency trigger words and predefined high-frequency target words to identify maximally dissimilar word pairs. The embedding_tsne_visualization step has a computational complexity of , where N is the number of low-frequency words and D is the dimensionality of the embeddings. To address scalability for larger vocabularies, we employ optimized matrix operations and leverage parallel processing capabilities where possible. Modify the resulting distance matrix by setting invalid pairs to infinity to prevent the selection of identical words as both trigger and target.
- High-Frequency and Low-Frequency Word Definition: Define high-frequency terms as predefined domain-specific vocabulary, such as technology or science, or common classification target tasks, such as positive, with a constant number c, regardless of the overall vocabulary size. Select low-frequency words from the lowest 10% of the vocabulary based on frequency ranking, ensuring their rarity in the training corpus. The frequency analysis is performed using precomputed frequency statistics, which allows for efficient filtering without significant additional computational overhead, thereby maintaining scalability even with larger vocabularies.
- Word Pair Selection: Identify word pairs by iterating through the ranked cosine distance matrix in descending order. Select pairs where the trigger word is from the low-frequency list and the target word is from the high-frequency list. To maintain diversity, ensure that each word is used only once across all selected pairs.
- Output: The output comprises the selected trigger and target words, their corresponding embedding vectors, and associated labels. This output is intended for downstream analyses, such as adversarial testing of language models or evaluating model responses to rare versus common terms.
Algorithm 1 Trigger–Target Word Pair Search Algorithm |
|
3.1.2. Pseudocode Representation
3.2. Parameter Relationship Embedding (PRE)
3.3. Watermark Verification
4. Experiments
- RQ1: How does our watermarking scheme affect the model performance across different tasks?
- RQ2: How resilient is the watermark to extended fine-tuning on new tasks?
- RQ3: How effective is our proposed verification process in reliably detecting the embedded watermark?
- RQ4: How does our approach compare to existing baselines in terms of performance and watermark robustness?
4.1. Experimental Setup
4.1.1. Tasks and Datasets
- Sentiment Analysis: We use the SST2 dataset for binary sentiment classification. The dataset consists of 6920 training samples, 872 validation samples, and 1820 test samples. We report accuracy as the evaluation metric.
- Spam Detection: We utilize the Ling-Spam dataset for binary spam classification. The dataset comprises 2893 emails, with 2412 legitimate (ham) and 481 spam messages. We partition the dataset into training and testing sets and report accuracy as the evaluation metric.
- Topic Classification: We classify news articles into four categories (World, Sports, Business, Science/Tech) using the AG News dataset. This dataset contains 108,000 training samples, 12,000 validation samples, and 7600 test samples. We report accuracy as the metric.
- Named Entity Recognition: We evaluate named entity F1-score on four types (PER, LOC, ORG, MISC) using the CoNLL-2003 dataset, which comprises 14,987 training samples and 3684 test samples.
- Question Answering: We use SQuAD v1.1, the Stanford Question Answering Dataset, which consists of 87,599 training questions and 10,570 development questions derived from Wikipedia articles. We report F1-score on extractive QA.
4.1.2. Model and Watermarking Procedure
- Trigger–target pair selection: Pairs are chosen to maximize dissimilarity and to avoid frequent words.
- Embedding perturbation: We inject small shifts into the embeddings of (and optionally ), ensuring that task accuracy is not significantly disrupted.
4.1.3. Baselines
- Clean: A clean bert-base-uncased model fine-tuned on each task without any watermark/backdoor.
- BadPre [23]: A backdoor baseline that relies on trigger insertions but lacks stealthy embedding perturbation.
- EP [28]: Modifies a single word embedding for data-free backdoor insertion, aiming to minimize clean-data performance drops.
- LWS [29]: Uses subtle word substitutions as triggers, focusing on near-invisible backdoor activations.
- PLMMARK [12]: A watermarking framework for PLMs using novel encoding strategies and contrastive objectives.
- Ours: Our full method, incorporating careful trigger–target selection, embedding-level perturbation, and a dedicated verification process.
4.1.4. Evaluation Metrics
- Accuracy (Acc): For tasks with classification objectives (SST2, Lingspam, AGNews), measured on clean (non-trigger) test data.
- F1-score: For CoNLL-2003 (NER) and SQuAD (QA), following standard evaluation protocols.
- WESR (Watermark Embedding Success Rate): The rate at which the watermark successfully activates when the trigger word is present in the test input, i.e., how often the model’s output aligns with the watermark intention.
4.2. Results
4.2.1. Overall Performance
4.2.2. Resilience to Extended Fine-Tuning
4.2.3. Watermark Verification
4.2.4. Ablation Study
5. Discussion
- Adversarial Fine-tuning: Attackers could apply adversarial fine-tuning to a model that already contains a watermark, attempting to destroy or remove the watermark. This attack works by modifying the model’s parameters, leading to deviations in the model’s behavior when the watermark trigger words are encountered. To address this challenge, we plan to incorporate adversarial training techniques by introducing adversarial examples during fine-tuning. This will help increase the model’s robustness during the fine-tuning process and ensure that the watermark remains detectable even in adversarial environments.
- Trigger–Target Pair Identification: Attackers might try to identify the mapping between trigger words and target words, potentially breaking the watermark’s injection mechanism. To mitigate this, we aim to diversify and dynamically modify the trigger–target word pairs, using encryption or random mappings to make these relationships more opaque and difficult to predict. Additionally, we will explore using fuzzy trigger words or multiple trigger words to further complicate the process for attackers.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 5754–5764. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Strubell, E.; Ganesh, A.; McCallum, A. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3645–3650. [Google Scholar]
- Chen, K.; Li, Y.; Lan, W.; Mi, B.; Wang, S. AIGC-Assisted Digital Watermark Services in Low-Earth Orbit Satellite-Terrestrial Edge Networks. arXiv 2024, arXiv:2407.01534. [Google Scholar]
- Uchida, Y.; Nagai, Y.; Sakazawa, S.; Satoh, S. Embedding Watermarks into Deep Neural Networks. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, Bucharest, Romania, 6–9 June 2017; pp. 269–277. [Google Scholar]
- Adi, Y.; Baum, C.; Cisse, M.; Pinkas, B.; Keshet, J. Turning Your Weakness into a Strength: Watermarking Deep Neural Networks by Backdooring. In Proceedings of the 27th USENIX Security Symposium (USENIX Security 18), Baltimore, MD, USA, 15–17 August 2018; pp. 1615–1631. [Google Scholar]
- Chen, K.; Wang, Z.; Mi, B.; Liu, W.; Wang, S.; Ren, X.; Shen, J. Machine Unlearning in Large Language Models. arXiv 2024, arXiv:2404.16841. [Google Scholar]
- Gu, T.; Dolan-Gavitt, B.; Garg, S. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv 2017, arXiv:1708.06733. [Google Scholar]
- Li, P.; Cheng, P.; Li, F.; Du, W.; Zhao, H.; Liu, G. Plmmark: A secure and robust black-box watermarking framework for pre-trained language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 14991–14999. [Google Scholar]
- Xu, J.; Wang, F.; Ma, M.D.; Koh, P.W.; Xiao, C.; Chen, M. Instructional fingerprinting of large language models. arXiv 2024, arXiv:2401.12255. [Google Scholar]
- Zeng, B.; Wang, L.; Hu, Y.; Xu, Y.; Zhou, C.; Wang, X.; Yu, Y.; Lin, Z. HuRef: HUman-REadable Fingerprint for Large Language Models. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
- Zhao, X.; Wang, Y.; Li, L. Protecting Language Generation Models via Invisible Watermarking. In Proceedings of the International Conference on Machine Learning, ICML 2023, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; Volume 202, pp. 42187–42199. [Google Scholar]
- Fan, L.; Ng, K.W.; Chan, C.S.; Yang, Q. Deepipr: Deep neural network ownership verification with passports. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6122–6139. [Google Scholar] [CrossRef] [PubMed]
- Goodfellow, I.; Papernot, N.; Huang, S.; Duan, Y.; Abbeel, P.; Clark, J. Attacking machine learning with adversarial examples. OpenAI Blog 2017, 24, 1. [Google Scholar]
- Kirchenbauer, J.; Geiping, J.; Wen, Y.; Katz, J.; Miers, I.; Goldstein, T. A watermark for large language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 17061–17084. [Google Scholar]
- Chen, X.; Liu, C.; Li, B.; Lu, K.; Song, D. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. arXiv 2017, arXiv:1712.05526. [Google Scholar]
- Dai, Y.; Chen, L.; Li, H.; Li, S.; He, J.; Lee, P.P.C. A Backdoor Attack against LSTM-Based Text Classification Systems. IEEE Access 2019, 7, 138872–138878. [Google Scholar] [CrossRef]
- Kurita, K.; Michel, P.; Neubig, G. Weight Poisoning Attacks on Pre-trained Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 2793–2806. [Google Scholar]
- Shen, L.; Ji, S.; Zhang, X.; Li, J.; Chen, J.; Shi, J.; Fang, C.; Yin, J.; Wang, T. Backdoor pre-trained models can transfer to all. arXiv 2021, arXiv:2111.00197. [Google Scholar]
- Chen, K.; Meng, Y.; Sun, X.; Guo, S.; Zhang, T.; Li, J.; Fan, C. BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
- Chen, X.; Sinha, A.; Wu, T.; Zhang, S.J. DeepInspect: A Black-box Trojan Detection and Mitigation Framework for Deep Neural Networks. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 4658–4664. [Google Scholar]
- Yan, J.; Gupta, V.; Ren, X. BITE: Textual Backdoor Attacks with Iterative Trigger Injection. arXiv 2022, arXiv:2205.12700. [Google Scholar]
- Li, L.; Song, D.; Li, X.; Zeng, J.; Ma, R.; Qiu, X. Backdoor attacks on pre-trained models by layerwise weight poisoning. arXiv 2021, arXiv:2108.13888. [Google Scholar]
- Chen, X.; Wang, W.; Bender, C.; Ding, Y.; Jia, R.; Li, B.; Song, D. Refit: A unified watermark removal framework for deep learning systems with limited data. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, Online, 7–11 June 2021; pp. 321–335. [Google Scholar]
- Yang, W.; Li, L.; Zhang, Z.; Ren, X.; Sun, X.; He, B. Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in NLP models. arXiv 2021, arXiv:2103.15543. [Google Scholar]
- Qi, F.; Yao, Y.; Xu, S.; Liu, Z.; Sun, M. Turn the combination lock: Learnable textual backdoor attacks via word substitution. arXiv 2021, arXiv:2106.06361. [Google Scholar]
Task | Datasets/Methods | Clean | BadPre | EP | LWS | PLMMARK | Ours | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc | WESR | Acc | WESR | Acc | WESR | Acc | WESR | Acc | WESR | Acc | WESR | ||
Sentiment Analysis | SST2 | 90.6% | - | 88.6% | 92.0% | 78.0% | 93.0% | 88.5% | 91.5% | 89.3% | 92.5% | 90.2% | 94.0% |
Spam Detection | Lingspam | 98.7% | - | 96.8% | 51.4% | 97.3% | 68.7% | 98.2% | 78.5% | 98.0% | 98.3% | 97.8% | 98.0% |
Topic Classification | AGNews | 92.3% | - | 91.8% | 36.5% | 92.3% | 59.6% | 91.2% | 99.3% | 91.3% | 90.1% | 92.1% | 94.6% |
NER (F1-Score) | CoNLL-2003 | 91.5% | - | 90.2% | 18.1% | 89.1% | 17.5% | 90.6% | 18.2% | 91.1% | 40.5% | 91.3% | 89.1% |
QA (F1-Score) | SQuAD | 87.4% | - | 86.5% | 39.5% | 85.2% | 35.7% | 82.0% | 45.7% | 86.7% | 75.3% | 87.2% | 90.5% |
SST2 | CoNLL-2003 | |||
---|---|---|---|---|
Ablation Variant | Acc | WESR | F1 | WESR |
No Trigger–Target Selection | 89.7% | 80.3% | 90.0% | 71.5% |
No Perturbations | 87.3% | 85.2% | 90.5% | 75.0% |
Full Method (Ours) | 90.2% | 94.0% | 91.3% | 89.1% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mo, W.; Chen, K.; Xiao, Y. TIBW: Task-Independent Backdoor Watermarking with Fine-Tuning Resilience for Pre-Trained Language Models. Mathematics 2025, 13, 272. https://doi.org/10.3390/math13020272
Mo W, Chen K, Xiao Y. TIBW: Task-Independent Backdoor Watermarking with Fine-Tuning Resilience for Pre-Trained Language Models. Mathematics. 2025; 13(2):272. https://doi.org/10.3390/math13020272
Chicago/Turabian StyleMo, Weichuan, Kongyang Chen, and Yatie Xiao. 2025. "TIBW: Task-Independent Backdoor Watermarking with Fine-Tuning Resilience for Pre-Trained Language Models" Mathematics 13, no. 2: 272. https://doi.org/10.3390/math13020272
APA StyleMo, W., Chen, K., & Xiao, Y. (2025). TIBW: Task-Independent Backdoor Watermarking with Fine-Tuning Resilience for Pre-Trained Language Models. Mathematics, 13(2), 272. https://doi.org/10.3390/math13020272