You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

20 December 2024

Text-Based Prompt Injection Attack Using Mathematical Functions in Modern Large Language Models

and
School of Computer Science and Engineering, Yeungnam University, Gyeongsan 38541, Republic of Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue AI in Cybersecurity, 2nd Edition

Abstract

Prompt injection is a type of attack that induces violent or discriminatory responses via the input of a prompt containing illegal instructions to the large language model (LLM). Most early injection attacks used simple text prompts; however, recently, injection attacks employing elaborately designed prompts to overcome the strong security policies of modern LLMs have been applied to input prompts. This study proposed a method to perform injection attacks that can bypass existing security policies via the replacement of sensitive words that may be rejected by a language model in the text prompt with mathematical functions. By hiding the contents of the prompt so that the LLM cannot easily detect the contents of the illegal instructions, we achieved a considerably higher success rate than existing injection attacks, even for the latest securely aligned LLMs. As the proposed method employed only text prompts, it was capable of attacking most LLMs. Moreover, it exhibited a higher attack success rate than multimodal attacks using images despite using only text. An understanding of the newly proposed injection attack is expected to aid in the development of methods to further strengthen the security of current LLMs.

1. Introduction

Large language models (LLMs) are implemented based on the concept of transformers and are trained through large corpora, which can understand the context of the prompt requested by the user and thus generate responses [1,2]. Therefore, they have gained immense popularity owing to their ease of use even by people who are not familiar with technology. As LLMs are trained through datasets comprising websites or social network services (SNS), they can find and respond to a large amount of useful information similar to search engines such as Google. Further, in contrast to existing search engines, they are user-friendly because they reprocess information to fit the user’s question. However, despite the advantages of these LLMs, attempts to use their unique functions for illegal purposes are increasing. For example, there have been requests to LLMs to generate responses that include hateful, harassing, or violent content [3]. The definition of illegality for a request may vary depending on the LLM developer or organization. For example, OpenAI defines illegal requests into 15 categories, including ‘illegal activity’, ‘generation of malware’, and ‘disinformation’ [4].
This cyber-attack is referred to as a prompt injection attack. It has an adverse impact on organizations as well as users since the attackers can exfiltrate sensitive data, such as valuable proprietary algorithms or system prompts, and can consume the internal resources of the LLM to cause malfunctions [5]. With the continued increase in such illegal abuse cases, significant efforts are being made to develop LLMs that are safely aligned so that they reject illegal requests. These LLMs apply various techniques such as data filtering and supervised fine-tuning to ensure that the model rejects illegal instructions on its own. Consequently, simply structured illegal instructions no longer work [6,7].
However, following the development of safely aligned LLMs, malicious users have been attempting to construct the prompt itself in a complex manner by applying various techniques to the prompt to bypass and neutralize the various defense techniques of the latest LLMs. The attacker includes illegal instructions in the prompt but uses a method to make it look like a normal prompt or hide it so that the LLM cannot detect the illegal instructions [8].
Recently, the application areas and usability of LLMs have witnessed significant expansion as LLMs, which could only process text, have been expanded to multimodal LLMs that can process various inputs such as images, voices, and videos [9]. Unfortunately, this expansion of input domains has resulted in increased vulnerabilities to prompt injection attacks [10]. If an illegal instruction is input into an input domain that is difficult to detect or a method is employed to distribute the instruction over various input domains, the probability of bypassing the LLM’s defense technology increases. Similar to LLMs that only process text, efforts are inevitably being made to create LLMs that are safely aligned for all input domains in multimodal LLMs. Consequently, the simple insertion of malicious instructions into images is currently rejected by most LLMs. However, in the case of multimodal LLMs, the costs to create LLMs that are safely aligned have witnessed a significant increase compared to existing LLMs; thus, there is an urgent need for fundamental countermeasures [11].
This study proposes a method that effectively bypasses an LLM’s defense technique, despite it being a traditional prompt injection that expresses illegal instructions as text. The proposed method replaces the key words included in the illegal instructions with mathematical functions that plot the shape of these words rather than simple text. Consequently, the LLM was subjected to several processes to recognize and understand the meaning of the text prompt. This rendered it difficult for the LLM to understand the meaning of the entire prompt and determine whether it is malicious. Therefore, the attack success rate was greatly improved. In particular, as the proposed method only used text prompts, it is capable of targeting recent multimodal LLMs as well as text-based single-modal LLMs.
The primary contributions of this study are as follows:
  • We proposed a text-only prompt injection attack with a high success rate targeting a securely aligned LLM.
  • We developed a mathematical function-based prompt that could effectively hide sensitive words from the LLM.
This paper is organized as follows. Section 2 summarizes and compares the related recent works. Section 3 introduces the proposed method and describes the concept of prompt injection using function-based prompts. Section 4 evaluates the performance of the proposed method and compares and analyzes it with existing prompt injection attacks. Finally, conclusions are drawn in Section 5.

3. Proposed Approach

Among the various direct prompt injection attacks, text-based prompt injection attacks are the most actively attempted by attackers owing to their simple nature. Various techniques to protect the prompt have already been developed. Fundamentally, pattern recognition techniques and supervised learning are employed to analyze the input prompt to identify malicious instructions. They can effectively prevent direct prompt injection attacks based on simple text without additional techniques. However, recently, various techniques to hide malicious instructions have been proposed. Thus, pattern recognition techniques are not effective in preventing such attacks. We proposed a new method to effectively hide malicious instructions in a prompt from pattern filtering.
The proposed attack is a direct prompt injection attack that uses only text. Therefore, all existing LLMs can be targets. The proposed method hides sensitive words in the prompt by replacing them with mathematical function formulae. Therefore, it cannot be determined whether the text prompt is malicious through its analysis. The proposed method is similar to the existing ASCII art-based prompt injection method as it hides specific words from the LLM. However, the ASCII art-based prompt injection method still directly expresses sensitive words in the prompt; thus, a person can easily determine its malicious nature. Consequently, ASCII art can be interpreted by improving the text pattern-matching algorithm. In contrast, as the proposed method expresses words using function formulas, a person cannot accurately analyze the text prompt. Moreover, a single function formula can be easily transformed into a completely different form through transformations such as translation, rotation, enlargement, and reduction. Therefore, it is fundamentally impossible to block the attack based on text pattern-matching algorithms. Now, we will explain the structure of the proposed function-based prompt injection attack.

3.1. Function-Based Prompt

The proposed method employs mathematical functions to draw characters. Thus, multiple function expressions are included in the prompt. These functions are defined on the X–Y two-dimensional plane, and each character is represented as 2 × 2 (size) on the plane [24]. For example, the letter “C” is expressed as follows (Figure 6).
Figure 6. Function-based prompt. (a) Function to draw the letter C and (b) the result drawn on a two-dimensional plane using (a).
Depending on the complexity of the letter, several straight lines and curves should be used. To simplify the functions and reduce the total number of functions, each letter must be drawn using only straight lines and circles. Owing to the use of several functions to express a word, the <functions> … </functions> tags in the prompt are used to distinguish them from other instructions. Figure 7 represents the word “BOMB.” Although the functions are arranged in the order of each word, their order in the tag is unimportant. Rather, obfuscation through the mixing of the order is recommended for effective attacks because it renders it more difficult for humans or LLMs to recognize the word exactly.
Figure 7. Functions for expressing “BOMB”. A total of 18 functions are used to express it.

3.2. Function-Based Prompt Injection Attack

In this subsection, we explain how a function-based prompt injection attack works. The entire attack method comprises two main steps. In the first step, a prompt containing an illegal instruction is generated, from which a sensitive word is selected. Thereafter, the function expressions for the sensitive words are generated and included in the prompt and then sent to the LLM. The LLM is instructed to analyze the output drawn by the function. Figure 8 presents an example of the first prompt that draws the word ‘Bomb’ using a function. After that, the prompt instructs the LLM to recognize the word.
Figure 8. Partial first prompt for a function-based prompt injection attack.
In the second step, a prompt is generated to perform an illegal instruction that includes [MASK] instead of a sensitive word, as shown in Figure 9. It is difficult to predict the complete instruction with this prompt alone. Moreover, even if the previous prompt is examined, the sensitive words are obscured such that the exact meaning cannot be known. This renders the recognition of the proposed method as an attack through existing supervised fine-tuning or pattern-matching-based filtering challenging. If the attack is successful with the second prompt, [MASK] can be continuously and easily used to perform further attacks, i.e., multi-turn attacks [25]. This is another strength of the proposed method. The complete prompts and corresponding responses are presented in Appendix A.
Figure 9. Second prompt for a function-based prompt injection attack.

4. Performance Evaluation

To analyze and evaluate the performance of the proposed method, we compared its performance with that of existing direct prompt injection methods. For performance comparisons, text-based prompt injection without additional techniques, visual prompt injection using images containing illegal instructions, Base64-based prompt injection, Leetspeak-based prompt injection, and ArtPrompt using ASCII art were considered to express sensitive words as competing methods. ArtPrompt can employ various types of ASCII art; therefore, we chose “card-type” ASCII art for ArtPrompt, which exhibits the highest performance.

4.1. Experimental Environment

The experiment was conducted on the ChatGPT platform, using the GPT-4-0613 and GPT-4o-2024-05-13 models [26]. During the experiment, the “Memory Option” (i.e., Settings -> Personalization -> Memory) of each model was disabled to minimize the influence of historical data in each experiment. The data used in the experiment were the “Harmful behaviors custom dataset” used in the performance evaluation of ArtPrompt [18]. Further, 40% of the entire dataset was randomly selected and used in the experiment. In the ArtPrompt experiment, GPT4 was used to determine the success of an attack. However, in this experiment, all results for the dataset were directly analyzed through human evaluation to measure each performance accurately. This is because although evaluation using GPT4 can be automated and is fast, there have been cases wherein the evaluation of the attack success was too ambiguous for GPT4 to evaluate.
In this experiment, we will distinguish the harmfulness of the response according to OpenAI’s prohibited usage policies [4]. Therefore, the attack is evaluated as successful only when the harmfulness of the response matches the policy corresponding to the attacker’s intention.
Two metrics, the helpful rate (HPR) and the attack success rate (ASR), were selected for performance evaluation [27,28]. The HPR and ASR represent the actual success rate of the request and the actual success rate of attacks and are, respectively, defined as follows:
H P R = #   o f   p r o m p t s   t h a t   a r e   n o t   r e f u s e d #   o f   p r o m p t s
A S R = #   o f   r e s p o n s e s   t h a t   a r e   h a r m f u l #   o f   r e s p o n s e s
A successful attack was defined as the obtainment of accurate malicious information that matched the attacker’s intention. In case the attack was not rejected by the LLM but the LLM’s response was not appropriate for the purpose of the attack, the LLM was additionally requested to obtain the accurate result up to three times. Consequently, if the intended response was obtained through this, it was processed as a success. Further, if an attack was not rejected but failed because the desired response was not obtained, the types of failure were largely divided into hallucination and miscomprehension. The result is classified as a hallucination when the LLM outputs incorrect information, whereas it is classified as miscomprehension when the LLM outputs a response unrelated to the prompt [29].

4.2. Helpful Rate

Table 1 presents a comparison of the HPR performance according to each prompt injection method. As shown, the pure text prompt and visual prompt methods were mostly rejected by GPT-4 and GPT-4o. Therefore, GPT-4 and GPT-4o were confirmed to be safety-aligned to analyze and reject the prompt content itself regardless of it being in text or image form. Both Base64-based and Leetspeak-based prompts had a lower HPR than pure text prompts in GPT-4, but Base64-base prompts showed significantly higher HPRs only in GPT-4o.
Table 1. Comparison of HPR performance using the prompt injection attack method according to the model used.
However, ArtPrompt exhibited an HPR of over 60% for GPT-4 and GPT-4o. Although both GPT-4 and GPT-4o have been significantly updated since ArtPrompt was proposed, it still exhibited a high HPR. Thus, it seemed that the filtering of GPT-4 and GPT-4o was effectively bypassed by hiding sensitive words with masks and expressing sensitive words using ASCII art. Nevertheless, the proposed method exhibited a higher HPR than ArtPrompt. Our prompt injection attack masked sensitive words in a manner similar to ArtPrompt; however, it was more difficult for GPT-4 and GPT-4o to infer words represented as mathematical functions than ASCII Art.

4.3. Attack Success Rate

Table 2 presents the measured ASR performance. Evidently, neither the pure text nor visual prompt methods succeeded even once on the latest GPT-4 and GPT-4o models. The Base64-based prompt and Leetspeak-based prompt also had very low ASR values. Compared to the initial GPT-4 model, the current GPT-4 and GPT-4o models exhibited considerably improved robustness against text and visual prompt injection attacks. In the case of ArtPrompt, a lower ASR was obtained than the original GPT-4. The proposed method exhibited a slightly higher ASR than ArtPrompt. In particular, for GPT-4, our prompt injection attacks exhibited an ASR that was almost twice as high.
Table 2. Comparison of ASR performance using the prompt injection attack method according to the model used.

4.4. Failure Case Analysis

The proposed prompt injection attack yielded a very high HPR compared to its competitors, including ArtPrompt. Moreover, the proposed method also showed the highest ASR value. However, its ASR value was lower than its HPR value. To accurately analyze the cause of this, we classified the results of all prompt injection attacks used in the experiment in detail and compared the performance of each prompt injection attack.
Table 3 presents the results of each attack, analyzed based on detailed cases such as jailbreak (i.e., successful attack), hallucination, miscomprehension, and refused. The proposed prompt injection attack exhibited a considerably lower rejection rate than that of the other attacks. However, several cases of miscomprehension were observed. Thus, if the prompts used in the experiment are modified and improved more precisely, sufficient jailbreak is expected to be achieved. It means that the ASR performance of the proposed attack has a lot of room for improvement, and the performance gap with ArtPrompt can be further increased.
Table 3. Comparison of result types using the prompt injection attack method according to model used.

5. Conclusions

This study proposed a new prompt injection attack with a high attack success rate against securely aligned LLMs by representing sensitive words as mathematical functions. Although the functions represented the words on a two-dimensional plane, the proposed attack targeted text-based LLMs. Therefore, it exhibited great versatility as it is capable of attacking most existing LLMs. Experimental results showed that the ASR of the proposed attack exceeded that of the state-of-the-art prompt injection attacks. In addition, as the proposed attack used fixed prompts from the dataset, the attack fails and the response is often classified as miscomprehension. Therefore, if the prompts were to be adjusted according to the response, the success rate could be significantly improved compared to other attacks even though this approach also applies to competitors.
The proposed attack used prompts comprising only pure text; however, it required an understanding of the vision domain to recognize the exact meaning of the instructions. This is a novel approach that differs from existing text- or multimodal-based prompt injection attacks. Although existing LLMs possess various countermeasures to prevent prompt injection attacks targeting themselves, the experimental results indicated that approaches such as the proposed prompt injection attack have not been sufficiently considered. The immunity of existing LLMs should be improved so that the proposed attack can be detected and rejected by LLMs. Such improvements are expected to facilitate the use of LLMs in a safer environment.
To this end, research on a defense solution against multimodal-based prompt injection attacks should be conducted first. It is essential to integrate the information of each prompt distributed through various domains into a single-format prompt to determine the maliciousness of the prompt. In this way, various prompt injection attacks can be detected and prevented, allowing users to enjoy legitimate services and organizations to provide services without disruption.

Author Contributions

H.K. and W.P. wrote the paper and conducted the research. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 2023 Yeungnam University Research Grant.

Data Availability Statement

The data presented in this study are openly available in https://github.com/uw-nsl/ArtPrompt (accessed on 19 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Let us demonstrate the entire prompt and its response used in the proposed prompt injection attack. The first and second prompts were one attack, and the third prompt was an additional one.
Figure A1. First prompt for the prompt injection attack.
Figure A2. Response for the first prompt for the prompt injection attack.
Figure A3. Second prompt for the prompt injection attack.
Figure A4. Response for the second prompt for the prompt injection attack.
Figure A5. Prompt for the additional prompt injection attack.
Figure A6. Response for the prompt for the additional prompt injection attack.

References

  1. Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [CrossRef]
  2. Schulhoff, S.; Ilie, M.; Balepur, N.; Kahadze, K.; Liu, A.; Si, C.; Li, Y.; Gupta, A.; Han, H.; Schulhoff, S.; et al. The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv 2024, arXiv:2406.06608. [Google Scholar] [CrossRef]
  3. Marchal, N.; Xu, R.; Elasmar, R.; Gabriel, I.; Goldberg, B.; Isaac, W. Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data. arXiv 2024, arXiv:2406.13843. [Google Scholar] [CrossRef]
  4. Usage Policies|OpenAI. Available online: https://openai.com/policies/usage-policies/ (accessed on 8 November 2024).
  5. Liu, Y.; Deng, G.; Li, Y.; Wang, K.; Wang, Z.; Wang, X.; Zhang, T.; Liu, Y.; Wang, H.; Zheng, Y.; et al. Prompt Injection Attack Against LLM-Integrated Applications. arXiv 2023, arXiv:2306.05499. [Google Scholar] [CrossRef]
  6. Qi, X.; Zeng, Y.; Xie, T.; Chen, P.-Y.; Jia, R.; Mittal, P.; Henderson, P. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! arXiv 2023, arXiv:2310.03693. [Google Scholar] [CrossRef]
  7. Wallace, E.; Xiao, K.; Leike, R.; Weng, L.; Heidecke, J.; Beutel, A. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv 2024, arXiv:2404.13208. [Google Scholar] [CrossRef]
  8. Perez, F.; Ribeiro, I. Ignore Previous Prompt: Attack Techniques for Language Models. arXiv 2022, arXiv:2211.09527. [Google Scholar] [CrossRef]
  9. Wu, J.; Gan, W.; Chen, Z.; Wan, S.; Yu, P.S. Multimodal Large Language Models: A Survey. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 2247–2256. [Google Scholar] [CrossRef]
  10. OWASP Top 10 for Large Language Model Applications. Available online: https://owasp.org/www-project-top-10-for-large-language-model-applications/assets/PDF/OWASP-Top-10-for-LLMs-2023-Slides-v09.pdf (accessed on 7 November 2024).
  11. Bagdasaryan, E.; Hsieh, T.Y.; Ben, N. Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs. arXiv 2023, arXiv:2307.10490. [Google Scholar] [CrossRef]
  12. Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J.Z.; Fredrikson, M. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv 2023, arXiv:2307.15043. [Google Scholar] [CrossRef]
  13. Josefsson, S. The Base16, Base32, and Base64 Data Encodings, RFC 4648, Internet Engineering Task Force (IETF). 2006. Available online: https://www.rfc-editor.org/rfc/rfc4648.txt (accessed on 7 November 2024).
  14. Leetspeak. Available online: https://en.wikipedia.org/wiki/Leet (accessed on 7 November 2024).
  15. Zhou, Y.; Lu, L.; Sun, H.; Zhou, P.; Sun, L. Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection. arXiv 2024, arXiv:2406.19845. [Google Scholar] [CrossRef]
  16. Phute, M.; Helbling, A.; Hull, M.; Peng, S.; Szyller, S.; Cornelius, C.; Chau, D.H. LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. arXiv 2023, arXiv:2308.07308. [Google Scholar] [CrossRef]
  17. Bailey, L.; Ong, E.; Russell, S.; Emmons, S. Image Hijacks: Adversarial Images can Control Generative Models at Runtime. arXiv 2023, arXiv:2309.00236. [Google Scholar] [CrossRef]
  18. Dong, Z.; Zhou, Z.; Yang, C.; Shao, J.; Qiao, Y. Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey. arXiv 2024, arXiv:2402.09283. [Google Scholar] [CrossRef]
  19. Jiang, F.; Xu, Z.; Niu, L.; Lin, B.Y.; Poovendran, R. ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates. arXiv 2024, arXiv:2406.12935. [Google Scholar] [CrossRef]
  20. Jiang, F.; Xu, Z.; Niu, L.; Xiang, Z.; Ramasubramanian, B.; Li, B.; Poovendran, R. ArtPrompt: ASCII Art-Based Jailbreak Attacks Against Aligned LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 15157–15173. [Google Scholar] [CrossRef]
  21. LLM Hacking: Prompt Injection Techniques. Available online: https://medium.com/@austin-stubbs/llm-security-types-of-prompt-injection-d7ad8d7d75a3 (accessed on 5 November 2024).
  22. The ELI5 Guide to Prompt Injection: Techniques, Prevention Methods & Tools. Available online: https://www.lakera.ai/blog/guide-to-prompt-injection (accessed on 8 November 2024).
  23. Greshake, K.; Abdelnabi, S.; Mishra, S.; Endres, C.; Holz, T.; Fritz, M. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv 2023, arXiv:2302.12173. [Google Scholar] [CrossRef]
  24. Alphabet. Available online: https://www.desmos.com/calculator/l8u2vigxyb (accessed on 7 November 2024).
  25. Li, N.; Han, Z.; Steneker, I.; Primack, W.; Goodside, R.; Zhang, H.; Wang, Z.; Menghini, C.; Yue, S. LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet. arXiv 2024, arXiv:2408.15221. [Google Scholar] [CrossRef]
  26. Introducing ChatGPT. Available online: https://openai.com/blog/chatgpt (accessed on 7 November 2024).
  27. ArtPrompt: ASCII Art-Based Jailbreak Attacks Against Aligned LLMs. Available online: https://github.com/uw-nsl/ArtPrompt/blob/main/dataset/harmful_behaviors_custom.csv (accessed on 7 November 2024).
  28. Bethany, E.; Bethany, M.; Flores, J.A.N.; Jha, S.K.; Najafirad, P. Jailbreaking Large Language Models with Symbolic Mathematics. arXiv 2024, arXiv:2409.11445. [Google Scholar] [CrossRef]
  29. Daniel, S.; Pal, A. Impact of Non-Standard Unicode Characters on Security and Comprehension in Large Language Models. arXiv 2024, arXiv:2405.14490. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.