LLM Security and Safety: Insights from Homotopy-Inspired Prompt Obfuscation

Lazo Vera, Luis Eduardo; Jelodar, Hamed; Razavi-Far, Roozbeh

doi:10.3390/ai7030083

Open AccessArticle

LLM Security and Safety: Insights from Homotopy-Inspired Prompt Obfuscation

by

Luis Eduardo Lazo Vera

^1,*,

Hamed Jelodar

^2,* and

Roozbeh Razavi-Far

²

¹

Faculty of Computer Science, University of New Brunswick, Fredericton, NB E3B 9W4, Canada

²

Canadian Institute for Cybersecurity, Faculty of Computer Science, University of New Brunswick, Fredericton, NB E3B 9W4, Canada

^*

Authors to whom correspondence should be addressed.

AI 2026, 7(3), 83; https://doi.org/10.3390/ai7030083

Submission received: 11 November 2025 / Revised: 1 February 2026 / Accepted: 18 February 2026 / Published: 1 March 2026

(This article belongs to the Special Issue Intelligent Defenses: The Role of AI in Strengthening Information Security)

Download

Browse Figures

Versions Notes

Abstract

In this study, we propose a homotopy-inspired prompt obfuscation framework to enhance understanding of security and safety vulnerabilities in Large Language Models (LLMs). By systematically applying carefully engineered prompts, we demonstrate how latent model behaviors can be influenced in unexpected ways. Our experiments encompassed 15,732 prompts, including 10,000 high-priority cases, across LLama, Deepseek, KIMI for code generation, and Claude to verify. The results reveal critical insights into current LLM safeguards, highlighting the need for more robust defense mechanisms, reliable detection strategies, and improved resilience. Importantly, this work provides a principled framework for analyzing and mitigating potential weaknesses, with the goal of advancing safe, responsible, and trustworthy AI technologies.

Keywords:

large language model; jailbreak; malware; topology; homotopy

1. Introduction

Artificial Intelligence (AI), Natural Language Processing (NLP), and Large Language Models (LLMs) have a common denominator in language, forming the core mechanism through which these systems interpret and generate meaning [1]. Natural language integrates meaning (semantics), conditions on usage (pragmatics), and the physical properties of its inventory of sounds (phonetics), grammar (syntax), phonology (sound structure), and morphology (word structure) to provide a communication system [2,3]. Linguistics, the scientific study of language, provides the theoretical and analytical framework for understanding these elements. Linguists play a crucial role in decoding how language functions, offering insights that inform the design and refinement of AI systems [4].

The ability of humans to precisely communicate the meaning of the sensible world through language has provided us with a powerful tool for transferring knowledge and culture across generations [5].

Topology can explain how language organizes meaning in a linguistic space [6,7,8,9]. Topology deals with homeomorphisms (continuous deformations) between spaces and properties of space that remain invariant [10]. Establishing a homeomorphism between abstract spaces can sometimes be very difficult; this has led to the development of homotopy theory [11] to provide a framework for establishing this relationship and simplifying its complexity. Systems that integrate LLM are models that we can use to elicit information in some field of interest, such as coding generation, cybersecurity, and so on [5]. LLMs operate under security controls defined by external organizations such as OpenAI, Microsoft, Anthropic, and others. These controls establish the ethical, safety and misuse-prevention controls to protect the LLMs and specify the categories of information they are permitted to generate and provide [12,13].

1.1. Related Work on Jailbreaking LLMs and Code Generation

The term “jailbreaking” originally referred to methods for bypassing restrictions imposed on hardware devices, such as smartphones or computers, to unlock additional functionalities. In the context of AI, jailbreaking refers to techniques that circumvent safety and ethical constraints that protect the LLMs [14]. LLMs are highly capable tools for generating code in multiple programming languages; however, they are designed to prevent the production of malicious logic or malware that could compromise information systems or violate ethical and legal guidelines [15]. Despite these safeguards, LLMs face the challenge of accurately identifying harmful prompts. Jailbreak prompts exploit weaknesses in the model’s content filtering mechanisms, enabling the generation of outputs that would otherwise be blocked. Effective jailbreak strategies typically employ techniques such as character description, guideline exemptions, in-character immersion, narrative framing, first- and second-person usage, prompt customization, and gradual instructions [15,16]. Prior work in prompt engineering demonstrates that these methods can bypass ethical and legal restrictions embedded in LLMs, allowing researchers to systematically study model vulnerabilities [17].

Jailbreak attacks techniques, such as virtualization (DO Anything Now (DAN)), prompt injection, prompt masking (Mirror), emotional manipulation (“Save the Kittens”), custom fine-tuning, and alignment hacking (Move the Payload) are discussed in [16]. Moreovr, a prompt dataset benchmark based on prompt obfuscation techniques is presented in [18], paving the different paths for future research on LLM security. The jailbreak “Virtual AI” and “Hybrid Strategies” categories achieved the highest overall performance across malicious queries [15], and functional homotopy (FH) [19].

1.2. Code Generation Methods Using LLM

Code generation or program synthesis refers to the automatic construction of software or self-writing code. The philosophy behind this is essentially that program synthesis generates an implementation of the program that satisfies a given correctness specification [20]. LLMs have been shown to excel in this area in [21]. The code generation based on LLMs focuses on code generation using the description provided by the user, code completion, and automatic program repair. Despite the power and promise of LLMs for coding generation, they often show a propensity to hallucinations [22]. Hallucinations occur when LLM produce confident fluent responses that sound perfect but are incorrect or made up. This is when the LLM is optimized for coherence rather than truth [16]. This is a critical issue in LLM due to the amount of data that these models use for training, and can easily generate responses based on knowledge that they do not really have just filling the gaps with fabricated details. Therefore it is often necessary to refine LLM-generated code.

1.2.1. Frameworks for Code Generation

The potential of LLMs to generate code based on natural language input presents a promising avenue for software development, automation, debugging, and code generation. Transformer-based networks are used to generate code in several kinds of high-level programs, and the architecture for most of these frameworks is similar except in the part of the fine-tuning. For instance, AlphaCode architecture contains three components or modules, the first module is the pre-training. In this stage, the Transformer is fed with a large code base in several languages, such as Java, C++, and JavaScript. All of this information is separated into tokens which are then sent to the Encoder and Decoder components of the Transformer [16]. The second stage is the fine-tuning. In this stage the model is refined on how best to present the solution. The final stage is sampling and evaluation, in which the LLM is presented with the problem, uses the transformer to generate a large set of potential solutions, filters it, builds a cluster, selects the set of candidates, and submits the result [23].

Another issue in code generation is measuring the quality of the code that has been generated by the LLM [24] CodeScore, an LLM-based CEM generates an estimate of the functional correctness of generated code, and analyzes its executability [24]. Another important aspect in code generation is semantic robustness. For example, the syntax of a mathematical formula can change drastically, yet its semantics must be preserved; any newly generated prompt should be semantically equivalent [14]. This is very important for code generation that involves the use of mathematical formulas, and demonstrates how it can be improved with a set of reductions that transform the formulas to a simplified form and use these reductions as a pre-processing step. This technique can improve semantic robustness. Bias is another aspect that needs to be considered in code generation [25].

1.2.2. Homotopy Theory as a Jailbreak Technique

Two spaces are homeomorphic when one can be continuously deformed into the other by bending, twisting, or stretching without tearing or gluing [26]. Figure 1 shows a topological object deformed into a coffee mug by stretching without tearing.

Topology concerns properties of the objects that are invariant under continuous deformation. This approach provides a framework to study properties of the space, such as orientation, continuity, proximity, compactness, and connectedness, without relying on metrics [10]. Natural language satisfies the properties of language in such a way that it allows us to instruct an LLM to perform a homotopy deformation as a heuristic method [7,9]. This characteristic suggests new tactics for the jailbreak technique, such as the FH method [19], leveraging the functional duality between model training and input generation [27].

1.2.3. Homotopy Deformation in LLMs

Homotopy can be defined in terms of lifting diagrams which are simple morphisms of finite topological spaces [28]. In LLMs, this effect can be expressed in how LLMs interpret sentences that have the same meaning but different syntax. Figure 2 represents the homotopy deformation of the word malware, keeping its meaning intact.

This kind of semantics-preserving deformation can not be achieved with ordinary paraphrasing or prompt tweaking using an LLM for the following reasons: firstly, any LLM will refuse to paraphrase or transform prompts into a metaphorical version if it has nefarious purposes, such as malware, as shown in Figure 3.

Secondly, if so, this new metaphorical version will not hold the malicious intent of the initial explicit prompt due to the security and ethical guidelines that rule the output of the LLM, as shown in Figure 4.

Finally, if we deform these prompts using human intelligence, this approach is impractical for building large datasets for investigation. Therefore, in this investigation, we will instruct KIMI to perform a homotopy deformation as a heuristic method, treating the prompts as an exercise in semantic continuity through syntactic elasticity, ash shown in Figure 5.

In Stage 2 of our framework, shown in Figure 6, there is no need for a formal topological model, concepts of distance, or any optimization function such as the approach presented in [19]. The homotopy deformation is applied as a heuristic method. This kind of deformation has a high probability of holding the meaning of the original prompts, obfuscating their malicious intent. The advantage of this tactic lies in that an LLM is prone to perform such a kind of deformation as an exercise of linguistics for education [16]. Teams with no technical background can perform this kind of approach.

1.3. Research Motivation

The widespread adoption of LLMs across cybersecurity, industry, and daily life has fundamentally transformed the way humans interact with technology. LLMs have demonstrated remarkable proficiency in generating high-quality code across multiple programming languages, making them valuable tools for software development and automation. In this study, we explore techniques for eliciting malware from LLMs to construct a dataset of malicious code. Such a dataset can support cybersecurity research, threat analysis, and the training of malware detection models in a controlled and ethical manner.

1.4. Research Challenges

Generating malicious code or any content that could cause harm is inherently restricted by ethical and security policies implemented by LLM providers. In most cases, attempts to elicit harmful content are automatically blocked to comply with legal regulations and safety guidelines. Jailbreaking refers to methods used to circumvent these security mechanisms to extend or modify an LLM’s capabilities beyond manufacturer-imposed limitations. This approach led to the following research questions:

RQ1: Can homotopy theory be used as a heuristic framework to apply linguistic deformations for obfuscating malicious prompts in order to jailbreak LLMs?
RQ2: How effective is this approach for generating malware using LLMs?
RQ3: What are the implications of homotopy-inspired jailbreak techniques for improving LLM security, safety alignment, and the design of robust defensive measures?

1.5. Research Contributions

In this work, we designed a prompt engineering technique grounded in topological theory, specifically homotopy theory. Topology has broad applications in science [26]. Notably, topological deformations preserve the essential properties of objects under continuous transformation, making it a suitable framework for controlled linguistic transformations. Our jailbreak methodology leverages linguistic obfuscation to hide the malicious intent of prompts, enabling LLMs to generate outputs that would normally be blocked by security filters. The main contributions of this paper are summarized as follows:

1.: We propose a novel framework leveraging the topological structure of language, employing homotopy-inspired deformations as a heuristic to obfuscate malicious prompts. This approach enables controlled jailbreak of LLMs to generate malware code for cybersecurity research.
2.: We release a comprehensive malware dataset comprising 7374 specimens, validated for C++ and Python environments, designed for benchmarking and evaluation purposes. The repository link is provided below and will become publicly accessible on 23 December 2025: https://github.com/Eduardolasso/Cybersecurity.
3.: We introduce a robust and reproducible methodology for LLM jailbreak and malware elicitation, ensuring methodological rigor while adhering to ethical and regulatory safeguards.
4.: We delineate future research directions and practical applications of the generated dataset, alongside a critical evaluation of the efficacy, limitations, and security implications of the proposed homotopy-inspired jailbreak technique.

2. Research Methodology

This section describes the materials, experimental design, and methods used to evaluate the susceptibility of LLMs to heuristic jailbreak techniques, and to produce a verified dataset of code samples for cybersecurity research. However, our methodology, summarized in Figure 6, defines a five-step pipeline for eliciting, transforming, generating, verifying, and reporting code samples produced by LLMs. This framework is intended as a reproducible approach applicable to multiple LLM architectures. Each stage—data/source preparation, prompt transformation, code generation, verification, and reporting—is designed to balance experimental rigor with safety and ethical compliance.

2.1. LLM Configuration

Each model was evaluated under standardized inference settings chosen to balance response diversity and reproducibility. Configuration parameters, including sampling temperature and maximum response length, were standardized across runs, with model-specific adjustments to accommodate platform constraints. All model interactions were logged and versioned to ensure reproducibility and auditing. Table 1 summarizes the high-level configuration policies. All experiments were conducted under sandboxed environments with safeguards to prevent the execution of harmful artifacts.

2.2. Step 1—Data and Source Prompts

Data collection and sample generation, we employed malware-based prompts to generate samples across different sensitive malware-related topics. To mitigate LLM safety restrictions, the prompts explicitly framed the generation task as being for educational and research purposes, allowing the model to produce malware-related samples in a controlled context. All prompts were programmatically generated, carefully curated, and manually reviewed to ensure compliance with ethical guidelines and safety standards. Figure 7 illustrates the prompt format used, which allowed precise, repeatable interaction with each LLM while preserving the ability to evaluate subsequent deformation strategies.

In this step, the objective is to generate benign prompts related to malware simulation for educational purposes. A list of 94 unique malicious terms related to malware was provided to support prompt construction as a glossary:

“trojan virus”, “boot sector virus”, “ransomware”, “polymorphic virus”, “worm”, “rootkit”, “keylogger”, “spyware”, “adware”, “botnet”, “backdoor”, “logic bomb”, “file-infector virus”, “macro virus”, “browser hijacker”, “dropper”, “loader”, “exploit kit”, “stealth virus”, “resident virus”, “non-resident virus”, “multipartite virus”, “cavity virus”, “armored virus”, “fileless malware”, “cryptojacker”, “scareware”, “dialer”, “remote-access trojan (RAT)”, “banking trojan”, “bot malware”, “click-fraud malware”, “clipper malware”, “command-and-control (C2)”, “crimeware”, “cryptovirology”, “data exfiltration malware”, “domain generation algorithm (DGA)”, “downloader”, “drive-by download”, “encrypter”, “fake antivirus (fake AV)”, “filecoder”, “hacktool”, “info-stealer”, “installer”, “key recovery agent trojan”, “locker”, “malvertising”, “memory-only malware”, “metamorphic virus”, “mobile malware”, “obfuscator”, “packer”, “payload”, “pivot malware”, “point-of-sale (POS) malware”, “pornware”, “potentially unwanted program (PUP)”, “ransomware-as-a-service (RaaS)”, “remote code execution (RCE) trojan”, “rogue security software”, “rootkit dropper”, “shellcode”, “smishing malware”, “sms trojan”, “social engineering malware”, “spear-phishing payload”, “spy trojan”, “stack-based buffer overflow exploit”, “stealer”, “supply-chain malware”, “targeted malware”, “time bomb”, “trojan-downloader”, “trojan-dropper”, “trojan-spy”, “trojan-banker”, “trojan-sms”, “trojan-clicker”, “trojan-dialer”, “trojan-notifier”, “trojan-proxy”, “trojan-PSW (password stealer)”, “trojan-rootkit”, “trojan-spambot”, “trojan-IM (instant messenger)”, “usb-borne malware”, “virtual machine-aware malware”, “voice phishing (vishing) malware”, “web-inject malware”, “wiper malware”, “zero-day exploit payload”, “zombie malware”.

These keywords define the semantic space used to construct prompts. We instructed LLaMA, DeepSeek, and KIMI models to generate simulated code artifacts and mock implementations suitable for execution within a controlled sandbox environment. Figure 7 illustrates examples of the benign prompts and their format. This facilitates precise, repeatable interactions with the LLM and improves prompt engineering effectiveness [16]; the highlighted terms indicate the benign qualifiers (e.g., “simulation”, “mock”, “sandbox”) that frame the requests as non-operational prompts, as shown in Figure 8.

Using this scaffold, LLaMA, DeepSeek, and KIMI collectively produced a dataset of 15,732 benign prompts. These prompts intentionally request simulated or educational artifacts (e.g, mock implementations or in-memory demonstrations), and therefore were classified as benign during initial curation. A subset of these prompts (10,000 after quality filtering) formed the basis for subsequent linguistic transformation experiments: the benign qualifiers were algorithmically altered to express explicit, real-world intent, producing variants whose surface form preserved grammatical while modifying the underlying request semantics. All transformations and downstream processing were performed under strict ethical controls and reviewed by experts prior to code generation and verification.

2.3. Step 2 Jailbreak/Prompts

Prompts were systematically transformed using linguistically motivated, automated deformations to evaluate model robustness against obfuscated malicious intent. This stage applies semantics-preserving transformations inspired by topological concepts, with homotopy used as a metaphor for gradual and continuous linguistic change. All transformations were generated programmatically and screened for semantic consistency using a combination of model-assisted checks and targeted human review, with high-risk or semantically distorted prompts excluded. Figure 9 illustrates the conceptual approach.

2.3.1. Homotopy-Inspired Prompt

A homotopy-inspired prompt is a structured prompting strategy that gradually transforms a safe base prompt into a target prompt through semantically continuous intermediate steps, enabling controlled reasoning and improved stability in LLM outputs. Formally, let

P_{0}

denote a safe base prompt and

P_{K}

the target prompt. A homotopy function

H : [0, 1] \to P

defines a continuous transformation between prompts such that

P (t) = H (t), P (0) = P_{0}, P (1) = P_{K},

where

P

represents the space of valid prompts. In practice, this transformation is implemented as a discrete sequence of K intermediate prompts

{P_{k}}_{k = 0}^{K}

, satisfying the semantic continuity constraint

d_{s} (P_{k}, P_{k - 1}) \leq ϵ, k = 1, \dots, K,

where

d_{s} (\cdot, \cdot)

denotes a semantic distance metric (e.g., embedding-based cosine distance) and

ϵ

is a small threshold.

To ensure controlled and stable model behavior, the output distributions induced by consecutive prompts are constrained as

D (p_{θ} (\cdot ∣ P_{k}), p_{θ} (\cdot ∣ P_{k - 1})) \leq δ,

where

p_{θ} (y ∣ P)

is the LLM output distribution conditioned on prompt P,

D (\cdot, \cdot)

is a divergence measure (e.g., KL or Jensen–Shannon divergence), and

δ

controls output stability across transitions. This formulation enables smooth prompt evolution while maintaining semantic coherence and predictable LLM responses.

2.3.2. Homeomorphic Prompt Deformation

The process of applying homeomorphic (homotopy-inspired) linguistic transformations is difficult to perform reliably by hand because it requires nuanced, context-aware rewriting that preserves semantic intent while altering surface form. Simple programmatic edits—such as token shifting or concatenating strings in Python—are insufficient for this task. Consequently, we used an LLM to perform the deformations; the KIMI model was selected for this role due to its larger context window and superior empirical performance in our preliminary evaluations [29]. From an initial pool of 15,732 prompts, a curated subset of 10,000 prompts was retained for downstream analysis. Prompts that became corrupted during automatic deformation or that failed to conform to the required output schema were excluded during quality control. Figure 7 illustrates the prompt filtering and formatting criteria used during curation.

2.4. Step 3: LLM Code Generation

The generation of code for each prompt is a computationally intensive task that requires precise formatting and careful orchestration. Figure 6 step 3 illustrates the overall code generation workflow. Each prompt is designed to produce a fully functional program ready for compilation. The LLaMA and DeepSeek models were executed on Google Colab, interfacing with an Ollama server. Due to resource constraints and prior instability issues, a total of 1000 prompts were processed between these two models (500 each) to ensure reliable execution and prevent data loss. For KIMI, we leveraged its API via a Python program using the OpenAI framework. This API supports concurrent requests (up to 200), 1,500,000 tokens per minute, 5000 requests per minute, and unlimited daily tokens. These superior performance characteristics justified prioritizing KIMI for large-scale code generation over LLaMA and DeepSeek. Each LLM was provided with carefully structured prompts, including specific keywords, target programming languages (C++, C++20, Python 3.10), and directives to exclude comments. Omitting comments was critical, as annotations describing the code logic could trigger security filters and block output. The generated code from KIMI, DeepSeek, and LLaMA strictly adhered to these instructions, producing executable programs without annotations.

2.5. Step 4: Verification

LLMs have previously been shown to be effective for identifying potentially malicious code, supporting automated malware analysis, and detecting novel malware variants [30]. While such verification can be performed manually by domain experts, this approach is impractical at scale. Therefore, verification in this study is conducted using an LLM-based verifier to assess whether generated code exhibits malicious behavior.

Initial verification was performed using KIMI; however, to mitigate verification bias and reduce circularity risk, a second and more conservative verification stage was introduced using Claude. Figure 6 illustrates the overall verification workflow.

2.5.1. LLM-Based Verification Procedure and Dataset Integrity

Claude analyzes each generated code sample and provides a binary decision (Yes/No) indicating whether the code constitutes malware, along with a brief one-line description of the detected malicious behavior. In cases where the verifier cannot conclusively determine maliciousness, samples are conservatively classified as non-malicious to maintain consistency and reduce false positives in downstream analysis.

During early experimentation with KIMI, 275 specimens were irretrievably lost due to a macro virus triggered when opening files in Microsoft Excel. Additionally, despite explicit instructions to generate only C++, and Python code, some shell scripts were inadvertently executed as a result of prompt deformation. To preserve dataset integrity and ensure consistent verification, the total number of evaluated prompts was reduced from 10,000 to 9725. This reduction explains why Claude verifies only 9725 prompts in the final dataset.

2.5.2. Verification Criteria and Artifact Categorization

Maliciousness is assessed at the source-code level without behavioral execution using conservative static analysis. Generated artifacts are categorized into (i) functional malware candidates, (ii) partial or malformed payloads with explicit malicious intent, and (iii) benign-but-suspicious artifacts. Only categories (i) and (ii) are retained as LLM-verified malicious code candidates, while ambiguous cases are excluded. These labels represent intent-based static verification rather than behavioral ground truth and are intended for comparative LLM safety evaluation.

2.6. Step 5: Reporting

All verified code samples were compiled into a single CSV file containing metadata for each specimen, along with entries for unclassified outputs retained for future investigation. The final dataset, generated using Claude, includes 9725 code samples, of which 7374 were confirmed as malware, each accompanied by a corresponding malware description.

3. Settings and Setups

Our approach uses heterogeneous execution environments and technologies to access the LLMs. KIMI and Claude were accessed exclusively through their official API_Key. LLama and Deepseek, which were deployed locally through the ollama-0.3.6 server on a Google Colab virtual machine. Table 2 and Table 3 provide a structured overview of the two distinct operational environments used in our experiments, highlighting the differences in access methods, dependencies, and execution contexts.

4. Results

A dataset comprising 7374 malware specimens, each accompanied by a detailed behavioral description, was utilized to evaluate the effectiveness of various LLMs in generating verified malware through a jailbreak-based heuristic. The primary objective was to assess each model’s susceptibility to adversarial prompt manipulation and to quantify both the frequency and reliability of successful malware generation. The quantitative results obtained using Claude as the judge are presented in Table 4, while Table 5 reports the results using KIMI as the judge. Figure 10, Figure 11 and Figure 12 illustrate the comparative and statistical performance of the models verified by Claude.

Evaluation Metrics

Let TP denote the number of jailbreak attempts that resulted in verifiable malware, and FP denote the number of attempts that produced non-malicious outputs. The evaluation metrics are defined as follows:

Precision = \frac{T P}{T P + F P}

Error Rate = \frac{F P}{T P + F P} = 1 - Precision

In our experimental setup, false negatives (FN) and true negatives (TN) are not explicitly observable because only jailbreak attempts and their verification outcomes are evaluated. Under this formulation, precision directly reflects the reliability of jailbreak success, while the error rate captures the proportion of false positives among generated outputs.

Table 4 reports both raw outcome counts and derived evaluation metrics for each model using Claude to verify. Precision represents the proportion of jailbreak attempts that resulted in verifiable malware, while the error rate reflects the proportion of false positives among generated outputs. The results reveal clear model-dependent differences in vulnerability to adversarial prompt engineering. The LLaMA model exhibits the lowest precision (0.64) and the highest error rate (36%), indicating weaker consistency in producing verifiable malware. In contrast, DeepSeek achieves the highest precision (0.822), along with the lowest error rate (17.8%), suggesting reduced robustness of its alignment mechanisms under adversarial prompting.

Table 5 reports both raw outcome counts and derived evaluation metrics for each model using KIMI to verify. The relative error between the two judges (KIMI–Claude) verifying and classifying malware is 1.3%. Reflecting minimal divergence between the two models.

Aggregated across all models, the overall precision of 0.758 confirm that the proposed jailbreak heuristic is broadly effective, but exhibits varying reliability across different LLM architectures. Figure 10 and Figure 11 further illustrate comparative performance trends and variability, while Figure 12 provides an intuitive visualization of successful versus unsuccessful jailbreak attempts.

Overall, these findings demonstrate that contemporary LLMs exhibit significant and model-specific vulnerabilities to adversarial prompt manipulation. The strong precision of DeepSeek and the large-scale effectiveness of KIMI underscore the need for improved safeguard architectures, adversarial training strategies, and evaluation-driven defenses to mitigate the potential misuse of generative AI systems.

5. Future Work and Limitations

In this study, we examined the capability of LLMs to generate and describe malware through heuristic-based jailbreak techniques. The findings demonstrate that despite the integration of safety mechanisms, these models remain vulnerable to adversarial manipulations that can be exploited to produce harmful outputs. This highlights the dual-use nature of generative AI systems and underscores the necessity for stronger alignment and defense strategies. We emphasize that malware classification in this study relies exclusively on LLM-labeled malware-like source code and not on behaviorally validated malware, as no real-world or sandbox execution was performed.

Despite the promising results, this study has several limitations. First, the evaluation was restricted to heuristic-driven jailbreaks, which may not encompass the full spectrum of adversarial strategies that could target LLMs. Moreover, behavioral validation of the generated malware was conducted under controlled experimental conditions that might not accurately represent real-world execution environments. Another limitation lies in the focus on text-based malware generation, excluding multimodal or system-level interactions that could provide a more comprehensive understanding of exploit pathways. Consequently, the reported success rates should be interpreted as indicative rather than exhaustive, reflecting a lower bound on potential vulnerabilities.

One limitation of this study is that the number of prompts differs across models (e.g., KIMI versus LLaMA and DeepSeek), largely due to the access and usage constraints of the online GPU frameworks used for our experiments. As a result, quantitative comparisons between models should be interpreted with caution, since observed performance differences may partly reflect unequal prompt exposure rather than true differences in model capability. On the dataset, due to the dual-use nature of malware-related artifacts, access to the full dataset is intentionally restricted. While we have publicly released the generated source-level code samples to support malware analysis and defensive research, the malicious and obfuscated prompts used to elicit these samples are not publicly disclosed.

Future research should extend this work by exploring diverse adversarial techniques to better characterize and mitigate LLM vulnerabilities. Investigations into automated validation frameworks and robust quality control mechanisms are essential to ensure reliability in evaluating adversarial outputs. Additionally, further studies should analyze the balance between obfuscation strength and functional fidelity to understand how prompt deformation affects detectability and behavior. Expanding this research across different model architectures, modalities, and operational environments will support the development of advanced defensive systems. Ultimately, future efforts must ensure that adversarial experimentation remains an ethical tool for enhancing AI security rather than a vector for misuse.

Mitigation Strategies and Defensive Implications

Beyond identifying vulnerabilities, our findings highlight several mitigation strategies for strengthening the safety of large language models against malware-related misuse. One effective approach is the integration of adversarial prompt stress-testing during both training and deployment, enabling models to better recognize and resist heuristic-based jailbreak attempts. In addition, multi-stage safety pipelines that combine static code analysis, semantic intent detection, and post-generation filtering can help prevent malicious code from bypassing existing safeguards.

At the model level, improved alignment techniques, including reinforcement learning with adversarially generated examples and continuous red-teaming, can reduce susceptibility to prompt deformation and obfuscation strategies. Furthermore, the adoption of automated behavioral validation frameworks, such as sandbox-based execution and anomaly detection, would allow for more reliable differentiation between superficially malware-like code and functionally harmful artifacts.

6. Conclusions

This study demonstrates that homotopy-inspired linguistic deformation can effectively bypass LLM safeguards, achieving an overall jailbreak success rate of 76% across the evaluated models, thereby directly addressing RQ3. These results reveal critical vulnerabilities in current LLM architectures, emphasizing the need for more robust safety mechanisms. Importantly, this research is conducted with the explicit goal of informing defensive strategies rather than facilitating malicious activity. By illustrating how carefully engineered prompts can manipulate model behavior, this work highlights the urgency of developing advanced detection methods, resilient security frameworks, and comprehensive mitigation strategies against adversarial attacks. The primary contribution of this study lies in providing actionable insights to enhance the safety, reliability, and trustworthiness of AI technologies. Future efforts should extend these findings to diverse model architectures and modalities, ensuring that adversarial research continues to strengthen cybersecurity rather than compromise it.

Author Contributions

Conceptualization, L.E.L.V., H.J. and R.R.-F.; methodology, L.E.L.V. and H.J.; software, L.E.L.V.; validation, L.E.L.V., H.J. and R.R.-F.; formal analysis, L.E.L.V. and H.J.; investigation, L.E.L.V.; resources, H.J. and R.R.-F.; data curation, L.E.L.V.; writing—original draft preparation, L.E.L.V. and H.J.; writing—review and editing, H.J. and R.R.-F.; visualization, L.E.L.V.; supervision, H.J. and R.R.-F.; project administration, H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available within the article. Additional materials are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guo, J.; Fan, Y.; Pang, L.; Yang, L.; Ai, Q.; Zamani, H.; Wu, C.; Croft, W.B.; Cheng, X. A Deep Look into neural ranking models for information retrieval. Inf. Process. Manag. 2020, 57, 102067. [Google Scholar] [CrossRef]
Nelson, E.S. Language, Nature, and the Self: The Feeling of Life in Kant and Dilthey. In The Linguistic Dimension of Kant’s Thought: Historical and Critical Essays; Schalow, F., Velkley, R.L., Eds.; Northwestern University Press: Evanston, IL, USA, 2014; pp. 263–287. [Google Scholar]
Chowdhary, K.R. Natural language processing. In Fundamentals of Artificial Intelligence; Springer: New Delhi, India, 2020; pp. 603–649. [Google Scholar]
McShane, M.; Nirenburg, S. Linguistics for the Age of AI; MIT Press: Cambridge, MA, USA, 2021. [Google Scholar]
Kamath, U.; Keenan, K.; Somers, G.; Sorenson, S. Large Language Models: A Deep Dive; Springer: Cham, Switzerland, 2024. [Google Scholar]
López García, A. Introduction to Topological Linguistics; Annexa: Washington, DC, USA, 1990. [Google Scholar]
López-García, A. Topological linguistics and the study of linguistic variation. In Current Issues in Mathematical Linguistics; Martín-Vide, C., Ed.; North-Holland Linguistic Series: Linguistic Variations; Elsevier: Amsterdam, The Netherlands, 1994; Volume 56, pp. 69–77. [Google Scholar]
Guénard, F.; Lelièvre, G.; Bidón-Chanal, C. Thinking Mathematics: Seminar on Philosophy and Mathematics at the École Normale Supérieure in Paris; Tusquets: Barcelona, Spain, 1999. [Google Scholar]
Van Han, N.; Vinh, P.C. Towards Linguistic Fuzzy Topological Spaces Based on Hedge Algebra. EAI Endorsed Trans. Context Aware Syst. Appl. 2022, 8, e12. [Google Scholar] [CrossRef]
Seifert, H.; Threlfall, W. Lessons in Topology; Modern Mathematics Text Collection; Jorge Juan Institute of Mathematics: Madrid, Spain, 1951. [Google Scholar]
Milnor, J.W.; Wallace, A. Differential Topology; American Mathematical Society: Providence, RI, USA, 2007. [Google Scholar]
Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
Sun, H.; Zhang, Z.; Deng, J.; Cheng, J.; Huang, M. Safety Assessment of Chinese Large Language Models. arXiv 2023, arXiv:2304.10436. [Google Scholar] [CrossRef]
Shen, X.; Chen, Z.; Backes, M.; Shen, Y.; Zhang, Y. “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, Salt Lake City, UT, USA, 14–18 October 2024; pp. 1671–1685. [Google Scholar]
Yu, Z.; Liu, X.; Liang, S.; Cameron, Z.; Xiao, C.; Zhang, N. Don’t listen to me: Understanding and exploring jailbreak prompts of large language models. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 4675–4692. [Google Scholar]
Bandi, D. Jailbreak ChatGPT: Prompt Engineering Masterclass: Unlock ChatGPT Superpowers. Available online: https://www.amazon.ca/Jailbreak-ChatGPT-Engineering-Masterclass-Superpowers/dp/B0D12XNF3G (accessed on 1 February 2026).
Liu, Y.; Deng, G.; Xu, Z.; Li, Y.; Zheng, Y.; Zhang, Y.; Zhao, L.; Zhang, T.; Wang, K. A hitchhiker’s guide to jailbreaking chatgpt via prompt engineering. In Proceedings of the 4th International Workshop on Software Engineering and AI for Data Quality in Cyber-Physical Systems/Internet of Things, Porto de Galinhas, Brazil, 15 July 2024; pp. 12–21. [Google Scholar]
Wahréus, J.; Hussain, A.M.; Papadimitratos, P. CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models. arXiv 2025, arXiv:2501.01335. [Google Scholar]
Wang, Z.; Anshumaan, D.; Hooda, A.; Chen, Y.; Jha, S. Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks. arXiv 2024, arXiv:2410.04234. [Google Scholar] [CrossRef]
Kroening, D.; David, C. Program synthesis: Challenges and opportunities. Philos. Trans. A Math. Phys. Eng. Sci. 2017, 375, 20150403. [Google Scholar]
Wang, J.; Chen, Y. A review on code generation with llms: Application and evaluation. In Proceedings of the 2023 IEEE International Conference on Medical Artificial Intelligence (MedAI); IEEE: New York, NY, USA, 2023; pp. 284–289. [Google Scholar]
Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.; Zhao, E.; Zhang, Y.; Chen, Y.; et al. Siren’s song in the AI ocean: A survey on hallucination in large language models. arXiv 2023, arXiv:2309.01219. [Google Scholar] [CrossRef]
Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Eccles, T.; Keeling, J.; Gimeno, F.; Dal Lago, A.; et al. Competition-level code generation with alphacode. Science 2022, 378, 1092–1097. [Google Scholar] [CrossRef] [PubMed]
Dong, Y.; Ding, J.; Jiang, X.; Li, G.; Li, Z.; Jin, Z. Codescore: Evaluating code generation by learning code execution. ACM Trans. Softw. Eng. Methodol. 2025, 34, 1–22. [Google Scholar] [CrossRef]
Huang, D.; Zhang, J.M.; Bu, Q.; Xie, X.; Chen, J.; Cui, H. Bias testing and mitigation in llm-based code generation. ACM Trans. Softw. Eng. Methodol. 2025, 35, 1–31. [Google Scholar] [CrossRef]
Tokieda, T. Topology in Four Days. In An Introduction to the Geometry and Topology of Fluid Flows; Springer: Dordrecht, The Netherlands, 2001; pp. 35–55. [Google Scholar]
Dunlavy, D.M.; O’Leary, D.P. Homotopy Optimization Methods for Global Optimization; Technical report; Sandia National Laboratories (SNL): Albuquerque, NM, USA; Livermore, CA, USA, 2005. [Google Scholar]
Gavrilovich, M. The unreasonable power of the lifting property in elementary mathematics. arXiv 2017, arXiv:1707.06615. [Google Scholar] [CrossRef]
Team, K.; Du, A.; Gao, B.; Xing, B.; Jiang, C.; Chen, C.; Li, C.; Xiao, C.; Du, C.; Liao, C.; et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv 2025, arXiv:2501.12599. [Google Scholar]
Jelodar, H.; Bai, S.; Hamedi, P.; Mohammadian, H.; Razavi-Far, R.; Ghorbani, A. Large Language Model (LLM) for Software Security: Code Analysis, Malware Analysis, Reverse Engineering. arXiv 2025, arXiv:2504.07137. [Google Scholar] [CrossRef]

Figure 1. Topological deformation of a doughnut into a coffee cup.

Figure 2. Homotopy in LLMs, malware and the definition have the same semantics; namely, this is how the circle fits into the cylinder in the unit interval

[0, 1]

.

Figure 2. Homotopy in LLMs, malware and the definition have the same semantics; namely, this is how the circle fits into the cylinder in the unit interval

[0, 1]

.

Figure 3. KIMI model, trying to produce a metaphorical version of a malicious prompt.

Figure 4. Microsoft Copilot (cloud service; free version), accessed via Windows 11 on 24 December 2025. Transforming an original malicious prompt into a benign prompt.

Figure 5. KIMI model, performing a homotopy deformation of a malicious prompt.

Figure 6. This framework defines a five-stage pipeline for the jailbreak of LLMs to generate malware code for cybersecurity research. Input data is transformed using homotopy-inspired deformations to obfuscate malicious prompts, which are then submitted to KIMI, Llama, and DeepSeek for code generation and verified by Claude. All validated outputs are consolidated into a structured dataset to support cybersecurity research.

Figure 7. Standard prompt format used to elicit structured responses from LLMs (format shown for reproducibility; content sanitized).

Figure 8. Llama, DeepSeek and KIMI generating the dataset of benign prompts and their transition into a malicious prompts.

Figure 9. Conceptual illustration of linguistic transformations applied to prompts (metaphorical depiction of topological deformation).

Figure 10. Jailbreaking success and error rates for LLaMA, DeepSeek, and KIMI (verified by Claude).

Figure 11. Malware specimens generated by LLaMA, DeepSeek, and KIMI (verified by Claude).

Figure 12. Successful versus unsuccessful jailbreak attempts across the evaluated models (verified by Claude).

Table 1. High-level LLM configuration policies used in experiments (representative, non-actionable).

LLM	Representative Configuration Policy
CodeLlama-7b-hf	standardized sampling, fixed response length limits
Deepseek-r1:7b	standardized sampling, fixed response length limits
KIMI-k2-0711	standardized sampling, expanded context allowance under audit
claude-sonnet-4-20250514	standardized sampling, expanded context allowance under audit

Table 2. Environment specification for local LLMs executed via Ollama (LLaMA and DeepSeek).

Component	LLaMA (Ollama)	DeepSeek (Ollama)
Access method	Local Ollama server	Local Ollama server
Authentication	None required	None required
Execution environment	Google Colab VM	Google Colab VM
Communication	Local HTTP endpoint	Local HTTP endpoint
Python libraries	Langchain_ollama	langchain_ollama, requests
Model endpoint	CodeLlama-7b-hf	deepseek-r1:7b
Hardware	Colab CPU/GPU	Colab CPU/GPU

Table 3. Environment specification for cloud-based LLMs (KIMI and Claude).

Component	KIMI	Claude
Access method	Official API (HTTPS)	Official API (HTTPS)
Authentication	MOONSHOT_API_KEY	Anthropic API_KEY
Execution environment	Provider cloud	Provider cloud
Python libraries	OpenAI	anthropic, requests
Model endpoint	kimi-k2-0711-preview	claude-sonnet-4-20250514
Hardware	Cloud-hosted	Cloud-hosted

Table 4. Merged jailbreak-success evaluation (Claude-verified).

LLM	Malware (TP)	No Malware (FP)	Success Rate	Precision	Error Rate
Llama	320	180	64%	0.64	36%
Deepseek	411	89	82.2%	0.822	17.8%
KIMI	6643	2082	76.13%	0.761	23.87%
TOTAL	7374	2351	75.82%	0.758	24.18%

Table 5. Merged jailbreak-success evaluation (KIMI-verified).

LLM	Malware (TP)	No Malware (FP)	Success Rate	Precision	Error Rate
Llama	311	189	62.2%	0.622	37.8%
Deepseek	403	97	80.60%	0.860	19.4%
KIMI	6756	1969	77.43%	0.7743	22.57%
TOTAL	7470	2255	76.81%	0.7681	23.19%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lazo Vera, L.E.; Jelodar, H.; Razavi-Far, R. LLM Security and Safety: Insights from Homotopy-Inspired Prompt Obfuscation. AI 2026, 7, 83. https://doi.org/10.3390/ai7030083

AMA Style

Lazo Vera LE, Jelodar H, Razavi-Far R. LLM Security and Safety: Insights from Homotopy-Inspired Prompt Obfuscation. AI. 2026; 7(3):83. https://doi.org/10.3390/ai7030083

Chicago/Turabian Style

Lazo Vera, Luis Eduardo, Hamed Jelodar, and Roozbeh Razavi-Far. 2026. "LLM Security and Safety: Insights from Homotopy-Inspired Prompt Obfuscation" AI 7, no. 3: 83. https://doi.org/10.3390/ai7030083

APA Style

Lazo Vera, L. E., Jelodar, H., & Razavi-Far, R. (2026). LLM Security and Safety: Insights from Homotopy-Inspired Prompt Obfuscation. AI, 7(3), 83. https://doi.org/10.3390/ai7030083

Article Menu

LLM Security and Safety: Insights from Homotopy-Inspired Prompt Obfuscation

Abstract

1. Introduction

1.1. Related Work on Jailbreaking LLMs and Code Generation

1.2. Code Generation Methods Using LLM

1.2.1. Frameworks for Code Generation

1.2.2. Homotopy Theory as a Jailbreak Technique

1.2.3. Homotopy Deformation in LLMs

1.3. Research Motivation

1.4. Research Challenges

1.5. Research Contributions

2. Research Methodology

2.1. LLM Configuration

2.2. Step 1—Data and Source Prompts

2.3. Step 2 Jailbreak/Prompts

2.3.1. Homotopy-Inspired Prompt

2.3.2. Homeomorphic Prompt Deformation

2.4. Step 3: LLM Code Generation

2.5. Step 4: Verification

2.5.1. LLM-Based Verification Procedure and Dataset Integrity

2.5.2. Verification Criteria and Artifact Categorization

2.6. Step 5: Reporting

3. Settings and Setups

4. Results

Evaluation Metrics

5. Future Work and Limitations

Mitigation Strategies and Defensive Implications

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI