LLM Firewall Using Validator Agent for Prevention Against Prompt Injection Attacks

Podpora, Michal; Baranowski, Marek; Chopcian, Maciej; Kwasniewicz, Lukasz; Radziewicz, Wojciech

doi:10.3390/app16010085

Open AccessArticle

LLM Firewall Using Validator Agent for Prevention Against Prompt Injection Attacks

by

Michal Podpora

^1,*

,

Marek Baranowski

^2,3,*

,

Maciej Chopcian

¹

,

Lukasz Kwasniewicz

⁴

and

Wojciech Radziewicz

²

¹

Institute of Computer Science, University of Opole, Oleska 48, 45-052 Opole, Poland

²

Faculty of Computer Science, Opole University of Technology, Proszkowska 76, 45-758 Opole, Poland

³

Opolskie Centrum Zarzadzania Projektami, Technologiczna 2, 45-839 Opole, Poland

⁴

Institute of Computer Science, Maria Curie-Sklodowska University, Akademicka 9, 20-033 Lublin, Poland

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 85; https://doi.org/10.3390/app16010085 (registering DOI)

Submission received: 8 November 2025 / Revised: 6 December 2025 / Accepted: 16 December 2025 / Published: 21 December 2025

Download

Browse Figures

Versions Notes

Abstract

Large Language Models with Retrieval-Augmented Generation are considered to be modern, chat-native interfaces to enterprise knowledge. However, deploying such systems safely requires precautions more advanced than input filtering. Numerous LLM-related security threats (including the top one: prompt injection attacks) demand robust defense mechanisms beyond input filtering. This paper extends our dual-agent RAG architecture as an LLM firewall with output-level security validation. Similar to network firewalls, which monitor semantic boundaries, we position the Validator Agent as a response firewall performing multi-feature security checks: prompt injection detection, policy compliance verification, sensitive information redaction, and toxic content filtering. While existing defense approaches focus mainly on input-level screening, our architecture proposes to analyze the output, where a successful attack becomes more obvious. The proposed architecture was verified to be viable using the Polish Bielik 2.3 LLM in an on-premise RAG system designed for energy auditing.

Keywords:

LLM firewall; LLM vulnerabilities; prompt injection defense; Retrieval-Augmented Generation security; dual-agent architecture; output validation; on-premise LLM deployment; multi-agent systems; response filtering; RAG security; multi-agent defense

1. Introduction

Large Language Models LLMs have become a common part of modern Natural Language Processing NLP applications, becoming the invisible fabric of the ubiquitous conversational agents across multitude of application domains: healthcare, finance, legal services, entertainment, and many more. The integration of Retrieval-Augmented Generation RAG [1,2] has further expanded the usability of LLMs by enabling the use of additional external knowledge for formulating responses, thereby mitigating hallucinations and offering the possibility to conduct a case-specific dialogue without fine-tuning a model. Unfortunately, as LLM-based systems transition from research prototypes to production deployments, they open a new wide range of possibilities for exploitation and other malicious activities, some of which were not possible “before LLMs”. They introduce novel security vulnerabilities [3,4,5] that traditional cybersecurity frameworks are not prepared to address.

One of the most critical threats facing LLM applications is Prompt Injection, which has been identified by the OWASP (OWASP Top 10 for LLM Applications 2025 [6]) as the number one risk [5,7]. Prompt Injection attack occurs when an adversary deliberately prepares malicious inputs (usually prompt) to manipulate an LLM’s behavior [8,9], bypassing safety mechanisms and guardlines, causing the system to execute unintended actions. These attacks can lead to unauthorized data exfiltrations, policy violations, compromise of system integrity [1,3], or even a leakage of information stored in locations available for the agent. The severity of this threat is partially a consequence of its accessibility: traditional exploits required basic (or advanced) technical expertise, while Prompt Injection can be performed by anyone capable of using natural language [10], and without any tools other than internet browser.

To counteract to this emerging threat, several approaches have been proposed. Some of them discuss analysis of the prompt for malicious intentions, some of them search for patterns, or perform sanitization. In our paper we have decided to review only a part of the broader landscape of possible solutions: the concept of “LLM firewall”, which seems to get traction in the industry (but also in academic research—[1,4,11,12,13,14]). Inspired by computer network firewalls (analysing and controlling packets based on security policies), LLM firewalls operate at the semantic level, analyzing natural language queries (sent to the model) and model outputs presented to the user to detect possible malicious activities, report them, and mitigate if possible [1,12,15]. Recent work has introduced various firewall architectures, including ControlNET for RAG-based systems [1,15], which makes good use of the activation shift phenomena to detect malicious adversarial queries, and systems like Cloudflare’s Firewall for AI [16] for input and output filtering [11,12,17].

Current LLM firewall implementations mostly focus on input-level defense as the main (or the only one) strategy [11,12,14,17], attempting to identify and block malicious prompts before they are sent to the model (see Figure 1). Input filtering seems to be a reasonable approach, limiting the usage of computational power for handling malicious prompts, and it can surely be treated as a first line of defense, but it addresses only a part of the security challenge. Malicious instructions can be embedded in retrieved documents (indirect prompt injection) [1,12,18], or attacks may be sophisticated enough to trigger inappropriate output after the model processes seemingly “harmful” (benign) input [8]. Attackers capable of preparing prompts able to evade input-based detection may be still achieving their objectives in the generated output [19,20] without triggering any alarms in input-based security analysis modules. For this reason, we have focused our efforts on the research on security architectures capable of more than just input validation, or typical guardrails [21,22,23].

As illustrated in Figure 1, purely input-level prompt validation can often be bypassed by creative or indirect attacks. Even when the initial user query appears benign, adversaries may hide harmful instructions in retrieved documents, exploit formatting tricks, or rely on multi-step interaction patterns that are difficult to capture with static input filters alone. This observation motivates our focus on complementing input-side protections with an additional validation layer that operates on the model’s outputs.

In our previous work [2], we introduced a dual-agent architecture for on-premise RAG-based conversational systems, specifically applied to the Energy Performance of Buildings Directive domain. The architecture employed two distinct LLM agents: a Generator Agent responsible for retrieving relevant knowledge and formulating responses, and a Validator Agent, system-prompted to ensure that the outputs stay relevant to the topic, and conform to required JSON specifications and the company policies, before presenting the system’s output to the user. This separation of concerns is able to improve response reliability and enable flexible deployment across various hardware configurations without requiring model retraining [2]. However, that work focused primarily on format compliance and did not explicitly address the security implications of the Validator Agent role. In other words, the Validator Agent in [2] acted mainly as a JSON and business-rule validator to improve reliability and integration with downstream systems. In this paper, we extend this mechanism into a security-focused LLM firewall: the Validator Agent is redesigned as an output-level security component that performs multi-dimensional checks (prompt injection detection, sensitive information redaction, toxic content filtering, and policy compliance), and is embedded within an explicit threat model and three-layer defense architecture (input, retrieval, output).

This work, on the other hand, is focused solely on the security aspect of LLM-based systems/modules, especially in applications that are available to external users, but should not reveal internal instructions, data, or guidelines.

The dual-agent use has emerged independently in research as the simplest use of multiagent system. The literature also mentiones “Dual LLM Pattern”, in the research area of LLM security, as an architecture intended to defend against Prompt Injection attacks [21,24,25,26]. This pattern introduces a separation of privileges by employing a Privileged LLM with access to tools and sensitive operations, and a Quarantined LLM that processes untrusted content only in its own sandbox [21,24]. The main idea is that unfiltered outputs from the quarantined component should not reach privileged systems [21,26]. This architectural uni-directional sandbox-based approach aligns somewhat with our Validator Agent conception [2], which intercepts and examines LLM at the output, effectively becoming a trust boundary.

Recent research has further progressed the development of LLM security architectures. The concept of “agentic firewalls” has been proposed to secure dynamic multi-agent LLM systems [4], introducing data firewalls able to abstract (temporarily replace/encode) sensitive information, trajectory firewalls able to validate action sequences, or input firewalls able to convert natural language to structured protocols. Similarly, frameworks like GenTel-Shield demonstrate state-of-the-art performance in detecting Prompt Injection attacks by being trained on comprehensive attack taxonomies [8]. These advances suggest (or prove) that effective LLM security requires multi-layered defense, and should operate on all stages: input, retrieval, and output [1,4,12,15].

Despite the progress of detection methods, output validation (as a security mechanism in RAG-based systems) is not significantly visible in research literature. Most existing firewall solutions either focus exclusively on input filtering [11,12,17] or require significant computational overhead because of multiple model calls [4,21]. There is limited research on lightweight, production-grade architectures that would use output validation at a core of the firewall architecture, in order to be capable of detecting and mitigating attacks, which managed to bypass input defenses.

This paper reframes and extends our dual-agent RAG architecture introduced in [2], by introducing the security context into the architecture as an LLM firewall system with comprehensive attack defense capabilities. Our previous research used the 2nd agent mainly as a JSON format checker and for policy enforcement. In this paper we present the Validator Agent as LLM firewall intended to perform multi-dimensional security validation (including Prompt Injection detection, sensitive information redaction, and toxic content filtering). By operating at the output level (see Figure 2), such an approach provides an important complementary defense mechanism capable of identifying attacks (and anomalies) missed by input-level methods and detect other adversarial behaviors, which may be visible only in outputs.

Figure 2 provides a high-level view of the proposed output-level validation stage integrated into the dual-agent RAG architecture. The Validator Agent inspects the responses generated by the Generator Agent before they are returned to the user, enforcing security and compliance policies and acting as a ’last-resort’ failsafe if earlier safeguards are bypassed. In particular, it is responsible for detecting prompt-injection patterns, identifying potential sensitive-information leakage, and filtering toxic or policy-violating content at the moment when such issues become visible in the model’s outputs.

Problem formulation. In this paper, we address the problem of designing a lightweight, production-grade LLM firewall architecture for on-premise, Retrieval-Augmented Generation (RAG)-based conversational systems. The goal is to provide effective defense against prompt injection, information leakage, and policy-violating outputs, while preserving the practicality of deployment in regulated environments. We focus on architectures in which an output-level Validator Agent enforces security and compliance policies on generated responses, complementing existing input-level defenses and operating under a clearly defined threat model.

The contributions of this paper are as follows:

A comprehensive threat- and attack-taxonomy, listed particularly for RAG-based conversational agents, categorizing vulnerabilities into input-level, retrieval-level, and output-level attacks [1,5,9,15], with additional attention to information leakage and policy violations—Section 2.
The methodology behind our conception of LLM firewall (dual-agent architecture first outlined in [2], enhanced by the security Validator Agent—Section 3.
Practical guidance for deploying secure on-premise LLM-based agents in regulated environments, discussing the trade-offs between security overhead and response latency [2,21], and identifying scenarios where dual-agent firewall architectures offer optimal protection.

The paper’s structure is as follows: Section 2 reviews related work on LLM security threats, firewall concepts, guard models, and multi-agent architectures. Section 3 presents our methodology, including the threat model, extended Dual-agent Firewall architecture, and implementation details. The last section discusses the advantages, limitations, and future research directions for Dual-agent Firewall approaches in securing LLM-based conversational systems.

2. Related Work

As Large Language Models become increasingly integrated into production systems and/or sensitive data processing pipelines, understanding their typical security vulnerabilities should definitely gain attention. The OWASP Top 10 for LLM Applications 2025 [6] presents a comprehensive taxonomy of threats, with the Prompt Injection attack “winning” the 1st place. This section provides a systematic analysis of major attack vectors typical for LLM-based systems, particularly those relevant to on-premise RAG-based deployments [3,27,28].

2.1. LLM Attack Vectors

2.1.1. Prompt Injection

Prompt Injection represents the most critical and pervasive threat to LLM security, occupying the top position in the OWASP Top 10 for LLM Applications 2025 [5]. The “Prompt Injection” is a term used to describe a scenario in which a user (either intentionally or accidentally) modifies the behavior of a model, causing it to execute unintended actions. This can potentially result in violating its guidelines. Opposed to traditional cybersecurity exploits, requiring technical expertise, prompt injection attacks may be performed by any person/agent capable of using natural language inputs/outputs [29,30,31,32,33,34].

Prompt Injection, as a type of vulnerability, can be classified into two subcategories: direct and indirect.

Direct Prompt Injection occurs when the behavior-altering content is located within the user prompt. The model may be persuaded to ignore its system prompt or safety guidelines, and instead to follow specific injected instructions. For example, an attacker might include a phrase like “ignore previous instructions and…” to hijack the model’s behavior [28,30,32,34].

Indirect Prompt Injection uses external content as a source of malicious instructions (documents, database entries, web pages, emails, etc.). The model retrieves the external resource, merges it into the context, and treats it equally to the “harmless” user prompt. This type of attack is especially common in RAG-based systems and agentic AI [5,35,36,37].

Recent research emphasizes the severity of these attack methods (and vectors). The authors of [29] have generated highly effective prompt injection attacks using the automatic gradient-based methods, achieving superior performance with only five training samples. The Attention Tracker method proposed in [31] showed that prompt injection attacks are able to effectively manipulate specific attention heads within LLMs, causing them to focus away from original instructions onto the injected ones. The BIPIA benchmark [36] established comprehensive evaluation framework for indirect prompt injection attacks, finding that LLMs are generally vulnerable, with attack success rates exceeding 50% in baseline scenarios.

Defense mechanisms remain challenging. The authors of [35] proposed spotlighting techniques that utilize transformations to provide reliable signals of input provenance, reducing attack success rates from over 50% to below 2% in their experiments. However, more advanced attackers are able to develop evasion strategies, pressing researchers for the need of multi-layered defense [36,38].

2.1.2. Jailbreaking

Jailbreaking an LLM refers to the act of causing the model to bypass its built-in safety guardrails or content moderation policies. Jailbreaking differs from Prompt Injection by specifically targeting the goal of bypassing ethical constraints, to make the model output “inappropriate” content [34]. Recent sophisticated Jailbreaking methods, documented in the literature, include prompt-based techniques “ArtPrompt” and “ArtPerception”, which are ASCII-art-based jailbreaks proposed by Jiang et al. [39] and Yang et al. [40], respectively.

Another attack type, exploits the way an LLM processes sequences of prompts, using it to embed a malicious instruction within the chain of harmless prompts. An example of such a method is SequentialBreak described by Saiem et al. [41].

Other techniques may involve include context manipulation, where attackers frame their malicious requests into fictional scenarios (e.g., “write a movie dialogue where a character explains how to…”), and prefix injection, which involves prompting the model to begin its responses with affirmative phrases that reduce the chance of the prompt being rejected [3,34].

Jailbreaking attacks, unfortunately, still remain a serious vulnerability for LLM-based systems. Even systems using well-aligned models can be exploited by simple yet creative approaches [42].

2.1.3. Information Leakage and Policy Violations

Information leakage attacks are designed to extract information that an LLM should be restricted from disclosing. The restricted information may originate from the system prompts, internal databases, RAG retrievals, or any information shared with an automation agent (emails, tasks, calendar, logins, etc.) [43,44,45,46,47]. A subset of these strategies concentrates on extracting Personally Identifiable Information (PII). A novel method described by Chen et al. [48] has shown that few-shot fine-tuning an LLM on just a small dataset of 10 PII has led to the disclosure of 699 out of 1000 target PII.

PII Extraction (Personally Identifiable Information Extraction) Attacks exploit LLMs’ tendency to memorize training data. The authors of [47] demonstrated that fine-tuning an LLM on just 10 PII examples could lead to the disclosure of 699 out of 1000 target PIIs. Moreover, PII jailbreaking via activation steering [49] achieved disclosure rates of over 95%, with 50% of responses revealing real personal information, including life events, relationships, and personal histories.

The PIG (Privacy Jailbreak Attack) [45] merges privacy leakage and jailbreak attacks, using a gradient-based iterative in-context optimization approach specifically with PII extraction in mind. This method identifies PII entities and their types, builds privacy contexts by using in-context learning, and iteratively updates them to elicit target information, achieving (unfortunately) positive results.

Model Inversion Attacks attempt to reconstruct original inputs or training data from model outputs or internal representations [50]. Research on Llama 3.2 [51] demonstrated successful extraction of passwords, email addresses, and account numbers through carefully crafted prompts [52]. More sophisticated attacks targeting LLM internal states can nearly perfectly reconstruct prompts of over 4000 tokens [53] by exploiting intermediate layer representations. The vulnerability also applies to embedding inversion [54], where text can be reconstructed from dense vector embeddings even without knowledge of the underlying model, threatening RAG systems that transmit embeddings between components.

KV-Cache exploitation (Key-Value cache exploitation) [55] represents a new privacy threat, unique to LLM inference optimization. Attackers can reconstruct sensitive user inputs directly from the Key-Value cache used to accelerate generation.

The above-mentioned privacy-related vulnerabilities can be seen as demonstrations of the fundamental differences between traditional data scrubbing techniques and LLM-related information leakage. Even models trained on sanitized datasets are susceptible to sophisticated information extraction attacks [47], causing the urgent need for hybrid defense mechanisms, operating at multiple layers of the LLM pipeline.

2.1.4. RAG-Specific Attacks (Document Poisoning, Retrieval Manipulation)

Retrieval-Augmented Generation systems introduce unique vulnerabilities distinct from standalone LLM deployments [2]. By employing external knowledge and retrieval, the available attack vectors start to include document stores, embedding models, and whole retrieval pipelines [29].

One such vulnerability is Document poisoning, a process that involves injecting documents with misinformation in order to corrupt the system’s knowledge base and mislead users. Opposed to Prompt Injection, the Document Poisoning attack remains effective not for one session of the malicious user, but it persists across all user interactions [36,37].

Retrieval manipulation is another type of RAG-specific attack, where the malicious prompt originates from the retrieved knowledge. The retrieved documents contain active instructions designed to redirect the response of the model away from the user’s intent and towards a malicious outcome, such as propagating propaganda or redirection to phishing sites [1,36,37].

The CPA-RAG (Covert Poisoning Attacks on RAG), proposed by Li et al. [56], is a black-box attack framework that generates adversarial texts designed to be retrieved by the RAG system. These texts manipulate the generation process to produce incorrect, targeted responses, weaponizing the retrieval mechanism against the system [57].

The ControlNET research [1] discovered the activation shift phenomena in RAG-based LLMs (when processing adversarial queries), giving us insight into how attacks manifest at the semantic level within a model’s internal representations. ControlNET has good detection capabilities at the input prompt level, although some limitations were observed in larger agentic uses, and it does not cope with word-level filtering.

2.1.5. Backdoor Attacks and Model Poisoning

Backdoor attacks embed hidden malicious behaviors within LLM parameters that remain ineffective until activated by specific trigger sequences. Unlike prompt-based attacks that exploit model, backdoors are intended to alter model weights in order to ensure predictable malicious behavior when triggers are active [58].

Training-Time backdoors occur when adversaries inject malicious data during the initial pre-training phase. Secret trigger sequence (e.g., specific phrases or token patterns) is associated with specific target behavior (e.g., leaking information). Once trained, the backdoored model behaves normally for clean inputs but executes malicious actions when trigger appears [59].

Fine-Tuning backdoors target the adaptation phase where pre-trained models are customized for specific tasks or domains. The TransTroj attack achieves nearly 100% attack success rate on deployed system’s tasks by optimizing embedding indistinguishability, causing the poisoned and clean samples become practically indistinguishable in the embedding space. This makes such backdoors remarkably persistent, surviving even the fine-tuning process, which usually neutralizes simpler attack patterns [59].

The BAIT (Backdoor Scanning by Inverting Attack Target) framework [58] showed that detecting backdoors in generative models is still a challenge. According to [58], LLMs’ exponential output space complicates traditional trigger inversion techniques, which does not take place in discriminative models with finite output spaces. BAIT proposes to invert backdoor targets instead of triggers, making use of the causal relationships among target tokens to identify compromised models, which offer only black-box access.

Supply Chain Poisoning [59] exploits the Machine Learning supply chain (train-share-deploy). Adversaries are able to poison pre-trained models shared on platforms like Hugging Face, embedding backdoors to wait for the developers to download and use the model in their systems [60,61]. The Pickle deserialization vulnerability [62] (Python-based serialization performed on a model) turns out to be a particularly stealthy attack (Ref. [62] shows that 19 of 22 identified loading paths were completely missed by existing scanners).

Recent research [63,64] demonstrates that larger LLMs are significantly more susceptible to backdoor and poisoning attacks, learning malicious behaviors from minimal exposure more quickly than smaller models. The PoisonBench evaluation [65] confirmed that data poisoning during preference learning can be used for manipulating model responses with efficiency, requiring only about 1% poisoned data in some scenarios.

2.1.6. Data Poisoning and Training Corruption

Data poisoning attacks manipulate the training or fine-tuning datasets to induce specific vulnerabilities or biases in the resulting model [63,64]. These attacks are particularly concerning as they can degrade model behavior [66] not requiring a trigger.

Malicious Fine-Tuning intentionally introduces harmful examples during the fine-tuning phase [42]. Research [64] shows that “jailbreak-tuning” (with poisoning) can create differences in refusal rates exceeding 60 percentage points compared to clean fine-tuning [64].

Scaling laws for poisoning [64] reveal a disturbing trend: larger, more capable LLMs are significantly more vulnerable to data poisoning attacks. Models ranging from 1.5 to 72 billion parameters show that bigger models learn harmful behaviors more rapidly from even minimal data exposure.

Imperfect Data Curation represents an inadvertent form of poisoning where insufficient filtering of training data allows low-quality, biased, or subtly malicious content to influence model behavior. Given that LLMs are trained on web-scale datasets, comprehensive manual curation is infeasible, making it possible for adversaries to inject poisoned content at scale [64].

Clinical applications also show the real-world impact of these vulnerabilities. Research on breast cancer clinical LLMs [67] showed successful manipulation using targeted data poisoning, which proves to be especially dangerous in high-stakes domains.

2.1.7. Adversarial Examples and Evasion Attacks

Adversarial examples are deliberately crafted inputs, designed to force an LLM to produce unintended outputs while appearing harmless to a human [68,69]. These attacks take advantage of high sensitivity of neural models to input perturbations, creating inputs that are semantically equivalent to benign queries but trigger malicious behavior [70].

Word-Level Perturbations replace words with synonyms or semantically similar alternatives to evade detection systems while maintaining adversarial intent [69,70]. The LLM-Attack framework utilizes the language understanding capabilities of LLMs to generate valid and natural adversarial examples using synonym replacement, achieving high attack effectiveness, despite keeping the text grammatically correct and seemingly bening.

Character-Level Perturbations are sililar technique to Word-Level Perturbations, but they include modifications at the character level (e.g., homoglyph substitutions, zero-width characters, or strategic typos). Research [69] states that Character-level attacks (although seemingly less effective than Word-level attacks) are more practical and require significantly fewer perturbations and queries to the deployed system, making them attractive in resource-constrained scenarios.

Adversarial Paraphrasing is a result of sophisticated evolution of evasion techniques. It uses instruction-following LLMs to paraphrase AI-generated content under the guidance of detectors, producing adversarial examples optimized to bypass detection. Compared to simple paraphrasing [68], which (ironically) in some cases increases detection, adversarial paraphrasing guided by detectors like OpenAI-RoBERTa-Large reduces detection true positive rates at 1% false positives by 64–99% across diverse detection systems.

These evasion attacks remain an active challenge in content moderation systems and in AI-generated text detectors. The effectiveness of adversarial examples is another example of vulnerability families, which require hybrid, multi-faceted defense mechanisms to successfully detect and mitigate possible attacks.

2.1.8. Supply Chain Vulnerabilities

The LLM supply chain [60,61,71,72] (including data sourcing, library dependencies, model training, deployment frameworks, and inference mechanisms) is a complex attack surface with important security implications.

Dependency Vulnerabilities affect the open-source packages and libraries upon which LLM systems rely. Analysis [72] of 719 CVEs across 93 LLM-related libraries (2019–2024) showed that over 37% of libraries have at least one CVE, with an alarming 62% classified as high or critical severity. The vulnerabilities concentrate in the application layer (50.3%) and model layer (42.7%), with improper resource control (45.7%) and improper neutralization (25.1%) as leading root causes.

The LLM supply chain exhibits a “locally dense, globally sparse” topology [61], with 79.7% of dependency trees containing fewer than 5 nodes, while a few large trees dominate the ecosystem and account for 77.66% of all nodes. High-degree hub packages create single points of failure: the top 5 most connected nodes average 1282 dependents each, meaning vulnerabilities in these packages propagate extensively, affecting an average of 142.1 nodes at the second layer alone.

Third-Party Model Risks emerge from the practice of downloading pre-trained models from repositories like Hugging Face. Adversaries can upload backdoored models that appear functional but contain hidden malicious behaviors. The TransTroj supply chain poisoning attack demonstrates that backdoors can persist through fine-tuning and transfer efficiently across the model supply chain, achieving nearly 100% attack success on downstream tasks [59,71].

Unfortunately, 8% of vulnerability patches are considered ineffective [60], resulting in recurring vulnerabilities, indicating systemic challenges in maintaining secure LLM supply chains even when vulnerabilities are identified and addressed.

2.1.9. Denial of Service and Resource Exhaustion

While less emphasized in the academic literature compared to confidentiality and integrity attacks, availability threats through Denial of Service (DoS) represent practical concerns for deployed LLM systems [73,74,75]. These attacks aim to degrade performance, exhaust computational resources, or crash systems to make them unavailable to their users.

Computational DoS attacks exploit resource-intensive nature of LLM inference. Special inputs are prepared that maximize computational cost (e.g., extremely long prompts, repetitive generation requests, queries triggering time-expensive RAG operations). Since LLM inference scales with sequence length and model size, such attacks can cause noticeable resource consumption [2].

Model Confusion Attacks provide inputs designed to cause LLMs to enter degenerate states or produce excessively long outputs [65,76]. The research [76] demonstrates that poisoning attacks can cause LLMs to generate repetitive, incoherent, or infinite-loop outputs, effectively disrupting service for other users in multi-tenant deployments.

Retrieval Amplification in RAG systems creates DoS opportunities where adversaries craft queries that trigger retrieval of large document sets or computationally expensive similarity searches. If the retrieval system lacks proper rate limiting or resource bounds, malicious queries can overwhelm the infrastructure [2].

The OWASP LLM Top 10 recognizes these availability threats, though they receive less attention than direct security exploits. On-premise deployments must consider DoS risks carefully, as resource exhaustion can impact all users sharing infrastructure, and recovery may require manual intervention [30].

2.1.10. Multi-Vector and Compound Attacks

Sophisticated adversaries increasingly employ combinations of attack techniques to maximize effectiveness and evade defenses [43]. Complex attacks exploit multiple vulnerabilities in sequence or parallel, amplifying impact beyond individual attack vectors.

Staged Attack Chains combine initial access (e.g., prompt injection) with privilege escalation (e.g., via information leakage) to achieve objectives requiring multiple steps. For instance, an attacker might first use indirect prompt injection via a poisoned document to bypass input filters, then exploit the compromised context to extract sensitive information.

Adaptive Adversaries observe system responses to reconnaissance queries, iteratively refining attacks based on feedback [36]. The gradient-based automatic attack generation exemplifies this approach, using victim model responses to optimize adversarial prompts. Similarly, attention-based detection can be evaded by adversaries who craft inputs that avoid triggering attention pattern anomalies [29,31].

The development of multi-vector attacks requires comprehensive defense architectures [2] that operate across all stages, including: input, retrieval, and output. Single-point defenses, such as input filtering alone, are insufficient against adaptive attackers employing complex strategies.

This comprehensive taxonomy of LLM attack vectors shows the multitude and sophistication of threats for modern LLM deployments, especially on-premise RAG systems. The following sections present possible defense mechanisms and position the dual-agent firewall architecture as a multi-layered security approach addressing these diverse attack vectors.

2.2. Defense Mechanisms

2.2.1. Input Filtering and Prompt Validation

In order to counteract attacks on LLMs, defense mechanisms have been developed, such as input filtering and prompt validation. These methods are not always infallible due to the probabilistic nature of LLMs, but they still can protect a system from some of the attacks [5].

2.2.2. Guard Models

Some research focuses on using LLMs themselves as a security tool. One such guard model is Llama Guard, which uses a safety risk taxonomy to build input–output safeguards. In addition to being a strong baseline for LLM defense, Llama Guard can be adapted to custom policy guidelines via additional fine-tuning [77].

2.2.3. Self-Defense Approaches

To protect large language models (LLMs) from numerous types of attacks, the same model can be used for creating a self-defense mechanism. In the paper “Self-Destructive Language Model” [78], the authors proposed a new loss function that, when someone attempts to fine-tune the model for malicious purposes, triggers a self-destruction process.

To protect models that cannot be fine-tuned, the authors of [79] proposed using a second instance of the same model as the responding one to verify whether the generated answer contains any harmful content.

2.2.4. Multi-Agent Security Architectures

The literature [80] also considers the use of a multi-agent system composed of several specialized agents. These include the Orchestrator, responsible for examining the user’s prompt; the Deflector, which blocks further response generation if the user’s query contains harmful content; and the Responder, which generates an appropriate response when the prompt is deemed safe. The final component of this system is the Evaluator agent, which verifies whether the response produced by the Responder contains only safe content.

2.3. LLM Firewall Concepts

Similar to a traditional firewall, a firewall for large language models (LLMs) is designed to monitor traffic according to predefined security rules. While a classical firewall makes decisions based on network-level parameters such as addresses, ports, or protocols, an LLM firewall analyzes the meaning and intent contained in user prompts. Unlike traditional firewalls, which require manual or semi-automatic updates of their rule sets, LLM firewalls can continuously learn and adapt to new types of inputs and emerging attack vectors.

An LLM firewall can provide multi-layered protection, ranging from input and output filtering to restricting the disclosure of sensitive data. The system proposed by researchers [4] is capable of blocking attacks based on predefined rules, while also learning from potential new threat scenarios. Its protection mechanisms ensure that the agent does not reveal unnecessary or sensitive information and that malicious actions are effectively blocked. The system can make decisions either based on a single user query or the full conversational context.

Existing LLM firewall and guardrail solutions span a wide spectrum, from input-level semantic filters for adversarial prompts to proprietary end-to-end stacks that combine input and output filtering, as well as more complex agentic firewall proposals for multi-agent systems. Our approach is intentionally minimal and output-centric. It reuses a single base model in a dual-agent configuration, placing the main security logic in a dedicated Validator Agent that operates at the output level of the pipeline. This design is suitable for on-premise deployments with strict data-protection requirements, and complements rather than replaces broader, multi-component security frameworks.

ControlNET for RAG Systems

ControlNET [1] was proposed as an AI Firewall for RAG-based LLM systems. ControlNET is able to detect many known threats by controlling queries at the semantic level, but not without limitations: its capability for usage in large-scale agentic networks is limited (in cases where interactions are more complex than query–response format), and the framework does not include word-level filters (useful in high-sensitivity approaches).

3. Methodology

The dual-agent architecture, presented within this paper, is the practical presentation of the applicability of the proposed LLM FW architecture. Its minimalistic, yet output-level-based approach, is based on separating two distinct tasks between agents. The proposed approach is similar in a way to the use of the Evaluator agent (as presented in Section 2.2.4), but its role is much richer, including detection, analysis, reporting, and other roles.

Individual agents of the proposed dual-agent LLM FW may be built using the same model or different ones, depending on specific deployment. In previous research, the authors chose to use the same model (Bielik 2.3) for both agents, using different system prompts for each of them to fulfill their respective roles.

3.1. Input Firewalls vs. Output Firewalls

In the context of large language models (LLMs), an input firewall refers to a mechanism that checks whether a user’s prompt complies with predefined security rules. An output firewall is used to verify whether the response generated by the LLM meets predefined safety/policy requirements.

3.2. Dual-Agent Architecture for RAG

One of the agents is responsible for generating the user’s response, also utilizing information from a knowledge base. This agent retrieves several of the most relevant facts based on the user’s prompt. Using these facts, the model then generates the final answer.

3.3. Validator Agent for JSON Compliance

The response generated by the first agent must be verified to ensure that it contains a valid JSON structure. In the initial tests, the first agent often failed to produce a stable JSON output that could be reliably used by other systems. The model occasionally generated additional explanatory text alongside the JSON, or the JSON itself was syntactically incorrect, e.g., missing commas or quotation marks. An additional responsibility of this agent was to check whether the generated response contained any toxic or harmful content [81] that could potentially pose risks to humans or the organization.

This idea was implemented in the original, primarily format-oriented role of the second agent, as introduced in our previous work [2].

3.4. Threat Model and System Assumptions

This work addresses the problem of securing on-premise, Retrieval-Augmented Generation (RAG)-based conversational systems deployed in applications involving the need for data protection. The threat model considers both direct and indirect prompt injection attacks, as well as attacks that show only in LLM outputs (i.e., after bypassing input-level defenses). Specific attack scenarios, relevant to conversational LLM agents, include adversaries attempting to extract sensitive information from retrieved documents, bypassing safety guidelines, poisoning knowledge base entries, manipulating the system into generating policy-violating outputs, or exfiltration of internal data through “LLM-social engineering” within the conversational context.

We assume that the LLM itself, as a model, remains intact (it is not manipulated in any way), but is (may be) vulnerable to manipulation through prompts, data, or other exploits), and that the underlying hardware infrastructure is trusted and physically secured. The architecture is designed for isolated network deployments where the knowledge base and model weights do not leave the organization’s premises, reflecting the typical applicability constraints of on-premise deployments in institutions. Trust boundaries are established at three critical junctures: between user input and the system, between the Generator Agent and the Validator Agent, and between the Validator Agent and the end user. The Generator Agent operates in an untrusted context where outputs are assumed potentially compromised until validated, while the Validator Agent occupies a privileged security position where its validation decisions are treated as authoritative.

To make this threat model more concrete, we briefly illustrate three representative attack scenarios. In a prompt injection attack, an adversary may succeed in influencing the Generator Agent so that it produces content that the system should never return to the user, even if the original user query looks relatively benign. For instance, a poisoned document in the knowledge base might contain hidden instructions that cause the Generator to output a step-by-step description of how to perform an illegal or dangerous activity (e.g., “how to construct an explosive device” under an innocuous label such as “B.O.O.M. device”). The Validator Agent does not rely on the literal presence of prompt-override phrases, but instead analyses the semantic content of the generated answer and is designed to recognize that detailed instructions for such activities fall outside the allowed policy and must be blocked or replaced with a safe refusal.

In a knowledge-base poisoning scenario, an attacker might insert manipulated guidance into the RAG knowledge base, for example by recommending practices that deliberately violate safety or legal requirements while being presented as “expert advice”. The Generator Agent may faithfully reproduce this guidance in its answer. The Validator’s output-level checks are intended to detect that the suggested actions conflict with encoded domain policies (e.g., building safety regulations or internal compliance rules) and to either reject the answer or modify it so that only compliant recommendations remain.

Finally, in a guideline-bypassing attack, a user can explicitly ask for advice that contradicts operational constraints (for instance, ways to disable mandatory safety mechanisms or to circumvent metering and billing rules). Even if the Generator produces a technically correct description, the Validator is configured to identify that the requested outcome is not permitted under the organization’s policies and to return a warning or refusal instead of forwarding the original answer.

3.5. Reframing the Dual-Agent Architecture as an LLM Firewall

In the remainder of this section, we extend this initially format-focused design into a security-oriented LLM firewall and clarify how the Validator Agent’s responsibilities are broadened from JSON checking to comprehensive output-level security validation.

Our previous work introduced a dual-agent architecture for on-premise RAG-based conversational systems, originally motivated by the practical engineering challenge: ensuring consistent output formatting across diverse hardware configurations and improving system reliability in production environments. In that setting, the first agent (responsible for RAG-based response generation) frequently failed to produce reliable JSON-compliant output. The model occasionally added explanatory text alongside the JSON or produced syntactically incorrect structures (e.g., missing commas or quotation marks). To address this, a second agent was introduced, solely to validate and correct the JSON structure before transmitting it to API endpoints.

However, this architectural separation revealed a deeper security potential. In this work, we reframe and extend the dual-agent architecture as a comprehensive LLM firewall, drawing conceptual parallels to traditional network security. Just as classical network firewalls monitor traffic at system boundaries according to predefined security policies, an LLM firewall operates at the semantic boundary between model generation and output.

The dual-agent firewall (Figure 3) positions two agents operating with distinct roles: the Generator Agent and the Validator Agent. Both agents are instantiated from the same base model, the Polish Bielik 2.3 LLM, but operate with distinct system prompts that define their respective operational contexts and constraints. This approach of using a single model for both agents offers significant practical advantages over architectures requiring multiple distinct models. It reduces memory footprint, inference latency, and computational requirements (critical considerations for on-premise deployments with constrained resources). Simultaneously, it maintains the separation of privileges necessary for effective defense, as the different system prompts ensure the agents operate within distinct security contexts and enforce distinct constraints. The architectural separation is enforced not through model differences but through prompt-based role specification and output validation, making the system implementable on a broad range of hardware platforms.

3.6. Generator Agent: RAG-Based Response Generation

The Generator Agent implements a standard Retrieval-Augmented Generation pipeline as established in our prior work. The RAG approach enables the system to incorporate domain-specific knowledge from external sources by using dense vector embeddings of text. Domain-specific knowledge documents are preprocessed into semantically coherent chunks and encoded as dense vector embeddings using the Ollama embeddings function, which captures semantic content through continuous vector representations. These preprocessing steps are performed once during knowledge base initialization, enabling efficient runtime retrieval.

At inference time, upon receiving a user query, the Generator Agent first embeds the query into the same vector space as the knowledge base. The system then identifies the most relevant knowledge chunks through similarity scoring, typically using dot product operations from optimized linear algebra libraries such as NumPy [82]. The top-k most similar chunks (in our implementation, k = 5) are retrieved and incorporated into the system prompt provided to the Generator Agent. Using this retrieved context as grounding, the Generator Agent formulates responses intended to directly address the user’s query while incorporating information from the knowledge base.

The system prompt provided to the Generator Agent is intentionally designed to encourage helpful, informative responses while maintaining awareness of its role within the broader system architecture. Critically, this system prompt does not implement security constraints or attempt to defend against adversarial manipulation. This design decision is deliberate and operationally important: the Generator Agent’s role is to leverage the full capabilities of the model to produce the most useful responses possible, given that all output will be subject to security validation downstream. Attempting to implement security constraints at the generation stage can lead to defensive behaviors that reduce helpfulness and utility for legitimate users, while simultaneously proving ineffective against sophisticated attacks.

The RAG approach replaces traditional NLP-based content matching with mathematically efficient dot product operations, significantly reducing retrieval latency while improving contextual relevance. The similarity search operates on precomputed embeddings stored in JSON format, enabling fast, low-latency retrieval during each user interaction. This efficiency is essential for on-premise deployments where response latency directly impacts user experience and system utility. The runtime process proceeds as follows: the user submits a query; the query is embedded into the vector space; the system identifies the most similar knowledge chunks via similarity scoring; the top chunks are retrieved and incorporated into the system prompt; the Generator Agent receives this augmented prompt and generates a response based on the context; and the generated response is passed to the Validator Agent for security assessment before being returned to the user.

3.7. Validator Agent: Output Firewall with Multi-Dimensional Security Checks

The Validator Agent represents the core security contribution of this work. Operating as an output firewall, it receives as input the original user query, the candidate response generated by the Generator Agent, the retrieved context used during generation, and the current system context including user role, session state, and applicable operational policies. The Validator Agent’s task is to evaluate whether the generated response is safe for transmission to the user, applying multiple independent validation criteria that collectively form a comprehensive security assessment. What’s important is that, unlike the Generator Agent which is designed to optimize for utility, the Validator Agent is explicitly designed and prompted to prioritize security and policy compliance, even when doing so may limit response utility.

In our implementation, the Validator Agent is realized as a fixed pipeline that combines structural, rule-based, and semantic checks. First, the candidate response is validated against a predefined JSON schema in order to detect malformed or incomplete outputs that would cause downstream failures. Next, a set of rule-based policy checks is applied to the user query, the retrieved context, and the candidate response, capturing organization- and domain-specific constraints such as disallowed entities, phrases, or categories of information. Finally, the Validator performs LLM-based semantic analysis under a security-focused system prompt, analyzing the candidate response (and, when needed, the query and context) for signs of prompt injection, attempts to override safety policies, potential data leakage, and toxic or otherwise policy-violating content.

Based on the combined outcome of these stages, the Validator Agent produces one of three decisions: allow (the response is forwarded unchanged), modify (the response is redacted or rephrased to remove problematic elements), or block (the response is replaced with a safe fallback message that explains why the original answer cannot be provided). The following subsections describe the main security validation dimensions in more detail.

The Validator Agent operates through multiple security validation dimensions, each addressing distinct classes of attacks.

3.7.1. Prompt Injection Detection

The Validator Agent analyzes the generated response for indicators of successful prompt injection attacks. This includes detecting anomalous instruction-following patterns that suggest the LLM may have adopted injected instructions rather than adhering to its original system prompt. The validation logic examines whether the response contains unexpected changes in tone, role-playing behaviors, or evidence of the model discussing internal instructions or system prompts. The validator checks for textual markers commonly associated with injection attempts, such as phrases indicating role assumption (“I am now”, “pretend I am”) or explicit instruction overrides (“ignore your guidelines”, “from now on”). Additionally, the validator assesses whether the response contains outputs that directly violate known security boundaries (e.g., claiming to perform system administration tasks, access external systems, or override its operational constraints).

While sophisticated attackers may craft responses that evade simple pattern-based checks, the multi-dimensional validation approach helps in ensuring that attack evidence manifesting in other dimensions can still be captured. For instance, an injected instruction that successfully manipulates the model into adopting a different role may simultaneously trigger the sensitive information detection layer if that new role involves accessing restricted data, or the policy compliance layer if the new role involves prohibited actions.

3.7.2. Sensitive Information Filtering

One of the most critical functions of the Validator Agent is preventing unintended leakage of sensitive information, addressing a category of attacks that represent persistent threats to LLM-based systems. The system implements multiple channels for detecting such leakage. First, regular expression-based pattern matching identifies common sensitive data structures including PII, such as email addresses, phone numbers, national identification numbers, and financial information such as account numbers and credit card patterns. Additionally, the system maintains databases of internal identifiers, employee names, proprietary system architectures, and other organizational information that should remain confidential.

Second, semantic analysis evaluates whether the response discusses topics that should remain confidential within the organizational context, such as internal policies, employee information, or proprietary processes. This semantic layer is crucial because simple pattern matching can be evaded through reformulation (for example, writing out numbers in words rather than digits, disguising identifiers through acronyms, or alluding to sensitive information indirectly). Semantic analysis captures the intent to disclose sensitive categories of information even when explicit patterns are obscured. The semantic layer is implemented through a combination of rule-based keyword detection for high-confidence sensitive terms and learned classifiers trained to recognize semantic content relevant to restricted categories.

3.7.3. Policy Compliance Verification

The system implements a configurable policy engine that encodes domain-specific operational requirements. For the energy auditing scenario in which this system is deployed, policies include requirements that responses must not recommend actions violating building regulations, must not suggest cost-cutting measures that compromise safety, must clearly distinguish between legally required and optional recommendations, and must accurately represent the substantive requirements of the Energy Performance of Buildings Directive. The Validator Agent evaluates generated responses against these policies, ensuring adherence before user exposure.

Policy violations are detected through a combination of approaches: rule-based checks searching for specific prohibited terms or patterns (for instance, flagging responses that claim to override safety requirements), keyword-based matching for policy-critical concepts (for instance, matching responses against regulatory terminology), and semantic classification assessing whether the response’s semantic content aligns with policy requirements. The policy engine is intentionally designed to support dynamic updates without model retraining: new policies can be added by updating rule sets or providing additional training examples to classifiers, enabling rapid adaptation to changing regulatory requirements or organizational guidelines.

3.7.4. Toxic and Harmful Content Blocking

The Validator Agent includes detection mechanisms for toxic, abusive, or otherwise harmful content. This operates as a content moderation layer, flagging responses containing improper language or socially harmful content. This layer is meant to protect both user experience and organizational reputation, making sure that even if the Generator Agent produces problematic content (whether through model misalignment or successful adversarial manipulation), it is prevented from reaching users. The toxic content detection operates using classical pre-trained content moderation classifiers, which can be customized according to organizational use-cases or cultural context.

3.8. Implementation Framework and System Integration

The entire system is implemented using the Ollama framework and API, which provides efficient local model serving and inference optimization. The authors have used the Polish Bielik 2.3 LLM in quantized version (specifically: Bielik-11B-v2.3-Instruct-GGUF-Q4_K_M with 4-bit quantization). Ollama-based multiagent architecture enables rapid agent switching by simply storing and using two sets of different interaction sessions (system prompts and prompt history) for the two agents, while maintaining the same underlying model in memory, avoiding the computational overhead of full model reloads. The choice of the Bielik 2.3 model was motivated by its fluent support for Polish language.

The knowledge base is prepared according to the following proposed procedure. The information covering the domain of interest (in our deployment scenario, the EPBD directive, relevant national and local regulations, and available subsidy programs) is provided by domain experts as plain-text files, with their contents organized into thematically coherent paragraphs. This structure is designed intentionally to mitigate common issues encountered during automated chunk slicing and during embedding. Each paragraph is then embedded using the Ollama embeddings function, producing dense vector representations that capture semantic content. All resulting vectors are stored in JSON format for efficient access during runtime. These embeddings enable efficient similarity-based retrieval by allowing fast search and comparison between user queries and the structured knowledge base. The similarity search is implemented using dense vector comparison based on the dot product function. During inference, the system calculates similarity scores between the embedding of a user query and all precomputed embeddings of the knowledge base, producing a ranking list of candidate knowledge chunks. The top five most relevant chunks are selected and incorporated into the system prompt provided to the Generator Agent. This approach has proven sufficiently efficient for identifying relevant knowledge fragments while maintaining acceptable response latency across diverse hardware platforms.

The Validator Agent’s security validation logic is designed as a combination of rule-based and learning-based components. Rule-based components include regular expression patterns for PII detection, any signs of information extraction, policy checks against prohibited terms/topics, and pattern matching for known attack signatures from threat intelligence sources. Learning-based components may use smaller classification models trained to detect toxic content and assess semantic policy alignment. This hybrid approach is able to offer interpretability and maintainability (rule-based systems are easily auditable and can be updated by domain experts without machine learning expertise) with the expressiveness of learning-based systems, which can capture subtle semantic patterns that would evade simple pattern matching otherwise.

Importantly, the Validator Agent’s security rules can be updated without retraining the underlying LLM. New policies and detection patterns can be added, existing rules refined, and the contents of policies can be adjusted in response to any new threats that may emerge, or if organizational requirements change. This operational flexibility gives a significant practical advantage over approaches requiring model fine-tuning to adapt to new security requirements, as it eliminates the need for model retraining infrastructure and accelerates the deployment of security updates.

3.9. Firewall Operational Layers

While the Validator Agent represents the primary security innovation, the complete firewall architecture operates across three operational layers, each addressing distinct attack surfaces and providing defense-in-depth.

3.9.1. Input Layer Security

At the input layer, user prompts undergo preliminary validation to detect obvious attack attempts. The input validation checks for excessively long inputs that might attempt computational denial of service and detects queries containing known attack patterns from threat intelligence databases using regular rexpressions (RegEx). Input layer filtering provides the first step in attack prevention and is able to reject obviously malicious queries before they consume downstream resources or expose the knowledge base to retrieval amplification attacks. This layer also implements rate limiting to prevent brute-force attempts to discover system behavior or trigger resource exhaustion.

3.9.2. Retrieval Layer Security

The retrieval layer filters documents retrieved from the knowledge base to reduce the risk of indirect prompt injection via possibly compromised (poisoned) knowledge base entries. Retrieved documents are scanned for injection indicators and any other red-flag patterns before merging the user prompt with the retrieved information as a context and the current conversation session history. In high-security environments, documents retrieved from external sources or community-contributed content may and should be subjected to additional verification before inclusion.

3.9.3. Output Layer Security

The output layer, implemented by the Validator Agent, represents the final and the most important security checkpoint. As detailed in the previous section, the Validator Agent applies multi-dimensional security validation to all generated content. This layer identifies attack attempts that successfully bypassed input and retrieval layer defenses, and attacks targeting the Generator Agent’s behaviour itself. The output layer is intentionally positioned as a comprehensive security boundary where failures are conservative (rejecting legitimate responses if necessary) rather than permissive.

An important structural implication of the proposed architecture is that the Validator agent remains unaffected by the user prompt or the retrieved data; its role is to analyse the system’s output rather than to let the inputs control its behaviour.

4. Discussion and Conclusions

4.1. Summary of Contributions

This paper reframes the Dual-agent RAG architecture from our previous work, as an LLM Firewall providing output-layer security validation. The core contribution is demonstrating that separation of generation and validation creates a trust boundary enabling comprehensive post hoc defense against diverse LLM attack vectors. The multi-dimensional validation approach (combining prompt injection detection, sensitive information filtering, policy compliance verification, and toxic content blocking) addresses attacks that are able to bypass input-level defenses and manifest only in generated output text.

Compared to existing dual-LLM patterns and guardrail frameworks, our design emphasizes a minimal dual-agent configuration with an explicitly defined trust boundary between Generator and Validator, and concentrates the firewall logic in an output-level validation stage suitable for resource-constrained on-premise environments.

4.2. Advantages and Limitations

The Dual-agent architecture offers practical advantages for on-premise deployments: security policies can be updated without model retraining, validation policies and decisions are fully auditable, and the approach is able to function along other existing input-level security approaches without conflicts or degradation of security posture. The architecture requires no specialized modifications to base models and is expected be relatively easy to integrate with existing architecture.

However, the proposed approach does have some limitations: the validation layer introduces latency, and certain attack classes (such as model weight backdoors or supply chain poisoning) gain limited protection from output validation alone. Another important limitation is the lack of systematic empirical metrics in the present work. We do not yet report quantitative results such as false-positive/false-negative rates, detection coverage for different attack classes, or precise latency overhead introduced by the Validator Agent. As a consequence, the security posture of the architecture must currently be understood qualitatively, and concrete performance numbers remain to be established in future evaluations.

The Dual-agent Firewall represents one design point among complementary LLM security approaches. It is most effective as part of a hybrid/layered defense strategy, combining input, retrieval, and output-layer protections. Organizations should evaluate the architecture in their specific deployment context instead of following the path of universal applicability.

4.2.1. False Positives and Usability Trade-Offs

An inherent drawback of output-level validation is the possibility of false positives, where benign answers are blocked or excessively modified. In our design, the Validator Agent is intentionally biased towards caution: in ambiguous cases it may prefer to refuse or redact an answer rather than risk exposing sensitive information or producing policy-violating content. While such behavior is appropriate in high-security, on-premise deployments, it can reduce user satisfaction and increase the need for human review. In practice, this trade-off can be mitigated by tuning the strictness of the policies, distinguishing between “block” and “modify” decisions, and using validator logs to iteratively adjust the rules to the organization’s risk tolerance. Nevertheless, the possibility of false positives remains a fundamental limitation of output firewalls and an important aspect of their operational deployment.

4.2.2. Evaluation Limitations

Evaluation limitations. The present work focuses primarily on the architectural design and security rationale of a dual-agent LLM firewall. While we report practical observations from our deployment and illustrate the defense layers in Figure 1 and Figure 2, we do not provide a systematic benchmark or detailed ablation study. A comprehensive empirical evaluation, including standardized attack suites, comparisons with alternative guardrail strategies, and controlled ablations of the Validator Agent components, is an important direction for future work.

4.3. Future Directions

Future research will involve the analysis and benchmarking of exact adaptive validation approaches for the Validator agent, integration strategies with other security layers (possibly using information bus between the Validator agent and other security modules of the system for coordinated threat-level assessment of any given interaction), formal methods for security analysis, and standardized evaluation benchmarks. The main idea is that checking the model’s output adds another layer of protection besides input-level defenses. This approach applies to numerous deployment scenarios and should be explored further.

4.4. Conclusions

Large Language Models introduce security challenges, which are far beyond traditional IT/software security. Our work suggests that straightforward architectural ideas (i.e., the separation of generation and validation responsibilities, and the output filtering/analysis), may serve as an additional layer of defense against various attack vectors. The proposed dual-agent firewall was intended to raise the bar sufficiently so that the LLM systems might be considered adequately secure, to be able to be implemented as components of critical information/interaction pipelines, handling sensitive data.

Author Contributions

Conceptualization, M.P. and M.B.; methodology, M.P., M.B. and L.K.; software, M.P. and M.B.; validation, M.P. and W.R.; formal analysis, M.P. and L.K.; investigation, M.P., M.B. and M.C.; resources, M.P.; data curation, M.P. and M.B.; writing—original draft preparation, M.P., M.B., M.C., L.K. and W.R.; writing—review and editing, M.P. and W.R.; visualization, M.P.; supervision, M.P.; project administration, M.P.; funding acquisition, M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the European Union from the European Regional Development Fund under the regional program “European Funds for Opole Voivodship” 2021–2027. Project title: “e-Assistant for energy and environmental auditor—Industrial research and experimental development work in the field of artificial intelligence application in energy and environmental assessments of residential buildings in Europe”. Grant number FEOP.01.01-IP.01-0009/23-00.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

Author Marek Baranowski was employed by the Opolskie Centrum Zarzadzania Projektami. The remaining authors declare that the re-search was con-ducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Yao, H.; Shi, H.; Chen, Y.; Jiang, Y.; Wang, C.; Qin, Z. ControlNET: A Firewall for RAG-based LLM System. arXiv 2025, arXiv:2504.09593. [Google Scholar]
A Dual-Agent Strategy Towards Trustworthy On-Premise Conversational LLMs. Under review.
Esmradi, A.; Yip, D.W.; Chan, C.F. A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models. arXiv 2023, arXiv:2312.10982. [Google Scholar] [CrossRef]
Abdelnabi, S.; Gomaa, A.; Bagdasarian, E.; Kristensson, P.O.; Shokri, R. Firewalls to Secure Dynamic LLM Agentic Networks. arXiv 2025, arXiv:2502.01822. [Google Scholar] [CrossRef]
OWASP. LLM01: 2025 Prompt Injection. 2025. Available online: https://genai.owasp.org/llmrisk/llm01-prompt-injection/ (accessed on 15 December 2025).
DeepTeam. OWASP Top 10 for LLMs. 2025. Available online: https://www.trydeepteam.com/docs/frameworks-owasp-top-10-for-llms (accessed on 15 June 2025).
Dev.to Contributor. Overview: OWASP Top 10 for LLM Applications 2025: A Comprehensive Guide. 2025. Available online: https://dev.to/foxgem/overview-owasp-top-10-for-llm-applications-2025-a-comprehensive-guide-8pk (accessed on 15 December 2025).
Li, R.; Chen, M.; Hu, C.; Chen, H.; Xing, W.; Han, M. GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks. arXiv 2024, arXiv:2409.19521. [Google Scholar]
Check Point. OWASP Top 10 for LLM Applications 2025: Prompt Injection. 2025. Available online: https://www.checkpoint.com/cyber-hub/what-is-llm-security/prompt-injection/ (accessed on 15 December 2025).
Duvall, P.M. Deep Dive into OWASP LLM Top 10 and Prompt Injection. 2025. Available online: https://www.paulmduvall.com/deep-dive-into-owasp-llm-top-10-and-prompt-injection/ (accessed on 15 December 2025).
NeuralTrust. Which Firewall Best Prevents Prompt Injection Attacks? 2025. Available online: https://neuraltrust.ai/blog/prevent-prompt-injection-attacks-firewall-comparison (accessed on 15 December 2025).
Raga AI. Security and LLM Firewall Controls. 2025. Available online: https://raga.ai/resources/blogs/llm-firewall-security-controls (accessed on 15 May 2025).
Aiceberg. What Is an LLM Firewall? Available online: https://www.aiceberg.ai/blog/what-is-an-llm-firewall (accessed on 15 December 2025).
Nightfall. Firewalls for AI: The Essential Guide. 2024. Available online: https://www.nightfall.ai/blog/firewalls-for-ai-the-essential-guide (accessed on 15 December 2025).
The Moonlight. Literature Review: ControlNET: A Firewall for RAG-Based LLM System. 2025. Available online: https://www.themoonlight.io/en/review/controlnet-a-firewall-for-rag-based-llm-system (accessed on 15 December 2025).
Cloudflare. Block Unsafe Prompts Targeting Your LLM Endpoints with Firewall for AI. 2025. Available online: https://blog.cloudflare.com/block-unsafe-llm-prompts-with-firewall-for-ai/ (accessed on 15 December 2025).
AccuKnox. How to Secure LLM Prompts and Responses. 2025. Available online: https://accuknox.com/blog/llm-prompt-firewall-accuknox (accessed on 15 December 2025).
Greshake, K.; Abdelnabi, S.; Mishra, S.; Endres, C.; Holz, T.; Fritz, M. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv 2023, arXiv:2302.12173. [Google Scholar]
Mangaokar, N.; Hooda, A.; Choi, J.; Ch, rashekaran, S.; Fawaz, K.; Jha, S.; Prakash, A. PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails. arXiv 2024, arXiv:2402.15911. [Google Scholar] [CrossRef]
Liu, H.; Huang, H.; Gu, X.; Wang, H.; Wang, Y. On Calibration of LLM-based Guard Models for Reliable Content Moderation. arXiv 2025, arXiv:2410.10414. [Google Scholar]
Threat Model Co. The Dual LLM Pattern for LLM Agents. 2025. Available online: https://threatmodel.co/blog/dual-llm-pattern (accessed on 15 December 2025).
Coralogix. LLM’s Insecure Output Handling: Best Practices and Prevention. 2025. Available online: https://coralogix.com/ai-blog/llms-insecure-output-handling-best-practices-and-prevention/ (accessed on 15 December 2025).
Cobalt. Introduction to LLM Insecure Output Handling. 2024. Available online: https://www.cobalt.io/blog/llm-insecure-output-handling (accessed on 15 December 2025).
MindsDB. Harnessing the Dual LLM Pattern for Prompt Security with MindsDB. 2025. Available online: https://mindsdb.com/blog/harnessing-the-dual-llm-pattern-for-prompt-security-with-mindsdb (accessed on 15 December 2025).
Beurer-Kellner, L.; Buesser, B.; Creţu, A.M.; Debenedetti, E.; Dobos, D.; Fabian, D.; Fischer, M.; Froelicher, D.; Grosse, K.; Naeff, D.; et al. Design Patterns for Securing LLM Agents against Prompt Injections. arXiv 2025, arXiv:2506.08837. [Google Scholar] [CrossRef]
Willison, S. The Dual LLM Pattern for Building AI Assistants That Can Resist Prompt Injection. 2023. Available online: https://simonwillison.net/2023/Apr/25/dual-llm-pattern/ (accessed on 15 December 2025).
Cui, J.; Xu, Y.; Huang, Z.; Zhou, S.; Jiao, J.; Zhang, J. Recent Advances in Attack and Defense Approaches of Large Language Models. arXiv 2024, arXiv:2409.03274. [Google Scholar] [CrossRef]
Rossi, S.; Michel, A.M.; Mukkamala, R.R.; Thatcher, J.B. An early categorization of prompt injection attacks on large language models. arXiv 2024, arXiv:2402.00898. [Google Scholar] [CrossRef]
Liu, X.; Yu, Z.; Zhang, Y.; Zhang, N.; Xiao, C. Automatic and universal prompt injection attacks against large language models. arXiv 2024, arXiv:2403.04957. [Google Scholar] [CrossRef]
Kumar, S.S.; Cummings, M.; Stimpson, A. Strengthening LLM trust boundaries: A survey of prompt injection attacks. In Proceedings of the 2024 IEEE 4th International Conference on Human-Machine Systems (ICHMS), Toronto, ON, Canada, 15–17 May 2024; pp. 1–6. [Google Scholar]
Hung, K.H.; Ko, C.Y.; Rawat, A.; Chung, I.; Hsu, W.H.; Chen, P.Y. Attention tracker: Detecting prompt injection attacks in LLMs. arXiv 2024, arXiv:2411.00348. [Google Scholar] [CrossRef]
Liu, Y.; Jia, Y.; Geng, R.; Jia, J.; Gong, N.Z. Formalizing and benchmarking prompt injection attacks and defenses. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 1831–1847. [Google Scholar]
Jacob, D.; Alzahrani, H.; Hu, Z.; Alomair, B.; Wagner, D. Promptshield: Deployable detection for prompt injection attacks. In Proceedings of the Fifteenth ACM Conference on Data and Application Security and Privacy, Pittsburgh, PA, USA, 4–6 June 2024; pp. 341–352. [Google Scholar]
Suo, X. Signed-prompt: A new approach to prevent prompt injection attacks against LLM-integrated applications. AIP Conf. Proc. 2024, 3194, 040013. [Google Scholar]
Hines, K.; Lopez, G.; Hall, M.; Zarfati, F.; Zunger, Y.; Kiciman, E. Defending against indirect prompt injection attacks with spotlighting. arXiv 2024, arXiv:2403.14720. [Google Scholar] [CrossRef]
Yi, J.; Xie, Y.; Zhu, B.; Kiciman, E.; Sun, G.; Xie, X.; Wu, F. Benchmarking and defending against indirect prompt injection attacks on large language models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1, Toronto, ON, Canada, 3–7 August 2025; pp. 1809–1820. [Google Scholar]
Yao, H.; Lou, J.; Qin, Z. Poisonprompt: Backdoor attack on prompt-based large language models. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 7745–7749. [Google Scholar]
Ostermann, S.; Baum, K.; Endres, C.; Masloh, J.; Schramowski, P. Soft begging: Modular and efficient shielding of LLMs against prompt injection and jailbreaking based on prompt tuning. arXiv 2024, arXiv:2407.03391. [Google Scholar] [CrossRef]
Jiang, F.; Xu, Z.; Niu, L.; Xiang, Z.; Ramasubramanian, B.; Li, B.; Poovendran, R. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 15157–15173. [Google Scholar]
Yang, G.Y.; Cheng, T.Y.; Teng, Y.W.; Wang, F.; Yeh, K.H. ArtPerception: ASCII art-based jailbreak on LLMs with recognition pre-test. J. Netw. Comput. Appl. 2025, 244, 104356. [Google Scholar] [CrossRef]
Saiem, B.A.; Shanto, M.S.H.; Ahsan, R.; ur Rashid, M.R. SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 27 July–1 August 2025. [Google Scholar]
Bhardwaj, R.; Poria, S. Language model unalignment: Parametric red-teaming to expose hidden harms and biases. arXiv 2023, arXiv:2310.14303. [Google Scholar] [CrossRef]
Pilán, I.; Manzanares-Salor, B.; Sánchez, D.; Lison, P. Truthful Text Sanitization Guided by Inference Attacks. arXiv 2024, arXiv:2412.12928. [Google Scholar] [CrossRef]
Zhang, M.; Abdollahi, M.; Ranbaduge, T.; Ding, M. POSTER: When Models Speak Too Much: Privacy Leakage on Large Language Models. In Proceedings of the 20th ACM Asia Conference on Computer and Communications Security, Hanoi, Vietnam, 25–29 August 2025; pp. 1809–1811. [Google Scholar]
Wang, Y.; Cao, Y.; Ren, Y.; Fang, F.; Lin, Z.; Fang, B. PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization. arXiv 2025, arXiv:2505.09921. [Google Scholar]
Cheng, S.; Li, Z.; Meng, S.; Ren, M.; Xu, H.; Hao, S.; Yue, C.; Zhang, F. Understanding PII Leakage in Large Language Models: A Systematic Survey. Codex 2021, 8, 10409–10417. [Google Scholar] [CrossRef]
Lukas, N.; Salem, A.; Sim, R.; Tople, S.; Wutschitz, L.; Zanella-Béguelin, S. Analyzing leakage of personally identifiable information in language models. In Proceedings of the 2023 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 21–25 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 346–363. [Google Scholar]
Chen, X.; Tang, S.; Zhu, R.; Yan, S.; Jin, L.; Wang, Z.; Su, L.; Zhang, Z.; Wang, X.; Tang, H. The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks. In CCS ’24, Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, Salt Lake City, UT, USA, 14–18 October 2024; Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar]
Kanth Nakka, K.; Jiang, X.; Zhou, X. PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage. arXiv 2025, arXiv:2507.02332. [Google Scholar] [CrossRef]
Wan, Z.; Cheng, A.; Wang, Y.; Wang, L. Information leakage from embedding in large language models. arXiv 2024, arXiv:2405.11916. [Google Scholar] [CrossRef]
Sivashanmugam, S.P. Model Inversion Attacks on Llama 3: Extracting PII from Large Language Models. arXiv 2025, arXiv:2507.04478. [Google Scholar] [CrossRef]
Shu, Y.; Li, S.; Dong, T.; Meng, Y.; Zhu, H. Model Inversion in Split Learning for Personalized LLMs: New Insights from Information Bottleneck Theory. arXiv 2025, arXiv:2501.05965. [Google Scholar] [CrossRef]
Dong, T.; Meng, Y.; Li, S.; Chen, G.; Liu, Z.; Zhu, H. Depth Gives a False Sense of Privacy: LLM Internal States Inversion. In Proceedings of the 34th USENIX Security Symposium (USENIX Security 25), Seattle, WA, USA, 13–15 August 2025; pp. 1629–1648. [Google Scholar]
Chen, Y.; Lent, H.; Bjerva, J. Text embedding inversion security for multilingual language models. arXiv 2024, arXiv:2401.12192. [Google Scholar] [CrossRef]
Luo, Z.; Shao, S.; Zhang, S.; Zhou, L.; Hu, Y.; Zhao, C.; Liu, Z.; Qin, Z. Shadow in the cache: Unveiling and mitigating privacy risks of KV-cache in llm inference. arXiv 2025, arXiv:2508.09442. [Google Scholar] [CrossRef]
Li, C.; Zhang, J.; Cheng, A.; Ma, Z.; Li, X.; Ma, J. CPA-RAG: Covert Poisoning Attacks on Retrieval-Augmented Generation in Large Language Models. arXiv 2025, arXiv:2505.19864. [Google Scholar]
Kuo, M.; Zhang, J.; Zhang, J.; Tang, M.; DiValentin, L.; Ding, A.; Sun, J.; Chen, W.; Hass, A.; Chen, T.; et al. Proactive privacy amnesia for large language models: Safeguarding PII with negligible impact on model utility. arXiv 2025, arXiv:2502.17591. [Google Scholar] [CrossRef]
Shen, G.; Cheng, S.; Zhang, Z.; Tao, G.; Zhang, K.; Guo, H.; Yan, L.; Jin, X.; An, S.; Ma, S.; et al. BAIT: Large language model backdoor scanning by inverting attack target. In Proceedings of the 2025 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 12–15 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1676–1694. [Google Scholar]
Wang, H.; Guo, S.; He, J.; Liu, H.; Zhang, T.; Xiang, T. Model Supply Chain Poisoning: Backdooring Pre-trained Models via Embedding Indistinguishability. In Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 840–851. [Google Scholar]
Wang, S.; Zhao, Y.; Liu, Z.; Zou, Q.; Wang, H. SoK: Understanding vulnerabilities in the large language model supply chain. arXiv 2025, arXiv:2502.12497. [Google Scholar] [CrossRef]
Hu, Y.; Wang, S.; Nie, T.; Zhao, Y.; Wang, H. Understanding Large Language Model Supply Chain: Structure, Domain, and Vulnerabilities. arXiv 2025, arXiv:2504.20763. [Google Scholar] [CrossRef]
Liu, T.; Meng, G.; Zhou, P.; Deng, Z.; Yao, S.; Chen, K. The Art of Hide and Seek: Making Pickle-Based Model Supply Chain Poisoning Stealthy Again. arXiv 2025, arXiv:2508.19774. [Google Scholar] [CrossRef]
Fu, T.; Sharma, M.; Torr, P.; Cohen, S.B.; Krueger, D.; Barez, F. PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data. In Proceedings of the Forty-Second International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Bowen, D.; Murphy, B.; Cai, W.; Khachaturov, D.; Gleave, A.; Pelrine, K. Data poisoning in LLMs: Jailbreak-tuning and scaling laws, 2024. arXiv 2024, arXiv:2408.02946. [Google Scholar]
Jiang, S.; Kadhe, S.R.; Zhou, Y.; Cai, L.; Baracaldo, N. Forcing generative models to degenerate ones: The power of data poisoning attacks. arXiv 2023, arXiv:2312.04748. [Google Scholar] [CrossRef]
Zhou, X.; Qiang, Y.; Zade, S.Z.; Roshani, M.A.; Khanduri, P.; Zytko, D.; Zhu, D. Learning to Poison Large Language Models for Downstream Manipulation. arXiv 2024, arXiv:2402.13459. [Google Scholar]
Das, A.; Tariq, A.; Batalini, F.; Dhara, B.; Banerjee, I. Exposing vulnerabilities in clinical LLMs through data poisoning attacks: Case study in breast cancer. In medRxiv; 2024. Available online: https://pmc.ncbi.nlm.nih.gov/articles/PMC10984073/ (accessed on 15 December 2025).
Cheng, Y.; Sadasivan, V.S.; Saberi, M.; Saha, S.; Feizi, S. Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text. arXiv 2025, arXiv:2506.07001. [Google Scholar] [CrossRef]
Vitorino, J.; Maia, E.; Praça, I. Adversarial evasion attack efficiency against large language models. In Proceedings of the International Symposium on Distributed Computing and Artificial Intelligence, Salamanca, Spain, 26–28 June 2024; Springer: Cham, Switzerland, 2024; pp. 14–22. [Google Scholar]
Wang, Z.; Wang, W.; Chen, Q.; Wang, Q.; Nguyen, A. Generating valid and natural adversarial examples with large language models. In Proceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Tianjin, China, 8–10 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1716–1721. [Google Scholar]
Kezron, N. Securing the AI supply chain: Mitigating vulnerabilities in AI model development and deployment. World J. Adv. Res. Rev. 2024, 22, 2336–2346. [Google Scholar] [CrossRef]
Moia, V.H.G.; de Meneses, R.D.; Sanz, I.J. An Analysis of Real-World Vulnerabilities and Root Causes in the LLM Supply Chain. In Proceedings of the Simpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais (SBSeg), Foz do Iguaçu, Brasil, 1–4 September 2025; SBC, 2025. pp. 388–396. [Google Scholar] [CrossRef]
Ferrag, M.A.; Alwahedi, F.; Battah, A.; Cherif, B.; Mechri, A.; Tihanyi, N.; Bisztray, T.; Debbah, M. Generative AI in cybersecurity: A comprehensive review of LLM applications and vulnerabilities. Internet Things Cyber-Phys. Syst. 2025, 5, 1–46. [Google Scholar] [CrossRef]
Pankajakshan, R.; Biswal, S.; Govindarajulu, Y.; Gressel, G. Mapping LLM security landscapes: A comprehensive stakeholder risk assessment proposal. arXiv 2024, arXiv:2403.13309. [Google Scholar] [CrossRef]
Derner, E.; Batistič, K.; Zahálka, J.; Babuška, R. A security risk taxonomy for prompt-based interaction with large language models. IEEE Access 2024, 12, 126176–126187. [Google Scholar] [CrossRef]
Jiang, S.; Kadhe, S.R.; Zhou, Y.; Ahmed, F.; Cai, L.; Baracaldo, N. Turning generative models degenerate: The power of data poisoning attacks. arXiv 2024, arXiv:2407.12281. [Google Scholar] [CrossRef]
Inan, H.; Upasani, K.; Chi, J.; Rungta, R.; Iyer, K.; Mao, Y.; Tontchev, M.; Hu, Q.; Fuller, B.; Testuggine, D.; et al. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. arXiv 2023, arXiv:2312.06674. [Google Scholar]
Phute, M.; Helbling, A.; Hull, M.; Peng, S.; Szyller, S.; Cornelius, C.; Chau, D.H. LLM self defense: By self examination, LLMs know they are being tricked. arXiv 2023, arXiv:2308.07308. [Google Scholar]
Wang, Y.; Zhu, R.; Wang, T. Self-Destructive Language Model. arXiv 2025, arXiv:2505.12186. [Google Scholar] [CrossRef]
Cai, Z.; Shabihi, S.; An, B.; Che, Z.; Bartoldson, B.R.; Kailkhura, B.; Goldstein, T.; Huang, F. AegisLLM: Scaling agentic systems for self-reflective defense in LLM security. arXiv 2025, arXiv:2504.20965. [Google Scholar]
Horodnyk, V.; Sabodashko, D.; Kolchenko, V.; Shchudlo, I.; Khoma, V.; Khoma, Y.; Baranowski, M.; Podpora, M. Comparison of Modern Deep Learning Models for Toxicity Detection. In Proceedings of the 13th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems, Gliwice, Poland, 4–6 September 2025; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar]
Harris, C.R.; Millman, K.J.; Van Der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]

Figure 1. Input-level prompt validation is not enough for some models, as there are numerous creative attacks to overcome prompt validation.

Figure 2. The proposed output-level validation, used either as the primary validation policy or as an additional “last-resort” failsafe, adds a much-needed layer of protection.

Figure 3. Informative diagram of the proposed Dual-agent LLM Firewall. The agent marked with green colour may become an object of an attack, but the agent marked with red makes sure the final answer does not go beyond predefined policies.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Podpora, M.; Baranowski, M.; Chopcian, M.; Kwasniewicz, L.; Radziewicz, W. LLM Firewall Using Validator Agent for Prevention Against Prompt Injection Attacks. Appl. Sci. 2026, 16, 85. https://doi.org/10.3390/app16010085

AMA Style

Podpora M, Baranowski M, Chopcian M, Kwasniewicz L, Radziewicz W. LLM Firewall Using Validator Agent for Prevention Against Prompt Injection Attacks. Applied Sciences. 2026; 16(1):85. https://doi.org/10.3390/app16010085

Chicago/Turabian Style

Podpora, Michal, Marek Baranowski, Maciej Chopcian, Lukasz Kwasniewicz, and Wojciech Radziewicz. 2026. "LLM Firewall Using Validator Agent for Prevention Against Prompt Injection Attacks" Applied Sciences 16, no. 1: 85. https://doi.org/10.3390/app16010085

APA Style

Podpora, M., Baranowski, M., Chopcian, M., Kwasniewicz, L., & Radziewicz, W. (2026). LLM Firewall Using Validator Agent for Prevention Against Prompt Injection Attacks. Applied Sciences, 16(1), 85. https://doi.org/10.3390/app16010085

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

LLM Firewall Using Validator Agent for Prevention Against Prompt Injection Attacks

Abstract

1. Introduction

2. Related Work

2.1. LLM Attack Vectors

2.1.1. Prompt Injection

2.1.2. Jailbreaking

2.1.3. Information Leakage and Policy Violations

2.1.4. RAG-Specific Attacks (Document Poisoning, Retrieval Manipulation)

2.1.5. Backdoor Attacks and Model Poisoning

2.1.6. Data Poisoning and Training Corruption

2.1.7. Adversarial Examples and Evasion Attacks

2.1.8. Supply Chain Vulnerabilities

2.1.9. Denial of Service and Resource Exhaustion

2.1.10. Multi-Vector and Compound Attacks

2.2. Defense Mechanisms

2.2.1. Input Filtering and Prompt Validation

2.2.2. Guard Models

2.2.3. Self-Defense Approaches

2.2.4. Multi-Agent Security Architectures

2.3. LLM Firewall Concepts

ControlNET for RAG Systems

3. Methodology

3.1. Input Firewalls vs. Output Firewalls

3.2. Dual-Agent Architecture for RAG

3.3. Validator Agent for JSON Compliance

3.4. Threat Model and System Assumptions

3.5. Reframing the Dual-Agent Architecture as an LLM Firewall

3.6. Generator Agent: RAG-Based Response Generation

3.7. Validator Agent: Output Firewall with Multi-Dimensional Security Checks

3.7.1. Prompt Injection Detection

3.7.2. Sensitive Information Filtering

3.7.3. Policy Compliance Verification

3.7.4. Toxic and Harmful Content Blocking

3.8. Implementation Framework and System Integration

3.9. Firewall Operational Layers

3.9.1. Input Layer Security

3.9.2. Retrieval Layer Security

3.9.3. Output Layer Security

4. Discussion and Conclusions

4.1. Summary of Contributions

4.2. Advantages and Limitations

4.2.1. False Positives and Usability Trade-Offs

4.2.2. Evaluation Limitations

4.3. Future Directions

4.4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI