XSS Attack Detection Method Based on CNN-BiLSTM-Attention

Li, Zhiping; Liu, Fangzheng; Gu, Zhaojun; Liu, Yun

doi:10.3390/app15168924

Open AccessArticle

XSS Attack Detection Method Based on CNN-BiLSTM-Attention

¹

Information Security Evaluation Center, Civil Aviation University of China, Tianjin 300300, China

²

College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China

³

Digital Government Affairs Office, Shandong Big Data Center, Jinan 250101, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 8924; https://doi.org/10.3390/app15168924

Submission received: 26 June 2025 / Revised: 20 July 2025 / Accepted: 28 July 2025 / Published: 13 August 2025

Download

Browse Figures

Versions Notes

Abstract

Cross-site scripting (XSS) is one of the most common security threats to web applications, posing a serious challenge to network information security. Targetting the limitations of traditional detection methods in identifying complex XSS attacks, this paper proposes a hybrid deep learning model that integrates convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and attention mechanism. The model captures local attack feature patterns through the CNN layer, learns contextual long-term dependencies through the BiLSTM layer, and introduces a multi-head attention mechanism to enhance the focus on key attack vectors. In the preprocessing stage, an improved regular word segmentation algorithm is used to construct semantic feature vectors, which effectively solves the problem of text feature representation of XSS attacks. Experimental results show that compared with the baseline method, the proposed method achieves an accuracy of 0.9938, a precision of 0.9936, a recall of 0.9936, and an F1-score of 0.9937 on real datasets. This shows that by integrating CNN and BiLSTM features and combining the attention mechanism, the model can effectively deal with complex XSS attacks.

Keywords:

cross-site scripting attacks; convolutional neural networks; bidirectional long short-term memory network; attention mechanism

1. Introduction

Cross-site scripting (XSS) attacks, as one of the core threats in the field of web application security, have long posed a severe challenge to user data security and system stability. According to the OWASP Top 10 report [1], XSS attacks have long been among the top web security risks. Attackers inject malicious scripts to steal user credentials, hijack sessions, and even spread malware, causing serious economic losses to key areas such as e-commerce and online finance. Cross-site scripting [2] is a common web security vulnerability. Attackers inject malicious scripts into web pages. When other users visit the page, the malicious scripts will be executed in the user’s browser, thereby stealing user information, tampering with page content, etc. Generally, XSS vulnerabilities are divided into the following three types: reflected XSS vulnerabilities, stored XSS vulnerabilities, and DOM XSS vulnerabilities. In reflected XSS, the malicious script is included in the parameters of the user’s request. When the user visits the page, the server returns the malicious script in the parameters and executes it in the user’s browser. In stored XSS, the malicious script is stored in the server’s database. When the user visits the page containing the script, the malicious script will be executed. In DOM XSS, the malicious script is executed by modifying the DOM structure of the page instead of relying on the server’s response.

In recent years, with the increasing complexity of web applications, XSS attacks have shown a trend of diversification and deformation. For example, they use HTML tag nesting, JavaScript event attribute obfuscation, encoding conversion, and other means to bypass traditional detection mechanisms, making traditional detection methods based on rule matching or simple feature engineering face significant challenges. It is worth noting that the successful application of deep learning technology in natural language processing and network security in recent years [3] has provided new ideas for XSS detection. Nilavarasan et al. [4] proposed an end-to-end detection method based on convolutional neural networks. By directly parsing the character-level sequence features of XSS scripts, it effectively solved the problems of relying on artificial rules and complex feature engineering in traditional XSS detection. However, its model did not solve the problem of long-distance semantic dependency, and its generalization ability for variant attacks needs to be improved. Joshi et al. [5] proposed a method based on the long short-term memory (LSTM) model. This method effectively solved the problem of cross-site scripting attack detection through data collection, preprocessing and training (using a three-layer LSTM network). However, this method still has the limitations of limited data sets (cannot cover all fuzzy attacks) and lacks built-in prevention mechanisms (need to rely on external filtering). Li et al. [6] proposed a character-level bidirectional long short-term memory network model (CMABLSTM) based on a multi-attention mechanism. This model effectively solves the problem of high-precision detection of cross-site scripting (XSS) attacks in cloud computing environments through automated feature extraction and obfuscation technology recognition. However, its generalization ability is limited to XSS attack types, and its adaptability in dynamic cloud computing environments still needs to be further optimized. Peng et al. [7] proposed a Transformer model XSS attack detection method that automatically generates a new XSS attack vector enhanced dataset, combines a parallel CNN-LSTM encoder to extract local and contextual features, and introduces an attention mechanism to optimize key information recognition, significantly improving detection accuracy and reducing false alarm rates, effectively solving the problem of insufficient detection of new attacks by traditional models; however, this method still has the limitation of relying on static datasets and needs to be continuously optimized to adapt to the dynamic evolution of attack technologies. The BERT-BiLSTM method proposed by Wan et al. [8] innovatively solves the accuracy problem of traditional XSS detection, but it has high complexity, lacks local feature extraction and attention optimization, and has room for improvement in resource efficiency, feature comprehensiveness, and generalization. Guo et al. [9] proposed an Internet XSS vulnerability detection method based on relational graph convolutional network (R-GCN), which solved the problem of real-time and accurate identification of cross-site scripting attacks by building an XSS ontology model, extracting word vectors and integrating attention mechanism to achieve efficient detection. However, it relies on prior knowledge to build an ontology graph, has weak sequence feature capture ability, does not fully utilize the temporal modeling advantages of bidirectional LSTM, and needs to supplement external threat rules (such as malicious IP library), which may introduce additional maintenance costs and misjudgment risks.

Based on the above research, in order to solve these problems, this paper proposes a hybrid deep learning model that integrates CNN, bidirectional long short-term memory network (BiLSTM), and multi-head attention mechanism. The main contributions include the following:

A word segmentation algorithm based on regular expressions is designed to solve the problem of semantic boundary recognition of programming language symbols and better preserve the semantic and structural integrity of the attack text.
A deep learning model that integrates convolutional neural networks, bidirectional long short-term memory networks and multi-head attention mechanisms is proposed, which realizes the three-stage collaborative processing of local feature perception, long-range dependency modeling, and key feature enhancement, and this effectively solves the semantic gap problem existing in traditional methods.
Experimental results on real datasets show that the model outperforms mainstream benchmark models in XSS attack detection and can effectively deal with complex and changeable XSS attacks.

2. Related Work

2.1. Traditional XSS Detection Methods

Early XSS detection research was mainly based on rule-driven methods. Jovanovic et al. [10] proposed the static data flow tracking tool Pixy, which locates unsanitized data by analyzing the contamination propagation path in the source code but cannot detect dynamically generated attack payloads. To enhance real-time defense capabilities, Gupta et al. [11] designed the XSS-SAFE framework, which injects purification logic on the server side to intercept stored XSS attacks, but the defense against DOM-based client attacks is insufficient. In the face of increasingly complex attack variants, Wang et al. [12] integrated grammatical features with social network propagation characteristics to build a detection model for online social platforms, significantly improving the ability to identify non-persistent XSS. Rathore et al. [13] further designed a multi-dimensional feature fusion strategy, integrating URL structure, HTML tags, and network topology features to enhance cross-platform detection robustness. However, the empirical analysis of Hydara et al. [14] revealed that advanced obfuscation techniques such as JavaScript encoding deformation can cause feature failure, and Ahmed’s test [15] showed that the rule base needs to be continuously iterated to adapt to new attacks and the maintenance cost remains high.

2.2. XSS Detection Method Based on Deep Learning

Deep learning technology promotes the innovation of an end-to-end detection paradigm. Fang et al. [16] adopted a character-level LSTM model to solve the problem of artificial feature bias by directly processing the original payload sequence, effectively identifying obfuscation attacks. Lei et al. [17] integrated the attention mechanism to dynamically focus on high-risk grammatical units (such as the eval() function), significantly improving the accuracy of semantic parsing. In terms of architecture optimization, the MRBN-CNN hybrid model proposed by Yan et al. [18] combines the residual structure with the lightweight convolutional layer to optimize the computational efficiency while ensuring the detection accuracy. In response to the challenge of sample imbalance, the PCA-LSTM framework developed by Stiawan [19] alleviates the gradient vanishing phenomenon through feature dimensionality reduction technology. To deal with the threat of compound attacks, Farea et al. [20] constructed a multi-task BiLSTM architecture to achieve joint detection of SQL injection and XSS, solving the problem of identifying mixed attack scenarios.Although these methods have their own advantages, they still have problems such as insufficient local feature extraction and insufficient context dependency modeling, which limits their robustness against complex XSS variants.

To address the above limitations, this paper proposes a hybrid deep learning model, CNN-BiLSTM-Attention, to solve the problem of XSS attack detection.This model integrates convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and multi-head attention mechanism to achieve multi-stage collaborative processing, outlined as follows: the CNN layer focuses on extracting local attack feature patterns, the BiLSTM layer models long-distance contextual dependencies, and the multi-head attention mechanism dynamically enhances the ability to focus on key attack vectors. This architecture not only solves the fault problem of traditional methods in complex semantic representation, but it also significantly improves detection accuracy and generalization performance.

3. Method

3.1. Overall Framework

The CNN-BiLSTM-Attention hybrid model proposed in this paper adopts an end-to-end deep learning framework, as shown in Figure 1. The model input is the original HTTP request text, which is converted into a word vector sequence by the preprocessing layer and then passes through the three core modules of CNN feature extractor, BiLSTM context encoder, and multi-head attention mechanism in sequence; finally, the classifier outputs the attack probability prediction. The innovation of this architecture lies in the realization of the three-stage coordinated processing of local feature perception, long-range dependency modeling, and key feature enhancement, which effectively solves the semantic fault problem of traditional methods in complex XSS attack detection.

3.2. Data Preprocessing

Due to the dual nature of XSS attack payloads—combining characteristics of both natural language and programming languages—they often contain HTML tags, JavaScript functions, and a large number of special symbols, resulting in highly complex and variable structures. Traditional whitespace-based tokenization methods exhibit significant limitations when processing such text, as they tend to incorrectly split critical semantic units (e.g., splitting “onerror=” into “onerror” and “=”), leading to a disconnect between symbols and their surrounding context. This negatively impacts the identification of attack vector boundaries and subsequent feature extraction.

To address this issue, this paper introduces an innovative regular expression-based tokenization pattern (r"[\w’]+[.,!?;]"|) during the data preprocessing stage. By leveraging a dual-capture mechanism, this approach enables the precise segmentation of words, punctuation, and programming symbols, thereby preserving the semantic and structural information inherent in XSS payloads.

The specific preprocessing workflow is as follows: First, the raw text is tokenized using the regular expression, splitting sentences into a sequence of words or subword units. Next, each token is mapped to an index according to a predefined vocabulary, resulting in a sequence of word indices. To accommodate batch processing and meet model input requirements, all sequences are padded or truncated to a uniform length. These index sequences are then fed into an embedding layer, transforming them into dense word vector representations that capture semantic features. Finally, the resulting word vector matrix is reshaped (e.g., transposed) to match the input format expected by the convolutional neural network, thus laying a solid foundation for subsequent feature extraction and sequence modeling.

3.3. Convolutional Feature Extraction Module

In order to effectively extract local context features in the input text sequence, this paper introduces a two-layer one-dimensional convolutional neural network as a feature extraction module in the model structure, as shown in Figure 2. This module can automatically learn and capture discriminative n-gram patterns in the text, providing rich feature representations for subsequent sequence modeling and classification tasks. The convolutional feature extraction module consists of two layers of one-dimensional convolution stacks, and each layer of convolution is followed by a ReLU activation function and Dropout regularization. The first layer of the convolution operation can be expressed as follows:

C^{(1)} = Dropout (ReLU (Conv1D (E, K, C)))

(1)

where E represents the word embedding matrix, K is the convolution kernel size, and C is the number of output channels. The second layer of convolution further extracts high-order patterns on the output feature map as follows:

C^{(2)} = Dropout (ReLU (Conv1D (C^{(1)}, K, C)))

(2)

3.4. Bidirectional LSTM Context Encoding

In order to fully model the temporal dependency and contextual semantic features of XSS attack text, this paper introduces a bidirectional long short-term memory network (BiLSTM) [21] as a sequence modeling layer after the convolutional feature extraction module. BiLSTM can simultaneously capture the forward and backward information flow in the sequence, effectively enhancing the model’s ability to understand complex semantic structures.

Specifically, let the feature sequence output by the convolutional module be denoted by

X = [x_{1}, x_{2}, \dots, x_{T}]

. The calculation process of BiLSTM at each time step t consists of two branches as follows: a forward LSTM and a backward LSTM. The working mechanism of LSTM is illustrated in Figure 3. Its core computational unit includes the input gate

i_{t}

, forget gate

f_{t}

, output gate

o_{t}

, and cell state

C_{t}

. Taking the forward LSTM as an example, the internal computation process is as follows:

The input gate is used to control the extent to which the current input information is retained. Its update process includes computing the input gate output

i_{t}

and the candidate cell state

{\tilde{C}}_{t}

, as shown in the following formulas:

i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})

(3)

{\tilde{C}}_{t} = tanh (W_{C} \cdot [h_{t - 1}, x_{t}] + b_{C})

(4)

where

h_{t - 1}

is the hidden state at the previous time step;

σ

is the sigmoid function;

W_{i}

and

W_{C}

are weight matrices;

b_{i}

and

b_{c}

are bias vectors;

x_{t}

is the input at the current time step; and

t a n h (x)

is the activation function used to compute the candidate cell state

{\tilde{C}}_{t}

.

The forget gate is used to control the extent to which information from the previous cell state is retained. The update formula for the forget gate output

f_{t}

is as follows:

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})

(5)

The cell state

C_{t}

is updated at the current time step by combining historical information and new input. The calculation formula is as follows:

C_{t} = f_{t} \cdot C_{t - 1} + i_{t} \cdot {\tilde{C}}_{t}

(6)

The output gate is used to regulate the influence of the cell state on the current output. The update formula for the output gate output is as follows:

o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})

(7)

{\vec{h}}_{t} = o_{t} \cdot tanh (C_{t})

(8)

where

W_{o}

is the weight matrix of the output gate;

b_{o}

is the bias vector;

σ

is the activation function of the output gate, representing the sigmoid function, used to compute

o_{t}

; and

t a n h (x)

is the activation function of the tanh layer, used to compute the output

{\vec{h}}_{t}

.

The computational process of a bidirectional LSTM is similar to that of a forward LSTM, except that the input sequence is processed in reverse order from

x_{T}

to

x_{t}

. Ultimately, the output of the BiLSTM at each time step is the concatenation of the forward and backward hidden states as follows:

H_{t} = [{\vec{h}}_{t}; {\overset{\leftarrow}{h}}_{t}]

(9)

where [;] denotes the vector concatenation operation.

3.5. Multi-Head Attention Mechanism

To further improve the model’s ability to model global dependencies in sequences, this paper introduces a multi-head attention mechanism after the bidirectional LSTM module. The multi-head attention mechanism can learn the correlation between positions in the sequence in parallel in different subspaces, thereby capturing richer and more diverse feature representations. Specifically, let the feature sequence output by the BiLSTM be

H = [h_{1}, h_{2}, \dots, h_{T}]

. The multi-head attention mechanism first applies different linear transformations to the input features to generate multiple subspaces of queries, keys, and values. For each attention head, the computation is as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(10)

where Q, K, and V represent the results of the linear transformations of the queries, keys, and values, respectively, and

d_{k}

is the dimension of the key vectors. The outputs of multiple attention heads are concatenated along the feature dimension and then passed through a linear transformation to obtain the final multi-head attention output as follows:

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{o}

(11)

where

h e a d_{i}

denotes the output of the i-th attention head, h is the number of attention heads, and

W^{o}

is the weight matrix of the output linear transformation. The multi-head attention mechanism enables the model to focus on different parts of the sequence in different representation subspaces, effectively enhancing the feature representation capability and overall model performance.

3.6. Classifier

After the multi-head attention mechanism and global pooling operation, the model obtains the global feature representation of each input sample. In order to achieve the final classification task, this paper uses a multi-layer fully connected neural network as a classifier. This classifier can perform nonlinear transformation and feature fusion on high-dimensional features, thereby improving the discrimination ability of the model. Specifically, the pooled global feature vector is first input to the fully connected layer, and after linear transformation, ReLU activation function, and Dropout regularization, it is mapped to a single output node through the second fully connected layer. Finally, the Sigmoid activation function is used to compress the output value to between 0 and 1, indicating the probability that the input sample belongs to the positive class. The calculation process can be expressed as follows:

z_{1} = ReLU (W_{1} x + b_{1})

(12)

z_{1}^{'} = Dropout (z_{1})

(13)

z_{2} = W_{2} z_{1}^{'} + b_{2}

(14)

\hat{y} = σ (z_{2})

(15)

where x is the pooled global feature vector;

W_{1}

,

W_{2}

, and

b_{1}

,

b_{2}

are the weights and biases of the fully connected layers;

σ (\cdot)

denotes the Sigmoid activation function; and y is the final classification probability output.

4. Experimental Analysis

4.1. Dataset

This experiment uses two datasets.The first dataset contains 33,427 XSS attack samples from the XSSed database and 31,407 normal traffic samples from the DMOZ dataset. After standardized preprocessing, the two types of samples were merged to construct a complete dataset. The second dataset uses the dataset compiled by Mereani et al. [22], which contains 14,989 malicious scripts and 27,675 benign scripts. Malicious scripts are mainly derived from the archives of XSSed.com, the world’s largest XSS vulnerability library, supplemented by public attack sample libraries (such as OWASP test cases) and active malicious site data crawled through the Tor network, covering obfuscated and non-obfuscated storage XSS attack vectors, such as cookie stealing, keyboard logging, phishing redirection, etc. Benign scripts are systematically collected from multiple trusted sources, including GitHub open source projects, official websites of educational institutions (.edu domain names), and independent e-commerce and blog platforms, covering real functional codes such as form verification and API interaction. All samples have been deduplicated.To ensure the rationality of data distribution, positive and negative samples are randomly divided into training sets and test sets in a ratio of 7:3. Table 1 presents the sample distribution characteristics of the dataset in detail.

4.2. Evaluation Indicators

In order to comprehensively evaluate the model performance, this study uses four core evaluation indicators widely used in the field of network security [23,24,25]. These indicators can fully reflect the performance differences of classification models in XSS detection tasks as follows:

(1) Accuracy: A key indicator for measuring the overall classification accuracy of the model. The calculation formula is as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(16)

Among them, TP is the number of true positive examples (attack samples are correctly identified), TN is the true negative examples (normal samples are correctly identified), FP is the false positive examples (normal samples are misidentified as attacks), and FN is the false negative examples (attack samples are missed).

(2) Precision: Evaluates the reliability of the model’s recognition results and reflects the ability to control false positives as follows:

Precision = \frac{T P}{T P + F P}

(17)

(3) Recall: The core indicator for measuring the model’s attack detection capability as follows:

Recall = \frac{T P}{T P + F N}

(18)

(4) F1-score: The harmonic mean of precision and recall, expressed as follows:

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(19)

4.3. Benchmark Model

In order to verify the effectiveness of the model proposed in this paper, representative methods in the current XSS detection field are selected as benchmark models, covering the following three categories: traditional machine learning, classic deep learning structure, and advanced hybrid architecture.

Support Vector Machine (SVM): A representative of traditional machine learning methods, using a radial basis function (RBF) kernel and TF-IDF feature extraction, which is robust on small- and medium-sized data sets.

Convolutional Neural Network (CNN): A local feature extractor based on convolution kernels, good at capturing n-gram-level attack patterns, commonly used in text classification tasks.

Long Short-Term Memory (LSTM): A standard architecture for sequence modeling with a triple-gated structure (input gate, forget gate, and output gate) that can effectively capture long-term dependencies.

Bidirectional Long Short-Term Memory (BiLSTM): An improved version of LSTM that can capture contextual dependencies more comprehensively by processing both forward and backward information of the sequence simultaneously.

CNN-LSTM hybrid model: Combining the local feature extraction capability of CNN and the sequence modeling advantage of LSTM, it represents the current mainstream hybrid architecture design direction.

Transformer model: An advanced structure based on the self-attention mechanism that can globally capture sequence dependencies and avoid efficiency bottlenecks caused by recursive calculations.

Graph Convolutional Networks: A neural network model for graph-structured data that learns node representations by aggregating information about node neighbors, thereby effectively capturing the relationship between structure and features in the graph.

4.4. Experimental Environment

The experiment uses a 64-bit Ubuntu 20.04 LTS operating system, an Intel(R) Xeon(R) 20-core server-level CPU (Intel, Santa Clara, CA, USA), 90 GB of memory, and programming languages such as Python 3.9.12 and PyTorch 1.12.1. The graphics processor uses NVIDIA GeForce RTX 3090 (24 GB video memory) (NVIDIA, Santa Clara, CA, USA).

4.5. Experimental Results and Analysis

This study compares the performance of machine learning and deep learning models in XSS attack detection tasks, and it selects SVM, CNN, LSTM, BiLSTM, CNN-LSTM, Transformer, and GCN for comparative experiments. The experimental results are shown in Table 2. Our method achieves the best performance in all evaluation metrics on both datasets.

The XSSed-DMOZ dataset is taken as a representative example for the analysis. From the overall results, the performance of deep learning models is generally better than that of traditional SVM methods. The various indicators of SVM are around 0.9760. Although it can achieve a high level of detection, it has certain limitations in complex feature expression and sequence modeling. In contrast, the CNN model effectively extracts local features through convolution operations, and the accuracy is improved to 0.9878. The LSTM model further utilizes its modeling ability for sequence information and improves the accuracy to 0.9916. On this basis, the BiLSTM model introduces a bidirectional structure, which can capture contextual information at the same time, with an accuracy of 0.9923, which is better than the unidirectional LSTM and CNN.

In terms of combined models, the CNN-LSTM model combines the advantages of convolutional and recurrent neural networks, with an accuracy of 0.9911, slightly lower than the single BiLSTM model but higher than the single CNN model, indicating that feature fusion has a certain effect on improving model performance. It is worth noting that the CNN-BiLSTM-Attention model with the attention mechanism achieved the best detection effect, with an accuracy of 0.9938 and precision, recall, and F1-scores all exceeding 0.9936. This shows that the attention mechanism can help the model better focus on the key features in the input sequence, thereby significantly improving the detection performance.

In addition, the Transformer model performed slightly worse than the RNN model in this experiment, with an accuracy of 0.9869. Although Transformer has outstanding performance in the field of natural language processing, its advantages have not been fully utilized in this task, perhaps due to limited data or short text length. However, the precision and recall of Transformer are still high, showing certain application potential.

In summary, deep learning models, especially those that integrate multiple feature extraction methods and introduce attention mechanisms, perform best in XSS attack detection tasks. The experimental results verify the significant impact of model structure design on detection performance, providing a strong reference for subsequent related research.

4.6. Ablation Studies

In order to verify the effectiveness of each component in the proposed CNN-BiLSTM-Attention model, this study conducted a series of ablation experiments based on the XSSed-DMOZ dataset. We designed four variant models as follows: BiLSTM-Attention (removing CNN components), CNN-Attention (removing BiLSTM components), CNN-BiLSTM (removing the attention mechanism), and CNN-LSTM-Attention (replacing BiLSTM with unidirectional LSTM). The experimental results are shown in Table 3. The complete CNN-BiLSTM-Attention model achieved the best performance in all evaluation indicators, reaching 0.9938 accuracy and 0.9937 F1-score.

It can be seen from the experimental data that removing any component will lead to a decrease in model performance. Among them, after removing the CNN component (BiLSTM-Attention), the accuracy dropped from 0.9938 to 0.9910; removing the BiLSTM component (CNN-Attention) had the most significant impact, with the accuracy dropping to 0.9895; removing the attention mechanism (CNN-BiLSTM) dropped the accuracy to 0.9904; and after replacing the bidirectional LSTM with a unidirectional LSTM (CNN-LSTM-Attention), the accuracy was 0.9898 This shows that CNN, BiLSTM, and attention mechanism each play an irreplaceable role in XSS attack detection, and their combination can effectively extract features from XSS attack payloads and accurately classify them. In particular, the BiLSTM component is particularly important for capturing long-term dependencies in XSS attack payloads, while the attention mechanism can further enhance the model’s ability to identify key features.

4.7. Model Interpretability Analysis Based on Attention Mechanism

This study uses the attention mechanism in the CNN-BiLSTM-Attention model to visualize the decision-making basis of the model in the process of XSS attack detection. The following shows the distribution of attention weights of two typical XSS samples, which intuitively presents the degree of attention paid by the model to different tokens. In the first obvious script injection sample, as shown in Figure 4, the model mainly focuses on HTML tags and attributes, especially giving high attention weights to tags such as br, IP, and script. It is worth noting that the model pays high attention to the HTML structure part, while the weight of the resource path part (thirdparty/scripts/ckers.org.js) gradually decreases, indicating that it pays more attention to structural features rather than specific resource locations.

The second sample shows a more challenging case, as shown in Figure 5, a URL-encoded XSS attack, which is characterized by the use of hexadecimal-encoded tags and functions (such as 3cscript for <script and 3ealert for alert). The attention scores in this sample are relatively evenly distributed across tokens, among which the parameter id obtains the relatively highest weight. This even distribution shows that the model adopts a comprehensive analysis strategy when dealing with complex obfuscation attacks, rather than relying on a single feature, demonstrating its ability to identify obfuscation techniques. This attention visualization analysis demonstrates that our model not only has high accuracy but also provides transparency into the decision-making process. The model is able to automatically learn to identify key features of XSS attacks, including obvious script tags and obfuscated attack code.

4.8. Parameter Impact Analysis

In this section, we analyze the performance of the model under different hyperparameters, including word embedding dimension, Dropout ratio, and different optimizers.

4.8.1. Word Embedding Dimension

We conducted a hyperparameter sensitivity analysis on the embedding dimension of the word embedding layer. We selected four different embedding dimensions of 32, 64, 128, and 256 and compared the performance of the model. The experimental results are shown in Table 4. As the embedding dimension increases, the model shows a gradual improvement in various indicators such as accuracy, precision, recall, and F1-score.

Specifically, when the embedding dimension is 32, the accuracy of the model is 0.9924 and the F1-score is 0.9924; when the embedding dimension is increased to 64, the accuracy and F1-score are increased to 0.9929 and 0.9928, respectively; when the embedding dimension is further set to 128, the accuracy and F1-score reach 0.9933 and 0.9933, respectively; and when the embedding dimension is increased to 256, the model performance reaches the best, with an accuracy of 0.9938 and an F1-score of 0.9937. It can be seen that a higher embedding dimension can provide the model with richer semantic representation capabilities, thereby improving the model’s ability to discriminate XSS attack samples.

However, the increase in embedding dimension will also lead to an increase in the number of model parameters, which in turn leads to an increase in computing resource consumption. Therefore, in practical applications, it is necessary to balance model performance and computing overhead.

4.8.2. Dropout Ratio

We conducted an experimental analysis on the Dropout Rate, a key regularization hyperparameter in the model. We set different Dropout Rates from 0.1 to 0.9 to examine their impact on the model accuracy.The experimental results are shown in Figure 6.

The experimental results show that as the Dropout ratio increases, the model accuracy shows a trend of first rising and then stabilizing, and finally decreasing significantly at a higher ratio. When the Dropout ratio gradually increases from 0.1 to 0.2, the model accuracy increases from 0.9926 to 0.9931, reaching the highest value in this group of experiments. Thereafter, as the Dropout ratio further increases, the accuracy remains at a high level overall with little fluctuation. When the Dropout ratio exceeds 0.7, the model accuracy begins to decrease significantly, especially when it is 0.9, the accuracy decreases significantly, indicating that too high a Dropout ratio will cause the model to be underfitted and affect its generalization ability. In summary, an appropriate Dropout ratio can effectively alleviate model overfitting and improve the generalization performance of the model [26], but an excessively high Dropout ratio will damage the learning ability of the model.

4.8.3. Optimizer

We compared the performance of three optimizers, Adam, SGD, and RMSprop, during model training. The experimental results are shown in Figure 7. Adam and RMSprop optimizers are superior to SGD optimizer in terms of loss reduction speed and final convergence effect.

Specifically, the Adam optimizer can quickly reduce the loss value at the beginning of training, showing the fastest convergence speed and the lowest final loss, and performs best among all optimizers. Although the performance of the RMSprop optimizer is similar to that of Adam, and it can also reduce the loss quickly and achieve good convergence effect, it is slightly inferior to Adam overall. In contrast, the loss value of the SGD optimizer does not change much during the training process, and there is almost no obvious decrease, indicating that its convergence speed in this task is slow and it is difficult to achieve a low loss level. In summary, the Adam optimizer shows the best training effect on this model and dataset, and it can reduce the loss faster and more effectively, significantly improving the model performance. Therefore, this paper finally selected the Adam optimizer as the parameter optimization method of the model to obtain the best training efficiency and model performance.

5. Conclusions

Targeting at the limitations of traditional XSS attack detection methods in complex attack pattern recognition, this paper proposes a hybrid deep learning model that integrates convolutional neural network (CNN), bidirectional long short-term memory network (BiLSTM), and multi-head attention mechanism. Through the multi-scale feature fusion strategy, the CNN layer effectively captures local attack features (such as malicious tags and event attributes), the BiLSTM layer models the long-term contextual dependencies across tags, and the attention mechanism dynamically strengthens the weight distribution of key attack vectors. In the preprocessing stage, the designed regular word segmentation algorithm solves the semantic boundary ambiguity problem of HTML/JS mixed text and significantly improves the accuracy of feature representation. Experimental results show that this model achieves 99.38% accuracy, 0.9936 precision, 0.9936 recall value, and 0.9936 F1-score on real datasets.

Author Contributions

Conceptualization, Z.L. and F.L.; methodology, F.L.; software, Z.L.; validation, Z.G., F.L. and Y.L.; formal analysis, Z.G.; investigation, Z.L.; resources, Z.G.; data curation, Z.L.; writing—original draft preparation, F.L.; writing—review and editing, F.L.; visualization, Y.L.; supervision, Y.L.; project administration, Z.G.; funding acquisition, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant U2333201. This work was supported by the Fundamental Research Business Funds for Central Universities of Civil Aviation University of China (3122022058).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available from the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kaur, J.; Garg, U.; Bathla, G. Detection of cross-site scripting (XSS) attacks using machine learning techniques: A review. Artif. Intell. Rev. 2023, 56, 12725–12769. [Google Scholar] [CrossRef]
Hussainy, A.S.; Khalifa, M.A.; Elsayed, A.; Hussien, A.; Razek, M.A. Deep learning toward preventing web attacks. In Proceedings of the 2022 5th International conference on computing and informatics (ICCI), New Cairo, Egypt, 9–10 March 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 280–285. [Google Scholar]
Eunaicy, J.C.; Suguna, S. Web attack detection using deep learning models. Mater. Today Proc. 2022, 62, 4806–4813. [Google Scholar] [CrossRef]
Nilavarasan, G.; Balachander, T. XSS attack detection using convolution neural network. In Proceedings of the 2023 International Conference on Artificial Intelligence and Knowledge Discovery in Concurrent Engineering (ICECONF), Chennai, India, 5–7 January 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Joshi, I.S.; Kiratsata, H.J. Cross-Site Scripting Recognition Using LSTM Model. In Proceedings of the International Conference on Intelligent Computing and Communication, Hyderabad, India, 18–19 November 2022; Springer: Singapore, 2022; pp. 1–10. [Google Scholar]
Li, X.; Wang, T.; Zhang, W.; Niu, X.; Zhang, T.; Zhao, T.; Wang, Y.; Wang, Y. An LSTM based cross-site scripting attack detection scheme for Cloud Computing environments. J. Cloud Comput. 2023, 12, 118. [Google Scholar] [CrossRef]
Peng, B.; Xiao, X.; Wang, J. Cross-site scripting attack detection method based on transformer. In Proceedings of the 2022 IEEE 8th International Conference on Computer and Communications (ICCC), Chengdu, China, 9–12 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1651–1655. [Google Scholar]
Wan, S.; Xian, B.; Wang, Y.; Lu, J. Methods for Detecting XSS Attacks Based on BERT and BiLSTM. In Proceedings of the 2024 8th International Conference on Management Engineering, Software Engineering and Service Sciences (ICMSS), Wuhan, China, 12–14 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar]
Guo, Z.; Li, X.; Hu, R.; Wang, D.; Song, W. A Vulnerability Detection Method for Internet Cross-site Scripting Based on Relationship Diagram Convolutional Networks. J. Web Eng. 2025, 24, 243–266. [Google Scholar] [CrossRef]
Jovanovic, N.; Kruegel, C.; Kirda, E. Pixy: A static analysis tool for detecting web application vulnerabilities. In Proceedings of the 2006 IEEE Symposium on Security and Privacy (S&P’06), Berkeley/Oakland, CA, USA, 21–24 May 2006; IEEE: Piscataway, NJ, USA, 2006. [Google Scholar]
Gupta, S.; Gupta, B.B. XSS-SAFE: A server-side approach to detect and mitigate cross-site scripting (XSS) attacks in JavaScript code. Arab. J. Sci. Eng. 2016, 41, 897–920. [Google Scholar] [CrossRef]
Wang, R.; Jia, X.; Li, Q.; Zhang, S. Machine learning based cross-site scripting detection in online social network. In Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC, CSS, ICESS), Paris, France, 20–22 August 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 823–826. [Google Scholar]
Rathore, S.; Sharma, P.K.; Park, J.H. XSSClassifier: An efficient XSS attack detection approach based on machine learning classifier on SNSs. J. Inf. Process. Syst. 2017, 13, 1014–1028. [Google Scholar] [CrossRef]
Hydara, I.; Sultan, A.B.M.; Zulzalil, H.; Admodisastro, N. Current state of research on cross-site scripting (XSS)—A systematic literature review. Inf. Softw. Technol. 2015, 58, 170–186. [Google Scholar] [CrossRef]
Ahmed, M.A.; Ali, F. Multiple-path testing for cross site scripting using genetic algorithms. J. Syst. Archit. 2016, 64, 50–62. [Google Scholar] [CrossRef]
Fang, Y.; Li, Y.; Liu, L.; Huang, C. DeepXSS: Cross site scripting detection based on deep learning. In Proceedings of the 2018 International Conference on Computing and Artificial Intelligence, Chengdu, China, 12–14 March 2018; pp. 47–51. [Google Scholar]
Lei, L.; Chen, M.; He, C.; Li, D. XSS detection technology based on LSTM-attention. In Proceedings of the 2020 5th International Conference on Control, Robotics and Cybernetics (CRC), Wuhan, China, 16–18 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 175–180. [Google Scholar]
Yan, H.; Feng, L.; Yu, Y.; Liao, W.; Feng, L.; Zhang, J.; Liu, D.; Zou, Y.; Liu, C.; Qu, L.; et al. Cross-site scripting attack detection based on a modified convolution neural network. Front. Comput. Neurosci. 2022, 16, 981739. [Google Scholar] [CrossRef] [PubMed]
Stiawan, D.; Bardadi, A.; Afifah, N.; Melinda, L.; Heryanto, A.; Septian, T.W.; Idris, M.Y.; Subroto, I.M.I.; Lukman; Budiarto, R. An Improved LSTM-PCA Ensemble Classifier for SQL Injection and XSS Attack Detection. Comput. Syst. Sci. Eng. 2023, 46. [Google Scholar] [CrossRef]
Farea, A.A.; Wang, C.; Farea, E.; Alawi, A.B. Cross-site scripting (XSS) and SQL injection attacks multi-classification using bidirectional LSTM recurrent neural network. In Proceedings of the 2021 IEEE International Conference on Progress in Informatics and Computing (PIC), Shanghai, China, 17–19 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 358–363. [Google Scholar]
Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
Mereani, F.A.; Howe, J.M. Detecting cross-site scripting attacks using machine learning. In Proceedings of the International Conference on Advanced Machine Learning Technologies and Applications, Cairo, Egypt, 22–24 February 2018; Springer: Cham, Switzerland, 2018; pp. 200–210. [Google Scholar]
Kitchenham, B.A.; Pickard, L.M.; MacDonell, S.G.; Shepperd, M.J. What accuracy statistics really measure. IEE Proc.-Softw. 2001, 148, 81–85. [Google Scholar] [CrossRef]
Buckland, M.; Gey, F. The relationship between recall and precision. J. Am. Soc. Inf. Sci. 1994, 45, 12–19. [Google Scholar] [CrossRef]
Yacouby, R.; Axman, D. Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, 20 November 2020; pp. 79–91. [Google Scholar]
Nitish, S. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1. [Google Scholar]

Figure 1. CNN-BiLSTM-Attention model.

Figure 2. Dual-layer convolution feature extraction structure.

Figure 3. Structure of LSTM.

Figure 4. Sample 1: Attention distribution heat map of the obvious script injection sample.

Figure 5. Sample 2 URL encoding obfuscation XSS attack attention distribution heat map.

Figure 6. Curve of model accuracy changing with Dropout ratio.

Figure 7. Loss function of different optimizer changes with Epoch.

Table 1. Sample distribution of the datasets.

Dataset	Malicious	Benign	Total
XSSed-DMOZ	33,426	31,407	64,833
Merwani-XSS	14,989	27,675	42,664

Table 2. Performance comparison table of different models in XSS attack detection tasks in two datasets. The best performance indicators are marked in bold in the table.

Dataset	Methods	Accuracy	Precision	Recall	F1-Score
XSSed-DMOZ	SVM	0.9760	0.9761	0.9767	0.9760
	CNN	0.9878	0.9878	0.9878	0.9878
	LSTM	0.9916	0.9896	0.9898	0.9897
	BiLSTM	0.9923	0.9912	0.9915	0.9913
	CNN-LSTM	0.9911	0.9899	0.9901	0.9900
	Transformer	0.9869	0.9738	0.9719	0.9725
	GCN	0.9852	0.9889	0.9822	0.9856
	CNN-BiLSTM-Attention	0.9938	0.9936	0.9936	0.9937
Merwani-XSS	SVM	0.9505	0.9427	0.9496	0.9460
	CNN	0.9589	0.9536	0.9539	0.9537
	LSTM	0.9532	0.9538	0.9427	0.9479
	BiLSTM	0.9498	0.9510	0.9386	0.9443
	CNN-LSTM	0.9603	0.9515	0.9505	0.9510
	Transformer	0.9543	0.9505	0.9487	0.9496
	GCN	0.9552	0.9517	0.9180	0.9345
	CNN-BiLSTM-Attention	0.9609	0.9602	0.9598	0.9599

Table 3. Comparison of XSS attack detection performance of different model variants.

Variants	Accuracy	Precision	Recall	F1-Score
BiLSTM-Attention	0.9909	0.9896	0.9898	0.9897
CNN-Attention	0.9894	0.9872	0.9876	0.9873
CNN-BiLSTM	0.9903	0.9892	0.9894	0.9893
CNN-LSTM-Attention	0.9897	0.9894	0.9897	0.9895
CNN-BiLSTM-Attention	0.9938	0.9936	0.9936	0.9937

Table 4. Comparison table of model performance indicators under different word embedding dimensions.

Embedding Dimension	Accuracy	Precision	Recall	F1-Score
32	0.9924	0.9923	0.9926	0.9924
64	0.9929	0.9927	0.9930	0.9928
128	0.9933	0.9931	0.9934	0.9933
256	0.9938	0.9936	0.9936	0.9937

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Liu, F.; Gu, Z.; Liu, Y. XSS Attack Detection Method Based on CNN-BiLSTM-Attention. Appl. Sci. 2025, 15, 8924. https://doi.org/10.3390/app15168924

AMA Style

Li Z, Liu F, Gu Z, Liu Y. XSS Attack Detection Method Based on CNN-BiLSTM-Attention. Applied Sciences. 2025; 15(16):8924. https://doi.org/10.3390/app15168924

Chicago/Turabian Style

Li, Zhiping, Fangzheng Liu, Zhaojun Gu, and Yun Liu. 2025. "XSS Attack Detection Method Based on CNN-BiLSTM-Attention" Applied Sciences 15, no. 16: 8924. https://doi.org/10.3390/app15168924

APA Style

Li, Z., Liu, F., Gu, Z., & Liu, Y. (2025). XSS Attack Detection Method Based on CNN-BiLSTM-Attention. Applied Sciences, 15(16), 8924. https://doi.org/10.3390/app15168924

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

XSS Attack Detection Method Based on CNN-BiLSTM-Attention

Abstract

1. Introduction

2. Related Work

2.1. Traditional XSS Detection Methods

2.2. XSS Detection Method Based on Deep Learning

3. Method

3.1. Overall Framework

3.2. Data Preprocessing

3.3. Convolutional Feature Extraction Module

3.4. Bidirectional LSTM Context Encoding

3.5. Multi-Head Attention Mechanism

3.6. Classifier

4. Experimental Analysis

4.1. Dataset

4.2. Evaluation Indicators

4.3. Benchmark Model

4.4. Experimental Environment

4.5. Experimental Results and Analysis

4.6. Ablation Studies

4.7. Model Interpretability Analysis Based on Attention Mechanism

4.8. Parameter Impact Analysis

4.8.1. Word Embedding Dimension

4.8.2. Dropout Ratio

4.8.3. Optimizer

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI