Previous Article in Journal
A Comparative Evaluation of Snort and Suricata for Detecting Data Exfiltration Tunnels in Cloud Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Improved Detection of Cross-Site Scripting (XSS) Attacks Using a Hybrid Approach Combining Convolutional Neural Networks and Support Vector Machine

1
LSATE Laboratory, National School of Applied Science, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco
2
L3IA Laboratory, Faculty of Sciences Dhar EL Mahraz, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco
*
Author to whom correspondence should be addressed.
J. Cybersecur. Priv. 2026, 6(1), 18; https://doi.org/10.3390/jcp6010018 (registering DOI)
Submission received: 7 November 2025 / Revised: 25 December 2025 / Accepted: 13 January 2026 / Published: 17 January 2026
(This article belongs to the Section Security Engineering & Applications)

Abstract

Cross-site scripting (XSS) attacks are among the threats facing web security, resulting from the diversity and complexity of HTML formats. Research has shown that some text processing-based methods are limited in their ability to detect this type of attack. This article proposes an approach aimed at improving the detection of this type of attack, taking into account the limitations of certain techniques. It combines the effectiveness of deep learning represented by convolutional neural networks (CNN) and the accuracy of classification methods represented by support vector machines (SVM). It takes advantage of the ability of CNNs to effectively detect complex visual patterns in the face of injection variations and the SVM’s powerful classification capability, as XSS attacks often use obfuscation or encryption techniques that are difficult to be detected with textual methods alone. This work relies on a dataset that focuses specifically on XSS attacks, which is available on Kaggle and contains 13,686 sentences in script form, including benign and malicious cases associated with these attacks. Benign data represents 6313 cases, while malicious data represents 7373 cases. The model was trained on 80% of this data, while the remaining 20% was allocated for test. Computer vision techniques were used to analyze the visual patterns in the images and extract distinctive features, moving from a textual representation to a visual one where each character is converted into its ASCII encoding, then into grayscale pixels. In order to visually distinguish the characteristics of normal and malicious code strings and the differences in their visual representation, a CNN model was used in the analysis. The convolution and subsampling (pooling) layers extract significant patterns at different levels of abstraction, while the final output is converted into a feature vector that can be exploited by a classification algorithm such as an Optimized SVM. The experimental results showed excellent performance for the model, with an accuracy of (99.7%), and this model is capable of generalizing effectively without the risk of overfitting or loss of performance. This significantly enhances the security of web applications by providing robust protection against complex XSS threats.

1. Introduction

Recently, thanks to the increased use of the Internet and sophisticated devices, the usefulness of web applications in providing services has significantly strengthened. They have become essential tools for many businesses and organizations, especially in the finance, education, and commerce sectors. With the evolution of digital technologies, web applications tend to become the preferred means of data representation and service delivery on the Internet. However, due to their increasing complexity and high interactivity, these applications are particularly exposed to vulnerabilities, making them prime targets for cyberattacks. Cybercriminals regularly exploit these flaws, thereby posing a constant threat to the security of online services. Among the most common vulnerabilities in web applications is Cross-Site Scripting (XSS), a security flaw that allows an attacker to inject malicious code—most often JavaScript—into a web page viewed by other users. According to the OWASP Foundation (2023) [1], XSS attacks remain among the top ten vulnerabilities affecting web applications. These attacks notably allow stealing cookies, hijacking sessions, altering website content, or even executing malware without the victim’s knowledge. By compromising users’ browsers in this way, XSS can lead to the theft of sensitive data such as passwords, credit card numbers, and personal information. Due to their frequency and potential impact, XSS attacks are considered critical by cybersecurity experts. This persistent threat has led researchers to actively focus on developing detection, prevention, and remediation mechanisms, making XSS a dynamic field of research in web application security. Cross-Site Scripting (XSS) is an injection-type attack targeting the application layer, which involves injecting malicious code into a vulnerable web application. The latter primarily occurs due to insufficient input validation, which is the main cause of XSS attacks [2]. When a web application accepts and displays user data without verifying its type, format, or content, it opens the door to malicious scripts. The absence of proper filtering and a lack of effective security mechanisms significantly increase the risk of attacks, allowing malicious content to go unnoticed and be executed by the browser, thereby exposing users to threats. As a result, this vulnerability exposes the victim’s browser to the involuntary execution of malicious scripts, as it fails to distinguish malicious content from legitimate content [3]. Detection methods for cross-site scripting (XSS) attacks have progressively evolved, moving from manual techniques to more advanced approaches based on artificial intelligence (AI). Originally, protection involved manually analyzing the source code and validating user input. Subsequently, rule-based filters were implemented, relying on predefined patterns to identify suspicious elements such as <script> tags [4]. However, these mechanisms remain insufficient against complex attacks or those concealed using obfuscation techniques. In this evolving landscape, new tools have been developed in response to increasingly sophisticated attacks. These tools rely on both static and dynamic analysis of the source code in order to examine the behavior of an application during execution [5]. Subsequently, machine learning (ML) techniques were implemented, using classification models based on features extracted from the code, such as input length, the number of special characters, and the presence of certain JavaScript functions. The most commonly used machine learning algorithms for XSS detection are Random Forest (RF), Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). Due to the rapid evolution of attack techniques, it has become necessary to adopt more advanced methods to counter them. Traditional methods face significant challenges, particularly due to high false positive and false negative rates resulting from the evolving nature and complexity of attack patterns. Thanks to advances in artificial intelligence (AI), deep learning (DL) techniques have been used to detect XSS attacks. Advanced models such as Recurrent Neural Networks (RNN), Long Short-Term Memory networks (LSTM), and the BERT model have been employed. These models are well capable of extracting hidden patterns from complex or obfuscated data, which has significantly improved the accuracy of detection and classification. All methods have certain limitations that attackers can exploit to compromise user data. In this context, our present work aims to develop a hybrid method for detecting XSS attacks. This method is based on a combination of Deep Learning and Machine Learning, specifically using Convolutional Neural Networks (CNN) and the SVM classifier. It leverages the ability of CNN to efficiently extract features from data, while the SVM is used to perform classification. CNNs have proven to be effective, especially when combined with an inexpensive FPGA accelerator optimized by detachable deep convolutions, which has significantly reduced computational complexity while improving processing speed. Squeeze-and-Excite modules enhance the network’s ability to dynamically reassign the importance of feature channels, allowing for the extraction of more distinctive and powerful features [6]. The authors of the paper [7], rely on the use of a Few-Shot image classification algorithm based on the integration of global and local features of the model, which allows for a better understanding of details while maintaining an overall contextual understanding. The combination of these methods significantly improves the overall performance of CNNs, providing richer, more accurate, and more relevant feature extraction, even in resource-limited environments. The chosen methodological approach is divided into three phases, which will be presented in detail below:
  • Phase 1: ASCII encoding of the HTML code, followed by its conversion into an image.
  • Phase 2: Automatic extraction of discriminative features from these images.
  • Phase 3: Data classification using an SVM (Support Vector Machine) classifier.
Based on the methodological steps described above, this article presents the following contributions:
  • XSS attacks are detected using a hybrid model combining CNN and SVM. The CNN extracts features from images after being trained on a set of 64 × 64 pixel images. These images were generated by first converting the code to ASCII, then transforming it into grayscale images, before being used for the classification step.
  • The results obtained using this hybrid model combining CNN and SVM are very accurate (99.7%) and balanced for all criteria evaluated, which will significantly improve the security of web applications by providing reliable defenses against sophisticated XSS threats.
  • Optimization of SVM hyperparameters using GridSearchCV.
  • The total time per query varies between 10 and 13 milliseconds, which is very low and perfectly suited for use in production. Our model offers a much better balance between computational cost and predictive efficiency, clearly outperforming traditional methods that rely solely on SVM.
The results obtained during the experiments are promising and show that the combination of Deep Learning (DL) and Machine Learning (ML) improves the accuracy of XSS attack detection. This research paper is structured as follows: after the introduction, the second section presents the fundamental concepts necessary for understanding the topic and related work. The third section describes the proposed approach in detail. The fourth section is dedicated to experimentation and results, including a subsection evaluating the performance of our approach. The fifth section is a discussion highlighting the strengths and weaknesses of our approach. Finally, a conclusion provides key takeaways and suggestions that pave the way for future research.

2. Background and Related Work

2.1. Cross-Site Scripting (XSS)

Cross-site scripting (XSS) is a security vulnerability in which an attacker injects malicious code, usually in JavaScript, into a web page consulted by other users. This injection enables the attacker to compromise the integrity and security of the targeted web application: it can be used to steal cookies, hijack user sessions, modify site content, or even execute malware without the victim’s knowledge [8,9]. Cross-Site Scripting (XSS) attacks fall into three main categories: reflected (non-persistent), persistent (stored), and DOM-based. Each type has its own characteristics.

2.1.1. Stored XSS (Persistent)

Persistent (stored) XSS attacks mainly target interactive platforms such as social networks, forums, or messaging services. Unlike non-persistent XSS attacks, they do not require a link trap: the attacker directly inserts a malicious script into a section of the site accessible to other users, such as comments, forums, reviews, or HTML emails, and persists in the server database. When a user visits the compromised page, the script runs automatically, potentially exposing all visitors to risks such as data theft or session takeover. This silent, automatic nature makes this type of attack particularly dangerous [10].

2.1.2. XSS Reflected (Non-Persistent)

Reflected XSS is a more targeted attack. The attacker tricks the victim into visiting a malicious URL (Uniform Resource Locator), often through unsolicited emails. The URL contains malicious code, which is then executed in the victim’s browser. This type of XSS is temporary and affects a specific user [10].

2.1.3. DOM-Based

Unlike traditional XSS attacks (reflected or stored), DOM (Document Object Model)-based XSS relies on injecting malicious code directly into the victim’s browser, via dynamic DOM modifications, without passing through the server [11]. Execution of the malicious script is usually triggered by user interaction with the web application. Its severity lies in its stealthiness, as no traces pass through the server, making it difficult for traditional security tools to detect [12]. This type of non-persistent XSS is based on DOM manipulation using unfiltered user input. This measure makes DOM-based XSS particularly formidable in modern web applications.

2.2. Basic Concepts

This section presents the fundamental concepts on which this XSS attack detection approach is based. It includes character-to-ASCII conversion, ASCII-to-image conversion, convolutional neural networks (CNN), and support vector machines (SVM).

2.2.1. ASCII Representation of the Sentence

ASCII (American Standard Code for Information Interchange) is a widely used character encoding standard that assigns a unique numerical code to each character. This code, consisting of 7 or 8 bits depending on the version, allows computers to represent and exchange text in a standardized format [13].

2.2.2. ASCII-Based Image Reconstruction

Coumar and Kingston (2025) also mentioned that ASCII-based image reconstruction refers to the correspondence between ASCII characters and visual information. This correspondence can also be explored in the reverse direction, paving the way for image reconstruction methods based on ASCII values [14]. This technique is based on the direct association of a numerical value of an ASCII character with a visual value (For example, a shade of gray.) assigned to a pixel in an image of given resolution, often used for visual analysis.

2.2.3. Convolutional Neural Networks (CNN)

The convolutional neural network (CNN) is a deep learning model specifically designed for the efficient processing of spatial data, particularly images. Thanks to its layered architecture, it reduces the size of pixel matrices while retaining most of the visual information. Initially developed for the automatic extraction of features from images, notably for tasks such as classification or segmentation, CNN have established themselves as an essential reference in many computer vision applications [15].
For a 2D input image I and a 2D kernel K, the convolution operation can be defined as
S ( i , j ) = ( I K ) ( i , j ) = m = 0 M 1 n = 0 N 1 I ( i + m , j + n ) · K ( m , n )
where
  • I ( i , j ) : The pixel value at position ( i , j ) in the input image;
  • K ( m , n ) : The weight of the kernel at position ( m , n ) ;
  • S ( i , j ) : The output feature map at position ( i , j ) .

2.2.4. Support Vector Machine (SVM)

“The support-vector network is a new leaming machine for two-group classification problems”. It seeks to separate data into distinct classes using an optimal hyperplane in a high-dimensional space [16].
Consider a binary classification problem with two classes, labeled as + 1 and 1 . We have a training dataset consisting of input feature vectors X and their corresponding class labels Y . The equation for the linear hyperplane can be written as
w T x + b = 0
where
  • w is the normal vector to the hyperplane (the direction perpendicular to it);
  • b is the offset or bias term representing the distance of the hyperplane from the origin along the normal vector w .
The distance between a data point x i and the decision boundary can be calculated as
d i = w T x i + b w
where w represents the Euclidean norm of the weight vector w . The predicted label y ^ of a data point can be determined by
y ^ = 1 if w T x + b 0 1 if w T x + b < 0
where y ^ is the predicted label of a data point.
For a linearly separable dataset, the goal is to find the hyperplane that maximizes the margin between the two classes while ensuring that all data points are correctly classified. This leads to the following optimization problem:
min w , b 1 2 w 2
subject to the constraint
y i ( w T x i + b ) 1 for i = 1 , 2 , 3 , , m
where
  • y i is the class label (+1 or −1) for each training instance;
  • x i is the feature vector for the i-th training instance;
  • m is the total number of training instances.
The condition y i ( w T x i + b ) 1 ensures that each data point is correctly classified and lies outside the margin.

2.3. Related Work

Cross-site scripting (XSS) attack detection has been the subject of extensive research based on various machine and deep learning approaches. Many innovative approaches have been proposed to classify malicious scripts, including support vector machines (SVMs), convolutional neural networks (CNNs), and hybrid approaches. This section examines a representative sample of previous research on using these techniques to effectively detect XSS attacks. The techniques are classified according to novelty and innovation.

2.3.1. Detection of XSS Attacks Using Convolutional Neural Network

The article [17] explores a machine learning-based approach to detecting XSS attacks in web applications. The authors compare several algorithms, including SVM (Support Vector Machine) and CNN (Convolutional Neural Network), using a real dataset containing 138,569 samples (100,000 benign and 38,569 malicious), divided into URL, HTML and JavaScript features. SVM uses an RBF kernel and achieves 98.53% precision, 97.84% recall and 98.52% F-measure. CNN, on the other hand, is based on an architecture with ReLU, max-pooling and cross-entropy loss functions. It achieves 98.82% precision, 98.07% recall and 98.81% F-score. Authors Waheed, Raed Gaata, Methaq. (2021) [18] propose a detection method combining a convolutional neural network (CNN) and long-term memories (LSTM). After data pre-processing and training to detect XSS attacks. The results are highly effective, with 99.4% accuracy. Yan, H., Feng, L., Yu, Y., Liao, W., Feng, L., Zhang, J., Liu, D., Zou, Y., Liu, C., Qu, L., and Zhang, X. [19] propose a convolutional neural network model (MRBN-CNN) based on a modified version of ResNet (residual network) and the NiN (network in network) model.The approach is based on pre-processing URLs according to the syntax and semantics of XSS scripts, enhancing ResNet’s residual blocks and replacing the fully connected layer with a 1 × 1 convolution. This model offers even better performance, with 99.23% accuracy, 99.94% precision and 98.53% recall. The authors Kumar et al., 2022 [20], propose a method for classifying XSS attacks based on a convolutional neural network (CNN). The dataset comprises 17,750 samples, including both malicious and benign scripts. Script obfuscation was initially eliminated. All characters in the XSS scripts were then converted to their ASCII representation (resized and reformatted to [120,120]), with non-ASCII characters removed. To standardize the feature matrix, it was normalized by division by 128. The resulting vector was transformed into a two-dimensional matrix of size [120,120], To standardize the feature matrix, it was normalized by dividing by 128. The resulting vector was transformed into a two-dimensional matrix of size [120,120], which was then used as input for the CNN. The experimental results show that the CNN model achieves high accuracy, around 98%, in detecting malicious scripts, confirming the relevance of convolutional architectures for the automatic analysis of the textual content of XSS attacks. One contribution is that of Wei N, Xie B, Zhang J, et al. [21], who proposed a method for detecting XSS attacks by representing queries as ASCII vectors, which are then analyzed using a convolutional neural network (CNN). This approach avoids manual feature extraction while remaining effective against a variety of XSS attacks, including new unknown vectors. On a dedicated dataset, it achieves 98.0% precision, 97.7% recall, and an F1-score of 97.9%. In their work, Lente et al., 2021 [22], improved a tool for detecting XSS attacks by combining a convolutional neural network (CNN) with a Long Short-Term Memory (LSTM) model. This hybrid architecture aims to capture both local features (via CNN) and sequential dependencies (via LSTM) in XSS scripts. The experimental results demonstrate a significant improvement in accuracy compared to the isolated use of CNN, suggesting that the integration of temporal and contextual dimensions is beneficial for identifying complex XSS attack patterns.

2.3.2. Detection of XSS Attacks Using SVM Algorithm

The study by Nunan et al. (2012) [23], proposes an automatic classification of XSS attacks by exploiting characteristics derived from web content and URL. Naive Bayes and SVM classifiers were applied to a set of 216,054 pages, including 15,366 malicious ones. SVM achieved 99.89% accuracy and a false positive rate of 0.02% on ClueWeb09. These results surpass those of Likarish et al. (2009) [24], thanks in particular to the addition of features based on HTML/JavaScript patterns and suspicious patterns. The approach demonstrates effective generalization and reliable detection of XSS scripts. Umehara et al. (2015) [25], have proposed a method for detecting XSS attacks based on the SVM classifier. The objective is to distinguish malicious queries from normal queries using feature vectors derived from the frequency of ASCII characters. The dataset includes 500 malicious URLs and 500 normal URLs. Two types of vectors (5 D and 128 D) were tested. With the 128 D vectors, the SVM achieved an accuracy of 98.2% and an F1-score of 98.9%. B. Gogoi, T. Ahmed, and H. K. Saikia [26], propose a method for detecting Cross-Site Scripting (XSS) attacks using an SVM classifier integrated into a custom web application firewall (WAF). Their approach relies on the use of textual data representing HTTP requests, including both malicious payloads (generated from tools such as XSSTIKE and XSSER) and simulated benign inputs. The text is pre-processed via tokenization and vectorized using the TF-IDF method, enabling it to feed the SVM classifier. Two variants were tested: a linear version and a non-linear version of the SVM. The experimental results show high accuracy (97%) and an excellent compromise between accuracy and recall, demonstrating the effectiveness of the approach, particularly against obfuscated attacks. This study illustrates the relevance of SVM models in the automatic detection of XSS attacks, overcoming the limitations of traditional signature-based WAF. Nguyen et al., 2022 [27], proposed a method based on Support Vector Machine (SVM) to improve the performance of web application firewalls (WAF) by analyzing the internal characteristics of HTTP requests. Their approach is based on extracting several key attributes (query length, parameter structure, presence of special characters, etc.) in order to train an SVM classifier capable of distinguishing legitimate queries from malicious ones. Experimental results showed a high detection rate (98–99%) and a low false positive rate, demonstrating the model’s effectiveness in real-time contexts. For their part, Habibi et al [28], explored several machine learning techniques, including support vector machines (SVM), k-nearest neighbors (KNN), and the Naive Bayes (NB) classifier, with the aim of detecting cross-site scripting (XSS) attacks. These algorithms were combined with the n-gram method applied to each script feature in order to optimize detection performance. Simulation results indicate that combining SVM with n-grams offers the best accuracy, reaching 98%. Banerjee et al. (2020) [29], proposed an approach based on the analysis of URLs and JavaScript code, applying various classification algorithms such as SVM, KNN, Random Forest, and logistic regression. Their study showed that Random Forest offered the best performance, with 98% accuracy and a low false positive rate (0.34), demonstrating the effectiveness of this method for identifying malicious content. In their study, Mereani et al. [30], focused on the detection of XSS attacks. To this end, they used machine learning methods such as SVM, KNN and Random Forests to analyze HTTP requests. The method proposed by the authors allows for the creation of a set of characteristics combining the syntax of the program and its behavioral characteristics, numbering 59 in total. This work involved studying the impact of these features on improving classifier accuracy using real datasets. They compiled a collection of datasets from multiple sources featuring balanced scripts, covering both harmful (malicious, dangerous) and beneficial (useful, legitimate) scripts sufficiently. Experiments on the test set showed that the classifiers were able to distinguish malicious scripts from benign scripts with high accuracy. The analysis also revealed that k-NN slightly outperformed SVM and Random Forest, with an accuracy of 96.79%. The study conducted by Mokbal et al., 2022 [31], proposes an approach to detecting XSS attacks based on average word embeddings combined with an SVM (Support Vector Machine) classifier. Their method is based on the semantic extraction of malicious queries through word vectors, which allows the contextual meaning to be captured even in the presence of obfuscation techniques. Evaluation on an XSS dataset demonstrates a significant improvement in performance compared to other conventional techniques, with robust results in terms of precision, recall, and F1-score. A notable contribution is that of Mokbal, F. M. M., Wang, D., Wang, X. (2022) [31], who proposed a method called NLP-SVM for detecting XSS attacks. It uses natural language processing (NLP) to analyze malicious payloads, then an SVM algorithm for detection. Unlike traditional word-based approaches, vectors are generated at the payload level. The model has been successfully validated using cross-validation (10-fold) and a data separation approach (hold-out). The results show effective detection with low false positives and false negatives, outperforming eight other known algorithms on the same dataset, with peak performance.

2.3.3. Detection of XSS Attacks Using a Hybrid CNN and Machine Learning Approach

In the work carried out by S. Abhishek, R. Ravindran, A. T and S. V [32], the authors propose a hybrid framework combining convolutional neural networks (CNN) and machine learning techniques to detect XSS attacks. The authors use annotated web queries to train their model. The results show a classification accuracy of over 99.9%, even when faced with obfuscated scripts. The model has certain limitations, including its dependence on large volumes of annotated data and the need for adaptations for effective deployment in real-world environments.

2.3.4. Recapitulation: Study of XSS Detection Methods

Table 1, below, compares the different approaches used by other authors to detect Cross-Site Scripting (XSS) attacks. It highlights the diversity of techniques and classification models used in the literature.

3. Proposed Approach

3.1. Diagram of Approach

To better Figure 1 illustrate and clarify the approach adopted in this work, we present below a detailed and comprehensive diagram of the steps involved in detecting the proposed XSS attacks. This diagram starts from the pre-processing stage and ends with the final classification. The entire process revolves around a single technique. This technique is based on the direct correlation between the ASCII value of a character and its visual representation. This visual value is then assigned to a pixel in an image of a certain resolution. This technique is commonly used in visual analysis. These images are analyzed using convolutional neural networks to extract features, which are then subjected to supervised classification by SVM.

3.2. Algorithm Description

The Algorithm 1 below illustrates the different steps, from data preparation to the final classification decision. This description provides a comprehensive overview of how the model works.
Step 1:
Image Conversion: Each text payload is transformed into a 64 × 64 grayscale image via ASCII encoding and normalization.
Step 2:
CNN Architecture: A three-layer convolutional neural network is constructed with batch normalization and dropout regularization to prevent overfitting.
Step 3:
CNN Training: The CNN is trained using Adam optimizer with binary cross-entropy loss for 30 epochs.
Step 4:
Feature Extraction: 256-dimensional feature vectors are extracted from the penultimate dense layer.
Step 5:
Feature Normalization: Extracted features undergo Z-score normalization to ensure zero mean and unit variance.
Step 6:
SVM Optimization: A support vector machine classifier is optimized through grid search with 3-fold cross-validation.
Step 7:
Hybrid Model: The final model combines CNN feature extraction with SVM classification.
Step 8:
Performance Evaluation: Standard metrics (Accuracy, Precision, Recall, F1-Score, AUC) are computed.
Algorithm 1: Detection xss with CNN + SVM
  • Input: Text payloads P, labels Y { 0 , 1 }
  • Output:
  •      H: trained model (SVM)
  •      M: evaluation metrics
  •       Y ^ : predictions for input P (XSS/Non-XSS)
  • # Step 1: Convert text to ASCII images, resize, normalize
  • for  p P   do
  •        IAsciiToImage(p)
  •        I_rResize(I, 64 × 64)
  •         I _ n I _ r / 255
  • end for
  • # Step 2: CNN Architecture
  • CNN ← BuildCNN
  • # Step 3: CNN Training
  • CNN.Train ( I n , t r a i n , Y t r a i n )
  • # Step 4: Feature Extraction
  • F ← CNN.ExtractFeatures ( I n )
  • # Step 5: Feature scaling
  • FsStandardScaler ( F )
  • # Step 6: Train SVM with hyperparameter optimization
  • HGridSearchCV ( SVM , F s , Y )
  • # Step 7: Make predictions
  • Y ^ H.Predict ( F s )
  • # Step 8: Evaluate model
  • MEvaluate ( H , I n , t e s t , Y t e s t )
  • return  H , M , Y ^

3.3. Dataset

3.3.1. Dataset Description

This work uses a public dataset dedicated to XSS attacks. This dataset is available on kaggle [33]. The CSV file contains 13,686 examples of HTML code, 7373 of which contain XSS vulnerabilities and 6313 of which do not. The file contains two columns:
  • Sentence: A string of HTML or JavaScript text representing either normal web content or an attempted malicious injection.
  • Label: An integer (0 or 1) indicating the nature of the content: 0 for non-malicious content and 1 for malicious content (XSS attack).
The model will be trained on 80% of the data, with the remaining 20% allocated to the validation phase.

3.3.2. Data Preprocessing

Automatic detection of XSS attacks often relies on analyzing HTML content, where malicious code can be inserted. This method involves converting HTML content into an image: each character is converted to its ASCII value, which is then represented by a gray scale level. This process produces a pixel matrix that forms an image, which can then be used Later to identify normal (legitimate) script and malicious script.

3.4. Text-to-ASCII Conversion for Image Reconstruction

This dataset has been split into two CSV files. The first file contains excerpts from source code demonstrating XSS vulnerabilities, while the second file contains clean HTML code. Each file contains a single column titled ‘Sentence’. The aim is to create a visual representation of the text data in the two CSV files and generate images measuring 64 × 64 pixels. First, each character in the text strings was converted to its ASCII value, and then arranged in a fixed-size array. This array was subsequently converted to grayscale pixels. If the sequence is shorter than expected, it is padded with zeros; otherwise, it is truncated. These images were then organized into two subdirectories named ‘xss’ and ‘clean’, located within a main directory called dataset. Figure 2 illustrates an example of this conversion, where each character is converted to its corresponding ASCII value. This process allows a string of characters to be converted to a numeric string while preserving the syntactic order of the code. The resulting ASCII values are then interpreted directly as gray pixel intensities.

3.5. Architecture of the Model CNN

The structure of a convolutional neural network (CNN) model designed to extract image features. It starts with a convolutional layer containing 32 3 × 3 filters and an IncreaseRelu activation function, which is used to address the “dead neuron” problem as suggested by the authors of the article [34], by doubling the slope of positive and negative values.
They propose the IncreaseReLU activation function to increase the gradient and avoid the dead neuron problem. In this work, we use this function with some modifications to maintain nonlinearity while reducing the risk of dead neurons.
The mathematical formula for this activation function is
f ( x ) = 1.5 x , si x > 0 0.15 x , si x 0
This is followed by max pooling (2 × 2), batch normalization and a 25% dropout rate to reduce overfitting. This structure is then repeated with deeper convolutions (first 64, then 128 filters), with each block retaining the same sequence of convolution, pooling, batch normalization and dropout. Once the features have been extracted, they are flattened via a flatten layer and passed to a dense layer of 256 neurons called ‘dense_feature’, which is then regularized with L2. Finally, normalization and 50% dropout are applied. This model is primarily intended for use as a feature extractor. The Figure 3 below shows the typical CNN model used for feature extraction.
The schematic overview in Figure 3 illustrates the data flow through the convolutional blocks and classifier. For precise reproducibility, Table 2 provides the complete technical specifications of each layer. The model accepts grayscale images of size 64 × 64 pixels as input ( 64 × 64 × 1 ). All convolutional layers employ 3 × 3 kernels with padding=‘same’ to maintain spatial dimensions before downsampling via 2 × 2 max-pooling. Critically, batch normalization is consistently applied after the IncreaseRelu activation (integrated within the Conv2D layers) and pooling operations, following the sequence: Conv2D (activation) → MaxPooling → BatchNormalization → Dropout. This configuration, along with the L2-regularized dense layer ( λ = 0.01 ), forms the complete architecture implemented in TensorFlow/Keras.

3.6. Classification SVM

The SVM thus stands out as a relevant choice, being both robust, theoretically well-founded, and relatively simple to implement. In this approach, the choice of SVM (Support Vector Machine) is based on several key factors, notably its strong performance in high-dimensional data classification, especially in the case of images. It offers good generalization ability, limits overfitting, and remains effective even with limited training data. Hyperparameter optimization (kernel type, C, gamma) is essential in this case to adapt the model to the data, often performed via grid search.

4. Experiments and Results

In this section, we present the experimental part, starting with a description of the dataset used, as well as the tools implemented for the model’s development, notably the Keras library in Python 3.8. The metrics used to evaluate the model’s effectiveness and performance, such as accuracy and the confusion matrix, are detailed. Finally, the results obtained are presented, accompanied by a data visualization using the t-SNE technique.

4.1. Performance Metrics

To evaluate this model, the following performance indicators were used: accuracy, recall, F1-score, as well as metrics derived from the ROC (Receiver Operating Characteristic) curve, analysis, notably the AUC (Area under the Curve). The following lines provide a brief description of each performance measure.

4.1.1. Accuracy

Accuracy is the proportion of HTTP web requests classified correctly
Accuracy = T P + T N T P + T N + F P + F N .

4.1.2. Recall

Recall measures the model’s ability to detect all positive instances. The proportion of XSS attacks detected by the model among all XSS attacks contained in the dataset.
Recall = T P T P + F N .

4.1.3. Precision

Precision is the proportion of XSS attacks correctly classified by the model among HTTP web requests detected as XSS attacks.
Precision = T P T P + F P .

4.1.4. F1-Score

F1-score: This measure offers a compromise between precision and recall.
F 1 - Score = 2 × Precision × Recall Precision + Recall = 2 T P 2 T P + F P + F N .

4.1.5. AUC (Area Under the Curve) of the Courbe ROC

The ROC (Receiver Operating Characteristic) curve plots the true positive rate (TPR) against the false positive rate (FPR):
FPR = F P F P + T N .
The AUC measures the area under this curve. The closer it is to 1, the better the model
TPR = T P T P + F N .

4.1.6. K-Fold Cross-Validation

K-fold cross-validation is a method for evaluating the performance and generalization ability of machine learning models by dividing the dataset into K subsets of equal size (folds). In each iteration, a model is trained using K − 1 folds, while the remaining fold is used to test the model.This process is repeated K times, with each fold being used as the testing set exactly once. The performance of the model is then averaged over all K iterations to provide a robust estimate of its generalization ability. Figure 4 illustrates the division of the dataset into K training and test subsets: for each part: use K − 1 parts to train the model and use the remaining part as a test set to evaluate the model.
Performance = 1 5 i = 1 5 Performance i

4.2. Results

The main performance metrics of the model, broken down by class, are presented in Table 3. The model attains an average accuracy of 0.9969, demonstrating excellent overall performance across both classes (benign code and XSS). The AUC reaches 0.99, indicating a strong ability to distinguish between the classes.
Performance is improved by combining a convolutional neural network (CNN), which is used as a feature extractor, with a support vector machine (SVM), which acts as a final classifier. This hybrid architecture capitalizes on the strengths of both approaches, combining the CNN’s ability to identify intricate, hierarchical patterns in the data with the SVM’s capacity to distinguish between classes in high-dimensional spaces. Together, they create a synergistic effect, combining the strengths of each model. Together, they demonstrate robust generalization capabilities on unseen data, significantly enhancing the system’s overall performance.
The “overall accuracy” shown in the last row represents the global performance of the classifier across all classes. It is computed as the ratio between the number of correct predictions and the total number of instances in the test set.
To ensure rigorous evaluation, we performed 5-fold cross-validation on the entire CNN + SVM pipeline. The accuracy per fold for the CNN + SVM model is shown in the Table 4 below. These results show that using an SVM classifier on the features extracted from the CNN consistently improves performance across folds, demonstrating the robustness and effectiveness of the proposed pipeline.
To evaluate the contribution of the SVM classifier, CNN was used in two stages during feature extraction and classification using the sinusoidal activation function to perform binary classification, then compared to the proposed CNN + SVM line, both using identical superimposed features and the same dataset. The results show that replacing the sinusoidal layer with the SVM classifier leads to a continuous improvement in performance, confirming the effectiveness of the proposed hybrid structure. Table 5 below presents a comparison between the results obtained using the two different approaches.
Although the CNN with sigmoid classifier achieved slightly higher precision, the proposed CNN + SVM model consistently outperformed it in terms of accuracy, recall, F1-score, and AUC. This indicates a better overall balance between false positives and false negatives. We also observes an improvement in results compared to those obtained in the study [22], that used an XSS attack detection method combining a convolutional neural network (CNN) and a long short-term memory (LSTM) model, with the proposed model achieving an accuracy of 99.4%.We also observe an increase in the accuracy of our model compared to the results of the research paper [19], that proposes a convolutional neural network based on a modified ResNet block and NiN model (MRBN-CNN) for detecting cross-site scripting (XSS) attacks, which achieved an accuracy of 99.23%. The results obtained in this article also showed an increase in accuracy compared to the system called 3C-LSTM [22], which achieved an accuracy of 99.4%.

4.3. SVM Hyperparameter Optimization

The SVM baseline was optimized using a grid search with 3-fold cross-validation. The search space included: (i) RBF kernel with C { 1 , 10 , 100 } and γ { 0.01 , 0.1 , 1 } ; and (ii) linear kernel with C { 0.1 , 1 , 10 } . The best SVM hyperparameters obtained from the grid search were: {‘C’: 10, ‘gamma’: 0.01, ‘kernel’: ‘rbf’}.

4.4. Accuracy Results

The graph in Figure 5 illustrates the evolution of accuracy on the training and validation sets over time. This curve highlights three key aspects of the model: rapid convergence, stable performance, and a notable absence of overfitting. The observation shows that there is rapid convergence in the model’s performance.
The model was trained for 30 epochs, as shown in the attached training curve. Stage 5 corresponds to epoch 5 with a validation accuracy of approximately 99.1%. Stage 10 corresponds to epoch 10 with a verification accuracy of approximately 99.3%. We can also see that training continued for 30 epochs. These steps are used as reference points to illustrate the evolution of the model’s performance during training.
The stabilization phase begins at the same stage 5 mentioned above, where the accuracy curves in training and verification remain high, approaching a value of 1.0, which indicates good generalization ability. Finally, as indicated, there is no overfitting, as the accuracy of the validation data does not decrease over time, indicating that the model does not overfit the training data.The model performs excellently from the early stages, demonstrating its ability to generalize effectively without overfitting or performance degradation.

4.5. Visualization t-SNE

According to this article T. Tony Cai et Rong Ma (2022) [35], t-SNE is defined as: “an iterative algorithm for visualizing high-dimensional data by mapping the data points to a two- or finite-dimensional space. It creates a single map that reveals the intrinsic structures in a high-dimensional dataset, including trends, patterns, and outliers, through a non-linear dimension reduction technique”. The two-dimensional graphical representation uses the t-SNE algorithm for feature vectors extracted by a convolutional neural network (CNN), where each point represents a data case displayed in a two-dimensional space of a higher-dimensional feature space. The colors of the points indicate their belonging to one of the following two categories:
  • Class 0: Represents code (normal) with the color blue.
  • Class 1: Represents XSS (malicious) with the color red.
As shown in this Figure 6, there is a clear separation between the two classes, which form two distinct groups. This indicates that the representations learned by the CNN allow for good discrimination between normal data and XSS attacks. The data for each category tends to cluster together, indicating that the model has successfully extracted the relevant characteristics specific to each type of behavior. Some points appear in the transition areas between the two groups, which may correspond to borderline or ambiguous cases that may be misclassified by the model (false positives or false negatives). In conclusion, this t-SNE visualization highlights the model’s ability to effectively separate the two classes based on the features it extracts, indicating that the model has good generalization ability and strong classification ability.

4.6. Confusion Matrix

The confusion matrix is a fundamental tool for evaluating the performance of classification models, especially in the context of detecting XSS (Cross-Site Scripting) attacks. It offers a clear visualization of the model’s classification outcomes by distinguishing between true positives, true negatives, false positives, and false negatives. Below is the typical structure of a confusion matrix for a binary classification task like XSS detection, where the classes are “XSS Attack” and “Non-XSS”.
Figure 7 below shows the confusion matrix, which illustrates how effective this classification model is at distinguishing between harmless HTML entries and HTML entries containing XSS (Cross-Site Scripting) attacks. The model accurately identified 1257 (True Negatives) benign entries and 1470 (True Positives) XSS entries, with only 5 false positives (benign HTML incorrectly reported as malicious) and 4 false negatives (XSS attacks not detected). These results correspond to an accuracy of 99.66%, a recall of 99.73%, and an overall accuracy of 99.65%, highlighting the model’s excellent performance in detecting malicious code injections while minimizing errors. This high level of reliability is essential for protecting web applications from XSS attacks without interfering with legitimate HTML content.

4.7. Computational Cost and Real-Time Inference Performance

Table 6 reports the inference-time performance of the proposed system. The CNN-based feature extraction stage requires approximately 7–8 ms per image, while the SVM classifier adds an additional 2–5 ms. The SVM operates on individual feature vectors and therefore does not involve batch processing; consequently, batch-level timing is not applicable for this component. Overall, the end-to-end inference latency ranges between 10 and 13 ms per image.
Inference Time is the critical metric for real-time deployment. All measurements were conducted on a development machine equipped with an Intel® UHD Graphics GPU for CNN inference and an Intel® Core i5-1235U (12th Gen) CPU for SVM classification. The resulting average inference time of approximately 11 ms corresponds to a throughput of around 90 FPS, which comfortably exceeds standard real-time video processing requirements (25–30 FPS).
The SVM classifier operates on a single feature vector per image and therefore does not rely on batch processing; as a result, batch-level timing is not applicable.

5. Discussion

The security of web applications relies in particular on the ability to detect malicious code injections, such as XSS (Cross-Site Scripting) attacks. In this research, a binary classification model is trained to differentiate legitimate HTML code fragments from those containing potentially malicious scripts. The evaluation of the model’s performance is based on the analysis of the confusion matrix, as well as on standard metrics such as precision, recall, and the F1-score. The obtained confusion matrix highlights the excellent performance of the model in classifying HTML data into two categories: benign inputs and those containing malicious XSS scripts. Out of a total of 2736 instances, only 9 were misclassified, corresponding to an overall accuracy of 99.65%. The precision rate for the XSS class is 99.66%, while the recall reaches 99.73%, indicating a very low proportion of false positives and false negatives, with an F1-score of 99.69%. In comparison to the study [36], which employed the same dataset and applied machine learning models (including SVM, Naive Bayes, Decision Tree and Random Forest) to categorize XSS attacks, our hybrid approach exhibits a substantial enhancement in performance across all evaluation metrics. The accuracy, recall and F1-score values obtained are notably higher than those reported in the aforementioned article, reflecting the enhanced classification capability and robustness of the proposed model. These results show that the model can correctly tell the difference between malicious data and good data.
As shown in Table 7, the combined use of a CNN and an SVM offers a clear improvement over the SVM model used in [36]. Indeed, automatic feature extraction by the CNN provides much richer and more discriminating representations than the descriptors explicitly used in a conventional SVM. This results in particular in a higher recall (0.9973 vs. 0.9958), confirming a better ability to correctly identify positive samples and reduce false negatives. In addition, our model maintains very high accuracy (0.9966), proving that CNN enhances the robustness of the classifier without compromising its selectivity. Overall, this CNN + SVM combination offers a much better balance between accuracy, recall, and F1-score, demonstrating the superiority of our approach for classification based on complex data.
These results were confirmed when using the sigmoid-based CNN, as shown in Table 5 of the Experimentation and Results section. The high accuracy achieved by the sigmoid-based CNN indicates a more conservative decision boundary, favoring the reduction in false positives. While this behavior may be desirable in scenarios where excessive alerts are costly, it inherently leads to lower recall and an increased risk of missing true threats. In contrast, the SVM-based classifier enhances feature discrimination, resulting in increased recall and F1 score. This property is particularly important in security-oriented applications, where ignoring harmful events can have serious consequences. Therefore, the CNN + SVM pipeline provides a more robust and operationally reliable framework for detection.
The advantages of this model lie in its independence from the HTML format, eliminating the need to explicitly extract or analyze HTML structure, which is often variable and subject to obfuscation techniques. By converting text into images, it leverages the capabilities of convolutional neural networks (CNN), which are renowned for their effectiveness in detecting complex visual patterns. Furthermore, this process provides robustness in the face of injection variations, as XSS attacks often employ masking or encoding techniques that are difficult to detect using methods based solely on textual analysis. The model’s primary advantage lies in its remarkable accuracy, characterized by a very low error rate. It successfully balances attack detection (sensitivity) and preservation of legitimate inputs (specificity), a crucial aspect in application security. Moreover, the model demonstrates strong generalization capabilities on test data, reflecting its robustness against the syntactic diversity of HTML inputs. Despite its excellent performance, the model has certain limitations. First, there is a moderate class imbalance, with a slight overrepresentation of XSS examples (1470 compared to 1257 benign). Although this did not negatively affect overall performance, a more pronounced imbalance could undermine the model’s stability. Additionally, while errors are few, they are critical in this context: a false negative (an undetected attack) can lead to exploitable security vulnerabilities. the results observed are generally superior to those obtained by traditional models such as logistic regression, linear SVMs or simple decision trees, which often struggle to capture the complex structures of HTML inputs. On the other hand, a comparison with more advanced architectures, such as recurrent neural networks (RNN, LSTM) or pre-trained models (e.g., BERT applied to HTML text), more accurately highlights the limitations and relative performance of the proposed solution.

6. Limitation of the Study

The limitations of this study can be summarized as follows.
Firstly, the dataset used, which comprises 13,686 samples from a single public source, may not fully reflect the diversity and complexity of XSS attacks that occur in real-world environments.
Secondly, the features were extracted using a CNN, which proved to be very effective, but the integration of other complementary methods could further enrich the extracted features and provide a more accurate understanding of XSS attack patterns.
Thirdly, the evaluation of the models was performed in a stable, offline environment. However, real-world XSS attacks are dynamic and influenced by contextual factors that may affect the actual performance of the models when deployed in production.
Furthermore, the study was limited to a single combination (CNN + SVM), which proved the effectiveness of CNN in feature extraction and SVM in classification, but this does not preclude the possibility of further studies on the combination of CNN with other classifiers, particularly those commonly used in classification.
Finally, during this work, the hyperparameters were optimized using GridSearch. It is still possible to use other optimization techniques, such as random search or Bayesian optimization, to further improve the models and enhance their overall stability.

7. Conclusions

This study introduces a hybrid CNN–SVM approach for detecting XSS attacks, based on converting HTML source code into visual representation through ASCII encoding. The experimental results demonstrate a significant improvement in the model performance compared to other approaches, as well as the effective detection of vulnerable source code even in obfuscated cases, thanks to the powerful feature extraction capabilities of deep neural networks combined with the efficiency of SVM on well-structured vector representations. Analysis of the confusion matrix reveals that the proposed classification model is particularly effective at detecting malicious HTML scripts, specifically those associated with XSS attacks. The model demonstrates an exceptionally low error rate and strong generalization ability while effectively minimizing false positives. These promising results suggest the potential for its integration into an automated vulnerability detection system, contingent upon extending the training dataset to encompass more complex and diverse attack patterns.

Author Contributions

Conceptualization, A.A., A.J. and L.L.; Methodology, Investigation, analysis, A.A., L.L., A.J. and H.T.; Resources, A.A., A.J. and L.L.; Writing—original draft preparation, A.A., L.L. and A.J.; Writing—review and editing, A.A., A.J., L.L. and H.T.; Correspondence, A.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding, and the APC was not funded by any external organization.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are publicly available on the Kaggle platform.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SVMSupport Vector Machine
CNNConvolutional Neural Network
XSSCross-Site Scripting
OWASPOpen Web Application Security Project
MRBNModified ResNet and Network-in-Network
KNNK-Nearest Neighbors
MLMachine Learning
LSTMLong Short-Term Memory
DLDeep Learning
ASCIIAmerican Standard Code for Information Interchange
URLUniform Resource Locator
JSJavaScript
t-SNEt-distributed Stochastic Neighbor Embedding

References

  1. OWASP. OWASP Top Ten. 2025. Available online: https://owasp.org/www-project-top-ten/ (accessed on 1 November 2025).
  2. Ayeni, B.K.; Sahalu, J.B.; Adeyanju, K.R. Detecting Cross-Site Scripting in Web Applications Using Fuzzy Inference System. J. Comput. Netw. Commun. 2018, 2018, 8159548. [Google Scholar] [CrossRef]
  3. Sarmah, U.; Bhattacharyya, D.K.; Kalita, J.K. A Survey of Detection Methods for XSS Attacks. J. Netw. Comput. Appl. 2018, 118, 113–143. [Google Scholar] [CrossRef]
  4. Minamide, Y. Static Approximation of Dynamically Generated Web Pages. In Proceedings of the 14th International Conference on World Wide Web (WWW’05), Chiba, Japan, 10–14 May 2005; ACM Press: Chiba, Japan, 2005; p. 432. [Google Scholar] [CrossRef]
  5. Doupé, A.; Cui, W.; Jakubowski, M.H.; Peinado, M.; Kruegel, C.; Vigna, G. deDacota: Toward Preventing Server-Side XSS via Automatic Code and Data Separation. In Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security (CCS’13), Berlin, Germany, 4–8 November 2013; ACM Press: Berlin, Germany, 2013; pp. 1205–1216. [Google Scholar] [CrossRef]
  6. Shen, J.; Cheng, X.; Yang, X.; Zhang, L.; Cheng, W.; Lin, Y. Efficient CNN Accelerator Based on Low-End FPGA with Optimized Depthwise Separable Convolutions and Squeeze-and-Excite Modules. AI 2025, 6, 244. [Google Scholar] [CrossRef]
  7. Zhang, L.; Yang, X.; Cheng, X.; Cheng, W.; Lin, Y. Few-Shot Image Classification Algorithm Based on Global–Local Feature Fusion. AI 2025, 6, 265. [Google Scholar] [CrossRef]
  8. Disawal, S.; Suman, U.; Rathore, M. Investigation of Detection and Mitigation of Web Application Vulnerabilities. Int. J. Comput. Appl. IJCA 2022, 184, 30–36. [Google Scholar] [CrossRef]
  9. Pasini, S.; Maragliano, G.; Kim, J.; Tonella, P. XSS Adversarial Attacks Based on Deep Reinforcement Learning: A Replication and Extension Study. arXiv 2025, arXiv:2502.19095. [Google Scholar] [CrossRef]
  10. Rodríguez, G.E.; Torres, J.G.; Flores, P.; Benavides, D.E. Cross-Site Scripting (XSS) Attacks and Mitigation: A Survey. Comput. Netw. 2020, 166, 106960. [Google Scholar] [CrossRef]
  11. Hartono, H.; Triloka, J. Method for Detection and Mitigation of Cross-Site Scripting Attack on Multi-Websites. In Proceedings of the International Conference on Information Technology and Business (ICITB 2021), Bandar Lampung, Indonesia, 17 November 2021; pp. 26–32. Available online: https://jurnal.darmajaya.ac.id/index.php/icitb/article/view/3037/ (accessed on 1 November 2025).
  12. Falana, O.J.; Ebo, I.O.; Tinubu, C.O.; Adejimi, O.A.; Ntuk, A. Detection of Cross-Site Scripting Attacks Using Dynamic Analysis and Fuzzy Inference System. In Proceedings of the 2020 International Conference in Mathematics, Computer Engineering and Computer Science (ICMCECS), Ayobo, Nigeria, 18–21 March 2020. [Google Scholar] [CrossRef]
  13. Salama, A.; Tarek, Z.; Darwish, Y.; Elseuofi, S.; Darwish, E.; Shams, M. Neutrosophic Encoding and Decoding Algorithm for ASCII Code System; Zenodo: Geneva, Switzerland, 2024. [Google Scholar] [CrossRef]
  14. Coumar, S.; Kingston, Z. Evaluating Machine Learning Approaches for ASCII Art Generation. arXiv 2025, arXiv:2503.14375. [Google Scholar] [CrossRef]
  15. El Sakka, M.; Ivanovici, M.; Chaari, L.; Mothe, J. A Review of CNN Applications in Smart Agriculture Using Multimodal Data. Sensors 2025, 25, 472. [Google Scholar] [CrossRef] [PubMed]
  16. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  17. Alhamyani, R.; Alshammari, M. Machine Learning-Driven Detection of Cross-Site Scripting Attacks. Information 2024, 15, 420. [Google Scholar] [CrossRef]
  18. Kadhim, R.W.; Gaata, M.T. A Hybrid of CNN and LSTM Methods for Securing Web Application Against Cross-Site Scripting Attack. Indones. J. Electr. Eng. Comput. Sci. 2020, 21, 1022–1029. [Google Scholar] [CrossRef]
  19. Yan, H.; Feng, L.; Yu, Y.; Liao, W.; Feng, L.; Zhang, J.; Liu, D.; Zou, Y.; Liu, C.; Qu, L.; et al. Cross-Site Scripting Attack Detection Based on a Modified Convolution Neural Network. Front. Comput. Neurosci. 2022, 16, 981739. [Google Scholar] [CrossRef] [PubMed]
  20. Kumar, J.; Santhanavijayan, A.; Rajendran, B. Cross Site Scripting Attacks Classification Using Convolutional Neural Network. In Proceedings of the 2022 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 25–27 January 2022; pp. 1–6. [Google Scholar] [CrossRef]
  21. Wei, N.; Xie, B. Detecting SQL Injection and XSS Attacks Using ASCII Code and CNN. In Network Simulation and Evaluation; Gu, Z., Zhou, W., Zhang, J., Xu, G., Jia, Y., Eds.; Communications in Computer and Information Science; Springer Nature: Singapore, 2024; Volume 2063, pp. 33–45. ISBN 978-981-9745-18-0. [Google Scholar] [CrossRef]
  22. Lente, C.; Hirata, R., Jr.; Batista, D.M. An Improved Tool for Detection of XSS Attacks by Combining CNN with LSTM. In Proceedings of the Anais Estendidos do XXI Simpósio Brasileiro em Segurança da Informação e de Sistemas Computacionais, Florianis, Brazil, 12–15 September 2021; pp. 1–8. Available online: https://sol.sbc.org.br/index.php/sbseg_estendido/article/view/17333 (accessed on 6 November 2025).
  23. Nunan, A.E.; Souto, E.; Dos Santos, E.M.; Feitosa, E. Automatic Classification of Cross-Site Scripting in Web Pages Using Document-Based and URL-Based Features. In Proceedings of the 2012 IEEE Symposium on Computers and Communications (ISCC), Cappadocia, Turkey, 1–4 July 2012; IEEE: Cappadocia, Turkey, 2012; pp. 702–707. Available online: https://ieeexplore.ieee.org/abstract/document/6249380/ (accessed on 6 November 2025).
  24. Likarish, P.; Jung, E.; Jo, I. Obfuscated Malicious JavaScript Detection Using Classification Techniques. In Proceedings of the 2009 4th International Conference on Malicious and Unwanted Software (MALWARE), Montreal, QC, Canada, 13–14 October 2009; IEEE: Montreal, QC, Canada, 2009; pp. 47–54. Available online: https://ieeexplore.ieee.org/abstract/document/5403020/ (accessed on 6 November 2025).
  25. Umehara, A.; Matsuda, T.; Sonoda, M.; Mizuno, S.; Chao, J. Consideration on the Cross-Site Scripting Attacks Detection Using Machine Learning. IPSJ SIG Tech. Rep. 2015, 2015, 1–4. [Google Scholar]
  26. Gogoi, B.; Ahmed, T.; Saikia, H.K. Detection of XSS Attacks in Web Applications: A Machine Learning Approach. Int. J. Innov. Res. Comput. Sci. Technol. 2021, 9, 1–10. [Google Scholar] [CrossRef]
  27. Thang, N.M.; Ho, T.P.; Nam, H.T. A New Approach to Improving Web Application Firewall Performance Based on Support Vector Machine Method with Analysis of Http Request. J. Sci. Technol. Inf. Secur. 2022, 1, 62–73. [Google Scholar] [CrossRef]
  28. Habibi, G.; Surantha, N. XSS Attack Detection with Machine Learning and N-Gram Methods. In Proceedings of the 2020 International Conference on Information Management and Technology (ICIMTech), Vitual, 13–14 August 2020; IEEE: Bandung, Indonesia, 2020; pp. 516–520. Available online: https://ieeexplore.ieee.org/abstract/document/9210946/ (accessed on 6 November 2025).
  29. Banerjee, R.; Baksi, A.; Singh, N.; Bishnu, S.K. Detection of XSS in Web Applications Using Machine Learning Classifiers. In Proceedings of the 2020 4th International Conference on Electronics, Materials Engineering & Nano-Technology (IEMENTech), Kolkata, India, 2–4 October 2020; IEEE: Kolkata, India, 2020; pp. 1–5. Available online: https://ieeexplore.ieee.org/abstract/document/9270052/ (accessed on 6 November 2025).
  30. Mereani, F.A.; Howe, J.M. Detecting Cross-Site Scripting Attacks Using Machine Learning. In Proceedings of the International Conference on Advanced Machine Learning Technologies and Applications (AMLTA2018), Cairo, Egypt, 22–24 February 2018; Advances in Intelligent Systems and Computing; Hassanien, A.E., Tolba, M.F., Elhoseny, M., Mostafa, M., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 723, pp. 200–210. [Google Scholar] [CrossRef]
  31. Mokbal, F.M.M.; Wang, D.; Wang, X. Detect Cross-Site Scripting Attacks Using Average Word Embedding and Support Vector Machine. Int. J. Netw. Secur. 2022, 4, 20–28. [Google Scholar]
  32. Abhishek, S.; Ravindran, R.; Anjali, T.; Shriamrut, V. AI-Driven Deep Structured Learning for Cross-Site Scripting Attacks. In Proceedings of the 2023 International Conference on Innovative Data Communication Technologies and Application (ICIDCA), Uttarakhand, India, 14–16 March 2023; IEEE: Tirunelveli, India, 2023; pp. 701–709. Available online: https://ieeexplore.ieee.org/abstract/document/10099960/ (accessed on 6 November 2025).
  33. Shah, S.H.; Hussain, S.S. Cross Site Scripting (XSS) Dataset for Deep Learning. Kaggle. 11 January 2024. Available online: https://www.kaggle.com/datasets/syedsaqlainhussain/cross-site-scripting-xss-dataset-for-deep-learning/data (accessed on 6 November 2025).
  34. Afifi, H.; Hsaini, A.M.; Merras, M.; Bouazi, A.; Chana, I. Enhanced Facial Recognition Using Parametrized ReLU Activation in Convolutional Neural Networks. In Intersection of Artificial Intelligence, Data Science, and Cutting-Edge Technologies: From Concepts to Applications in Smart Environment; Farhaoui, Y., Herawan, T., Lucky Imoize, A., Allaoui, A.E., Eds.; Lecture Notes in Networks and Systems; Springer Nature: Cham, Switzerland, 2025; Volume 1397, pp. 308–314. [Google Scholar] [CrossRef]
  35. Cai, T.T.; Ma, R. Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data. J. Mach. Learn. Res. 2022, 23, 1–54. [Google Scholar]
  36. Hamzah, K.H.; Osman, M.Z.; Anthony, T.; Ismail, M.A.; Abdullah, Z.; Alanda, A. Comparative Analysis of Machine Learning Algorithms for Cross-Site Scripting (XSS) Attack Detection. JOIV Int. J. Inform. Vis. 2024, 8, 1678. [Google Scholar] [CrossRef]
Figure 1. Key steps of the proposed approach for XSS attack detection.
Figure 1. Key steps of the proposed approach for XSS attack detection.
Jcp 06 00018 g001
Figure 2. Transformation text–ASCII-image reconstruction.
Figure 2. Transformation text–ASCII-image reconstruction.
Jcp 06 00018 g002
Figure 3. The architecture model CNN.
Figure 3. The architecture model CNN.
Jcp 06 00018 g003
Figure 4. K-fold_cross_validation.
Figure 4. K-fold_cross_validation.
Jcp 06 00018 g004
Figure 5. Proposed model’s accuracy.
Figure 5. Proposed model’s accuracy.
Jcp 06 00018 g005
Figure 6. T-SNE projection of CNN feature vectors for benign vs. XSS samples.
Figure 6. T-SNE projection of CNN feature vectors for benign vs. XSS samples.
Jcp 06 00018 g006
Figure 7. Confusion matrix of the model on the test set.
Figure 7. Confusion matrix of the model on the test set.
Jcp 06 00018 g007
Table 1. Comparison of methods used to detect cross-site scripting (XSS).
Table 1. Comparison of methods used to detect cross-site scripting (XSS).
ReferenceMethodData TypeTechniques UsedAccuracy/ResultsComment
 [24]Supervised classification of obfuscated JavaScript scriptsJavaScript scripts collected from the web (malicious vs. benign)Naive Bayes, ADTree, SVM90–95%The number of false positives is very low
 [23]Traditional MLURLs and page contentSVM, NB94–98%Malicious pages are classified with great accuracy.
 [25]Supervised classification of HTTP requestsURL extracts and web pages (plain text)SVM (RBF kernel), Random Forest>90%XSS requests can be clearly separated from normal requests.
 [28]ML + n-gramWeb scripts (benign or malicious)SVM, KNN, NB + n-gram98%The n-gram method with SVM achieved high accuracy.
 [29]Traditional MLJS + URLSVM, KNN, RF, Logistic Regression98% (RF)Random Forest achieved the highest accuracy and lowest FPR (0.34).
 [26]SVM Linear/Non-LinearPayloads (XSS)TF-IDF, SVM98%Simple but effective.
 [27]SVM + HTTP analysisRaw HTTP requestsTF-IDF, SVM99.99%Very precise but computationally expensive.
 [22]Deep learning (3C-LSTM)XSS URL (XSSed, Benign)Word2Vec CBOW, CNN, LSTM, Softmax99.36%High and stable accuracy.
 [18]Hybrid DLXSS payloadsCNN + LSTM96–97%Combines semantic extraction (Word2Vec) and deep classification.
 [31]SVM + embeddingsWord averagesSVM94%High-level NLP + ML performance.
 [19]MRBN-CNNHTML, URLModified ResNet + Network-in-Network 99%Very high accuracy with CNN.
 [20]CNNHTML content and scriptsCNN98.62%No specific figures provided.
 [32]Deep Learning (CNN + ML)XSS data (simulated/real)CNN + ML>99.9%Robust against complex XSS attacks, outperforms traditional models.
 [17]Comparative ML/DLReal XSS (URL, HTML, JS)SVM, CNN, RF, etc.99.78% (RF)In-depth comparative study.
 [21]CNN + ASCII encodingASCII textCNNn/aDetects both XSS and SQLi together.
Table 2. CNN block architecture summary.
Table 2. CNN block architecture summary.
ComponentLayers (In Order)
InputGrayscale image
Conv Block 1 Conv2D(32, 3 × 3 , same, IncreaseRelu) → MaxPool( 2 × 2 ) → BN →
Dropout(0.25)
Conv Block 2 Conv2D(64, 3 × 3 , same, IncreaseRelu) → MaxPool( 2 × 2 ) → BN →
Dropout(0.25)
Conv Block 3 Conv2D(128, 3 × 3 , same, IncreaseRelu) → MaxPool( 2 × 2 ) → BN →
Dropout(0.25)
FlattenFlatten()
Dense Block Dense(256, IncreaseRelu, L2( λ = 0.01 )) → BN → Dropout(0.5)
Note: same = padding = ‘same’; BN = BatchNormalization; L2 = L2 regularization. All convolutional layers use padding = ‘same’. Batch normalization is applied after activation and pooling operations.
Table 3. Summary of key performance metrics by class.
Table 3. Summary of key performance metrics by class.
ClassPrecisionRecallF1-ScoreSupport
Benign0.99680.99600.99641262
XSS0.99660.99730.99691474
Overall Accuracy--0.99692736
Table 4. K-Fold cross-validation accuracies for the CNN + SVM pipeline.
Table 4. K-Fold cross-validation accuracies for the CNN + SVM pipeline.
FoldAccuracy
10.9953
20.9942
30.9945
40.9927
50.9949
Mean ± Std0.9943 ± 0.0010
Table 5. Study comparing CNN with sigmoid classifier and CNN + SVM using identical features.
Table 5. Study comparing CNN with sigmoid classifier and CNN + SVM using identical features.
ModelAccuracyPrecisionRecallF1-Score
CNN + SVM0.99650.99660.99730.9969
CNN + Sigmoid0.98831.00000.97830.9890
Table 6. Computational cost and inference time per request of the CNN–SVM model measured during inference.
Table 6. Computational cost and inference time per request of the CNN–SVM model measured during inference.
StepTime per BatchBatch SizeTime per Request
CNN feature extraction231–242 ms327–8 ms
SVM classificationNot batch-based12–5 ms
Total inference time238 ± 8 ms3210–13 ms
Table 7. Comparison of performance metrics between the proposed model and the article [36].
Table 7. Comparison of performance metrics between the proposed model and the article [36].
ModelAccuracyPrecisionRecallF1-Score
Proposed Model0.99650.99660.99730.9969
 [36]0.99781.00000.99580.9979
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ayoubi, A.; Laaouina, L.; Jeghal, A.; Tairi, H. An Improved Detection of Cross-Site Scripting (XSS) Attacks Using a Hybrid Approach Combining Convolutional Neural Networks and Support Vector Machine. J. Cybersecur. Priv. 2026, 6, 18. https://doi.org/10.3390/jcp6010018

AMA Style

Ayoubi A, Laaouina L, Jeghal A, Tairi H. An Improved Detection of Cross-Site Scripting (XSS) Attacks Using a Hybrid Approach Combining Convolutional Neural Networks and Support Vector Machine. Journal of Cybersecurity and Privacy. 2026; 6(1):18. https://doi.org/10.3390/jcp6010018

Chicago/Turabian Style

Ayoubi, Abdissamad, Loubna Laaouina, Adil Jeghal, and Hamid Tairi. 2026. "An Improved Detection of Cross-Site Scripting (XSS) Attacks Using a Hybrid Approach Combining Convolutional Neural Networks and Support Vector Machine" Journal of Cybersecurity and Privacy 6, no. 1: 18. https://doi.org/10.3390/jcp6010018

APA Style

Ayoubi, A., Laaouina, L., Jeghal, A., & Tairi, H. (2026). An Improved Detection of Cross-Site Scripting (XSS) Attacks Using a Hybrid Approach Combining Convolutional Neural Networks and Support Vector Machine. Journal of Cybersecurity and Privacy, 6(1), 18. https://doi.org/10.3390/jcp6010018

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop